## 1 Importing the Pre-Processed Dataset

As the dataset has been separated to 4 parts, we need to reread them from files:

- X_train (training variables of the dataset)
- X_val (validation variables of the dataset)
- y_train (training labels of the dataset)
- y_val (validation labels of the dataset)

In [1]:
import pandas as pd

In [2]:
X_train = pd.read_csv('X_train.csv')
X_val = pd.read_csv('X_val.csv')
y_train = pd.read_csv('y_train.csv')
y_val = pd.read_csv('y_val.csv')

Because we are only attempting to classify the ```True``` from the ```False``` by the text. Then we should select the variable "text" from X_train and X_val, and select the variable "target" from y_train and y_val.

In [3]:
train_text = X_train['cleaned_text'].to_list()
train_label = y_train['target'].to_list()
val_text = X_val['cleaned_text'].to_list()
val_label = y_val['target'].to_list()

In [4]:
train_text[:5]

['  jimmyfallon crush squirrel bone mortar pestl school  bio dept  realli sure whi worstsummerjob',
 ' mccainenl think spectacular look stonewal riot obliter white house ',
 'can t bloodi wait   soni set date stephen king       the dark tower    stephenk thedarktow    bdisgust',
 'protest ralli stone mountain  atleast they r burn build loot store like individu  protest ',
 ' rbcinsur quot websit   disaster  tri 3 browser  amp  3 machines  alway get  miss info  error due non exist drop down ']

In [5]:
train_label[:5]

[0, 1, 0, 0, 0]

## 2 Training and Validating Classifier

We will use the TfidfVectorizer to tokenize tweets and the SVC classifier from sklearn.svm for the task of classification.

Thus, the pipeline would be:
1. TfidfVectorizer
2. SVC

Before that, we should use the GridSearchCV classifier to perform an exhaustive search for some parameters of the TfidfVectorizer and SVC classes in order to find the ones that give the best results.

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

pipeline = Pipeline([('tfidf', TfidfVectorizer(decode_error="ignore")), ('svc', SVC(random_state=888))])

parameters = {
    'tfidf__ngram_range': ((1,1), (1,2), (2,2)),
    'tfidf__use_idf': (True, False),
    'tfidf__smooth_idf': (True, False),
    'tfidf__sublinear_tf': (True, False),
    'svc__C': (1.3, 1.5, 1.7),
}

grid = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=3, cv=3)
grid_result = grid.fit(train_text, train_label)

print("Best score: " + str(grid_result.best_score_))
print("Best parameters: " + str(grid_result.best_params_))

Fitting 3 folds for each of 72 candidates, totalling 216 fits
Best score: 0.8031198686371099
Best parameters: {'svc__C': 1.5, 'tfidf__ngram_range': (1, 2), 'tfidf__smooth_idf': False, 'tfidf__sublinear_tf': True, 'tfidf__use_idf': True}


According to the result above, we can apply the best parameters on final training.

In [20]:
tfidf_vectorizer = TfidfVectorizer(
    decode_error="ignore",
    ngram_range=(1,2),
    smooth_idf=False,
    sublinear_tf=True,
    use_idf=True
)
train_text_vectorized = tfidf_vectorizer.fit_transform(train_text)
val_text_vectorized = tfidf_vectorizer.transform(val_text)

model = SVC(random_state=888, C=1.5)
model.fit(train_text_vectorized, train_label)

SVC(C=1.5, random_state=888)

Then evaluate the prediction results:

In [21]:
from sklearn.metrics import classification_report

y_val_prediction = model.predict(val_text_vectorized)

print(classification_report(val_label, y_val_prediction))
model_score = model.score(val_text_vectorized, val_label)
print("Mean accuracy " + str(model_score))

              precision    recall  f1-score   support

           0       0.78      0.86      0.82       884
           1       0.78      0.66      0.71       639

    accuracy                           0.78      1523
   macro avg       0.78      0.76      0.77      1523
weighted avg       0.78      0.78      0.77      1523

Mean accuracy 0.7774130006565988
