# Experiment 01

### General description

<ul>
    <li>Encoding: TF-IDF</li>
    <li>Models: logistic regression vs. linear SVC vs. multinomial bayesian</li>
    <li>Training set: 100% jigsaw dataset</li>
    <li>Test set: CTEC dataset (sent by Alex)</li>
    <li>Metric: ROC AUC score</li>
</ul>

### Hyperparameter tuning 

#### Logistic regression

<ul>
    <li>inverse of regularization strength: 0.1, 1, 10</li>
</ul>
    
#### Multinomial naive Bayes

<ul>
    <li>additive smoothing rate: 0.1, 1, 10</li>
</ul>

#### Linear SVC

<ul>
    <li>inverse of regularization strength: 0.1, 1, 10</li>
</ul>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [2]:
!unzip "drive/My Drive/Lambus Li Internship/train_preproc.zip"
!unzip "drive/My Drive/Lambus Li Internship/ctec_training_data_preproc.zip"

Archive:  drive/My Drive/Lambus Li Internship/train_preproc.zip
  inflating: train_preproc.csv       
Archive:  drive/My Drive/Lambus Li Internship/ctec_training_data_preproc.zip
  inflating: ctec_training_data_preproc.csv  


In [3]:
# Import modules 
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC, SVC
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
from sklearn.calibration import CalibratedClassifierCV
from sklearn.preprocessing import StandardScaler

import warnings 
warnings.simplefilter('module')

## Part 1 Load data and set up models

In the dataset, the texts are already pre-processed (uncensoring, lemmatizing, etc.). 

### Load datasets

In [4]:
# Load data
toxic_df_train = pd.read_csv('train_preproc.csv')
toxic_df_test = pd.read_csv('ctec_training_data_preproc.csv')

# All inputs 
X_train_text = toxic_df_train['comment_text']
X_test_text = toxic_df_test['comment_text']

# target >= 0.5 --> toxic --> label = 1
# target < 0.5 --> non-toxic --> label = 0
toxic_df_train.loc[toxic_df_train['target'] >= 0.5, 'label'] = 1
toxic_df_train.loc[toxic_df_train['target'] < 0.5, 'label'] = 0

# All labels / outputs 
y_train = toxic_df_train['label']
y_test = toxic_df_test['label']


### Vectorize text data with TF-IDF

In [5]:
encoder = TfidfVectorizer(strip_accents = 'unicode', stop_words = 'english')
X_train_unscaled = encoder.fit_transform(X_train_text)
X_test_unscaled = encoder.transform(X_test_text)


### Normalize data

Normalizing data and reaching 0-mean and 1-std can improve the performance of training algorithms, especially SVC. 

Normalization cannot reach 0-mean on the sparse matrix `X_train_unscaled` because this routine includes building a dense matrix as an intermediate step. Working with dense matrix is computationally expensive. 


In [6]:
scaler = StandardScaler(with_mean = False)
X_train = scaler.fit_transform(X_train_unscaled)
X_test = scaler.transform(X_test_unscaled)

## Part 2: Training and testing

### Logistic regression

In [10]:
# Model and hyperparameterization
# clf = GridSearchCV(
#     LogisticRegression(max_iter = 2000), 
#     param_grid = {'C': [0.1, 1, 10]}, 
#     scoring = 'roc_auc'
# )

clf = LogisticRegression(C = 1, max_iter = 2000)

# Train 
%time clf.fit(X_train, y_train)
 
# Predict label
y_train_pred_class = clf.predict(X_train)
y_test_pred_class = clf.predict(X_test)
# Predict probability of being toxic
y_train_pred_prob = clf.predict_proba(X_train)[:,1]
y_test_pred_prob = clf.predict_proba(X_test)[:,1]

print('\nLogistic regression')
# print(f'best parameter = {clf.best_params_}')

# Store results
results = [
    [
        'accuracy', 
        metrics.accuracy_score(y_train, y_train_pred_class), 
        metrics.accuracy_score(y_test, y_test_pred_class)
    ], 
    [
        'confusion matrix', 
        str(metrics.confusion_matrix(y_train, y_train_pred_class).tolist()), 
        str(metrics.confusion_matrix(y_test, y_test_pred_class).tolist())
    ], 
    [
        'F1 score', 
        metrics.f1_score(y_train, y_train_pred_class), 
        metrics.f1_score(y_test, y_test_pred_class)
    ], 
    [
        'ROC AUC score', 
        metrics.roc_auc_score(y_train, y_train_pred_class), 
        metrics.roc_auc_score(y_test, y_test_pred_class)
    ]
]

colNames = ['metric', 'train set', 'test set']

# Show result 
pd.DataFrame(results, columns = colNames)

CPU times: user 10min 34s, sys: 3min 44s, total: 14min 18s
Wall time: 9min 48s

Logistic regression


Unnamed: 0,metric,train set,test set
0,accuracy,0.960642,0.165312
1,confusion matrix,"[[1641516, 19024], [52012, 92322]]","[[1556, 345], [19271, 2329]]"
2,F1 score,0.722168,0.191893
3,ROC AUC score,0.814092,0.46317


### Multinomial naive Bayes

In [7]:
# Model and hyperparameterization
clf = GridSearchCV(
    MultinomialNB(), 
    param_grid = {'alpha': [0.1, 1, 10]}, 
    scoring = 'roc_auc'
)

# Train 
%time clf.fit(X_train, y_train)
 
# Predict label
y_train_pred_class = clf.predict(X_train)
y_test_pred_class = clf.predict(X_test)
# Predict probability of being toxic
y_train_pred_prob = clf.predict_proba(X_train)[:,1]
y_test_pred_prob = clf.predict_proba(X_test)[:,1]

print('\nMultinomial naive Bayes')
print(f'best parameter = {clf.best_params_}')

# Store results
results = [
    [
        'accuracy', 
        metrics.accuracy_score(y_train, y_train_pred_class), 
        metrics.accuracy_score(y_test, y_test_pred_class)
    ], 
    [
        'confusion matrix', 
        str(metrics.confusion_matrix(y_train, y_train_pred_class).tolist()), 
        str(metrics.confusion_matrix(y_test, y_test_pred_class).tolist())
    ], 
    [
        'F1 score', 
        metrics.f1_score(y_train, y_train_pred_class), 
        metrics.f1_score(y_test, y_test_pred_class)
    ], 
    [
        'ROC AUC score', 
        metrics.roc_auc_score(y_train, y_train_pred_class), 
        metrics.roc_auc_score(y_test, y_test_pred_class)
    ]
]

colNames = ['metric', 'train set', 'test set']

# Show result 
pd.DataFrame(results, columns = colNames)

CPU times: user 16.8 s, sys: 81.3 ms, total: 16.9 s
Wall time: 17 s

Multinomial naive Bayes
best parameter = {'alpha': 10}


Unnamed: 0,metric,train set,test set
0,accuracy,0.746606,0.375516
1,confusion matrix,"[[1218955, 441585], [15759, 128575]]","[[867, 1034], [13642, 7958]]"
2,F1 score,0.359905,0.520267
3,ROC AUC score,0.812444,0.412251


### Linear SVC with calibrated probability  

I don't have full understanding of how the calibration works, though. 

In [11]:
# Linear SVC does not support predict_proba()
# However, we can bypass this problem by using CalibratedClassifierCV()
# caliclf = CalibratedClassifierCV(base_estimator = LinearSVC(max_iter = 3000))

# Model and hyperparameterization
# clf = GridSearchCV(
#     caliclf, 
#     param_grid = {'base_estimator__C': [0.1, 1, 10]}, 
#     scoring = 'roc_auc'
# )

caliclf = CalibratedClassifierCV(base_estimator = LinearSVC(C = 0.1, max_iter = 2000))

# Train 
%time clf.fit(X_train, y_train)
 
# Predict label
y_train_pred_class = clf.predict(X_train)
y_test_pred_class = clf.predict(X_test)
# Predict probability of being toxic
y_train_pred_prob = clf.predict_proba(X_train)[:,1]
y_test_pred_prob = clf.predict_proba(X_test)[:,1]

print('\nLinear SVC with calirated classifier')
# print(f'best parameter = {clf.best_params_}')

# Store results
results = [
    [
        'accuracy', 
        metrics.accuracy_score(y_train, y_train_pred_class), 
        metrics.accuracy_score(y_test, y_test_pred_class)
    ], 
    [
        'confusion matrix', 
        str(metrics.confusion_matrix(y_train, y_train_pred_class).tolist()), 
        str(metrics.confusion_matrix(y_test, y_test_pred_class).tolist())
    ], 
    [
        'F1 score', 
        metrics.f1_score(y_train, y_train_pred_class), 
        metrics.f1_score(y_test, y_test_pred_class)
    ], 
    [
        'ROC AUC score', 
        metrics.roc_auc_score(y_train, y_train_pred_class), 
        metrics.roc_auc_score(y_test, y_test_pred_class)
    ]
]

colNames = ['metric', 'train set', 'test set']

# Show result 
pd.DataFrame(results, columns = colNames)

CPU times: user 10min 32s, sys: 3min 44s, total: 14min 16s
Wall time: 9min 47s

Linear SVC with calirated classifier


Unnamed: 0,metric,train set,test set
0,accuracy,0.960642,0.165312
1,confusion matrix,"[[1641516, 19024], [52012, 92322]]","[[1556, 345], [19271, 2329]]"
2,F1 score,0.722168,0.191893
3,ROC AUC score,0.814092,0.46317


## Part 3: Evaluation

Out of 23501 test examples, 21600 of them are positive and 1901 of them are negative. Therefore, null accuracy = 0.9191. Any model with accuracy lower than 0.9191 is not acceptable. 