# Experiment 01

### General description

<ul>
    <li>Encoding: TF-IDF</li>
    <li>Models: logistic regression vs. linear SVC vs. multinomial bayesian</li>
    <li>Training set: 100% jigsaw dataset</li>
    <li>Test set: CTEC dataset (sent by Alex)</li>
    <li>Metric: ROC AUC score</li>
</ul>

### Hyperparameter tuning 

#### Logistic regression

<ul>
    <li>inverse of regularization strength: 0.1, 1, 10</li>
</ul>
    
#### Multinomial naive Bayes

<ul>
    <li>additive smoothing rate: 0.1, 1, 10</li>
</ul>

#### Linear SVC

<ul>
    <li>inverse of regularization strength: 0.1, 1, 10</li>
</ul>

In [12]:
# Import modules 
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC, SVC
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
from sklearn.calibration import CalibratedClassifierCV
from sklearn.preprocessing import StandardScaler

import warnings 
warnings.simplefilter('module')

## Part 1 Load data and set up models

In the dataset, the texts are already pre-processed (uncensoring, lemmatizing, etc.). 

### Load datasets

In [2]:
# Load data
toxic_df_train = pd.read_csv('train_preproc_shrk.csv')
toxic_df_test = pd.read_csv('ctec_training_data_preproc.csv')

# All inputs 
X_train_text = toxic_df_train['comment_text']
X_test_text = toxic_df_test['comment_text']

# target >= 0.5 --> toxic --> label = 1
# target < 0.5 --> non-toxic --> label = 0
toxic_df_train.loc[toxic_df_train['target'] >= 0.5, 'label'] = 1
toxic_df_train.loc[toxic_df_train['target'] < 0.5, 'label'] = 0

# All labels / outputs 
y_train = toxic_df_train['label']
y_test = toxic_df_test['label']


### Vectorize text data with TF-IDF

In [3]:
encoder = TfidfVectorizer(strip_accents = 'unicode', stop_words = 'english')
X_train_unscaled = encoder.fit_transform(X_train_text)
X_test_unscaled = encoder.transform(X_test_text)


### Normalize data

Normalizing data and reaching 0-mean and 1-std can improve the performance of training algorithms, especially SVC. 

Normalization cannot reach 0-mean on the sparse matrix `X_train_unscaled` because this routine includes building a dense matrix as an intermediate step. Working with dense matrix is computationally expensive. 


In [7]:
scaler = StandardScaler(with_mean = False)
X_train = scaler.fit_transform(X_train_unscaled)
X_test = scaler.transform(X_test_unscaled)

## Part 2: Training and testing

### Logistic regression

In [8]:
# Model and hyperparameterization
clf = GridSearchCV(
    LogisticRegression(), 
    param_grid = {'C': [0.1, 1, 10]}, 
    scoring = 'roc_auc'
)

# Train 
%time clf.fit(X_train, y_train)
 
# Predict label
y_train_pred_class = clf.predict(X_train)
y_test_pred_class = clf.predict(X_test)
# Predict probability of being toxic
y_train_pred_prob = clf.predict_proba(X_train)[:,1]
y_test_pred_prob = clf.predict_proba(X_test)[:,1]

print('\nLogistic regression')
print(f'best parameter = {clf.best_params_}')

# Store results
results = [
    [
        'accuracy', 
        metrics.accuracy_score(y_train, y_train_pred_class), 
        metrics.accuracy_score(y_test, y_test_pred_class)
    ], 
    [
        'confusion matrix', 
        str(metrics.confusion_matrix(y_train, y_train_pred_class).tolist()), 
        str(metrics.confusion_matrix(y_test, y_test_pred_class).tolist())
    ], 
    [
        'F1 score', 
        metrics.f1_score(y_train, y_train_pred_class), 
        metrics.f1_score(y_test, y_test_pred_class)
    ], 
    [
        'ROC AUC score', 
        metrics.roc_auc_score(y_train, y_train_pred_class), 
        metrics.roc_auc_score(y_test, y_test_pred_class)
    ]
]

colNames = ['metric', 'train set', 'test set']

# Show result 
pd.DataFrame(results, columns = colNames)

CPU times: user 15.8 s, sys: 13.2 s, total: 29 s
Wall time: 7.49 s

Logistic regression
best parameter = {'C': 0.1}


Unnamed: 0,metric,train set,test set
0,accuracy,1,0.134292
1,confusion matrix,"[[9182, 0], [0, 818]]","[[1755, 146], [20199, 1401]]"
2,F1 score,1,0.121052
3,ROC AUC score,1,0.49403


### Multinomial naive Bayes

In [9]:
# Model and hyperparameterization
clf = GridSearchCV(
    MultinomialNB(), 
    param_grid = {'alpha': [0.1, 1, 10]}, 
    scoring = 'roc_auc'
)

# Train 
%time clf.fit(X_train, y_train)
 
# Predict label
y_train_pred_class = clf.predict(X_train)
y_test_pred_class = clf.predict(X_test)
# Predict probability of being toxic
y_train_pred_prob = clf.predict_proba(X_train)[:,1]
y_test_pred_prob = clf.predict_proba(X_test)[:,1]

print('\nMultinomial naive Bayes')
print(f'best parameter = {clf.best_params_}')

# Store results
results = [
    [
        'accuracy', 
        metrics.accuracy_score(y_train, y_train_pred_class), 
        metrics.accuracy_score(y_test, y_test_pred_class)
    ], 
    [
        'confusion matrix', 
        str(metrics.confusion_matrix(y_train, y_train_pred_class).tolist()), 
        str(metrics.confusion_matrix(y_test, y_test_pred_class).tolist())
    ], 
    [
        'F1 score', 
        metrics.f1_score(y_train, y_train_pred_class), 
        metrics.f1_score(y_test, y_test_pred_class)
    ], 
    [
        'ROC AUC score', 
        metrics.roc_auc_score(y_train, y_train_pred_class), 
        metrics.roc_auc_score(y_test, y_test_pred_class)
    ]
]

colNames = ['metric', 'train set', 'test set']

# Show result 
pd.DataFrame(results, columns = colNames)

CPU times: user 186 ms, sys: 3.15 ms, total: 189 ms
Wall time: 195 ms

Multinomial naive Bayes
best parameter = {'alpha': 0.1}


Unnamed: 0,metric,train set,test set
0,accuracy,0.9497,0.202077
1,confusion matrix,"[[8679, 503], [0, 818]]","[[1486, 415], [18337, 3263]]"
2,F1 score,0.764843,0.258169
3,ROC AUC score,0.972609,0.466379


### Linear SVC with calibrated probability  

I don't have full understanding of how the calibration works, though. 

In [14]:
# Linear SVC does not support predict_proba()
# However, we can bypass this problem by using CalibratedClassifierCV()
caliclf = CalibratedClassifierCV(base_estimator = LinearSVC(max_iter = 3000))

# Model and hyperparameterization
clf = GridSearchCV(
    caliclf, 
    param_grid = {'base_estimator__C': [0.1, 1, 10]}, 
    scoring = 'roc_auc'
)

# Train 
%time clf.fit(X_train, y_train)
 
# Predict label
y_train_pred_class = clf.predict(X_train)
y_test_pred_class = clf.predict(X_test)
# Predict probability of being toxic
y_train_pred_prob = clf.predict_proba(X_train)[:,1]
y_test_pred_prob = clf.predict_proba(X_test)[:,1]

print('\nLinear SVC with calirated classifier')
print(f'best parameter = {clf.best_params_}')

# Store results
results = [
    [
        'accuracy', 
        metrics.accuracy_score(y_train, y_train_pred_class), 
        metrics.accuracy_score(y_test, y_test_pred_class)
    ], 
    [
        'confusion matrix', 
        str(metrics.confusion_matrix(y_train, y_train_pred_class).tolist()), 
        str(metrics.confusion_matrix(y_test, y_test_pred_class).tolist())
    ], 
    [
        'F1 score', 
        metrics.f1_score(y_train, y_train_pred_class), 
        metrics.f1_score(y_test, y_test_pred_class)
    ], 
    [
        'ROC AUC score', 
        metrics.roc_auc_score(y_train, y_train_pred_class), 
        metrics.roc_auc_score(y_test, y_test_pred_class)
    ]
]

colNames = ['metric', 'train set', 'test set']

# Show result 
pd.DataFrame(results, columns = colNames)

  and should_run_async(code)


CPU times: user 15.3 s, sys: 270 µs, total: 15.3 s
Wall time: 15.5 s

Linear SVC with calirated classifier
best parameter = {'base_estimator__C': 10}


Unnamed: 0,metric,train set,test set
0,accuracy,1,0.11825
1,confusion matrix,"[[9182, 0], [0, 818]]","[[1811, 90], [20632, 968]]"
2,F1 score,1,0.0854444
3,ROC AUC score,1,0.498736
