In [1]:
import sqlite3 as sql
import pandas as pd
import numpy as np
import re

seed = 101

Load the dataset from the previous notebook.

In [2]:
with sql.connect('../data/toxic.db') as conn:
    df = pd.read_sql_query('select * from toxic', conn)
df.head()

Unnamed: 0,rev_id,comment,year,logged_in,ns,sample,split,min,max,avg,y
0,2232.0,This:\n:One can make an analogy in mathematica...,2002,1,article,random,train,-1.0,1.0,0.4,0
1,4216.0,"""\n\n:Clarification for you (and Zundark's ri...",2002,1,user,random,train,0.0,2.0,0.5,0
2,8953.0,Elected or Electoral? JHK,2002,0,article,random,test,0.0,1.0,0.1,0
3,26547.0,"""This is such a fun entry. Devotchka\n\nI on...",2002,1,article,random,train,0.0,2.0,0.6,0
4,28959.0,Please relate the ozone hole to increases in c...,2002,1,article,random,test,-1.0,1.0,0.2,0


Remember to isolate the train, dev, and test sets.

In [3]:
idx_train = df['split'] == 'train'
idx_dev = df['split'] == 'dev'
idx_test = df['split'] == 'test'

Let's start things off with a pretty basic model to serve as a baseline: a CountVectorizer and a LogisticRegression. We'll start with the tokenizer we developed in the exploration step.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import SGDClassifier

def tokenizer(text):
    return re.findall(r'[a-z0-9]+', text.lower())

vect = CountVectorizer(tokenizer=tokenizer)
clf = SGDClassifier(loss='log', max_iter=100, tol=1e-6, random_state=seed)

In [5]:
X_train = df.loc[idx_train, 'comment'].values
X_dev = df.loc[idx_dev, 'comment'].values

y_train = df.loc[idx_train, 'y'].values
y_dev = df.loc[idx_dev, 'y'].values

X_train_vect = vect.fit_transform(X_train)
X_dev_vect = vect.transform(X_dev)

In [6]:
clf.fit(X_train_vect, y_train)
y_pred = clf.predict(X_dev_vect)
np.mean(y_dev==y_pred)

0.8945468127490039

Nice, we nearly got 90% on the dev set. Is accuracy a good metric for this problem though?

In [7]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_dev, y_pred)

array([[25830,   420],
       [ 2968,  2910]])

In [8]:
from sklearn.metrics import classification_report

print(classification_report(y_dev, y_pred))

              precision    recall  f1-score   support

           0       0.90      0.98      0.94     26250
           1       0.87      0.50      0.63      5878

   micro avg       0.89      0.89      0.89     32128
   macro avg       0.89      0.74      0.79     32128
weighted avg       0.89      0.89      0.88     32128



The recall is pretty poor for the positive class. Let's take a look at a situation where we incorrectly classified a comment.

In [9]:
idx_error = (y_dev != y_pred) & (y_dev == 1)
print(X_dev[idx_error][1])

Removed from article: 
:An extra-large common pork sausage was named after him.  

Nah.  


That's strange, what was the average toxicity score for this instance?

In [10]:
df['hash'] = df['comment'].map(hash)
df[df['hash'] == hash(X_dev[idx_error][1])]

Unnamed: 0,rev_id,comment,year,logged_in,ns,sample,split,min,max,avg,y,hash
445,2943468.0,Removed from article: \n:An extra-large common...,2003,1,article,random,dev,-1.0,1.0,-0.1,1,-4185677943264876719


This may indicate problems in the dataset (or how we aggregate the scores). At this point, I would normally revisit the EDA step to review the labeling function, but let's forge ahead for now. Since we have a pretty large imbalance and the positive class is low, let's use F1 as the evaluation metric.

In [11]:
from sklearn.metrics import f1_score

f1_score(y_dev, y_pred)

0.6320590790616855

So how do we efficiently tune our hyperparameters? We can use GridSearchCV or RandomizedSearchCV, but we have a pre-defined dev set so we need to use some tricks to override the normal behavior. This is actually pretty standard for large-scale NLP problems. Cross-validation is preferred, but often not feasible for large datasets.

```
For some datasets, a pre-defined split of the data into training- and validation fold or into several cross-validation folds already exists. Using PredefinedSplit it is possible to use these folds e.g. when searching for hyperparameters.

For example, when using a validation set, set the test_fold to 0 for all samples that are part of the validation set, and to -1 for all other samples.
```

In [12]:
from sklearn.model_selection import PredefinedSplit, GridSearchCV
from sklearn.pipeline import Pipeline

df_train_dev = df[df['split'].map(lambda x: x in {'train', 'dev'})].copy()
idx = np.where(df_train_dev['split'] == 'dev', 0, -1) # See documentation above

ps = PredefinedSplit(idx)

In [13]:
ps.get_n_splits()

1

In [14]:
pipe = Pipeline([('vect', CountVectorizer(tokenizer=tokenizer)),
                 ('clf', SGDClassifier(loss='log', max_iter=10, tol=1e-6, penalty='elasticnet', random_state=seed))])
param_grid = {'vect__ngram_range':[(1,1), (1,2)], 'vect__min_df':[1, 2, 5, 10, 20],
              'clf__l1_ratio':[0.0, 0.1, 0.2], 'clf__class_weight':[{0:1,1:1}, {0:1,1:2}]}
gs = GridSearchCV(pipe, param_grid, scoring='f1', n_jobs=6, cv=ps, verbose=2)
gs.fit(df_train_dev['comment'].values, df_train_dev['y'].values)
print(gs.best_params_)
print(gs.best_score_)

Fitting 1 folds for each of 60 candidates, totalling 60 fits


[Parallel(n_jobs=6)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done  29 tasks      | elapsed:  2.1min
[Parallel(n_jobs=6)]: Done  60 out of  60 | elapsed:  4.1min finished


{'clf__class_weight': {0: 1, 1: 2}, 'clf__l1_ratio': 0.0, 'vect__min_df': 2, 'vect__ngram_range': (1, 2)}
0.7021696252465484




Since this step took so long, I am going to persist the result sto disk using joblib (cooking-show style).

In [15]:
from joblib import dump, load

# dump(gs, '../results/gs_cv_sgd.joblib')
gs = load('../results/gs_cv_sgd.joblib')

Now let's look at the results.

In [16]:
y_pred = gs.best_estimator_.predict(X_dev)
confusion_matrix(y_dev, y_pred)

array([[25657,   593],
       [  814,  5064]])

In [17]:
print(classification_report(y_dev, y_pred))

              precision    recall  f1-score   support

           0       0.97      0.98      0.97     26250
           1       0.90      0.86      0.88      5878

   micro avg       0.96      0.96      0.96     32128
   macro avg       0.93      0.92      0.93     32128
weighted avg       0.96      0.96      0.96     32128



In [18]:
np.mean(y_dev==y_pred)

0.9562064243027888

Not a bad result! Let's see if adding a TfidfTransformer can make a difference. Remember, TFIDF will reduce the impact of high-frequency words and increase the impact of rare words.

In [19]:
from sklearn.feature_extraction.text import TfidfTransformer

pipe = Pipeline([('vect', CountVectorizer(tokenizer=tokenizer)),
                 ('tfidf', TfidfTransformer()),
                 ('clf', SGDClassifier(loss='log', max_iter=10, tol=1e-6, penalty='elasticnet', n_jobs=2, random_state=seed))])
param_grid = {'vect__ngram_range':[(1,1), (1,2)], 'vect__min_df':[1, 2, 5, 10, 20],
              'clf__l1_ratio':[0.0, 0.1, 0.2], 'clf__class_weight':[{0:1,1:1}, {0:1,1:2}]}
gs = GridSearchCV(pipe, param_grid, scoring='f1', n_jobs=4, cv=ps, verbose=2)
gs.fit(df_train_dev['comment'].values, df_train_dev['y'].values)
print(gs.best_params_)
print(gs.best_score_)

Fitting 1 folds for each of 60 candidates, totalling 60 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  33 tasks      | elapsed:  2.6min
[Parallel(n_jobs=4)]: Done  60 out of  60 | elapsed:  4.9min finished


{'clf__class_weight': {0: 1, 1: 2}, 'clf__l1_ratio': 0.0, 'vect__min_df': 20, 'vect__ngram_range': (1, 1)}
0.6791372399312846


Again, let's store the result.

In [20]:
# dump(gs, '../results/gs_cv_tfidf_sgd.joblib')
gs = load('../results/gs_cv_tfidf_sgd.joblib')

In [21]:
y_pred = gs.best_estimator_.predict(X_dev)
confusion_matrix(y_dev, y_pred)

array([[25284,   966],
       [ 2282,  3596]])

In [22]:
print(classification_report(y_dev, y_pred))

              precision    recall  f1-score   support

           0       0.92      0.96      0.94     26250
           1       0.79      0.61      0.69      5878

   micro avg       0.90      0.90      0.90     32128
   macro avg       0.85      0.79      0.81     32128
weighted avg       0.89      0.90      0.89     32128



In [23]:
np.mean(y_dev==y_pred)

0.8989043824701195

So not quite as good. Now let's put together a model that uses character ngrams.

In [24]:
def strip_junk(text):
    return ' '.join(re.findall(r'[a-z0-9]+', text))

vect = CountVectorizer(preprocessor=strip_junk, analyzer='char_wb', ngram_range=(2,4))

vect.fit(X_train[:10])
vect.vocabulary_

{' h': 221,
 'hi': 1838,
 'is': 2104,
 's ': 3376,
 ' hi': 236,
 'his': 1857,
 'is ': 2105,
 ' his': 241,
 'his ': 1858,
 ' n': 347,
 'ne': 2532,
 'e ': 1305,
 ' ne': 357,
 'ne ': 2533,
 ' ne ': 358,
 ' c': 82,
 'ca': 1026,
 'an': 773,
 'n ': 2488,
 ' ca': 83,
 'can': 1035,
 'an ': 774,
 ' can': 85,
 'can ': 1036,
 ' m': 318,
 'ma': 2383,
 'ak': 727,
 'ke': 2193,
 ' ma': 320,
 'mak': 2390,
 'ake': 728,
 'ke ': 2194,
 ' mak': 323,
 'make': 2391,
 'ake ': 729,
 ' a': 0,
 ' an': 26,
 ' an ': 27,
 'na': 2492,
 'al': 733,
 'lo': 2341,
 'og': 2730,
 'gy': 1787,
 'y ': 4064,
 'ana': 775,
 'nal': 2493,
 'alo': 753,
 'log': 2342,
 'ogy': 2738,
 'gy ': 1788,
 ' ana': 28,
 'anal': 776,
 'nalo': 2495,
 'alog': 754,
 'logy': 2344,
 'ogy ': 2739,
 ' i': 248,
 'in': 2028,
 ' in': 265,
 'in ': 2029,
 ' in ': 266,
 'at': 886,
 'th': 3632,
 'he': 1812,
 'em': 1437,
 'ti': 3656,
 'ic': 1926,
 'l ': 2223,
 'mat': 2398,
 'ath': 894,
 'the': 3637,
 'hem': 1818,
 'ema': 1439,
 'ati': 898,
 'tic': 3658,
 'ica

Notice that char_wb pads the beginning and end of the ngram with spaces.

In [25]:
X_train_vect = vect.fit_transform(X_train)
X_dev_vect = vect.transform(X_dev)
X_train_vect

<95692x114053 sparse matrix of type '<class 'numpy.int64'>'
	with 41360832 stored elements in Compressed Sparse Row format>

In [26]:
clf = SGDClassifier(loss='log', max_iter=100, tol=1e-6, random_state=seed)
clf.fit(X_train_vect, y_train)
y_pred = clf.predict(X_dev_vect)
confusion_matrix(y_dev, y_pred)

array([[25540,   710],
       [ 3247,  2631]])

In [27]:
print(classification_report(y_dev, y_pred))

              precision    recall  f1-score   support

           0       0.89      0.97      0.93     26250
           1       0.79      0.45      0.57      5878

   micro avg       0.88      0.88      0.88     32128
   macro avg       0.84      0.71      0.75     32128
weighted avg       0.87      0.88      0.86     32128



In [28]:
np.mean(y_dev==y_pred)

0.8768364043824701

The result isn't great, but it isn't terrible either. Let's try a little tuning.

In [29]:
pipe = Pipeline([('vect', CountVectorizer(preprocessor=strip_junk, analyzer='char_wb', ngram_range=(2,4))),
                 ('clf', SGDClassifier(loss='log', max_iter=10, tol=1e-6, penalty='elasticnet', n_jobs=2, random_state=seed))])
param_grid = {'vect__ngram_range':[(2,3), (2,4), (2,5)], 'vect__min_df':[1, 2, 5],
              'clf__l1_ratio':[0.0, 0.1, 0.2], 'clf__class_weight':[{0:1,1:1}, {0:1,1:2}]}
gs = GridSearchCV(pipe, param_grid, scoring='f1', n_jobs=4, cv=ps, verbose=2)
gs.fit(df_train_dev['comment'].values, df_train_dev['y'].values)
print(gs.best_params_)
print(gs.best_score_)

Fitting 1 folds for each of 54 candidates, totalling 54 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  33 tasks      | elapsed: 13.1min
[Parallel(n_jobs=4)]: Done  54 out of  54 | elapsed: 21.5min finished


{'clf__class_weight': {0: 1, 1: 2}, 'clf__l1_ratio': 0.0, 'vect__min_df': 2, 'vect__ngram_range': (2, 5)}
0.6320825515947467


In [30]:
# dump(gs, '../results/gs_cv__char_wb_sgd.joblib')
gs = load('../results/gs_cv__char_wb_sgd.joblib')

In [31]:
y_pred = gs.best_estimator_.predict(X_dev)
confusion_matrix(y_dev, y_pred)

array([[21750,  4500],
       [ 1169,  4709]])

In [32]:
print(classification_report(y_dev, y_pred))

              precision    recall  f1-score   support

           0       0.95      0.83      0.88     26250
           1       0.51      0.80      0.62      5878

   micro avg       0.82      0.82      0.82     32128
   macro avg       0.73      0.81      0.75     32128
weighted avg       0.87      0.82      0.84     32128



In [33]:
np.mean(y_dev==y_pred)

0.8235495517928287

_Add some language here about evaluating model performance in the next notebook_