# 2A-jkk-naive-bayes

In this notebook, we will train a [Naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html) classifier using a bag-of-words representation of text. The goal of this notebooks is to introduce the following scikit-learn classes: [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html?highlight=countvectorizer#sklearn.feature_extraction.text.CountVectorizer), [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer), [MultinomialNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB), [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV), and [RandomizedSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html#sklearn.model_selection.RandomizedSearchCV).

Note that the documentation linked above may refer to a newer version of scikit-learn.

In [1]:
import sqlite3 as sql
import pandas as pd
import numpy as np
import re

seed = 101

Load the dataset from the previous notebook.

In [2]:
with sql.connect('../data/toxic.db') as conn:
    df = pd.read_sql_query('select * from toxic', conn)
df.head()

Unnamed: 0,rev_id,comment,year,logged_in,ns,sample,split,num,min,max,avg,y
0,2232.0,This:\n:One can make an analogy in mathematica...,2002,1,article,random,train,10.0,-1.0,1.0,0.4,0
1,4216.0,"""\n\n:Clarification for you (and Zundark's ri...",2002,1,user,random,train,10.0,0.0,2.0,0.5,0
2,8953.0,Elected or Electoral? JHK,2002,0,article,random,test,10.0,0.0,1.0,0.1,0
3,26547.0,"""This is such a fun entry. Devotchka\n\nI on...",2002,1,article,random,train,10.0,0.0,2.0,0.6,0
4,28959.0,Please relate the ozone hole to increases in c...,2002,1,article,random,test,10.0,-1.0,1.0,0.2,0


Remember to isolate the train, dev, and test sets.

In [3]:
idx_train = df['split'] == 'train'
idx_dev = df['split'] == 'dev'
idx_test = df['split'] == 'test'

Let's start off by training a CountVectorizer, which is a nice class provided by Scikit-learn that allows you to transform raw text into a sparse token count vector.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(
    token_pattern = r"[a-z]+", 
    ngram_range = (1,1),
    lowercase = True,
    min_df = 1,
    max_df = 1.0
)

In [5]:
X_train = df.loc[idx_train, 'comment'].values
X_train_vect = vect.fit_transform(X_train)
X_train_vect

<95692x124461 sparse matrix of type '<class 'numpy.int64'>'
	with 4277363 stored elements in Compressed Sparse Row format>

Note that the fit_transform method allows us to fit the CountVectorizer (build a dictionary from the select parameters) and transform the training set into a sparse matrix representing token counts. Each row is an instance from the training set, and each column is associated with a token. The value carried represents the number of times a particular token occurred in the instance.

In [6]:
X_train[0]

"This:\n:One can make an analogy in mathematical terms by envisioning the distribution of opinions in a population as a Gaussian curve. We would then say that the consensus would be a statement that represents the range of opinions within perhaps three standard deviations of the mean opinion. \nsounds arbitrary and ad hoc.  Does it really belong in n encyclopedia article?  I don't see that it adds anything useful.\n\nThe paragraph that follows seems much more useful.  Are there any political theorists out there who can clarify the issues?  It seems to me that this is an issue that Locke, Rousseau, de Toqueville, and others must have debated...  SR\n"

In [7]:
X_train_vect[0]

<1x124461 sparse matrix of type '<class 'numpy.int64'>'
	with 83 stored elements in Compressed Sparse Row format>

We can visualize the output by using instance variables created by the fit/fit_transform method.

In [8]:
id2token = {v:k for k,v in vect.vocabulary_.items()}
counter = 0
for i, value in enumerate(X_train_vect[0].toarray()[0]):
    if value != 0:
        print(f"{counter}\t{value}\t{id2token[i]}")
        counter += 1

0	3	a
1	1	ad
2	1	adds
3	2	an
4	1	analogy
5	2	and
6	1	any
7	1	anything
8	1	arbitrary
9	1	are
10	1	article
11	1	as
12	1	be
13	1	belong
14	1	by
15	2	can
16	1	clarify
17	1	consensus
18	1	curve
19	1	de
20	1	debated
21	1	deviations
22	1	distribution
23	1	does
24	1	don
25	1	encyclopedia
26	1	envisioning
27	1	follows
28	1	gaussian
29	1	have
30	1	hoc
31	1	i
32	3	in
33	1	is
34	1	issue
35	1	issues
36	3	it
37	1	locke
38	1	make
39	1	mathematical
40	1	me
41	1	mean
42	1	more
43	1	much
44	1	must
45	1	n
46	3	of
47	1	one
48	1	opinion
49	2	opinions
50	1	others
51	1	out
52	1	paragraph
53	1	perhaps
54	1	political
55	1	population
56	1	range
57	1	really
58	1	represents
59	1	rousseau
60	1	say
61	1	see
62	2	seems
63	1	sounds
64	1	sr
65	1	standard
66	1	statement
67	1	t
68	1	terms
69	6	that
70	6	the
71	1	then
72	1	theorists
73	2	there
74	2	this
75	1	three
76	1	to
77	1	toqueville
78	2	useful
79	1	we
80	1	who
81	1	within
82	2	would


Now that we have a vectorized training set, we should be able to train a Naive Bayes model and test performance against the dev set.

In [9]:
from sklearn.naive_bayes import MultinomialNB

y_train = df.loc[idx_train, 'y'].values

clf = MultinomialNB() # Leave everything at default for now
clf.fit(X_train_vect, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [10]:
X_dev = df.loc[idx_dev, 'comment'].values
X_dev_vect = vect.transform(X_dev) # Note that the vectorizer is already fit, so we only use the transform method.
y_dev = df.loc[idx_dev, 'y'].values

In [11]:
y_pred = clf.predict(X_dev_vect)
np.mean(y_dev==y_pred)

0.8967255976095617

Nice, we nearly got 90% on the dev set. Is accuracy a good metric for this problem though?

In [12]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_dev, y_pred)

array([[25186,  1064],
       [ 2254,  3624]], dtype=int64)

In [13]:
from sklearn.metrics import classification_report

print(classification_report(y_dev, y_pred))

              precision    recall  f1-score   support

           0       0.92      0.96      0.94     26250
           1       0.77      0.62      0.69      5878

   micro avg       0.90      0.90      0.90     32128
   macro avg       0.85      0.79      0.81     32128
weighted avg       0.89      0.90      0.89     32128



The recall is pretty poor for the positive class. Let's take a look at a situation where we incorrectly classified a comment.

In [14]:
idx_error = (y_dev != y_pred) & (y_dev == 1)
print(X_dev[idx_error][1])

Removed from article: 
:An extra-large common pork sausage was named after him.  

Nah.  


That's strange, what was the average toxicity score for this instance?

In [15]:
df['hash'] = df['comment'].map(hash) # Using hash to speed up lookup
df[df['hash'] == hash(X_dev[idx_error][1])]

Unnamed: 0,rev_id,comment,year,logged_in,ns,sample,split,num,min,max,avg,y,hash
445,2943468.0,Removed from article: \n:An extra-large common...,2003,1,article,random,dev,10.0,-1.0,1.0,-0.1,1,-7357645735245327406


This may indicate problems in the dataset (or how we aggregate the scores). At this point, I would normally revisit the EDA step to review the labeling function, but let's forge ahead for now. Since we have a pretty large imbalance and the positive class is low, let's use F1 as the evaluation metric.

In [16]:
from sklearn.metrics import f1_score

f1_score(y_dev, y_pred)

0.6859738784781374

## Hyperparameter tuning

So how do we efficiently tune our hyperparameters? We can use GridSearchCV or RandomizedSearchCV, but we have a pre-defined dev set so we need to use some tricks to override the normal behavior. This is actually pretty standard for large-scale NLP problems. Cross-validation is preferred, but often not feasible for large datasets.

```
For some datasets, a pre-defined split of the data into training- and validation fold or into several cross-validation folds already exists. Using PredefinedSplit it is possible to use these folds e.g. when searching for hyperparameters.

For example, when using a validation set, set the test_fold to 0 for all samples that are part of the validation set, and to -1 for all other samples.
```

With that in mind, we can set up a PredefinedSplit. Note that the following code is bad from a memory standpoint. We are simply doing it this way for clarity.

In [17]:
X_train = df.loc[idx_train, "comment"].values
y_train = df.loc[idx_train, "y"].values

X_dev = df.loc[idx_dev, "comment"].values
y_dev = df.loc[idx_dev, "y"].values

X = np.hstack([X_train, X_dev])
y = np.hstack([y_train, y_dev])

In [18]:
idx = np.zeros(shape=y.shape)
idx[:y_train.shape[0]] = -1
pd.value_counts(idx)

-1.0    95692
 0.0    32128
dtype: int64

In [19]:
from sklearn.model_selection import PredefinedSplit, GridSearchCV

ps = PredefinedSplit(idx)

Before we proceed, let's talk about pipelines. Pipelines are a helper class in scikit-learn that allow you daisy-chain transformers and classifiers together into a single object. This can vastly simply training and deployment. As an example, let's look at our earlier example:

In [20]:
from sklearn.pipeline import Pipeline

vect = CountVectorizer(
    token_pattern = r"[a-z]+", 
    ngram_range = (1,1),
    lowercase = True,
    min_df = 1,
    max_df = 1.0
)

clf = MultinomialNB()

pipe = Pipeline([("vect", vect), ("clf", clf)])
pipe.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='[a-z]+', tokenizer=None,
        vocabulary=None)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

In [21]:
pipe.predict(X_dev[:10])

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int64)

Using Pipeline and GridSearchCV, we can define a parameter range to characterize. GridSearchCV will keep track of the parameters used and the target metric (F1 here), and report the best performing combination. Note that since we defined the basic pipeline above, we do not need to redefine it before tuning. Each iteration will overwrite the previous configuration.

In [22]:
param_grid = {
    'vect__ngram_range':[(1,1), (1,2), (1,3)],
    'vect__min_df':[1, 2, 5, 10, 20],
    'clf__fit_prior':[False, True]
}

gs = GridSearchCV(pipe, param_grid, scoring='f1', n_jobs=6, cv=ps, verbose=2)
gs.fit(X, y)
print(gs.best_params_)
print(gs.best_score_)

Fitting 1 folds for each of 30 candidates, totalling 30 fits


[Parallel(n_jobs=6)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done  30 out of  30 | elapsed:  5.3min finished


{'clf__fit_prior': True, 'vect__min_df': 2, 'vect__ngram_range': (1, 1)}
0.690830822212656


Since this step took so long, I am going to persist the result sto disk using joblib (cooking-show style).

In [23]:
from joblib import dump, load

dump(gs, '../results/gs_cv_nb.joblib')
# gs = load('../results/gs_cv_nb.joblib')

['../results/gs_cv_nb.joblib']

Now let's look at the results.

In [24]:
y_pred = gs.best_estimator_.predict(X_dev)
confusion_matrix(y_dev, y_pred)

array([[24515,  1735],
       [ 1473,  4405]], dtype=int64)

In [25]:
print(classification_report(y_dev, y_pred))

              precision    recall  f1-score   support

           0       0.94      0.93      0.94     26250
           1       0.72      0.75      0.73      5878

   micro avg       0.90      0.90      0.90     32128
   macro avg       0.83      0.84      0.84     32128
weighted avg       0.90      0.90      0.90     32128



In [26]:
np.mean(y_dev==y_pred)

0.9001494023904383

Not a bad result! Let's see if adding a TfidfVectorizer can make a difference by adding it to the parameter search. TF-IDF will reduce the weight of high-frequency words and increase the weight of rare words (via inverse document frequency, or IDF).

In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer

vect_1 = CountVectorizer(
    token_pattern = r"[a-z]+", 
    ngram_range = (1,1),
    lowercase = True,
    min_df = 1,
    max_df = 1.0
)

vect_2 = TfidfVectorizer(
    token_pattern = r"[a-z]+", 
    ngram_range = (1,1),
    lowercase = True,
    min_df = 1,
    max_df = 1.0
)

clf = MultinomialNB()

pipe = Pipeline([("vect", vect), ("clf", clf)])

In [28]:
param_grid = {
    'vect':[vect_1, vect_2],
    'vect__ngram_range':[(1,1), (1,2), (1,3)],
    'vect__min_df':[1, 2, 5, 10, 20],
    'clf__fit_prior':[False, True]
}

gs = GridSearchCV(pipe, param_grid, scoring='f1', n_jobs=6, cv=ps, verbose=2)
gs.fit(X, y)
print(gs.best_params_)
print(gs.best_score_)

Fitting 1 folds for each of 60 candidates, totalling 60 fits


[Parallel(n_jobs=6)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done  29 tasks      | elapsed:  4.9min
[Parallel(n_jobs=6)]: Done  60 out of  60 | elapsed: 10.3min finished


{'clf__fit_prior': True, 'vect': CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=2,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='[a-z]+', tokenizer=None,
        vocabulary=None), 'vect__min_df': 2, 'vect__ngram_range': (1, 1)}
0.690830822212656


Same result! TF-IDF did not help here, which is an important lesson when it comes to machine learning (i.e., no free lunch). 

In [29]:
y_pred = gs.best_estimator_.predict(X_dev)
confusion_matrix(y_dev, y_pred)

array([[24515,  1735],
       [ 1473,  4405]], dtype=int64)

In [30]:
print(classification_report(y_dev, y_pred))

              precision    recall  f1-score   support

           0       0.94      0.93      0.94     26250
           1       0.72      0.75      0.73      5878

   micro avg       0.90      0.90      0.90     32128
   macro avg       0.83      0.84      0.84     32128
weighted avg       0.90      0.90      0.90     32128



In [31]:
np.mean(y_dev==y_pred)

0.9001494023904383

Finally, you'll note that GridSearchCV is an exhaustive search across all possible parameters. Rather than define a rigid grid of points to examine, we can use RandomizedSearchCV to set a budget and sample from a parameter space. Let's add another transformation and more hyperparameters to tune as an example.

In [32]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.feature_selection import SelectPercentile, chi2

vect_1 = CountVectorizer(
    token_pattern = r"[a-z]+", 
    ngram_range = (1,1),
    lowercase = True,
    min_df = 1,
    max_df = 1.0
)

vect_2 = TfidfVectorizer(
    token_pattern = r"[a-z]+", 
    ngram_range = (1,1),
    lowercase = True,
    min_df = 1,
    max_df = 1.0
)

select = SelectPercentile(score_func=chi2)

clf = MultinomialNB()

pipe = Pipeline([("vect", vect), ("select", select), ("clf", clf)])

Note that while there are 2\*3\*5\*6\*2=360 parameter combinations, we are only going to run 30 trials drawn randomly from the possibility space.

In [33]:
param_grid = {
    'vect':[vect_1, vect_2],
    'vect__ngram_range':[(1,1), (1,2), (1,3)],
    'vect__min_df':[1, 2, 5, 10, 20],
    'select__percentile':[1, 2, 5, 10, 20, 50],
    'clf__fit_prior':[False, True]
}

rs = RandomizedSearchCV(pipe, param_grid, n_iter=30, scoring='f1', n_jobs=6, cv=ps, verbose=2)
rs.fit(X, y)
print(rs.best_params_)
print(rs.best_score_)

Fitting 1 folds for each of 30 candidates, totalling 30 fits


[Parallel(n_jobs=6)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done  30 out of  30 | elapsed:  6.8min finished


{'vect__ngram_range': (1, 2), 'vect__min_df': 2, 'vect': TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=2,
        ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='[a-z]+', tokenizer=None, use_idf=True,
        vocabulary=None), 'select__percentile': 5, 'clf__fit_prior': False}
0.7009715725080965


In [34]:
dump(gs, '../results/rs_cv_nb.joblib')
# gs = load('../results/rs_cv_nb.joblib')

['../results/rs_cv_nb.joblib']

In [35]:
y_pred = rs.best_estimator_.predict(X_dev)
confusion_matrix(y_dev, y_pred)

array([[25047,  1203],
       [ 1392,  4486]], dtype=int64)

In [36]:
print(classification_report(y_dev, y_pred))

              precision    recall  f1-score   support

           0       0.95      0.95      0.95     26250
           1       0.79      0.76      0.78      5878

   micro avg       0.92      0.92      0.92     32128
   macro avg       0.87      0.86      0.86     32128
weighted avg       0.92      0.92      0.92     32128



In [37]:
np.mean(y_dev==y_pred)

0.9192293326693227