# 2C-jkk-random-forest

Building on the lessons from the previous notebook (2A-jkk-naive-bayes), we will train a [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html?highlight=random%20forest#sklearn.ensemble.RandomForestClassifier) classifier using a bag-of-words representation of text.

Note that the documentation linked above may refer to a newer version of scikit-learn.

In [1]:
import sqlite3 as sql
import pandas as pd
import numpy as np
import re

seed = 101

Load the dataset from the previous notebook.

In [2]:
with sql.connect('../data/toxic.db') as conn:
    df = pd.read_sql_query('select * from toxic', conn)
df.head()

Unnamed: 0,rev_id,comment,year,logged_in,ns,sample,split,num,min,max,avg,y
0,2232.0,This:\n:One can make an analogy in mathematica...,2002,1,article,random,train,10.0,-1.0,1.0,0.4,0
1,4216.0,"""\n\n:Clarification for you (and Zundark's ri...",2002,1,user,random,train,10.0,0.0,2.0,0.5,0
2,8953.0,Elected or Electoral? JHK,2002,0,article,random,test,10.0,0.0,1.0,0.1,0
3,26547.0,"""This is such a fun entry. Devotchka\n\nI on...",2002,1,article,random,train,10.0,0.0,2.0,0.6,0
4,28959.0,Please relate the ozone hole to increases in c...,2002,1,article,random,test,10.0,-1.0,1.0,0.2,0


Remember to isolate the train, dev, and test sets.

In [3]:
idx_train = df['split'] == 'train'
idx_dev = df['split'] == 'dev'
idx_test = df['split'] == 'test'

## Hyperparameter tuning

So how do we efficiently tune our hyperparameters? We can use GridSearchCV or RandomizedSearchCV, but we have a pre-defined dev set so we need to use some tricks to override the normal behavior. This is actually pretty standard for large-scale NLP problems. Cross-validation is preferred, but often not feasible for large datasets.

```
For some datasets, a pre-defined split of the data into training- and validation fold or into several cross-validation folds already exists. Using PredefinedSplit it is possible to use these folds e.g. when searching for hyperparameters.

For example, when using a validation set, set the test_fold to 0 for all samples that are part of the validation set, and to -1 for all other samples.
```

With that in mind, we can set up a PredefinedSplit. Note that the following code is bad from a memory standpoint. We are simply doing it this way for clarity.

In [4]:
X_train = df.loc[idx_train, "comment"].values
y_train = df.loc[idx_train, "y"].values

X_dev = df.loc[idx_dev, "comment"].values
y_dev = df.loc[idx_dev, "y"].values

X = np.hstack([X_train, X_dev])
y = np.hstack([y_train, y_dev])

In [5]:
idx = np.zeros(shape=y.shape)
idx[:y_train.shape[0]] = -1
pd.value_counts(idx)

-1.0    95692
 0.0    32128
dtype: int64

In [6]:
from sklearn.model_selection import PredefinedSplit, GridSearchCV

ps = PredefinedSplit(idx)

Let's continue our earlier experiments, simply swapping out the classifier for a Random Forest model. Note that tree-based methods are somewhat inefficient for sparse input, and it might be better to consider a method that produces a dense vector representation (e.g., [LSA](https://scikit-learn.org/0.22/modules/generated/sklearn.decomposition.TruncatedSVD.html#sklearn.decomposition.TruncatedSVD), [NMF](https://scikit-learn.org/0.22/modules/generated/sklearn.decomposition.NMF.html#sklearn.decomposition.NMF), or [LDA](https://scikit-learn.org/0.22/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html#sklearn.decomposition.LatentDirichletAllocation)). In addition, since the Random Forest is an ensemble of decision trees, we have a trade-off between the extremes of training individual pipelines quickly (but sequentially) or training individual pipelines slowly (but in parallel).

In [7]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.feature_selection import SelectPercentile, chi2
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

vect_1 = CountVectorizer(
    token_pattern = r"[a-z]+", 
    ngram_range = (1,1),
    lowercase = True,
    min_df = 1,
    max_df = 1.0
)

vect_2 = TfidfVectorizer(
    token_pattern = r"[a-z]+", 
    ngram_range = (1,1),
    lowercase = True,
    min_df = 1,
    max_df = 1.0
)

select = SelectPercentile(score_func=chi2)

clf = RandomForestClassifier(n_jobs=2)

pipe = Pipeline([("vect", vect_1), ("select", select), ("clf", clf)])

In [8]:
param_grid = {
    'vect':[vect_1, vect_2],
    'vect__ngram_range':[(1,1), (1,2), (1,3)],
    'vect__min_df':[1, 2, 5, 10, 20],
    'select__percentile':[1, 2, 5, 10, 20, 50],
    'clf__n_estimators':[10, 20, 50, 100],
    'clf__max_depth':[1, 2, 5, 10],
    'clf__class_weight':[None, 'balanced', 'balanced_subsample']
}

# Note that in this case, we need to 
rs = RandomizedSearchCV(pipe, param_grid, n_iter=30, scoring='f1', n_jobs=3, cv=ps, verbose=2)
rs.fit(X, y)
print(rs.best_params_)
print(rs.best_score_)

Fitting 1 folds for each of 30 candidates, totalling 30 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done  30 out of  30 | elapsed:  8.2min finished


{'vect__ngram_range': (1, 3), 'vect__min_df': 10, 'vect': CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=10,
        ngram_range=(1, 3), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='[a-z]+', tokenizer=None,
        vocabulary=None), 'select__percentile': 1, 'clf__n_estimators': 100, 'clf__max_depth': 10, 'clf__class_weight': 'balanced_subsample'}
0.4800218938149972


In [9]:
from joblib import dump, load

dump(rs, '../results/rs_cv_rf.joblib')
# rs = load('../results/rs_cv_nb.joblib')

['../results/rs_cv_rf.joblib']

Now we can examine the results.

In [10]:
from sklearn.metrics import confusion_matrix, classification_report

y_pred = rs.best_estimator_.predict(X_dev)
confusion_matrix(y_dev, y_pred)

array([[18252,  7998],
       [ 1313,  4565]], dtype=int64)

In [11]:
print(classification_report(y_dev, y_pred))

              precision    recall  f1-score   support

           0       0.93      0.70      0.80     26250
           1       0.36      0.78      0.50      5878

   micro avg       0.71      0.71      0.71     32128
   macro avg       0.65      0.74      0.65     32128
weighted avg       0.83      0.71      0.74     32128



In [12]:
np.mean(y_dev==y_pred)

0.7101904880478087

Hm, not so great. Let's try NMF.

In [13]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.feature_selection import SelectPercentile, chi2
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import NMF

vect_1 = CountVectorizer(
    token_pattern = r"[a-z]+", 
    ngram_range = (1,1),
    lowercase = True,
    min_df = 1,
    max_df = 1.0
)

vect_2 = TfidfVectorizer(
    token_pattern = r"[a-z]+", 
    ngram_range = (1,1),
    lowercase = True,
    min_df = 1,
    max_df = 1.0
)

nmf = NMF()

clf = RandomForestClassifier(n_jobs=2)

pipe = Pipeline([("vect", vect_1), ("nmf", nmf), ("clf", clf)])

In [14]:
param_grid = {
    'vect':[vect_1, vect_2],
    'vect__ngram_range':[(1,1), (1,2), (1,3)],
    'vect__min_df':[1, 2, 5, 10, 20],
    'nmf__n_components':[10, 20, 50, 100, 200],
    'nmf__init':['random', 'nndsvda', 'nndsvda'],
    'clf__n_estimators':[10, 20, 50, 100],
    'clf__max_depth':[1, 2, 5, 10, 20],
    'clf__class_weight':[None, 'balanced', 'balanced_subsample']
}

# Note that in this case, we need to 
rs = RandomizedSearchCV(pipe, param_grid, n_iter=60, scoring='f1', n_jobs=3, cv=ps, verbose=2)
rs.fit(X, y)
print(rs.best_params_)
print(rs.best_score_)

Fitting 1 folds for each of 60 candidates, totalling 60 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done  35 tasks      | elapsed: 349.4min
[Parallel(n_jobs=3)]: Done  60 out of  60 | elapsed: 1397.1min finished


{'vect__ngram_range': (1, 3), 'vect__min_df': 20, 'vect': TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=20,
        ngram_range=(1, 3), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='[a-z]+', tokenizer=None, use_idf=True,
        vocabulary=None), 'nmf__n_components': 100, 'nmf__init': 'nndsvda', 'clf__n_estimators': 100, 'clf__max_depth': 20, 'clf__class_weight': 'balanced_subsample'}
0.6257511575214264


In [15]:
dump(rs, '../results/rs_cv_nmf_rf.joblib')

['../results/rs_cv_nmf_rf.joblib']

In [16]:
y_pred = rs.best_estimator_.predict(X_dev)
confusion_matrix(y_dev, y_pred)

array([[25816,   434],
       [  572,  5306]], dtype=int64)

In [17]:
print(classification_report(y_dev, y_pred))

              precision    recall  f1-score   support

           0       0.98      0.98      0.98     26250
           1       0.92      0.90      0.91      5878

   micro avg       0.97      0.97      0.97     32128
   macro avg       0.95      0.94      0.95     32128
weighted avg       0.97      0.97      0.97     32128



In [18]:
np.mean(y_dev==y_pred)

0.968687749003984