#### Step 3: Modeling

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

In [2]:
data = pd.read_csv('../data/cleaned_posts.csv')

In [3]:
X = data['title']
y = data['subreddit']

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=1120)

Before starting to evaluate any classification models, it's important to determine a baseline of accuracy for comparing results. This null model will use the most commonly occuring target value in the training data as a prediction for every value in the test data. 

#### Step 3A: Null Model

In [5]:
y_train.value_counts(normalize=True)

Coffee    0.50275
tea       0.49725
Name: subreddit, dtype: float64

In [6]:
y_test.value_counts(normalize=True)

Coffee    0.502604
tea       0.497396
Name: subreddit, dtype: float64

Guessing 'Coffee' for every sample gives us a null accuracy of 50.3%, only slightly off from the 50% we would have expected had all samples been kept and the classes were exactly even.

#### Step 3B: Natural Language Processing & Vectorization

The first step in our modeling process will be to transform our text. This will include any preprocessing in the form of tokenization, stemming, or lemmatization, the actual transformer that we'll use to vectorize the tokens, and the hyperparameters that we'll feed into that transformer, like n-gram size, the minimum and maximum number of samples a token must occur in to be included, and the number of tokens we ultimately want to see used as features.

There are two main vectorizers we've been using to consider:
- CountVectorizer, which gives a simple count of tokens as matrix values
- TfidfVectorizer (TFDIF: 'term frequency'-'inverse document frequency'), which creates a matrix which assigns value based on how often a term occurs in the target class vs how often it occurs in the sample as a whole

Both vectorizers will require a tokenizer to be specified to break up the text into words or phrases as well as process things like punctuation. This could be a simple tokenizer that just grabs the words as they appear in the sample text, or a tokenizer that includes lemmatization, the process of attempting to reduce a word to a base form in order to count all the forms a word might take as the same feature (i.e. ideally 'run,' 'ran,' and 'running' would all become 'run'). 

For this model, I will be testing the default basic tokenizer as well as a custom tokenizer that incorporates the WordNetLemmatizer. 

Further work could be done to specify custom regular expressions ('regex') for tokenization or a more powerful NLP library like spaCy could be used, but prior testing has shown that neither one will provide a significant improvement to this model to justify their complexities.

In [7]:
# this custom wordnetlemmatizer code has been taken from:
# https://stackoverflow.com/questions/47423854/sklearn-adding-lemmatizer-to-countvectorizer
# and is similar to code found in lesson 5.04 NLP II

class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, articles):
        return [self.wnl.lemmatize(t) for t in word_tokenize(articles)]

To test what kind of vectorizer to use, I'll use GridSearchCV to test out different combinations of vectorizers, tokenizers, and hyperparameters. For comparison, they will all use a naive Bayes classifier. 

In [8]:
vector_pipe = Pipeline([
    ('vect', CountVectorizer()),
    ('bayes', MultinomialNB())
])

vector_params = {
    'vect' : [CountVectorizer(), TfidfVectorizer()],
    'vect__tokenizer' : [None, LemmaTokenizer()],
    'vect__stop_words' : [None, 'english'],
    'vect__ngram_range' : [(1, 1), (1, 2), (2, 2)],
    'vect__min_df' : [1, 2, 5],
    'vect__max_features' : [None, 1000, 2500]
}

In [9]:
vector_grid = GridSearchCV(vector_pipe, param_grid=vector_params, n_jobs=-1)

Instead of rerunning the timely search several times, I saved the results of the cell below before commenting out the code. 

In [10]:
# vector_grid.fit(X_train, y_train)
# vector_search = pd.DataFrame(vector_grid.cv_results_).sort_values(by='mean_test_score', ascending=False)
# vector_search.to_csv('../grid_search/vectorizers.csv', index=False)

I can then load the saved results:

In [11]:
vector_search = pd.read_csv('../grid_search/vectorizers.csv')

In [12]:
vector_search.head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_vect,param_vect__max_features,param_vect__min_df,param_vect__ngram_range,param_vect__stop_words,param_vect__tokenizer,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,1.307686,0.030241,0.290893,0.007107,"CountVectorizer(ngram_range=(1, 2), stop_words...",,1,"(1, 2)",english,<__main__.LemmaTokenizer object at 0x1a13778310>,"{'vect': CountVectorizer(ngram_range=(1, 2), s...",0.917511,0.914616,0.914616,0.931983,0.918958,0.919537,0.006446,1
1,1.261823,0.028666,0.285279,0.009024,"CountVectorizer(ngram_range=(1, 2), stop_words...",,2,"(1, 2)",english,<__main__.LemmaTokenizer object at 0x1a13778310>,"{'vect': CountVectorizer(ngram_range=(1, 2), s...",0.911722,0.910275,0.913169,0.937771,0.916064,0.9178,0.010167,2
2,0.143002,0.007003,0.02336,0.001405,"CountVectorizer(ngram_range=(1, 2), stop_words...",2500.0,2,"(1, 2)",english,,"{'vect': CountVectorizer(ngram_range=(1, 2), s...",0.914616,0.908828,0.905933,0.930535,0.924747,0.916932,0.009361,3
3,0.112515,0.0072,0.022423,0.002819,"CountVectorizer(ngram_range=(1, 2), stop_words...",,2,"(1, 2)",english,,"{'vect': CountVectorizer(ngram_range=(1, 2), s...",0.916064,0.910275,0.903039,0.931983,0.9233,0.916932,0.010051,3
4,1.203209,0.120937,0.289473,0.025417,"CountVectorizer(ngram_range=(1, 2), stop_words...",,1,"(1, 1)",english,<__main__.LemmaTokenizer object at 0x1a13778310>,"{'vect': CountVectorizer(ngram_range=(1, 2), s...",0.9233,0.904486,0.911722,0.930535,0.913169,0.916643,0.00918,5


The results show that the top performing models used the CountVectorizer with minimum occurences of 1 or 2, monograms and bigrams, stop words removed, lemmatization with the custom LemmaTokenizer, and no maximum features.

I'll use the second best performing vectorization model moving forward, as limiting the features to occurring in a minimum of 2 samples will significantly reduce the number of features being used in calculations without significantly impacting accuracy.

In [13]:
cvect = CountVectorizer(ngram_range=(1, 2), stop_words='english', min_df=2, tokenizer=LemmaTokenizer())

#### Step 3C: Classification Model

For the classification itself, I'll be testing several different models as well as fine tuning their respective hyperparameters. I'll begin with three singular models: a naive Bayes classifier, logistic regression, and a support vector classifier.

In [14]:
bayes_pipe = Pipeline([
    ('cvect', cvect),
    ('sscaler', StandardScaler(with_mean=False)),
    ('bayes', MultinomialNB())
])

bayes_params = {
    'bayes__alpha' : [0.001, 1, 100, 100_000, 1_000_000]
}

In [15]:
bayes_grid = GridSearchCV(bayes_pipe, bayes_params, n_jobs=-1)

Like with the vectorizers above, I'll be saving and loading the results of the grid searches to keep from running the searches more than necessary. 

In [16]:
# bayes_grid.fit(X_train, y_train)
# bayes_search = pd.DataFrame(bayes_grid.cv_results_).sort_values(by='mean_test_score', ascending=False)
# bayes_search.to_csv('../grid_search/naivebayes.csv', index=False)

In [17]:
bayes_search = pd.read_csv('../grid_search/naivebayes.csv')

In [18]:
bayes_search.head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_bayes__alpha,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,1.986317,0.016954,0.44208,0.019782,100000.0,{'bayes__alpha': 100000},0.901592,0.892909,0.879884,0.910275,0.888567,0.894645,0.010492,1
1,2.104413,0.142098,0.45498,0.030158,100.0,{'bayes__alpha': 100},0.863965,0.868307,0.855282,0.888567,0.869754,0.869175,0.01093,2
2,2.43817,0.13144,0.553253,0.095937,1.0,{'bayes__alpha': 1},0.859624,0.845152,0.845152,0.875543,0.859624,0.857019,0.011299,3
3,7.190884,2.276147,0.580462,0.01211,0.001,{'bayes__alpha': 0.001},0.849493,0.837916,0.843705,0.872648,0.853835,0.85152,0.011849,4
4,1.818018,0.359644,0.626132,0.211996,1000000.0,{'bayes__alpha': 1000000},0.659913,0.65123,0.633864,0.625181,0.617945,0.637627,0.015736,5


Here we can again see that the large alpha parameter of 100,000 is the best performing on the dataset, with a mean score of .895. We can also check how it performs on the full training set and the test set:

In [19]:
bayes_pipe = Pipeline([
    ('cvect', cvect),
    ('sscaler', StandardScaler(with_mean=False)),
    ('bayes', MultinomialNB(alpha=100_000))
])

bayes_pipe.fit(X_train, y_train)

  'stop_words.' % sorted(inconsistent))


Pipeline(steps=[('cvect',
                 CountVectorizer(min_df=2, ngram_range=(1, 2),
                                 stop_words='english',
                                 tokenizer=<__main__.LemmaTokenizer object at 0x10300ea90>)),
                ('sscaler', StandardScaler(with_mean=False)),
                ('bayes', MultinomialNB(alpha=100000))])

In [20]:
bayes_pipe.score(X_train, y_train)

0.9502170767004342

In [21]:
bayes_pipe.score(X_test, y_test)

0.8880208333333334

There's a fair gap between the train and test scores, indicating that the model is overfitting, but this tends to be the norm for a naive Bayes classifier and isn't surprising for data that could see such large differences like vocabulary choice. 

I'll search over logistic regression next:

In [22]:
lr_pipe = Pipeline([
    ('cvect', cvect),
    ('sscaler', StandardScaler(with_mean=False)),
    ('lr', LogisticRegression(solver='saga'))
])

lr_params= {
    'lr__penalty' : ['l1', 'l2'],
    'lr__C' : [0.001, 0.01, 0.1, 1, 10, 100, 1000],
    'lr__max_iter' : [100, 1000]
}

In [23]:
lr_grid = GridSearchCV(lr_pipe, param_grid=lr_params, n_jobs=-1)

In [24]:
# lr_grid.fit(X_train, y_train)
# lr_search = pd.DataFrame(lr_grid.cv_results_).sort_values(by='mean_test_score', ascending=False)
# lr_search.to_csv('../grid_search/logisticregression.csv', index=False)

In [25]:
lr_search = pd.read_csv('../grid_search/logisticregression.csv')

In [26]:
lr_search.head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_lr__C,param_lr__max_iter,param_lr__penalty,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,4.745059,0.123086,0.384371,0.049956,0.1,100,l1,"{'lr__C': 0.1, 'lr__max_iter': 100, 'lr__penal...",0.918958,0.920405,0.907381,0.918958,0.910275,0.915195,0.005305,1
1,15.333631,1.740514,0.470107,0.176817,0.1,1000,l1,"{'lr__C': 0.1, 'lr__max_iter': 1000, 'lr__pena...",0.910275,0.908828,0.905933,0.903039,0.910275,0.90767,0.002806,2
2,11.666374,0.568221,0.366477,0.029431,1.0,100,l1,"{'lr__C': 1, 'lr__max_iter': 100, 'lr__penalty...",0.908828,0.890014,0.910275,0.913169,0.898698,0.904197,0.008606,3
3,2.028229,0.040023,0.434689,0.016125,0.001,1000,l2,"{'lr__C': 0.001, 'lr__max_iter': 1000, 'lr__pe...",0.904486,0.898698,0.903039,0.917511,0.894356,0.903618,0.007799,4
4,1.886967,0.025981,0.410863,0.011716,0.001,100,l2,"{'lr__C': 0.001, 'lr__max_iter': 100, 'lr__pen...",0.904486,0.898698,0.903039,0.917511,0.894356,0.903618,0.007799,4


Here we see our highest performance with an l1 penalty, a C value of 0.1, and a maximum of 100 iterations.

The highest mean test score was .915, and we can check performance on the full training set and the test set next: 

In [27]:
lr_pipe1 = Pipeline([
    ('cvect', cvect),
    ('sscaler', StandardScaler(with_mean=False)),
    ('lr', LogisticRegression(penalty='l1', C=.1, max_iter=100, solver='saga'))
])

lr_pipe1.fit(X_train, y_train)



Pipeline(steps=[('cvect',
                 CountVectorizer(min_df=2, ngram_range=(1, 2),
                                 stop_words='english',
                                 tokenizer=<__main__.LemmaTokenizer object at 0x10300ea90>)),
                ('sscaler', StandardScaler(with_mean=False)),
                ('lr', LogisticRegression(C=0.1, penalty='l1', solver='saga'))])

In [28]:
lr_pipe1.score(X_train, y_train)

0.9777134587554269

In [29]:
lr_pipe1.score(X_test, y_test)

0.9140625

We can see that the logistic regression outperforms the naive Bayes on the train data and the test data, but also shows signs of overfit.

With that in mind, we can check the scores with some smaller C values, corresponding with stronger regularization. 

In [30]:
lr_pipe2 = Pipeline([
    ('cvect', cvect),
    ('sscaler', StandardScaler(with_mean=False)),
    ('lr', LogisticRegression(solver='saga', penalty='l1', max_iter=100, C=0.05))
])

lr_pipe2.fit(X_train, y_train)



Pipeline(steps=[('cvect',
                 CountVectorizer(min_df=2, ngram_range=(1, 2),
                                 stop_words='english',
                                 tokenizer=<__main__.LemmaTokenizer object at 0x10300ea90>)),
                ('sscaler', StandardScaler(with_mean=False)),
                ('lr',
                 LogisticRegression(C=0.05, penalty='l1', solver='saga'))])

In [31]:
lr_pipe2.score(X_train, y_train)

0.9623733719247467

In [32]:
lr_pipe2.score(X_test, y_test)

0.9210069444444444

Through some trial and error, it seems like that without majorly sacrificing train and test accuracy in the name of curbing overfit, the best parameters are an l1 penalty with a C value of 0.05.

The next model I'll try is the support vector classifier.

In [33]:
svc_pipe = Pipeline([
    ('cvect', cvect),
    ('sscaler', StandardScaler(with_mean=False)),
    ('svc', SVC())
])

svc_params= {
    'svc__kernel' : ['rbf', 'sigmoid', 'poly'],
    'svc__C' : [0.001, 0.01, 0.1, 1, 10, 100, 1000],
}

In [34]:
svc_grid = GridSearchCV(svc_pipe, param_grid=svc_params, n_jobs=-1)

In [35]:
# svc_grid.fit(X_train, y_train)
# svc_search = pd.DataFrame(svc_grid.cv_results_).sort_values(by='mean_test_score', ascending=False)
# svc_search.to_csv('../grid_search/svc.csv', index=False)

In [36]:
svc_search = pd.read_csv('../grid_search/svc.csv')

In [37]:
svc_search.head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_svc__C,param_svc__kernel,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,2.830177,0.115086,0.699153,0.13584,1.0,sigmoid,"{'svc__C': 1, 'svc__kernel': 'sigmoid'}",0.907381,0.879884,0.890014,0.903039,0.904486,0.896961,0.010412,1
1,3.082071,0.137042,0.645653,0.027951,10.0,rbf,"{'svc__C': 10, 'svc__kernel': 'rbf'}",0.89725,0.859624,0.875543,0.872648,0.888567,0.878726,0.013054,2
2,3.189655,0.139689,0.641812,0.036743,1.0,rbf,"{'svc__C': 1, 'svc__kernel': 'rbf'}",0.898698,0.852388,0.871201,0.871201,0.87699,0.874096,0.014843,3
3,2.984897,0.064087,0.66547,0.058165,100.0,rbf,"{'svc__C': 100, 'svc__kernel': 'rbf'}",0.881331,0.846599,0.850941,0.861071,0.882779,0.864544,0.015056,4
4,2.836434,0.164419,0.501529,0.04266,10.0,sigmoid,"{'svc__C': 10, 'svc__kernel': 'sigmoid'}",0.859624,0.845152,0.862518,0.862518,0.871201,0.860203,0.008468,5


In [38]:
svc_pipe1 = Pipeline([
    ('cvect', cvect),
    ('sscaler', StandardScaler(with_mean=False)),
    ('svc', SVC(kernel='sigmoid'))
])

svc_pipe1.fit(X_train, y_train)

Pipeline(steps=[('cvect',
                 CountVectorizer(min_df=2, ngram_range=(1, 2),
                                 stop_words='english',
                                 tokenizer=<__main__.LemmaTokenizer object at 0x10300ea90>)),
                ('sscaler', StandardScaler(with_mean=False)),
                ('svc', SVC(kernel='sigmoid'))])

In [39]:
svc_pipe1.score(X_train, y_train)

0.9577424023154848

In [40]:
svc_pipe1.score(X_test, y_test)

0.8923611111111112

The best model appears to be the sigmoid kernel with a C value of 1. Trial of different C values was not able to significantly reduce the gap between the train and test scores or improve the test accuracy. We can also compare the gap to the gaussian kernel:

In [41]:
svc_pipe2 = Pipeline([
    ('cvect', cvect),
    ('sscaler', StandardScaler(with_mean=False)),
    ('svc', SVC(C=1))
])

svc_pipe2.fit(X_train, y_train)

Pipeline(steps=[('cvect',
                 CountVectorizer(min_df=2, ngram_range=(1, 2),
                                 stop_words='english',
                                 tokenizer=<__main__.LemmaTokenizer object at 0x10300ea90>)),
                ('sscaler', StandardScaler(with_mean=False)),
                ('svc', SVC(C=1))])

In [42]:
svc_pipe2.score(X_train, y_train)

0.9748191027496382

In [43]:
svc_pipe2.score(X_test, y_test)

0.8715277777777778

Across different C values, the gaussian kernel performed worse on the test data than the sigmoid kernel. 

Next I'll look at an ensembling model, the random forest classifier.

In [44]:
rfc_pipe = Pipeline([
    ('cvect', cvect),
    ('sscaler', StandardScaler(with_mean=False)),
    ('rfc', RandomForestClassifier(n_jobs=-1))
])

rfc_params= {
    'rfc__criterion' : ['gini', 'entropy'],
    'rfc__n_estimators' : [10, 100, 1000],
    'rfc__max_depth' : [3, 5, 10, None]
}

In [45]:
rfc_grid = GridSearchCV(rfc_pipe, param_grid=rfc_params)

In [46]:
# rfc_grid.fit(X_train, y_train)
# rfc_search = pd.DataFrame(rfc_grid.cv_results_).sort_values(by='mean_test_score', ascending=False)
# rfc_search.to_csv('./grid_search/randomforest.csv', index=False)

In [47]:
rf_search = pd.read_csv('../grid_search/randomforest.csv')

In [48]:
rf_search.head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_rfc__criterion,param_rfc__max_depth,param_rfc__n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,3.297346,0.075845,0.321364,0.003114,gini,,1000,"{'rfc__criterion': 'gini', 'rfc__max_depth': N...",0.916,0.926667,0.906667,0.914667,0.905333,0.913867,0.007664,1
1,3.10244,0.037014,0.318379,0.003102,entropy,,1000,"{'rfc__criterion': 'entropy', 'rfc__max_depth'...",0.913333,0.932,0.901333,0.917333,0.904,0.9136,0.010911,2
2,0.397004,0.015792,0.114382,0.001129,gini,,100,"{'rfc__criterion': 'gini', 'rfc__max_depth': N...",0.914667,0.922667,0.902667,0.92,0.908,0.9136,0.007419,3
3,0.37396,0.013042,0.114435,0.002338,entropy,,100,"{'rfc__criterion': 'entropy', 'rfc__max_depth'...",0.913333,0.930667,0.902667,0.914667,0.904,0.913067,0.010028,4
4,0.088379,0.010437,0.113634,0.001857,gini,,10,"{'rfc__criterion': 'gini', 'rfc__max_depth': N...",0.918667,0.92,0.896,0.905333,0.892,0.9064,0.01142,5


The best parameters were found to be gini criterion, 1000 estimators, and no limit on depth. We can see the performance on train and test sets below.

In [49]:
rfc_pipe1 = Pipeline([
    ('cvect', cvect),
    ('sscaler', StandardScaler(with_mean=False)),
    ('rfc', RandomForestClassifier(criterion='gini', n_estimators=1000, n_jobs=-1))
])

rfc_pipe1.fit(X_train, y_train)

Pipeline(steps=[('cvect',
                 CountVectorizer(min_df=2, ngram_range=(1, 2),
                                 stop_words='english',
                                 tokenizer=<__main__.LemmaTokenizer object at 0x10300ea90>)),
                ('sscaler', StandardScaler(with_mean=False)),
                ('rfc', RandomForestClassifier(n_estimators=1000, n_jobs=-1))])

In [50]:
rfc_pipe1.score(X_train, y_train)

0.9939218523878437

In [51]:
rfc_pipe1.score(X_test, y_test)

0.8940972222222222

The training score is very high, indicating there's definitely overfit happening. We can test to see if other parameters avoided overfit any better or were able to increase the accuracy on the test set:

In [52]:
rfc_pipe2 = Pipeline([
    ('cvect', cvect),
    ('sscaler', StandardScaler(with_mean=False)),
    ('rfc', RandomForestClassifier(criterion='entropy', n_estimators=1000, n_jobs=-1))
])

rfc_pipe2.fit(X_train, y_train)

Pipeline(steps=[('cvect',
                 CountVectorizer(min_df=2, ngram_range=(1, 2),
                                 stop_words='english',
                                 tokenizer=<__main__.LemmaTokenizer object at 0x10300ea90>)),
                ('sscaler', StandardScaler(with_mean=False)),
                ('rfc',
                 RandomForestClassifier(criterion='entropy', n_estimators=1000,
                                        n_jobs=-1))])

In [53]:
rfc_pipe2.score(X_train, y_train)

0.9939218523878437

In [54]:
rfc_pipe2.score(X_test, y_test)

0.8949652777777778

Entropy as a criterion seems to score slightly better on the test set than gini.  

In [55]:
rfc_pipe3 = Pipeline([
    ('cvect', cvect),
    ('sscaler', StandardScaler(with_mean=False)),
    ('rfc', RandomForestClassifier(criterion='gini', n_estimators=100, n_jobs=-1))
])

rfc_pipe3.fit(X_train, y_train)

Pipeline(steps=[('cvect',
                 CountVectorizer(min_df=2, ngram_range=(1, 2),
                                 stop_words='english',
                                 tokenizer=<__main__.LemmaTokenizer object at 0x10300ea90>)),
                ('sscaler', StandardScaler(with_mean=False)),
                ('rfc', RandomForestClassifier(n_jobs=-1))])

In [56]:
rfc_pipe3.score(X_train, y_train)

0.9939218523878437

In [57]:
rfc_pipe3.score(X_test, y_test)

0.8862847222222222

Seems like the higher number of estimators didn't actually add much to the model.

In [58]:
rfc_pipe4 = Pipeline([
    ('cvect', cvect),
    ('sscaler', StandardScaler(with_mean=False)),
    ('rfc', RandomForestClassifier(criterion='gini', n_estimators=100, max_depth=75, n_jobs=-1))
])

rfc_pipe4.fit(X_train, y_train)

Pipeline(steps=[('cvect',
                 CountVectorizer(min_df=2, ngram_range=(1, 2),
                                 stop_words='english',
                                 tokenizer=<__main__.LemmaTokenizer object at 0x10300ea90>)),
                ('sscaler', StandardScaler(with_mean=False)),
                ('rfc', RandomForestClassifier(max_depth=75, n_jobs=-1))])

In [59]:
rfc_pipe4.score(X_train, y_train)

0.9583212735166425

In [60]:
rfc_pipe4.score(X_test, y_test)

0.9010416666666666

No matter the parameters, the gap between the train score and the test score remains between about 4 and 8 points, but a max depth of 75 gives us the highest test set score of 90.9.