# Table of Content (Code Part 4 of 6)

1. [Libraries](#libraries)
2. [Random Forest Classifier](#random-forest-classifier)
3. [RF with Countvectorizer](#rf-countvectorizer)
4. [RF with TFIDF Vectorizer](#rf-with-tfidf-vectorizer)
5. [RF Countvectorizer Optimization](#rf-countvectorizer-optimization)

# Libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 100)

In [4]:
# Read allposts csv file
allposts = pd.read_csv('../Data/allposts.csv')

In previous section, we have selected two models for optimization. Our goal is to increase the test score, lower the difference between train score and test score to avoid overfitting. After that, we will look into the metrics and performance.

# Random Forest Classifier

## RF with Countvectorizer

Random forest classifier with countvectorizer transformer and original combined text.

In the Model Selection section, we made use of default hyper parameter and only countvectorizer transformer. In this section, we will try with different hyperparameters as well as TFID transformer to determine which is able to generate better CV score and test accuracy.

In [5]:
# Initialize X and y
X = allposts['combinedtext']
y = allposts['subreddit']

# Perform train-test-split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=12)

In [6]:
# Set up pipeline with countvectorizer and random forest classifier
pipe_rf_c = Pipeline([
    ('cvec', CountVectorizer(stop_words= 'english')),
    ('rf', RandomForestClassifier())
])

In [7]:
# Pipeline parameters
pipe_rf_c_params = {
    'cvec__max_df': [0.9, 1.0],
    'rf__n_estimators': [100, 200]
}

In [8]:
# Perform Gridsearch with 3 fold cross validation
gs_rf_c = GridSearchCV(pipe_rf_c, pipe_rf_c_params, cv=3, verbose=10)

In [9]:
# Train the model
gs_rf_c.fit(X_train, y_train);

Fitting 3 folds for each of 4 candidates, totalling 12 fits
[CV 1/3; 1/4] START cvec__max_df=0.9, rf__n_estimators=100......................
[CV 1/3; 1/4] END cvec__max_df=0.9, rf__n_estimators=100;, score=0.849 total time=  17.8s
[CV 2/3; 1/4] START cvec__max_df=0.9, rf__n_estimators=100......................
[CV 2/3; 1/4] END cvec__max_df=0.9, rf__n_estimators=100;, score=0.836 total time=  18.6s
[CV 3/3; 1/4] START cvec__max_df=0.9, rf__n_estimators=100......................
[CV 3/3; 1/4] END cvec__max_df=0.9, rf__n_estimators=100;, score=0.854 total time=  19.4s
[CV 1/3; 2/4] START cvec__max_df=0.9, rf__n_estimators=200......................
[CV 1/3; 2/4] END cvec__max_df=0.9, rf__n_estimators=200;, score=0.842 total time=  35.5s
[CV 2/3; 2/4] START cvec__max_df=0.9, rf__n_estimators=200......................
[CV 2/3; 2/4] END cvec__max_df=0.9, rf__n_estimators=200;, score=0.833 total time=  34.9s
[CV 3/3; 2/4] START cvec__max_df=0.9, rf__n_estimators=200......................
[CV 

In [20]:
# Gridsearch CV best score
gs_rf_c.best_score_

0.8463760109506193

In [11]:
# Gridsearch best parameters
gs_rf_c.best_params_

{'cvec__max_df': 0.9, 'rf__n_estimators': 100}

The best parameters for countvectorizer is at max_df = 0.9 while the best n_estimator is the same as default value.

In [13]:
# Train score
print (f'train score: {gs_rf_c.score(X_train, y_train)}')
# Test Score
print (f'test score: {gs_rf_c.score(X_test, y_test)}')

train score: 0.999125
test score: 0.838


The model is clearly overfitting. Let's take a look at TFID vectorizer to see if it is able to provide better result.

## RF with TFIDF Vectorizer

Random forest classifier with TFIDF vectorizer transformer and original combined text.

In [15]:
# Set up pipeline with TFIDF Vectorizer and random forest classifier
pipe_rf_t = Pipeline([
    ('tvec', TfidfVectorizer(stop_words='english')),
    ('rf', RandomForestClassifier())
])

In [17]:
# Pipeline parameters
pipe_rf_t_params = {
    'tvec__max_df': [0.9,1.0],
    'rf__n_estimators': [100, 200]
}

In [18]:
# Perform Gridsearch with 3 fold cross validation
gs_rf_t = GridSearchCV(pipe_rf_t, pipe_rf_t_params, cv=3, verbose=10)

In [19]:
# Train the model
gs_rf_t.fit(X_train, y_train);

Fitting 3 folds for each of 4 candidates, totalling 12 fits
[CV 1/3; 1/4] START rf__n_estimators=100, tvec__max_df=0.9......................
[CV 1/3; 1/4] END rf__n_estimators=100, tvec__max_df=0.9;, score=0.845 total time=  17.3s
[CV 2/3; 1/4] START rf__n_estimators=100, tvec__max_df=0.9......................
[CV 2/3; 1/4] END rf__n_estimators=100, tvec__max_df=0.9;, score=0.837 total time=  19.0s
[CV 3/3; 1/4] START rf__n_estimators=100, tvec__max_df=0.9......................
[CV 3/3; 1/4] END rf__n_estimators=100, tvec__max_df=0.9;, score=0.852 total time=  17.4s
[CV 1/3; 2/4] START rf__n_estimators=100, tvec__max_df=1.0......................
[CV 1/3; 2/4] END rf__n_estimators=100, tvec__max_df=1.0;, score=0.845 total time=  18.4s
[CV 2/3; 2/4] START rf__n_estimators=100, tvec__max_df=1.0......................
[CV 2/3; 2/4] END rf__n_estimators=100, tvec__max_df=1.0;, score=0.837 total time=  18.2s
[CV 3/3; 2/4] START rf__n_estimators=100, tvec__max_df=1.0......................
[CV 

In [21]:
# Gridsearch CV best score
gs_rf_t.best_score_

0.8453756202455188

In [22]:
# Gridsearch best parameters
gs_rf_t.best_params_

{'rf__n_estimators': 200, 'tvec__max_df': 0.9}

For this Gridsearch with TFIDF vectorizer, the best parameters are not the default parameters anymore.

In [23]:
# Train score
print (f'train score: {gs_rf_t.score(X_train, y_train)}')
# Test Score
print (f'test score: {gs_rf_t.score(X_test, y_test)}')

train score: 0.999125
test score: 0.8335


The CV score and test score are not as good as the Gridsearch with countvectorizer. We will make use of countvectorizer for further optimization to reduce overfitting.

## RF Countvectorizer Optimization

In [25]:
# Countvectorizer transformer with max_df 0.9 as it is the best parameter.
cvec = CountVectorizer(max_df=0.9)

X_train_c = cvec.fit_transform(X_train)
X_test_c = cvec.transform(X_test)

In [45]:
# Instantiate Random Forest Classifier
rf_c_opt = RandomForestClassifier(random_state=12)

In [1]:
# Set different regulization strength in new parameters
rf_c_opt_params = {
    'ccp_alpha': [0.0002, 0.0004, 0.0006, 0.0008, 0.001, 0.0012],
    'max_depth': [5, 10, 15]
}

In [62]:
# Perform Gridsearch with 5 fold cross validation
gs_rf_c_opt = GridSearchCV(rf_c_opt, rf_c_opt_params, cv=5, verbose=1)

In [63]:
# Train the model with new parameters
gs_rf_c_opt.fit(X_train_c, y_train)

Fitting 5 folds for each of 18 candidates, totalling 90 fits


In [64]:
# Gridsearch CV best score
gs_rf_c_opt.best_score_

0.8026250000000001

In [65]:
# Gridsearch CV best parameters
gs_rf_c_opt.best_params_

{'ccp_alpha': 0.0006, 'max_depth': 15}

In [66]:
# Train score
print (f'train score: {gs_rf_c_opt.score(X_train_c, y_train)}')
# Test Score
print (f'test score: {gs_rf_c_opt.score(X_test_c, y_test)}')

train score: 0.815
test score: 0.7975


After optimization, our model is much less overfitting. Train score and test score are quite close to each other.

In [None]:
# This is the model with the best parameters
cvec_best = CountVectorizer(max_df=0.9)
rf_best = RandomForestClassifier(ccp_alpha=0.0006, max_depth=15, random_state=12)