# Random Forest

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
import pickle
import warnings; warnings.simplefilter('ignore')
np.random.RandomState(42)



<mtrand.RandomState at 0x2dd6cbed678>

In [2]:
X_train = pd.read_csv("../Data/X_train.csv",index_col=0)
y_train = pd.read_csv("../Data/y_train.csv",index_col=0)
with open('../Assets/custom_stop_words.pkl','rb') as f:
    custom_stop_words = pickle.load(f)

This cell instantiates a Pipeline, which will process the data, and  determines which model I will use in this notebook, which is the Random Forest Classifier.

In [3]:
pipe = Pipeline([
    ("tfidif", TfidfVectorizer()),
    ("RFC", RandomForestClassifier())
])

This dictionary lists the hyper parameters that I want my Gridsearch to test, which will then pick a single model. Gridsearch will then use the model that had the best accuracy.

In [4]:
params = {
    "tfidif__stop_words":[custom_stop_words],
    "tfidif__min_df":[5],
    'RFC__max_features':['auto','log2',5,10],
    "RFC__n_estimators":[10,50,100]
}

Max_features is the amount of words that we want to split each tree in our random forest on
N_estimators is the number of individual decision trees that our model will use to ultimately classify our data. Auto takes the square root of the total number of features, and log2 determines the number of features using the equation log2(n_features) = max_features. The min_df parameter is saying that in order for a word to be considered in the model, it must show up in at least 5 documents. This is done in order to reduce the overall number of feature. Most is not all of the words being eliminated should have no importance in determining each class if it shows up in less than 5 posts.

In [5]:
gs_RanFor = GridSearchCV(pipe,param_grid=params)

In [6]:
gs_RanFor.fit(X_train['Total_text'],y_train['subreddit'])

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('tfidif', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
...n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'tfidif__stop_words': [['bill', 'however', 'move', 'whoever', 'becoming', 'detail', 'here', 'she', 'its', 'forty', 'sometimes', 'yours', 'describe', 'seem', 'thereby', 'may', 'and', 'these', 'along', 'still', 'yourselves', 'three', 'whether', 'none', 'themselves', 'first', 'two', 'before...dif__min_df': [5], 'RFC__max_features': ['auto', 'log2', 5, 10], 'RFC__n_estimators': [10, 50, 100]},
       pre_dispatch='2*n_jobs', refit=True, scoring=N

In [7]:
print(gs_RanFor.best_params_['RFC__max_features'])
print(gs_RanFor.best_params_['RFC__n_estimators'])

log2
100


For our random forest, our model performed the best when max features is set to log2, and the number of estimators is equal to 100

###### Saving our Random Forrest Classifier to be evaluated later

In [8]:
with open('../Assets/Ranfor.pkl','wb+') as f:
    pickle.dump(gs_RanFor,f)