<a href="https://colab.research.google.com/github/ojasnadkar96/cs273p_project/blob/master/randomForest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Random Forest (168 Features)

Importing all the required libraries.<br>

In [0]:
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline

The two functions below are to save and import pickle files.<br>

In [0]:
def save_pkl(df,name):
    fullname = name+'.pkl'
    output = open(fullname, 'wb')
    pickle.dump(df, output)
    output.close()

In [0]:
def import_pkl(df,name):
    fullname = name+'.pkl'
    df = pickle.load(open(fullname, 'rb'))
    return df

In [0]:
df_train = pd.DataFrame()
df_valid = pd.DataFrame()
df_test = pd.DataFrame()
df_train_l = pd.DataFrame()
df_valid_l = pd.DataFrame()
df_test_l = pd.DataFrame()

In [0]:
df_train = import_pkl(df_train,'train_x')
df_valid = import_pkl(df_valid,'valid_x')
df_test = import_pkl(df_test,'test_x')
df_train_l = import_pkl(df_train_l,'train_x_l')
df_valid_l = import_pkl(df_valid_l,'valid_x_l')
df_test_l = import_pkl(df_test_l,'test_x_l')

In [0]:
print(df_train.shape)
print(df_valid.shape)
print(df_test.shape)
print(df_train_l.shape)
print(df_valid_l.shape)
print(df_test_l.shape)

(77854, 168)
(13737, 168)
(10175, 168)
(77854, 1)
(13737, 1)
(10175, 1)


Pre-processed data with a total of 168 features has been imported into dataframes.

In [0]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)
from sklearn.metrics import accuracy_score

In [0]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(class_weight='balanced')
rf.fit(df_train,np.ravel(df_train_l))
score = rf1.score(df_train,np.ravel(df_train_l))
print(score*100)

98.02322295578904


The accuracy for training data for Random Forest is around 98%<br>

In [0]:
score = rf1.score(df_valid,np.ravel(df_valid_l))
print(score*100)

54.553395937977726


The accuracy for validation data for Random Forest is around 54.5%<br>
This looks like overfitting, we have to look at the hyperparameters now.<br>

In [0]:
from pprint import pprint
pprint(rf.get_params())

{'bootstrap': True,
 'class_weight': 'balanced',
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 10,
 'n_jobs': 1,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}


These are a parameters being used.<br>

In [0]:
from sklearn.model_selection import RandomizedSearchCV

n_estimators = [10,20,30,40,50,64,128]
max_features = ['auto', 'sqrt']
max_depth = [10,20,30,40,50]
max_depth.append(None)
min_samples_split = [2, 6, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]

random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

pprint(random_grid)

{'bootstrap': [True, False],
 'max_depth': [10, 20, 30, 40, 50, None],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 6, 10],
 'n_estimators': [10, 20, 30, 40, 50, 64, 128]}


Now, we take a set of values for each parameter and pass them through RandomizedCV for 100 iterations.<br>
This should give us a good range of better parameters.<br>

In [0]:
from sklearn.model_selection import PredefinedSplit
train_len = len(df_train)
valid_len = len(df_valid)
df_tv = pd.concat([df_train, df_valid], ignore_index = True)
df_tv_l = pd.concat([df_train_l, df_valid_l], ignore_index = True)
bound = np.array([(i < train_len) * -1 for i in range(train_len + valid_len)])
split = PredefinedSplit(bound)

In the RandomizedCV model, we define a custom split for validation.<br>
Our validation data is being used as the cross-validation split for RandomizedCV.<br>
Below, we have obtained a set of hyperparameters.<br>

In [0]:

rf2 = RandomForestClassifier()
rf_random = RandomizedSearchCV(estimator = rf2, param_distributions = random_grid, n_iter = 100, n_jobs = 2, verbose = 1, cv = split)
rf_random.fit(df_tv,np.ravel(df_tv_l))
rf_random.best_params_

Fitting 1 folds for each of 100 candidates, totalling 100 fits


[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:  8.9min
[Parallel(n_jobs=2)]: Done 100 out of 100 | elapsed: 18.5min finished


{'n_estimators': 128,
 'min_samples_split': 6,
 'min_samples_leaf': 2,
 'max_features': 'auto',
 'max_depth': 40,
 'bootstrap': False}

Using the hyperparameters obtained above, we will train our model.<br>

In [0]:
from sklearn.ensemble import RandomForestClassifier
rf2 = RandomForestClassifier(class_weight='balanced',n_estimators=128,
 min_samples_split=6,
 min_samples_leaf=2,
 max_features='auto',
 max_depth=40,
 bootstrap=False)
rf2.fit(df_train,np.ravel(df_train_l))
score = rf2.score(df_train,np.ravel(df_train_l))
print(score*100)

96.9121689315899


In [0]:
score = rf2.score(df_valid,np.ravel(df_valid_l))
print(score*100)

56.1476304870059


The train accuracy is 96.9% and the validation accuracy is 56.1%<br>
There is less overfitting than before, but the validation accuracy is less than before.<br>
Now we will use GridSearchCV on a smaller range of values in hopes of finding good hyperparameter values.<br>

In [0]:
from sklearn.model_selection import GridSearchCV

n_estimators = [128]
max_features = ['auto']
max_depth = [40,50]
max_depth.append(None)
min_samples_split = [6, 10]
min_samples_leaf = [2, 4]
bootstrap = [False]

search_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

pprint(search_grid)

{'bootstrap': [False],
 'max_depth': [40, 50, None],
 'max_features': ['auto'],
 'min_samples_leaf': [2, 4],
 'min_samples_split': [6, 10],
 'n_estimators': [128]}


In [0]:
rf3 = RandomForestClassifier()
rf_grid = GridSearchCV(estimator = rf3, param_grid = search_grid, n_jobs = 2, verbose = 1, cv = split)
rf_grid.fit(df_tv, np.ravel(df_tv_l))
rf_grid.best_params_

Fitting 1 folds for each of 12 candidates, totalling 12 fits


[Parallel(n_jobs=2)]: Done  12 out of  12 | elapsed:  5.9min finished


{'bootstrap': False,
 'max_depth': None,
 'max_features': 'auto',
 'min_samples_leaf': 4,
 'min_samples_split': 6,
 'n_estimators': 128}

In [0]:
from sklearn.ensemble import RandomForestClassifier
rf3 = RandomForestClassifier(class_weight='balanced',n_estimators=128,
 min_samples_split=6,
 min_samples_leaf=4,
 max_features='auto',
 max_depth=None,
 bootstrap=False)
rf3.fit(df_train,np.ravel(df_train_l))
score = rf3.score(df_train,np.ravel(df_train_l))
print(score*100)

86.27816168726076


In [0]:
score = rf.score(df_valid,np.ravel(df_valid_l))
print(score*100)

55.25223848001747


Again, we used GridSearchCV to get better hyperparameters.<br>
Now training accuracy is down to 86.2% and validation accuracy is increased to 55%%<br>
The model is now overfitting less than before.<br>
This seems to be a very good model.<br>
Therefore, we will finalize these set of hyperparameters and find the test accuracy.<br>

In [0]:
score = rf.score(df_test,np.ravel(df_test_l))
print(score*100)

55.10565110565111


The test accuracy on this model is 55.1%.<br>
So, by using hyperparameter tuning method we got the best rise in validation acuracy.<br>
It increase by around 1%<br>