# Random Forest Parameter Tuning
In this notebook, I tune the hyperparameters for a random forest and save the best model.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from tsfresh import extract_features
from tsfresh.utilities.dataframe_functions import impute
import matplotlib.pyplot as plt
import warnings

warnings.filterwarnings('ignore')

%matplotlib inline

Read in the dataset of extracted features

In [2]:
df = pd.read_csv('ExtractedFinalDataset.csv')

Remove the bad features

In [3]:
bad_features = []
for i in range(8):
    langevin = str(i) + "__max_langevin_fixed_point__m_3__r_30"
    bad_features.append(langevin)
    for j in range(9):
        quantile = (j+1)*0.1
        if quantile != 0.5:
            feature_name = str(i) + "__index_mass_quantile__q_" + str(quantile)
            bad_features.append(feature_name)

Preprocess the data, add labels

In [4]:
df = df.drop(bad_features, axis=1)

In [5]:
df.index = df['9']
df = df.drop(['9'], axis=1)
df['Label'] = "One"
df['Label'][2001.0 <= df.index ] = "Two"
df['Label'][4001.0 <= df.index ] = "Three"
df['Label'][6001.0 <= df.index ] = "Four"
df['Label'][8001.0 <= df.index ] = "Five"
df['Label'][10001.0 <= df.index ] = "Six"

In [6]:
df = df[1:]

In [7]:
df.columns = df.columns.map(lambda t: str(t))
df = df.sort_index(axis=1)

In [8]:
extracted_features = df

Take a subsample of 5% of the dataset to do initial parameter tuning on, and create train/validation splits for the subsample and the full dataset

In [9]:
subsample = extracted_features.sample(frac=0.05).reset_index(drop=True)
subsample.shape

(599, 1705)

In [10]:
X = subsample.drop(['Label'], 1)
y = subsample['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [9]:
fullset = extracted_features.sample(frac=1).reset_index(drop=True)
X = fullset.drop(['Label'], 1)
y = fullset['Label']
X_train_full, X_test_full, y_train_full, y_test_full = train_test_split(X, y, test_size=0.3, random_state=42)

Random forest grid search. Can use full dataset since random forests are fast.

In [40]:
param_grid = {'n_jobs': [7, 10, 12],
             'n_estimators': [10, 100, 50],
             'min_samples_split': [0.2, 0.5, 0.7, 2]}
model = GridSearchCV(RandomForestClassifier(), param_grid)
model.fit(X_train_full, y_train_full)

print 'Training accuracy:', model.score(np.array(X_train_full),np.array(y_train_full))
print 'Test accuracy:', model.score(np.array(X_test_full), np.array(y_test_full))
print model.best_params_

Training accuracy: 1.0
Test accuracy: 0.823692992214
{'min_samples_split': 2, 'n_estimators': 100, 'n_jobs': 10}


Seems as if more estimators might perform better. Try another grid search.

In [42]:
param_grid = {'n_jobs': [7, 10, 12],
             'n_estimators': [100, 200, 300],
             'min_samples_split': [0.2, 0.5, 0.7, 2]}
model = GridSearchCV(RandomForestClassifier(), param_grid)
model.fit(X_train_full, y_train_full)

print 'Training accuracy:', model.score(np.array(X_train_full),np.array(y_train_full))
print 'Test accuracy:', model.score(np.array(X_test_full), np.array(y_test_full))
print model.best_params_

Training accuracy: 1.0
Test accuracy: 0.823414905451
{'min_samples_split': 2, 'n_estimators': 300, 'n_jobs': 12}


Seems as if we're doing about as good as we can using these parameters. Try one more search.

In [44]:
param_grid = {'n_jobs': [10, 12, 15, 20],
             'n_estimators': [100, 300, 500, 1000]}
model = GridSearchCV(RandomForestClassifier(), param_grid)
model.fit(X_train_full, y_train_full)

print 'Training accuracy:', model.score(np.array(X_train_full),np.array(y_train_full))
print 'Test accuracy:', model.score(np.array(X_test_full), np.array(y_test_full))
print model.best_params_

Training accuracy: 1.0
Test accuracy: 0.835650723026
{'n_estimators': 1000, 'n_jobs': 10}


Clearly more estimators is better, as makes intuitive sense, but takes longer to train.

In [46]:
param_grid = {'n_jobs': [10, 12],
             'n_estimators': [2000, 4000]}
model = GridSearchCV(RandomForestClassifier(), param_grid)
model.fit(X_train_full, y_train_full)

print 'Training accuracy:', model.score(np.array(X_train_full),np.array(y_train_full))
print 'Test accuracy:', model.score(np.array(X_test_full), np.array(y_test_full))
print model.best_params_

Training accuracy: 1.0
Test accuracy: 0.836484983315
{'n_estimators': 4000, 'n_jobs': 10}


Okay, it's apparent that we've plateaued with the number of estimators. Let's train a model on the whole dataset, and save it. Only going to use 1000 estimates since it will be much faster for barely any increase in accuracy.

In [12]:
from sklearn.externals import joblib
model = RandomForestClassifier(n_jobs = 10, n_estimators = 1000)
model.fit(X, y)

joblib.dump(model, 'bestrandomforest.pkl')

['bestrandomforest.pkl']