$$\Large \color{green}{\textbf{Optimizing The Random Forest Algorithm}}$$



$$\large \color{blue}{\textbf{Phuong Van Nguyen}}$$
$$\small \color{red}{\textbf{ phuong.nguyen@summer.barcelonagse.eu}}$$

$\color{green}{\underline{\textbf{I. Introduction}}}$

The main purpose of this project is to introduce how to optimize the Random Forest Algorithm. It is worth noting that this project introduces a pipeline of this procedure. In practice, one should finalize its own projects using a Server, such as the Amazon Webs Services. This is because the optimizing procedure typically depends on a compute with a high configuration. However, for the purpose of a simple demonstration or understanding, hiring a Server is not necessary.

$\color{green}{\underline{\textbf{II. Data}}}$

To replicate my code lines below, one can download the data from my Repository on my Github site.

https://github.com/phuongvnguyen/Optimizing-Random-Forest-Algorithm/blob/master/data.csv


# Loading Necessary Lib

In [255]:
from timeit import default_timer as timer
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
from sklearn.tree import export_graphviz
from subprocess import call
import pydot
from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit

from pickle import dump
from pickle import load
Purple= '\033[95m'
Cyan= '\033[96m'
Darkcyan= '\033[36m'
Blue = '\033[94m'
Green = '\033[92m'
Yellow = '\033[93m'
Red = '\033[91m'
Bold = "\033[1m"
Reset = "\033[0;0m"
Underline= '\033[4m'
End = '\033[0m'
from pprint import pprint

# Loading Data

In [189]:
data=pd.read_csv('data.csv')

# Explanatory Data Analysis (EDA)
## Choosing sample

In [190]:
data.head(5)

Unnamed: 0,year,month,day,weekday,ws_1,prcp_1,snwd_1,temp_2,temp_1,average,actual,friend
0,2011,1,1,Sat,4.92,0.0,0,36,37,45.6,40,40
1,2011,1,2,Sun,5.37,0.0,0,37,40,45.7,39,50
2,2011,1,3,Mon,6.26,0.0,0,40,39,45.8,42,42
3,2011,1,4,Tues,5.59,0.0,0,39,42,45.9,38,59
4,2011,1,5,Wed,3.8,0.03,0,42,38,46.0,45,39


## Checking the dimentionality

In [191]:
print('The number of obs: %d. The number of cols: %d'%(data.shape))

The number of obs: 2191. The number of cols: 12


## Checking the Missing value

In [192]:
print(Bold+'Missing data:'+End)
print(data.isnull().sum())

[1mMissing data:[0m
year       0
month      0
day        0
weekday    0
ws_1       0
prcp_1     0
snwd_1     0
temp_2     0
temp_1     0
average    0
actual     0
friend     0
dtype: int64


## Checking the datatype

In [193]:
print(Bold+'The datatype:'+End)
print(data.dtypes)

[1mThe datatype:[0m
year         int64
month        int64
day          int64
weekday     object
ws_1       float64
prcp_1     float64
snwd_1       int64
temp_2       int64
temp_1       int64
average    float64
actual       int64
friend       int64
dtype: object


# Data Preparation
## Transforming data

In [194]:
dum_data = pd.get_dummies(data)
dum_data.head(5)

Unnamed: 0,year,month,day,ws_1,prcp_1,snwd_1,temp_2,temp_1,average,actual,friend,weekday_Fri,weekday_Mon,weekday_Sat,weekday_Sun,weekday_Thurs,weekday_Tues,weekday_Wed
0,2011,1,1,4.92,0.0,0,36,37,45.6,40,40,0,0,1,0,0,0,0
1,2011,1,2,5.37,0.0,0,37,40,45.7,39,50,0,0,0,1,0,0,0
2,2011,1,3,6.26,0.0,0,40,39,45.8,42,42,0,1,0,0,0,0,0
3,2011,1,4,5.59,0.0,0,39,42,45.9,38,59,0,0,0,0,0,1,0
4,2011,1,5,3.8,0.03,0,42,38,46.0,45,39,0,0,0,0,0,0,1


## Extract Output and Input Data
### Output

In [195]:
Y = dum_data['actual']
Y.head(5)

0    40
1    39
2    42
3    38
4    45
Name: actual, dtype: int64

In [196]:
dum_data = dum_data.drop('actual', axis = 1)
dum_data.head(5)

Unnamed: 0,year,month,day,ws_1,prcp_1,snwd_1,temp_2,temp_1,average,friend,weekday_Fri,weekday_Mon,weekday_Sat,weekday_Sun,weekday_Thurs,weekday_Tues,weekday_Wed
0,2011,1,1,4.92,0.0,0,36,37,45.6,40,0,0,1,0,0,0,0
1,2011,1,2,5.37,0.0,0,37,40,45.7,50,0,0,0,1,0,0,0
2,2011,1,3,6.26,0.0,0,40,39,45.8,42,0,1,0,0,0,0,0
3,2011,1,4,5.59,0.0,0,39,42,45.9,59,0,0,0,0,0,1,0
4,2011,1,5,3.8,0.03,0,42,38,46.0,39,0,0,0,0,0,0,1


### Input
One might not do that

In [197]:
# Names of six features accounting for 95% of total importance
important_feature_names = ['temp_1', 'average', 'ws_1', 'temp_2', 'friend', 'year']

# Update feature list for visualizations
feature_list = important_feature_names[:]

X = dum_data[important_feature_names]
X.head(5)

Unnamed: 0,temp_1,average,ws_1,temp_2,friend,year
0,37,45.6,4.92,36,40,2011
1,40,45.7,5.37,37,50,2011
2,39,45.8,6.26,40,42,2011
3,42,45.9,5.59,39,59,2011
4,38,46.0,3.8,42,39,2011


## Convert to numpy arrays

In [198]:
X = np.array(X)
Y = np.array(Y)

## Training and Testing Sets

In [199]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
                                                    test_size = 0.25,
                                                    random_state = 42)

In [200]:
print('Training Features Shape:', X_train.shape)
print('Training Labels Shape:', Y_train.shape)
print('Testing Features Shape:', X_test.shape)
print('Testing Labels Shape:', Y_test.shape)

Training Features Shape: (1643, 6)
Training Labels Shape: (1643,)
Testing Features Shape: (548, 6)
Testing Labels Shape: (548,)


In [201]:
print('{:0.1f} years of data in the training set'.format(X_train.shape[0] / 365.))
print('{:0.1f} years of data in the test set'.format( X_test.shape[0] / 365.))

4.5 years of data in the training set
1.5 years of data in the test set


# Training Machine Learning
## Examine the Default Random Forest


In [202]:
myrf = RandomForestRegressor(random_state = 42)
print(Bold+'The configuration of the default algorithm:'+End)
pprint(myrf .get_params())

[1mThe configuration of the default algorithm:[0m
{'bootstrap': True,
 'criterion': 'mse',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 'warn',
 'n_jobs': None,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}


## Optimizing algorithm
### Random Search with Cross Validation
#### Creating the list of options

In [203]:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
print(Bold+'List of options:'+End)
pprint(random_grid)

[1mList of options:[0m
{'bootstrap': [True, False],
 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}


#### Configuring model

In [204]:
ran_forest = RandomForestRegressor(random_state = 42)

#### Setting the K-Fold Cross Validation with the Random Searching

In [205]:
# Using 10 fold cross validation, 
kfold = KFold(n_splits=3, random_state=1)
# search across 10 different combinations, and use all available cores
rf_ranSearch = RandomizedSearchCV(estimator=ran_forest, param_distributions=random_grid,
                              n_iter = 10, scoring='neg_mean_absolute_error', 
                              cv = kfold, verbose=2, random_state=42, n_jobs=-1,
                              return_train_score=True)
print(Bold+'The configuration of the Random Cross Validation:'+End)
print(rf_ranSearch)

[1mThe configuration of the Random Cross Validation:[0m
RandomizedSearchCV(cv=KFold(n_splits=3, random_state=1, shuffle=False),
                   error_score='raise-deprecating',
                   estimator=RandomForestRegressor(bootstrap=True,
                                                   criterion='mse',
                                                   max_depth=None,
                                                   max_features='auto',
                                                   max_leaf_nodes=None,
                                                   min_impurity_decrease=0.0,
                                                   min_impurity_split=None,
                                                   min_samples_leaf=1,
                                                   min_samples_split=2,
                                                   min_weight_fraction_leaf=0.0,
                                                   n_estimators=...
                   param_d

#### Searching for the optimal model

In [216]:
start = timer()
# Fit the random search model
rf_random_res=rf_ranSearch.fit(X_train, Y_train);
print(Bold+"Time of seaching %.2fs" % (timer() - start))
print(Bold+"The Best Model: %f .With the Configuration: %s" % (rf_random_res.best_score_,
                                                            rf_random_res.best_params_)+End)

Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:  1.3min finished


[1mTime of seaching 81.76s
[1mThe Best Model: -3.845635 .With the Configuration: {'n_estimators': 1200, 'min_samples_split': 2, 'min_samples_leaf': 4, 'max_features': 'auto', 'max_depth': 100, 'bootstrap': True}[0m


##### Displaying result as Table

In [214]:
display(pd.DataFrame(rf_random_res.cv_results_)\
        .sort_values(by='rank_test_score'))

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_n_estimators,param_min_samples_split,param_min_samples_leaf,param_max_features,param_max_depth,param_bootstrap,...,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,mean_train_score,std_train_score
8,8.129508,0.216112,0.261476,0.013659,1200,2,4,auto,100,True,...,-3.759856,-3.919509,-3.845635,0.065722,1,-2.47434,-2.504242,-2.416524,-2.465035,0.03641
7,0.95504,0.012539,0.043425,0.001233,200,5,2,sqrt,10,True,...,-3.815214,-3.919729,-3.874046,0.043676,2,-2.40891,-2.436069,-2.360194,-2.401724,0.03139
0,0.788137,0.002074,0.038291,0.000889,200,10,2,sqrt,50,True,...,-3.81434,-3.918472,-3.874939,0.044202,3,-2.663432,-2.691844,-2.608703,-2.65466,0.034504
3,6.524325,0.151373,0.31284,0.020273,1400,5,1,sqrt,30,True,...,-3.821712,-3.920944,-3.876557,0.041186,4,-1.959789,-1.973688,-1.912303,-1.948593,0.026281
9,13.066182,0.611692,0.363456,0.033084,2000,5,2,auto,50,True,...,-3.805482,-3.947498,-3.879227,0.058104,5,-1.897369,-1.903537,-1.843069,-1.881325,0.027168
1,2.214839,0.023425,0.111359,0.003828,600,10,4,sqrt,90,False,...,-3.836444,-3.94042,-3.888569,0.042442,6,-2.293233,-2.291427,-2.227358,-2.270673,0.030637
5,1.803763,0.019429,0.087036,0.000887,400,10,1,sqrt,60,False,...,-3.870566,-3.958148,-3.913056,0.035794,7,-1.773999,-1.76368,-1.710786,-1.749488,0.027689
4,7.909009,0.137809,0.212746,0.010147,1000,10,1,auto,80,False,...,-4.680684,-4.739724,-4.64766,0.091598,8,-2.016371,-2.007436,-1.962682,-1.995496,0.023488
6,18.991412,0.192834,0.743197,0.046026,2000,2,2,auto,50,False,...,-5.161502,-4.934425,-4.986701,0.12695,9,-1.233286,-1.232015,-1.247548,-1.237616,0.007042
2,4.638523,0.072354,0.129124,0.008864,600,2,2,auto,60,False,...,-5.159866,-4.935824,-4.988282,0.124405,10,-1.233345,-1.231769,-1.247719,-1.237611,0.007177


In [217]:
means = rf_random_res.cv_results_['mean_test_score']
stds = rf_random_res.cv_results_['std_test_score']
params = rf_random_res.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

-3.874939 (0.044202) with: {'n_estimators': 200, 'min_samples_split': 10, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': 50, 'bootstrap': True}
-3.888569 (0.042442) with: {'n_estimators': 600, 'min_samples_split': 10, 'min_samples_leaf': 4, 'max_features': 'sqrt', 'max_depth': 90, 'bootstrap': False}
-4.988282 (0.124405) with: {'n_estimators': 600, 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': 'auto', 'max_depth': 60, 'bootstrap': False}
-3.876557 (0.041186) with: {'n_estimators': 1400, 'min_samples_split': 5, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 30, 'bootstrap': True}
-4.647660 (0.091598) with: {'n_estimators': 1000, 'min_samples_split': 10, 'min_samples_leaf': 1, 'max_features': 'auto', 'max_depth': 80, 'bootstrap': False}
-3.913056 (0.035794) with: {'n_estimators': 400, 'min_samples_split': 10, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 60, 'bootstrap': False}
-4.986701 (0.126950) with: {'n_estimators': 2000, 'min

#### Evaluation Function

In [218]:
def evaluate(model, X_test, Y_test):
    trainedModel=model.fit(X_test, Y_test)
    predictions = trainedModel.predict(X_test)
    errors = abs(predictions - Y_test)
    mape = 100 * np.mean(errors / Y_test)
    accuracy = 100 - mape
    print('Model Performance')
    print('Average Error: {:0.4f} degrees.'.format(np.mean(errors)))
    print('Accuracy = {:0.2f}%.'.format(accuracy))
    
    return accuracy

##### Evaluate the Default Model

In [220]:
base_model = RandomForestRegressor(n_estimators=100)
base_model.fit(X_train, Y_train)
base_accuracy = evaluate(base_model, X_test, Y_test)

Model Performance
Average Error: 1.4011 degrees.
Accuracy = 97.65%.


##### Evaluate the Best Random Search Model

In [221]:
rf_random_res.best_params_

{'n_estimators': 1200,
 'min_samples_split': 2,
 'min_samples_leaf': 4,
 'max_features': 'auto',
 'max_depth': 100,
 'bootstrap': True}

In [222]:
best_random= RandomForestRegressor(n_estimators=1200,
                                   min_samples_split=2,
                                   min_samples_leaf=4,
                                   max_features='auto',
                                   max_depth=100,
                                   bootstrap=True)
trained_best_random=best_random.fit(X_train, Y_train)

random_best_accuracy = evaluate(trained_best_random,  X_test, Y_test)

Model Performance
Average Error: 2.4172 degrees.
Accuracy = 95.92%.


##### Measure the improvement

In [223]:
print('Improvement of {:0.2f}%. in comparison with the baseline model'.\
      format( 100 * (random_best_accuracy - base_accuracy) / base_accuracy))

Improvement of -1.76%. in comparison with the baseline model


$$\textbf{Comments:}$$

Hey! one should not worry about the result above. Here is just a simple demonstration about how to optimize the Random Forest algorithm

### Grid Search 

We can now perform grid search building on the result from the random search. 
We will test a range of hyperparameters around the best values returned by random search.
#### Setting list of option

In [224]:
param_grid = {
    'bootstrap': [True],
    'max_depth': [80, 90, ],
    'max_features': [2, 3],
    'min_samples_leaf': [3, 4, 5],
    'min_samples_split': [8, 10, 12],
    'n_estimators': [100, 200]
}

In [225]:
print('The number of combinations: %d'%(len(param_grid['bootstrap'])*
                                      len(param_grid['max_depth'])*
                                      len(param_grid['max_features'])*
                                       len(param_grid['min_samples_leaf'])*
                                      len(param_grid['min_samples_split'])*
                                      len(param_grid['n_estimators'])))

The number of combinations: 72


#### Setting the K-Fold Cross Validation with Grid Searching 

In [226]:
grid_search = GridSearchCV(estimator = ran_forest, param_grid = param_grid, 
                          cv = kfold, n_jobs = -1, verbose = 2, return_train_score=True)
print(grid_search)

GridSearchCV(cv=KFold(n_splits=3, random_state=1, shuffle=False),
             error_score='raise-deprecating',
             estimator=RandomForestRegressor(bootstrap=True, criterion='mse',
                                             max_depth=None,
                                             max_features='auto',
                                             max_leaf_nodes=None,
                                             min_impurity_decrease=0.0,
                                             min_impurity_split=None,
                                             min_samples_leaf=1,
                                             min_samples_split=2,
                                             min_weight_fraction_leaf=0.0,
                                             n_estimators='warn', n_jobs=None,
                                             oob_score=False, random_state=42,
                                             verbose=0, warm_start=False),
             iid='warn', n_jobs=-1,


#### Grid Searching for the optimal model

In [227]:
start = timer()
# Fit the grid search to the data
rf_randomGrid_res=grid_search.fit(X_train, Y_train);
print(Bold+"The Best Model: %f With the configuration %s" % (rf_randomGrid_res.best_score_, 
                                                    rf_randomGrid_res.best_params_)+End)
print(Bold+"Time of seaching %.2fs" % (timer() - start))

Fitting 3 folds for each of 72 candidates, totalling 216 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   19.9s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:   43.8s
[Parallel(n_jobs=-1)]: Done 216 out of 216 | elapsed:   58.7s finished


[1mThe Best Model: 0.862637 With the configuration {'bootstrap': True, 'max_depth': 80, 'max_features': 3, 'min_samples_leaf': 5, 'min_samples_split': 12, 'n_estimators': 100}[0m
[1mTime of seaching 59.17s


##### Displaying result as Table

In [228]:
display(pd.DataFrame(rf_randomGrid_res.cv_results_)\
        .sort_values(by='rank_test_score'))

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_bootstrap,param_max_depth,param_max_features,param_min_samples_leaf,param_min_samples_split,param_n_estimators,...,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,mean_train_score,std_train_score
70,0.580002,0.030763,0.040317,0.012682,True,90,3,5,12,100,...,0.865480,0.849417,0.862637,0.009830,1,0.916011,0.917407,0.923577,0.918998,0.003287
34,0.524902,0.045770,0.027230,0.001935,True,80,3,5,12,100,...,0.865480,0.849417,0.862637,0.009830,1,0.916011,0.917407,0.923577,0.918998,0.003287
35,1.082895,0.020057,0.046092,0.002724,True,80,3,5,12,200,...,0.865032,0.848807,0.862108,0.009874,3,0.916448,0.917240,0.923865,0.919184,0.003325
71,1.045334,0.122135,0.033041,0.006910,True,90,3,5,12,200,...,0.865032,0.848807,0.862108,0.009874,3,0.916448,0.917240,0.923865,0.919184,0.003325
59,1.374349,0.030209,0.068933,0.013942,True,90,3,3,12,200,...,0.865020,0.848736,0.861674,0.009487,5,0.924130,0.925519,0.931651,0.927100,0.003268
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6,0.383846,0.008711,0.019286,0.000213,True,80,2,4,8,100,...,0.857042,0.848355,0.858229,0.008579,67,0.922171,0.922704,0.929429,0.924768,0.003303
41,0.943008,0.016797,0.045990,0.001880,True,90,2,3,12,200,...,0.859969,0.846631,0.858042,0.008629,69,0.918020,0.919461,0.925471,0.920984,0.003227
5,0.751495,0.012808,0.041725,0.007604,True,80,2,3,12,200,...,0.859969,0.846631,0.858042,0.008629,69,0.918020,0.919461,0.925471,0.920984,0.003227
40,0.458152,0.011897,0.022061,0.000645,True,90,2,3,12,100,...,0.859772,0.846652,0.857642,0.008233,71,0.916913,0.919232,0.925730,0.920625,0.003732


In [229]:
grid_means = rf_randomGrid_res.cv_results_['mean_test_score']
grid_stds = rf_randomGrid_res.cv_results_['std_test_score']
grid_params = rf_randomGrid_res.cv_results_['params']
for mean, stdev, param in zip(grid_means, grid_stds, grid_params):
    print("%f (%f) with: %r" % (mean, stdev, param))

0.858726 (0.008681) with: {'bootstrap': True, 'max_depth': 80, 'max_features': 2, 'min_samples_leaf': 3, 'min_samples_split': 8, 'n_estimators': 100}
0.858946 (0.008244) with: {'bootstrap': True, 'max_depth': 80, 'max_features': 2, 'min_samples_leaf': 3, 'min_samples_split': 8, 'n_estimators': 200}
0.858426 (0.008365) with: {'bootstrap': True, 'max_depth': 80, 'max_features': 2, 'min_samples_leaf': 3, 'min_samples_split': 10, 'n_estimators': 100}
0.858973 (0.008937) with: {'bootstrap': True, 'max_depth': 80, 'max_features': 2, 'min_samples_leaf': 3, 'min_samples_split': 10, 'n_estimators': 200}
0.857642 (0.008233) with: {'bootstrap': True, 'max_depth': 80, 'max_features': 2, 'min_samples_leaf': 3, 'min_samples_split': 12, 'n_estimators': 100}
0.858042 (0.008629) with: {'bootstrap': True, 'max_depth': 80, 'max_features': 2, 'min_samples_leaf': 3, 'min_samples_split': 12, 'n_estimators': 200}
0.858229 (0.008579) with: {'bootstrap': True, 'max_depth': 80, 'max_features': 2, 'min_samples_l

#### Evaluate the Best Model from Grid Search

In [230]:
rf_randomGrid_res.best_params_

{'bootstrap': True,
 'max_depth': 80,
 'max_features': 3,
 'min_samples_leaf': 5,
 'min_samples_split': 12,
 'n_estimators': 100}

In [231]:
best_gridModel=RandomForestRegressor(bootstrap=True,
                            max_depth=80,
                           max_features=3,
                           min_samples_leaf=5,
                           min_samples_split=12,
                           n_estimators=100)
print(best_gridModel)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=80,
                      max_features=3, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=5, min_samples_split=12,
                      min_weight_fraction_leaf=0.0, n_estimators=100,
                      n_jobs=None, oob_score=False, random_state=None,
                      verbose=0, warm_start=False)


In [232]:
best_gridModel_accuracy = evaluate(best_gridModel, X_test, Y_test)

Model Performance
Average Error: 2.8390 degrees.
Accuracy = 95.22%.


In [235]:
best_grid = grid_search.best_estimator_
grid_best_accuracy = evaluate(best_grid, X_test, Y_test)

Model Performance
Average Error: 2.8377 degrees.
Accuracy = 95.22%.


##### Measure the improvement

In [237]:
print('Improvement of {:0.2f}%. in comparison with the baseline model'.\
      format( 100 * (grid_best_accuracy - base_accuracy) / base_accuracy))

Improvement of -2.49%. in comparison with the baseline model


$$\textbf{Comments:}$$
Again, one should not worry about the result above. Here is just a simple demonstration about how to optimize the Random Forest algorithm

## Finalizing Model

The final model from hyperparameter tuning is as follows.
### Configuration

In [144]:
final_model = best_gridModel

In [146]:
print('Final Model Parameters:\n')
pprint(final_model.get_params())
print('\n')
grid_final_accuracy = evaluate(final_model, X_test, Y_test)

Final Model Parameters:

{'bootstrap': True,
 'criterion': 'mse',
 'max_depth': 80,
 'max_features': 3,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 5,
 'min_samples_split': 12,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}


Model Performance
Average Error: 2.8684 degrees.
Accuracy = 95.17%.


### Saving

In [239]:
filename = 'Phuong_Trained_Random_Forest.sav'
dump(final_model, open(filename, 'wb'))

## Visualizing One Tree in the Forest

My favorite part about the random forest in scikit-learn may be that you can actually look at any tree in the forest.
I'll pick one tree and visualize it as an image.

### Loading model

In [240]:
Phuong_TrainedML = load(open(filename, 'rb'))

### Visualization
#### Extracting single Tree

In [246]:
visual_tree=Phuong_TrainedML.estimators_[12]
visual_tree

DecisionTreeRegressor(criterion='mse', max_depth=80, max_features=3,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=5,
                      min_samples_split=12, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=2116289759, splitter='best')

#### Creating dot file
Write the decision tree as a dot file

In [242]:
export_graphviz(visual_tree, out_file = 'best_tree.dot',
                feature_names = important_feature_names, 
                precision = 2, filled = True, rounded = True, max_depth = None)

In [252]:
export_graphviz(visual_tree, out_file='tree.dot', 
                feature_names = important_feature_names,
                class_names = None,
                rounded = True, proportion = False, 
                precision = 2, filled = True)

In [257]:
#call(['dot', '-Tpng', 'tree.dot', '-o', 'tree.png', '-Gdpi=600'])