# Random Forest
Moving to a more sophisticated tree example, using Random Forests for regression.
### Lab prep

In [4]:
import pandas as pd
import pylab
import seaborn as sns
import numpy as np
import datetime
import copy
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.ensemble import RandomForestRegressor

In [2]:
import utils
df = utils.get_data()
len(df)

348

In [6]:
X = np.array(df[["month", "temp_2", "temp_1", "average", "week_Fri", "week_Mon", "week_Sat", "week_Sun", "week_Thurs", "week_Tues", "week_Wed", "unix_days"]])
y = np.array(df["actual"])
print(X.shape)
print(y.shape)

(348, 12)
(348,)


### Random Forest
From looking at the average in the dataset, and the performance of decision trees, the baseline was a mean abs error of 5 Farenheit, and a tuned decision tree gave a mean abs error of just below 4 Farenheit. The out-of-the-box params for a Random Forest gives a mean abs error of just below 4 Farenheit.

In [10]:
regr = RandomForestRegressor()
scores = utils.get_cross_val_scores(regr, X, y, 5)
print((scores, np.mean(scores)))

(array([-3.489     , -3.75      , -4.01571429, -3.57173913, -4.04637681]), -3.7745660455486543)


Tweaking the default parameters does not seem to dramatically improve results, although parameters can be tweaked to reduce runtime (reducing n_estimators from the default of 100) for faster run times. This would be handy for grid-search parameter tuning.

In [323]:
regr = RandomForestRegressor(
    n_estimators=20,
    max_depth=None,
    min_samples_split=2,
    min_samples_leaf=15,
    max_features="auto",
    max_leaf_nodes=12,
    ccp_alpha=0.0,
    max_samples=None
)
scores = utils.get_cross_val_scores(regr, X, y, 5)
print((scores, np.mean(scores)))

(array([-3.83492664, -3.51561701, -3.2848083 , -3.69237525, -4.02723049]), -3.6709915397548927)
