### Random Forests: solving the overfit tendency of decision trees

As we have learned, decision trees are powerful tools by which to filter data into subgroups for predictive purposes, but they tend to overfit. Random Forest is an excellent method by which to counteract the overfit tendency of decision tree models. 

Random Forest allows the user to generate many decision trees simultaneously, and aggregate their predictions. This alleviates overfitting of decision trees through two primary functions: 1) each tree in the random forest is fit to a different subset of training data through a process known as Bootstrap Aggregation* 2) each tree in the random forest utilizes subspace sampling which allows each tree to use a different subset of features for splitting purposes.

*there is a very small chance that the same bootstrap sample will be identical for more than one tree due to replacement when selecting from the dataset, but the chances of this are exceptionally low and should not be a concern

In [1]:
import pandas as pd 
import numpy as np 
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

# Load & split the data
filename = 'data/Folds5x2_pp.xlsx'
df = pd.read_excel(filename)

X = df[df.columns.difference(['PE'])]
y = df['PE']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=1)

In [3]:
# fit random forest to data

rf = RandomForestRegressor()
rf.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)

In [4]:
#check the R_squared assessment
rf.score(X_test, y_test)

0.9581109778614281

As you can see, the random forest regressors performs very well even on default settings. I highly recommend tweaking the settings on your own and exploring the effectiveness of trees with different characteristis.