### ** Trees: Ensemble Methods - Bagging

Bagging: Training a bunch of individual models in a parallel way. Each model is trained by a random subset of the data. (Summary!)

BAGGing stands for Bootstrapping(sampling with replacement) and AGGregating (Averaging predictions).

With Random Forest in addition to taking the random subset of data, it also takes the random selection of features rather than using all features to grow trees. When you have many random trees. It’s called Random Forest.

With Random Forest, our goal is to reduce the variance of a decision Tree. We end up with an ensemble of different models. Average of all the predictions from different trees are used which is more robust than a single decision tree.

- forests = high variance, low bias base learners
- Bagging to decrease the model’s variance

<img src="./images/boostrap_aggregating.png" width="500" height="500" />

### <strong> Extremely Randomized Trees </strong>

Extremely Randomized Trees, abbreviated as ExtraTrees in Sklearn, adds one more step of randomization to the random forest algorithm. 

Random forests will 

1. compute the optimal split to make for each feature within the randomly selected subset, and it will then choose the best feature to split on. 
2. builds multiple trees with bootstrap = False by default, which means it samples without replacement.

ExtraTrees on the other hand(compared to Random Forests) will instead choose a random split to make for each feature within that random subset, and it will subsequently choose the best feature to split on by comparing those randomly chosen splits. (nodes are split on random splits, not best splits.)

Extremely randomized trees are much more computationally efficient than random forests, and their performance is almost always comparable. In some cases, they may even perform better!

Link to Paper: https://link.springer.com/content/pdf/10.1007/s10994-006-6226-1.pdf

In [17]:
import pandas as pd
import numpy as numpy

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.metrics import f1_score

In [18]:
#load dataset

X,y = load_iris(return_X_y=True)

#train,test split

X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=42)

#random forest with gini
rf = RandomForestClassifier(criterion='gini',n_estimators=150,max_depth=4,n_jobs=-1)

rf.fit(X_train,y_train)

rf_predict = rf.predict(X_test)

f1_score(y_test, rf_predict, average=None)

array([1., 1., 1.])

In [19]:
#random forest with gini
rf = RandomForestClassifier(criterion='entropy',n_estimators=200,max_depth=4,n_jobs=-1)

rf.fit(X_train,y_train)

rf_predict = rf.predict(X_test)

f1_score(y_test, rf_predict, average=None)

array([1., 1., 1.])

Exercise: Can you get a better mean absolute error compared to the random forest used at https://shoe-size-predict.herokuapp.com/

In [20]:
import pandas as pd

In [21]:
df = pd.read_csv('shoesizes.csv', index_col=0)

In [22]:
df

Unnamed: 0,height,sex_no,shoe_size
0,160.0,2,40
1,171.0,2,39
2,174.0,2,39
3,176.0,2,40
4,195.0,1,46
...,...,...,...
176,185.0,1,45
177,185.0,1,46
178,170.0,1,42
179,170.0,1,42


In [30]:
print('Does the df conatin nulls?:', df.isnull().any().any())

Does the df conatin nulls?: False


In [31]:
df.loc[:, "height"].value_counts().head()

175.0    14
168.0    13
171.0    10
162.0    10
165.0    10
Name: height, dtype: int64

In [32]:
df.loc[:, "sex_no"].value_counts().head()

2    113
1     66
0      2
Name: sex_no, dtype: int64

In [33]:
df.loc[:, "shoe_size"].value_counts().head()

38    31
39    24
42    20
37    19
40    17
Name: shoe_size, dtype: int64

In [35]:
df =df[df['sex_no'] !=0]
df

Unnamed: 0,height,sex_no,shoe_size
0,160.0,2,40
1,171.0,2,39
2,174.0,2,39
3,176.0,2,40
4,195.0,1,46
...,...,...,...
176,185.0,1,45
177,185.0,1,46
178,170.0,1,42
179,170.0,1,42


In [54]:
X = df.drop('shoe_size', axis = 1)
y = df['shoe_size']

In [55]:
#train,test split
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=42, test_size=0.2, shuffle=True)

In [56]:
import numpy
from sklearn import linear_model
rfr = RandomForestRegressor(n_estimators=250, max_depth=7, n_jobs=-1)
#cls = RandomForestRegressor(n_estimators=150)

rfr.fit(X_train, y_train)
#We are training the model with RBF'ed data

scoreOfModel = rfr.score(X_train, y_train)

print("Score is calculated as: ",scoreOfModel)

Score is calculated as:  0.8396511755462269


In [57]:
pred = rfr.predict(X_test)

In [58]:
pred

array([38.1609336 , 37.94000491, 43.10776186, 37.01125287, 37.94000491,
       37.23967013, 43.81674904, 38.1609336 , 39.55520243, 43.10776186,
       38.1609336 , 44.12469856, 38.20417884, 40.10752381, 42.14264762,
       44.12469856, 42.63303501, 38.20417884, 38.19637501, 37.22883156,
       45.10148571, 37.71314237, 41.14409957, 38.10999017, 39.68815238,
       37.71314237, 39.28225036, 39.28276454, 39.16894855, 42.63303501,
       38.10999017, 41.14409957, 37.49598528, 38.20417884, 39.06151417,
       42.63303501])

In [59]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, pred)

2.620178802807926

Mean absolute error at present is 0.789