#**RANDOM FORESTS**

One of the most productive ideas in machine learning has been the idea of the creation
of a meta-model, by using an ensemble of classic models. This simple idea has led to
powerful models. Random forest is a prominent example: It uses an ensemble of trees
and is very strong **bold text**, especially in classification problems.

Random Forests are introduced in the context of classification and regression.

* it works in both classification and regression problems;

* they are specially good for classifications;

* it used multiple trees;

* it may be heavy computationally.




#**1. ALGORITHM**

The algorithm of random forest is schematized in the following figure (see notes).

It works by constructing a multitude (multidão) of decision trees during the training phase and outputs the mode of the classes (classification) or the mean prediction (regression) of individual trees. The key concept behind random forest is that the combination of multiple trees reduces overfitting and improves the overall accuracy and stability of the model.

Random forests use a technique called `bootstrapping` to create several different datasets from the original dataset. It does this by sampling with replacement, meaning some data points might be repeated in a single bootstrap dataset, while others might be excluded.

For each bootstrap dataset, a decision tree is created. During the construction of each decision tree, a random subset of predictors is selected at each node to determine the best split. This random feature selection introduces diversity among the trees and helps reduce overfitting.

Once all the decision trees have been constructed, the final prediction is made by agregating the predictions of each tree. In the case of classification, this is usually done by taking a majority vote of the classes predicted by individual trees. In the case of regression, it's done by averaging the predicted values.

#**EX1 - CLASSIFICATION**

**Example: Bank Marketing Campaign**

In [73]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

df = pd.read_csv('/content/bank_mark_campaign.csv', sep=';')

df = df.replace('unknown', np.nan) 

col_nan = df.columns[df.isna().any(axis=0)].to_list()
col_num = df.describe().columns.to_list()
df.columns.difference(col_nan + col_num)
col_cat = df.columns.difference(col_nan + col_num + ['y']).to_list()

na_treat = Pipeline([
    ('imp', SimpleImputer(strategy='most_frequent')),
    ('oneh', OneHotEncoder(drop='first'))])

preprocessor = ColumnTransformer([
    ('na_tr', na_treat, col_nan),
    ('cat_tr', OneHotEncoder(drop='first'), col_cat),
    ('scale_tr', StandardScaler(), col_num)], 
    remainder='passthrough')

Let's create the model itself and don't forget to correct the imbalance of the dataset, as we have seen in tree chapter.

In [81]:
#creating the final pipeline
from sklearn.ensemble import RandomForestClassifier

#creating the final pipeline
pipe = Pipeline([
    ('pre', preprocessor),
    ('rf', RandomForestClassifier(class_weight='balanced'))])

X = df.drop('y', axis=1)
y = df['y']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=45)

pipe.fit(X_train, y_train)

In [82]:
#predictive performance of the model in the train set
y_pred = pipe.predict(X_train)

acur = accuracy_score(y_train, y_pred)
print(f'Accuracy= {acur}')
cm = confusion_matrix(y_train, y_pred, labels=['yes', 'no'])
print(cm)
recall = recall_score(y_train, y_pred, pos_label='yes')
print(f'Recall= {recall}')

Accuracy= 1.0
[[ 3700     0]
 [    0 29250]]
Recall= 1.0


In [83]:
#predictive performance of the model in the test set
y_pred2 = pipe.predict(X_test)

acur = accuracy_score(y_test, y_pred2)
print(f'Accuracy= {acur}')
cm = confusion_matrix(y_test, y_pred2, labels=['yes', 'no'])
print(cm)
recall = recall_score(y_test, y_pred2, pos_label='yes')
print(f'Recall= {recall}')

Accuracy= 0.910536537994659
[[ 400  540]
 [ 197 7101]]
Recall= 0.425531914893617


This model is overfitting because the results for the predictive performance are very different, the model is fitting perfectly for the training set, including the noise and does not fit good with unseen or new data as we can see in the results for the predictive performance with the test set, also because in this set the noise is different.

In fact, the model is very bad. To be good, the recall score should be closer to 1 and is not.

#DÚVIDA AQUI: QUEREMOS AUMENTAR O RECALL MAS SÓ OTIMIZAMOS O CCP_ALPHA, A COMPLEXIDADE DA ÁRVORE, NÃO DEVÍAMOS USAR TBM O ROC AUC PARA MUDAR O GOAL DA GRID SEARCH PARA MELHORAR O RECALL SCORE AO INVÉS DA ACCURACY QUE É A DEFAULT OPTION????

In [84]:
hyper = {
    'ccp_alpha': [0.001, 0.01, 0.1, 0.2, 0.5]
}

pipe = Pipeline([
    ('pre', preprocessor),
    ('grid', GridSearchCV(RandomForestClassifier(class_weight='balanced'), hyper, cv=5))])

X = df.drop('y', axis=1)
y = df['y']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=45)

pipe.fit(X_train, y_train)

In [85]:
#predictive performance of the model in the train set
y_pred = pipe.predict(X_train)

acur = accuracy_score(y_train, y_pred)
print(f'Accuracy= {acur}')
cm = confusion_matrix(y_train, y_pred, labels=['yes', 'no'])
print(cm)
recall = recall_score(y_train, y_pred, pos_label='yes')
print(f'Recall= {recall}')

Accuracy= 0.8283459787556905
[[ 3530   170]
 [ 5486 23764]]
Recall= 0.9540540540540541


In [86]:
#predictive performance of the model in the test set
y_pred2 = pipe.predict(X_test)

acur = accuracy_score(y_test, y_pred2)
print(f'Accuracy= {acur}')
cm = confusion_matrix(y_test, y_pred2, labels=['yes', 'no'])
print(cm)
recall = recall_score(y_test, y_pred2, pos_label='yes')
print(f'Recall= {recall}')

Accuracy= 0.8227725176013595
[[ 887   53]
 [1407 5891]]
Recall= 0.9436170212765957


Finally, the model is good!!! The predictive performance are very similar!

In this sense, we don't need to test other types of hyperparameters since this one already originates a good performance of the model.

#**EX2 - REGRESSION**

Example: Defective Car Radios

In [66]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.ensemble import RandomForestRegressor

import datetime

df = pd.read_excel('/content/data_carradios.xlsx')

def get_ages(col):
  result = (datetime.datetime.now()-col).astype('<m8[Y]')
  result = pd.DataFrame(result)
  return result

ager = Pipeline([
    ('ages', FunctionTransformer(get_ages, feature_names_out='one-to-one')),
    ('scale', StandardScaler())
])

def get_weekdays(col):
  result = col.iloc[:,0].dt.weekday
  result = pd.DataFrame(result)
  return result

weeker = Pipeline([
    ('weekd', FunctionTransformer(get_weekdays, feature_names_out='one-to-one')),
    ('oneh', OneHotEncoder(drop='first'))
])

preprocessor = ColumnTransformer([
    ('ages_tr', ager, ['bdate']),
    ('weekd_tr', weeker, ['datep']),
    ('team_tr', OneHotEncoder(drop='first'), ['team']),
    ('scaler', StandardScaler(), ['prized', 'prizeq'])],
    remainder='passthrough')

X = df.drop('perc_defec', axis=1)
y = df['perc_defec']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=45)

pipe = Pipeline([
    ('pre', preprocessor),
    ('rf', RandomForestRegressor())])

pipe.fit(X_train, y_train)

#train set
y_pred = pipe.predict(X_train)

mae = mean_absolute_error(y_train, y_pred)
rsme = mean_squared_error(y_train, y_pred, squared=False)
r2 = r2_score(y_train, y_pred)

print(f'MAE= {mae}')
print(f'RSME= {rsme}')
print(f'R2= {r2}')

#test set
y_pred = pipe.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
rsme = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)

print(f'MAE= {mae}')
print(f'RSME= {rsme}')
print(f'R2= {r2}')

MAE= 2.7598933740599882
RSME= 3.722021969388369
R2= 0.9390793271021662
MAE= 3.765669768562014
RSME= 5.0788333152600655
R2= 0.8904468168432652


Again, we can be a little bit suspicious of overfitting because there is some difference between the R2 of both sets.

In [67]:
hyper = ({
    'ccp_alpha': [0.001, 0.003, 0.1, 0.3, 0.5],
    'n_estimators': [10, 50, 100, 150]
})

In [68]:
pipe = Pipeline([
    ('pre', preprocessor),
    ('grid', GridSearchCV(RandomForestRegressor(), hyper, cv=5))])

In [69]:
pipe.fit(X_train, y_train)

In [70]:
#train set
y_pred = pipe.predict(X_train)

mae = mean_absolute_error(y_train, y_pred)
rsme = mean_squared_error(y_train, y_pred, squared=False)
r2 = r2_score(y_train, y_pred)

print(f'MAE= {mae}')
print(f'RSME= {rsme}')
print(f'R2= {r2}')

#test set
y_pred = pipe.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
rsme = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)

print(f'MAE= {mae}')
print(f'RSME= {rsme}')
print(f'R2= {r2}')

MAE= 3.289608816933123
RSME= 4.266938261894273
R2= 0.9199355830345625
MAE= 3.72056413366291
RSME= 4.927206212054344
R2= 0.8968905290216066


What is the best hyperparameter and the best number of trees?

In [71]:
pipe.named_steps['grid'].best_params_

{'ccp_alpha': 0.5, 'n_estimators': 150}