<a href="https://colab.research.google.com/github/jp7252/ML4RM/blob/main/Class_9.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Bagging

- One of the biggest drawbacks of decision trees is that they suffer from high variance:
 - If we randomly split our data into two different parts and fit independent decision trees, the results are likely to be quite different.
- **B**ootstrap **agg**regation (i.e., bagging) is a procedure that aids in the reduction of variance for a statistical learning method; it is frequently used alongside trees.
- Bagging is also called resample with replacement.
- We can train three **independent** tree based models using the resampled dataset and the final prediction would be the average of the predictions from these 3 models.
```
Original training dataset: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Resampled training set 1: [2, 3, 3, 5, 6, 1, 8, 10, 9, 1]
Resampled training set 2: [1, 1, 5, 6, 3, 8, 9, 10, 2, 7]
Resampled training set 3: [1, 5, 8, 9, 2, 10, 9, 7, 5, 4]
```

- Recall that given a set of n **independent** observations $X_1, X_2, ..., X_n$, each themselves drawn from a distribution with variance $\sigma^2$, the variance of these observations as a group would be given by $\sigma^2/n$.
- Averaging the set of observations reduces the overall variance.

- Let's use the Lending Club dataset to illustrate the idea.

In [None]:
from sklearn.impute import KNNImputer
import pandas as pd

df = pd.read_csv("https://drive.google.com/uc?id=1Ijs6Quta_ZAd3dsKWMvI6pxaHjpXgFoU")

y = df['loan_outcome']
X = df.drop('loan_outcome', axis=1)

# One-hot encode the categorical column
X = pd.get_dummies(X)

# Impute the missing values
imputer = KNNImputer(n_neighbors=5)
X = imputer.fit_transform(X)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

In [None]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

In [None]:
print(clf.score(X_train, y_train))
print(clf.score(X_test, y_test))

0.9996504311349336
0.7186735525958141


- Using bagged tree classifier, we got a better result on the test set.

In [None]:
from sklearn.ensemble import BaggingClassifier
bagging = BaggingClassifier()
bagging.fit(X_train, y_train)

In [None]:
print(bagging.score(X_train, y_train))
print(bagging.score(X_test, y_test))

0.9757632253553951
0.8048382712693667


## Random Forest

- In bagging example, we are assuming that the output from each tree are **independent** of each other. However, it is **NOT** true in general. Even though we are using a different dataset when we fit each tree, the output are still correlated.
- Thus, by averaging the outputs of B trees, the variance of the final prediction is given by $p*\sigma^2 + (1 - p)\sigma^2 / B$, where p is the pairwise correlation between trees.
- In other words, correlated observations are not as effective at reducing the uncertainty of the mean as uncorrelated, independent observations.
- If we could generate trees that are not correlated with one another, we could improve upon the bagging procedure.
 - **Random forest** help us by decorrelating our trees.

### Details
- Similar to bagging, we first build various decision trees on bootstrapped training samples, but we split internal nodes in a special way.
 - Each time a split is considered within the construction of a decision tree only a random subset of $m$ of the overall $p$ predictors are allowed to be candidates.
- In other words, only the $m$ predictors have the possibility to be chosen as the splitting factor.
- Typically, $m ≈ \sqrt{p}$ is a general rule of thumb for subset selection.
- **What happens if we choose $m = p$**?

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(max_depth=5)
rf.fit(X_train, y_train)

In [None]:
print(rf.score(X_train, y_train))
print(rf.score(X_test, y_test))

0.8204381263108832
0.8206034248437075


- Even though we are getting a better result on the test set, there is still a significant overfitting effect in our model.
- We can use GridSearch Cross Validation to find the best set of hyperparameters that will give us the best outcome.

In [None]:
from sklearn.model_selection import GridSearchCV
import numpy as np

grid_para_forest = [{
    "n_estimators": [100, 200, 500],
    "max_depth": [5, 10, 15]
    # "min_samples_leaf": range(1, 10),
    # "min_samples_split": np.linspace(start=2, stop=30, num=15, dtype=int)
    }]
grid_search_forest = GridSearchCV(rf, grid_para_forest, scoring='accuracy', cv=5, n_jobs=-1)
%time grid_search_forest.fit(X_train, y_train)

CPU times: user 1.44 s, sys: 238 ms, total: 1.68 s
Wall time: 1min 22s


In [None]:
grid_search_forest.best_params_

{'max_depth': 5, 'n_estimators': 100}

In [None]:
grid_search_forest.best_score_

0.8203216022959762