# 1. Random Forest
*Reference: [Scikit Learn](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)*

## 4.1. The algorithm
Random Forest is an ensemble method consists of a large number of Decision Trees, they operate as a committee outperforms any individual weak model. This wonderful effect - the wisdom of crowds - can be explained that trees protect each other from their individual errors. If trees share the same behaviors, they also make the same mistakes. So, the low correlation between trees is the key. But how Random Forest ensures the trees are independent?

#### Bootstrap aggregating
Bootstrap aggregating (bagging) is a resampling technique that randomly selects instances from the original data to form a bootstrap sample, which becomes input for an individual tree. There are two points should be noticed:
- Sampling is done with replacement, meaning an observation can appear more than once in a bootstrap sample. This makes every single tree is completely random.
- The size for all bootstrap samples is fixed, it can be a number of observations or a ratio of the original size.

#### Variable randomness
In a normal Decision Tree, every input variable is considered to find the best split at a node (so greedy). In contrast, each tree of a Random Forest only picks a random subset of variables, this forces even more variation amongst trees. Compare variable randomness to bootstrap aggregating, they both do sampling, but one selects rows, the other selects columns.

## 4.2. Implementation
Notable hyperparameters:

Hyperparameter|Meaning|Default value|Common values|
:---|:---|:---|:---|
`n_estimators`|Number of trees in the forest|`100`||
`criterion`|Measure of impurity for each tree|`gini`|`entropy` `gini`|
`max_features`|Number or ratio of variables used in each tree|`auto`|`0.8` `0.9`|
`max_samples`|Number or ratio of instances used in each tree|`None`|`0.5`|

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report

In [2]:
credit = pd.read_excel('data/credit_scoring.xlsx')
credit = credit.dropna().reset_index()
credit.head()

Unnamed: 0,index,bad_customer,credit_balance_percent,age,num_of_group1_pastdue,debt_ratio,income,num_of_loans,num_of_times_late_90days,num_of_estate_loans,num_of_group2_pastdue,num_of_dependents
0,0,1,0.766127,45,2,0.802982,9120.0,13,0,6,0,2.0
1,1,0,0.957151,40,0,0.121876,2600.0,4,0,0,0,1.0
2,2,0,0.65818,38,1,0.085113,3042.0,2,1,0,0,0.0
3,3,0,0.23381,30,0,0.03605,3300.0,5,0,0,0,0.0
4,4,0,0.907239,49,1,0.024926,63588.0,7,0,1,0,0.0


In [3]:
y = credit.bad_customer.values
x = credit.drop(columns='bad_customer')

In [4]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

In [5]:
params = {
    'max_features': [0.8, 0.9],
    'max_samples': [0.5]
}

forest = RandomForestClassifier()
forest = GridSearchCV(forest, params, cv=5)
forest = forest.fit(x_train, y_train)
forest.best_params_

{'max_features': 0.8, 'max_samples': 0.5}

In [7]:
y_train_pred = forest.predict(x_train)
y_test_pred = forest.predict(x_test)

In [8]:
print(classification_report(y_train, y_train_pred))

              precision    recall  f1-score   support

           0       0.97      1.00      0.98     89543
           1       0.99      0.58      0.73      6672

    accuracy                           0.97     96215
   macro avg       0.98      0.79      0.86     96215
weighted avg       0.97      0.97      0.97     96215



In [9]:
print(classification_report(y_test, y_test_pred))

              precision    recall  f1-score   support

           0       0.94      0.99      0.96     22369
           1       0.55      0.17      0.26      1685

    accuracy                           0.93     24054
   macro avg       0.75      0.58      0.61     24054
weighted avg       0.91      0.93      0.92     24054

