# Random Forest Classifier

<h3>Big Problem with Decision Trees!!!</h3>
<p>Every split it makes at each node is optimazed for the dataset it is fit to.<br>
This splitting process will rarely generalise well to other/new data.<br>
The algorithm overfits!!!
</p>
<p style="color:red; font-weight:bold">Generally it is impossible to train a model that trains to fit a training dataset and is capable of predicting given a new test dataset.<br>
WE WANT A MODEL THAT CAPTURES THE RELATIONSHIPS WITHIN THE TRAINING DATSET AND GENRALIZES WELL TO UNSEEN DATA.<br>    
This is called bias-variance trade-off.
</p>

# What is bias and variance?

<strong>BIAS</strong>
<ul>
    <li>Error from reclassifications in the learning algorithm.</li>
    <li>High Bias : The algorithm misses relevant information between features and targets (<em>underfitting</em>).</li>
    <li><strong>ERROR DUE TO MODEL MISMATCH!</strong></li>
</ul>
<strong>VARIANCE</strong>
<ul>
    <li>Error in sensitivity to small changes in training set.</li>
    <li>High Variance: The algorithm models noise (<em>overfitting</em>).</li>
    <li><strong>VARAINCE DUE TO TRAINING SAMPLE AND RANDOMIZATION.</strong></li>
</ul>
<strong style="color:red">BIAS / VARIANCE Trade-Off</strong>
<ul>
    <li>We are not able to optimize both variance and bias.</li>
    <li>Low bias -> High variance</li>
    <li>Low variance -> Hish bias</li>
</ul>
    <img src = "Images/bias_variance_to.png"/>
    <p style="text-align:center"><em>Image from -> <a href="http://scott.fortmann-roe.com/docs/BiasVariance.html">http://scott.fortmann-roe.com/docs/BiasVariance.html</a></em></p>

# Pruning 

<p>Usually decision trees are likely to overfit the data leading to poor test performance.<br>
Trees are also unstable classifiers. If you perturb the data a little the tree might change significantly (low bias but high variance model).</p>
<p>Smaller Tree + Fewer Splits : Better predictor at the cost of a little extra bias.<br>
    <strong style="color:green">Better solution : grow a large tree and then prune it back to a smaller subtree.</strong>
</p>
<img src = "Images/bfpruning.png" style="width:450px"/>
<img src = "Images/afpruning.png" style="width:450px"/>
<p style="text-align:center"><em>Image from Holczer Balazs's ML course on Udemy.</em></p>
<p>But we prefer other mehods like Bagging and Random Forest Classifier as pruning introduces a little extra bias.</p>

# Bagging 

<p>Bagging stands for <strong>Bootstrap aggregation.</strong></p>
A rather counter-intuitive theory: A weak learner is not able to make good predictions. <br>
<ul>
    <li>A weak learner is just a little better someone who guesses randomly (like coin flip). For example: Decision Trees with depth 1.</li>
    <li>Combining weak learners can prove to be an extremely powerful classifier !!!</li>
    <li>Black-Scholes model is approximately the same: two risky positions taken together can effectively eliminate the risk itself.</li>
</ul>

<ul>
    <li>Bagging reduces variance of a learning algorithm.</li>
    <li>If we have a set $X$ of $n$ independent variables $x_1,x_2,...,x_n$ each with variance $V$ then the variance of the mean $X$ is $\frac{V}{n}$.<br>
        <strong style="color:red">We can reduce the variance by averaging a set of observations.</strong>
    </li><br>
    <li><strong>The idea: </strong> Have multiple training sets and construct a decision tree (without pruning) on every single dataset !!!</li>
    <li><strong>Problem: </strong>We do not have several training sets.</li>
    <li><strong style="color:green">Solution: </strong>We should take repeated samples from the single dataset, then construct trees and finally average all predicitons at the end.<ul><li>All trees are fullt grown unpruned decision trees.</li></ul></li>
    <li><strong>Regression Problem: </strong>We take the average.</li>
    <li><strong>Classification Problem: </strong>We take the majority vote.</li>
</ul> 

<strong>DISADVANTAGES</strong>
<ul>
    <li>The contructed trees are highly correlated because we use all the features.</li>
    <li>Every dataset has a strong predictor/feature amongst all the features. All the bagged trees tend to make the same splits as they all share the same features !!! <br>
        <strong>Because of this all the trees look similar.</strong></li>
    <li>We can do better.</li>
</ul>
<img src = "Images/bagging.png"/>
<p style="text-align:center"><em>Image from Holczer Balazs's ML course on Udemy.</em></p>

# Random Forest Classifier 

<ul>
    <li>This algorithm <strong>decorrelates</strong> those single decision trees that were constructed.</li>
    <li>This reduces variance even more when averaging trees.</li>
    <li>Similar to bagging:<strong style="color:red"> We keep contructing decision trees on the training data. But on every split in the tree, a random selection of features / predictors is chosen from the full feature set.</strong><br><br>
        <ul>
            <li>The number of features considered at a given split is approximately equal to the square root of the total number of features.</li>
            <li><strong>Bagging: </strong>algorithm searches over all $N$ features to find the best feature that splits the data at that node.</li>
            <li><strong>Random Forest Classifier: </strong>algorithm searches over random $\sqrt{N}$ features to find the best one.</li>
        </ul>
    </li>
</ul>
<img src = "Images/RFC.png" style="width:700px"/>
<p style="text-align:center"><em>Image from Holczer Balazs's ML course on Udemy.</em></p><br>
<strong style="color:green">Why is this good?</strong>
<ul>
    <li>If one of the features are very strong predictors for the response variable (TARGET), these features will be selected in many of the decision trees, so only they will be correlated.</li>
    <li>At some point the variance starts decreasing no matter how many more trees we add to our random forest. So there won't be overfitting!!!</li>
</ul>

# <span style="color:purple">CODE (IRIS DATASET)</span>

In [3]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn import datasets

In [4]:
iris_data = datasets.load_iris()

features = iris_data.data
targets = iris_data.target

In [12]:
feature_train, feature_test, target_train, target_test = train_test_split(features, targets, test_size=.2)

#max_features: The number of features to consider when looking for the best split.
#n_estimators: The number of trees in the forest.
model = RandomForestClassifier(n_estimators=1000, max_features='sqrt')

fitted_model = model.fit(feature_train, target_train)
predictions = fitted_model.predict(feature_test)

In [13]:
print(f"Confusion matrix for Iris dataset : " + "\n"
      f"{confusion_matrix(target_test, predictions)}")
print()
print(f"Accuracy score for Iris dataset : {accuracy_score(target_test, predictions)}")

Confusion matrix for Iris dataset : 
[[11  0  0]
 [ 0  5  0]
 [ 0  1 13]]

Accuracy score for Iris dataset : 0.9666666666666667


# <span style="color:purple">CODE (CREDIT DATASET)</span>

In [14]:
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_validate

In [15]:
# logistic regression accuracy: 93%
# we do better with knn: 97.5% !!!
# 84% simple kNN without normalizing the dataset
# we can achieve ~ 99% with random forests

credit_data = pd.read_csv("../LogisticRegression/credit_data.csv")

features_credit = credit_data[["income", "age", "loan"]]
targets_credit = credit_data.default

In [17]:
X = np.array(features_credit).reshape(-1, 3)
y = np.array(targets_credit)

model_credit = RandomForestClassifier(n_estimators=100)
predicted = cross_validate(model_credit, X, y, cv=10)

print(f"Mean accuracy over 10 folds is : {np.mean(predicted['test_score'])}")

Mean accuracy over 10 folds is : 0.9894923373084328


# <span style="color:purple">CODE (DIGIT DATASET and PARALLEL COMPUTING)</span>

In [18]:
from sklearn.model_selection import GridSearchCV

In [19]:
digit_data = datasets.load_digits()

image_features = digit_data.images.reshape((len(digit_data.images), -1))
image_targets = digit_data.target

In [21]:
#n_jobs :The number of jobs to run in parallel. 
#fit, predict, decision_path and apply are all parallelized over the trees. 
#-1 means using all processors. 
random_forest_model = RandomForestClassifier(n_jobs=-1, max_features='sqrt')

feature_train_digit, feature_test_digit, target_train_digit, target_test_digit = train_test_split(image_features, image_targets, test_size=.2)

In [22]:
#min_samples_leaf : The minimum number of samples required to be at a leaf node.

param_grid = {
    "n_estimators": [10, 100, 500, 1000],
    "max_depth": [1, 5, 10, 15],
    "min_samples_leaf": [1, 2, 4, 10, 15, 30, 50]
}

grid_search = GridSearchCV(estimator=random_forest_model, param_grid=param_grid, cv=10,iid=False)
grid_search.fit(feature_train_digit, target_train_digit)
print("Best parameters are : ")
print(grid_search.best_params_)

Best parameters are : 
{'max_depth': 15, 'min_samples_leaf': 1, 'n_estimators': 100}


In [23]:
optimal_estimators = grid_search.best_params_.get("n_estimators")
optimal_depth = grid_search.best_params_.get("max_depth")
optimal_leaf = grid_search.best_params_.get("min_samples_leaf")

print(f"Optimal n_estimators: {optimal_estimators}")
print(f"Optimal optimal_depth: {optimal_depth}")
print(f"Optimal optimal_leaf: {optimal_leaf}")

Optimal n_estimators: 100
Optimal optimal_depth: 15
Optimal optimal_leaf: 1


In [26]:
grid_predictions = grid_search.predict(feature_test_digit)
print("Confusion matrix for the digit dataset is : ")
print(confusion_matrix(target_test_digit, grid_predictions))
print()
print(f"Accuracy score for digit dataset is : {accuracy_score(target_test_digit, grid_predictions)}")

Confusion matrix for the digit dataset is : 
[[32  0  0  0  2  0  0  0  0  0]
 [ 0 29  0  0  0  0  0  0  0  0]
 [ 1  0 39  0  0  0  0  0  0  0]
 [ 0  0  0 42  0  0  0  1  2  0]
 [ 0  0  0  0 35  0  0  0  0  0]
 [ 0  0  0  0  0 32  0  0  0  0]
 [ 0  0  0  0  0  1 37  0  0  0]
 [ 0  0  0  0  0  0  0 30  0  0]
 [ 0  1  1  0  0  0  0  0 38  0]
 [ 0  0  0  1  0  0  0  0  0 36]]

Accuracy score for digit dataset is : 0.9722222222222222


**Running the algorithm a few more times we can get an accuracy of approximately 99.5%!!!**

# <span style="color:darkgreen">END</span>