## What is random in Random Forest?

> Random Feature is the primary reason why RFs are so called.

### Random Forest

* The Random Forest Algorithm is an extension of the bagging method (also known as  bootstrap aggregation) as it utilizes both bagging and feature randomness to create an uncorrelated forest of decision trees. 

* Feature randomness, also known as feature bagging or “the random subspace method”, generates a random subset of features, which ensures low correlation among decision trees. This is a key difference between decision trees and random forests. 

* Random forest algorithms have three main hyperparameters, which need to be set before training. These include node size, the number of trees, and the number of features sampled. From there, the random forest classifier can be used to solve for regression or classification problems.

### Decision Tree vs Random Forest

* Training Decision Trees (DTs) was always a challenge as they tend to overfit. 
* While decision trees consider all the possible feature splits, random forests only select a subset of those features.

### Random Subspace & Random Forest

* Random subspace methods are the general method of choosing a random subset of features from the data to fit the estimator. 
* Random Subspace method, when combined with bagged decision trees results, gives rise to Random Forests. 
* There could be more sophisticated extensions of the Random Subspace method, for example creating a subspace that is a linear combination of the original features. But, RFs work remarkably well by just random selection or features.

> NOTE: RFs are immune to feature scales, noisy or correlated data and even overfitting

### Implementation with code

We can further build intuition about these ideas by trying these ideas out in Python. Starting from the root idea of randomness, we will try to implement a bagged tree. Sklearn implementation of bagged trees is via the generic sklearn.ensemble.BaggingClassifier API.

In [19]:
import pandas as pd
import numpy as np

In [9]:
# Import data

from sklearn.datasets import load_breast_cancer

In [22]:
dataset = load_breast_cancer()
dataset

{'data': array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
         1.189e-01],
        [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
         8.902e-02],
        [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
         8.758e-02],
        ...,
        [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
         7.820e-02],
        [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
         1.240e-01],
        [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
         7.039e-02]]),
 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
        1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
        1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
        1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0

In [61]:
X = dataset["data"]
y = dataset["target"]
print(X.shape)
print(y.shape)
print(X.dtype)
print(y.dtype)

(569, 30)
(569,)
float64
int64


In [33]:
dataset["target_names"]

array(['malignant', 'benign'], dtype='<U9')

In [88]:
df = pd.DataFrame(data= np.c_[X, y], columns=list(dataset.feature_names) + ["target"])
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0.0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0.0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0.0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0.0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0.0


In [90]:
# Check for missing values
df.isna().sum()

mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
target                     0
dtype: int64

## Bagged Decision Trees Implementation
Following 3 parameters are important in a BaggedTree Classifier
* max_samples - is the percentage of data used for each tree in the ensemble
* max_features - is the number of features to use in each tree
* bootstrap - denotes our sampling strategy. ‘bootstrap=True’ denotes sampling with replacement.

In [92]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

In [94]:
bgc = BaggingClassifier(base_estimator=DecisionTreeClassifier(), random_state=12)

In [97]:
bgc.set_params(
    n_estimators = 10,  # The no of trees that will be used
    max_samples = 0.3,  # The Basic idea of bagging, for each tree use only 30% of data; Default = 1.0
    max_features = 1.0, # For now default to 1.0 - Include all features in each tree
    bootstrap = True,   # Allow replacement in sample
    verbose = 0,
    oob_score = 1,
    n_jobs = 3
)

BaggingClassifier(base_estimator=DecisionTreeClassifier(), max_samples=0.3,
                  n_jobs=3, oob_score=1, random_state=12)

In [98]:
bgc_model = bgc.fit(X, y)

In [107]:
print(f"Here we have an error of {round((1 - bgc_model.oob_score_) * 100,2)}% on Out of Bag (OOB) data using the bagged decision trees.")

Here we have an error of 5.45% on Out of Bag (OOB) data using the bagged decision trees.


> Random Subspace method, when combined with bagged decision trees results, gives rise to Random Forests. 

In [118]:
# Lets apply and see 
bgc2 = BaggingClassifier(base_estimator=DecisionTreeClassifier(), random_state=12)
bgc2.set_params(
    n_estimators = 10,  # The no of trees that will be used
    max_samples = 0.3,  # The Basic idea of bagging, for each tree use only 30% of data; Default = 1.0
    max_features = 0.7, # Lets allow only 70% of features to be included in each tree
    bootstrap = True,   # Allow replacement in sample
    verbose = 0,
    oob_score = 1,
    n_jobs = 3
)
bgc_model2 = bgc2.fit(X, y)
print(f"Here we have an error of {round((1 - bgc_model2.oob_score_) * 100,2)}% on Out of Bag (OOB) data using the bagged decision trees.")

Here we have an error of 5.98% on Out of Bag (OOB) data using the bagged decision trees.


## Random Forest Implementation

The RF implementation allows us to control 2 sets of parameters, 
* The first set is used to create the forest.
    * n_estimators - s the number of trees the ramdom forest will have
    

* The second set is specific to the individual tree in the forest.
    * max_features - the percentage of features in each tree