# Random Forest

## What will you learn in this course? 🧐🧐

Random Forest is the best way to alleviate overfitting and create a powerful robust algorithm based on decision trees. In this course you'll learn: 

* Interpretation
* Random Forest
    * Bagging Principle
    * Bootstrap
    * Aggregating
* Intuitive understanding 
* Randomness ?
* Feature importance
    * Mean decrease Accuracy
    * Mean decrease Gini
    * Variable importance

## Interpretation

Aggregating the models to produce the final prediction makes direct interpretation of the model difficult, however two methods are used in practice to assess the importance of each variable for the prediction.

## Random Forest 🌲🌲🌲🌲

As its name implies, Random Forest is nothing more than a set of random trees that we will cooperate to obtain better regression or classification results.

### Bagging Principle

Bagging is the contraction of two words : Bootstrap and Aggregating.
The principle of Bagging is very simple. Let $Y$ be the target variable, linked to the explanatory variables $X_1,...,X_p$ by a function $f$ such as $Y=f(X)+\epsilon$ and $n$ the number of observations.

### Bootstrap

Bootstrapping is a process that artificially increases the number of observations in a data sample without changing the distribution of variables in the dataset. The principle is simple: one has a data set containing $n$ observations to create a sample of size $m$. One draws with discount $m$ observations from the original dataset, each observation of the original dataset having $\frac{1}{n}$ chances to be drawn (it is a draw with equiprobable discount). The equiprobability of the draw is essential so that the distribution law of the sample is the same as that of the initial base.

### Aggregating

By drawing $B$ independent samples $\{Z_b\}_{b\in[[1,B]]}$ from the set of observations we can train $B$ different models, the aggregated forecast given by the $B$ models derived from the $B$ samples is written :

* $Y$ quantitative: the aggregated model is the mean of the functions estimated by the models, the mean of the values of $Y$ for a given observation.

$$
\hat{f}_B(.)=\frac{1}{B}\sum_{b=1}^{B}\hat{f}_{z_b}(.)
$$

* $Y$ qualitative: the aggregated model is the majority vote among the functions estimated by the models, the modality of $Y$ most represented among the responses of the different models to a given observation.

$$
\hat{f}_B(.)=arg\max_{i}Card\{b|\hat{f}_B(.)=j\}
$$

The main difficulty of bagging is to build $B$ independent samples. Indeed, unless you have a database containing a very large number of observations, it is difficult to respect this constraint in most cases.

## Intuitive understanding 🤓🤓

The first idea behind the Random Forest is to do a bagging of several random trees. Several prunings of the trees thus built are possible:

* One can keep the complete trees and possibly limit the minimum number of observations at the terminal nodes.
* Keep at most $q$ leaves or limit the depth of the tree to $q$ node levels.
* Adopt the method seen above for a single tree, i.e. build the complete tree then prune by cross validation.

In general, the first strategy will be retained, because it represents a good compromise between quality of estimation and quantity of calculations. Each tree thus constructed will have a very low bias and a large variance. However, aggregating the models together helps to reduce this variance. This algorithm is very simple to implement, which is a great advantage, however the number of models to be computed before the test error (also called validation error) stabilizes can be very important. The final model will be large in terms of disk space because it is necessary to store the complete structure of all the trees in order to be able to make predictions. Finally, the multiplication of the number of trees participating in the model makes it more difficult, if not impossible, to interpret the model as was possible with a single tree.

The second idea is to improve the bagging method in order to create Random Forests based on data samples that are as "independent" as possible. Not only the chance intervenes at the time of the selection of the observations and during the construction of the learning samples, but one will also make intervene the chance in the choice of the explanatory variables retained for each sample on which one will build a random tree.

This double randomness in the selection of observations and explanatory variables has several advantages: it makes it possible to approach the hypothesis of independence of the samples, it reduces the number of calculations to be carried out for the construction of each tree and it reduces the risks of errors linked to possible correlations between explanatory variables.

Last remark, i.e. $Y$ the target variable and $X_1,...X_p$ the $p$ explanatory variables at our disposal, in general the number of variables we will keep per tree for a classification is $\sqrt{p}$ and $\frac{p}{3}$ for a regression.

## Randomness ?

At this point you might ask yourself : how is a random forest random ?
It is a fair question since a decision by definition is not random, it is entirely deterministic by construction. Randomness in random forests comes from the bootstrap. Bootstrap generates a collection of slightly different training sets that will produce different models. A second source of randomness is what we call bootstrap on the columns, in random forests not all the explanatory variables need to be used for each decision tree in the forest, a good practice is actually to randomly select a subset of variables to build a forest with trees that are all the more diverse. In addition to that, randomly selecting variables for each individual tree allows all explanatory variables to contribute to the final prediction, which is not necessarily the case for decision trees.

## Feature importance 📊📊

### Mean decrease Accuracy

This method consists in randomly swapping the values of an explanatory variable. The difference between the pre- and post-switching validation error is then measured. The higher the pre-switching validation error, the more important the variable in question is considered to be for the prediction of the target variable.

### Mean decrease Gini

This method makes it possible to evaluate the importance of a variable at the level of a node: it measures the decrease of the heterogeneity function if we re-use the explanatory variable used for the node by the one we want to evaluate. The general importance of the variable is then a sum of the decreases in heterogeneity measured and weighted by the number of observations at each node.

### Variable importance

A very useful perk of working with decision trees and random forests is that they make it possible to quantify variable importance. Contrary to F1 score or Chi2 variable importance, the variable importance measured by decision trees and random forests is able to capture non linear dependencies between the explanatory variables and the target variable.

Variable importance is calculated in the following way : First variable importance for each variable is initialized to zero, then every time a variable is selected to perform a division, the subsequent reduction in hetereogeneity contributes to increase the given variable's importance. Variable importance for decision trees is the sum for each variable of their constribution to the construction of the tree. For Random Forests, variable importance is equal to the average of variable importances calculated for each tree.

