# Random Forest Algorithm

Random forest, like decision tree, is a Supervised Machine Learning Algorithm that is used in Classification and Regression problems. It builds decision trees on different samples and takes their majority vote for classification and average in case of regression.

It is an **ensemble** machine learning algorithm. Ensemble simply means combining multiple models. Thus a collection of models is used to make predictions rather than an individual model.

*Ensemble uses two types of methods:*
1. **Bagging** – It creates a different training subset from sample training data with replacement & the final output is based on majority voting or averaging. For example,  Random Forest.

2. **Boosting** – It combines weak learners into strong learners by creating sequential models such that the final model has the highest accuracy. For example, AdaBoost, XGBoost
![image.png](attachment:image.png)

# Bagging
**Bagging**, also known as **Bootstrap Aggregation**, is the ensemble technique used by random forest. 

1. Bagging chooses a random sample from the data set. 
2. Hence each model is generated from the samples (Bootstrap Samples) provided by the Original Data with replacement known as row sampling. This step of **row sampling with replacement** is called **bootstrap**. 
3. Now each model is **trained independently** which generates results. 
4. The final output is based on **majority voting** (classification) or **averaging** (regression) **after combining the results of all models**. This step which involves combining all the results and generating output based on majority voting is known as **aggregation**.

![image.png](attachment:image.png)

# Important Characteristics
1. **Diversity** - Not all attributes/variables/features are considered while making an individual tree, each tree is different.
    
2. **Immune to the curse of dimensionality** - Since each tree does not consider all the features, the feature space is reduced.

3. **Parallelization** - Each tree is created independently out of different data and attributes. This means that we can make full use of the CPU to build random forests.
   
4.  **Train-Test split** - In a random forest we don’t necessarily have to segregate the data for train and test as there will always be 30% of the data which is not seen by the decision tree (by setting bootstrap=True and max_samples=0.7 for the RandomForestclassifier).
    
5.  **Stability** - Stability arises because the result is based on majority voting/ averaging.

# Difference Between Decision Tree & Random Forest
Although random forest is a collection of decision trees, there are a lot of differences in their behavior.
<img src="attachment:%E8%9E%A2%E5%B9%95%E6%93%B7%E5%8F%96%E7%95%AB%E9%9D%A2%20%28536%29.png" alt="drawing" width="500"/>


# Important Hyperparameters

### Hyperparameters that increase the predictive power:

1. n_estimators – **number of trees** that the algorithm builds before averaging the predictions (default: 100)

2. max_features – maximum number of features that random forest considers splitting a node (seen in a decision tree too)

3. mini_sample_leaf – determines the minimum number of leaves required to split an internal node (seen in a decision tree too)

4. oob_score – OOB means out of the bag. It is a random forest **cross-validation method**. The oob_score is the accuracy of the examples $X_i$ using all trees in the random forest ensemble but omitting it during training. Only the current examples $X_i$ is not used but for evaluating its performance. These samples are called out of bag samples. Only available if bootstrap=True. (default: False)

### Hyperparameters that increase the speed:

1. n_jobs – it tells the engine how many processors it is allowed to use. If the value is 1, it can use only one processor but if the value is -1 there is no limit. (default: None)

2. n_estimators - Reduce this parameter only if it does not deterioate the model accuracy

Note that **max_depth is recommended to leave as default: None**. This is because **multiple decorrelated overfitted trees as a whole can result in lower variance** in the resulting forest. This means the training process should be **started by growing the trees large and complex with minimal regularisation**. 

# Advantages and Disadvantages
### Advantages:
1. Able to take care of numeric as well as categorical features.

2. Performs well even if the data contains null/missing values.

3. Solve the problem of overfitting as output is based on majority voting or averaging. Indeed, random forest ensembles (do not) are very unlikely to overfit in general.

4. Each decision tree created is independent of the other thus it shows the property of parallelization.

5. Highly stable model as the average answers given by a large number of trees are taken.

6. Maintains diversity as all the attributes are not considered while making each decision tree though it is not true in all cases.

7. Immune to the curse of dimensionality. Since each tree does not consider all the attributes, feature space is reduced.


### Disadvantages:

1. Random forest is highly complex when compared to decision trees where decisions can be made by following the path of the tree.

2. Training time is more compared to other models due to its complexity. Whenever it has to make a prediction, each decision tree has to generate output for the given input data.

3. Redundant features could be misleading while interpretating the feature importance (but not affecting the accuracy)
 - Solution: regularized random forest (RRF) -  In the tree building process, RRF memorizes the features used in previous tree nodes, and prefer these features in splitting future tree nodes, therefore avoiding redundant features in the trees