# **Random Forests**

**What will you learn?**
1. **Disadvantages of Decision Trees:** The need of Random Forests
2. **Random Forest:** Basic intuition and introduction
3. **Data Bagging:** A powerful ensemble method
4. **Extra Trees:** Extremely Randomised Trees
5. **Implementation using Sklearn:** RandomForestClassifier

## **Disadvantages of Decision Trees**

As we have already seen in decision trees, **overfitting** is a major problem. We used pruning to reduce overfitting and got a good decision tree.
As decision trees keep expanding till they run out of features or till they find a pure node, therefore there is a very high chance of overfitting on the training data. Decision tree will split on the nodes that are not actually important if it is favourable. This causes decision trees to perform perfectly on training data but sometimes fail on testing data. Pruning definitely helps to some extent but even it does not consider how important a feature is. Decision tree may be going to a direction on the basis of some useless features at the starting levels of the tree and thus it effects the decision. Pruning may improve the error but error will still remain.

For example, in our previous example of predicting whether a candidate gets an interview call based upon their resume, we also included features like: resume has a picture of the candidate, colours used in the resume and number of pages resume has. These features mentioned are comparitively less useful as compared to our earlier features (which were projects, college and internships). But there can be a case where our data-set has positive results corresponding to blue colour of resume and resume having the applicant's picture. Therefore, our decision tree may have these as deciding features and gives true as soon as these two are present in a particular resume.

##**Random Forest**

Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction.

Random forest is a way to reduce overfitting in decision trees and it can also be used to find importance of features we are using. 
Each decision tree built will have a randomly selected set of features and randomly selected set of data points. Number of features in each decision tree in the forest will be less than the total number of features we have in our dataset. So if we have a feature 'A', it may appear in some of the decision trees of the forest and not in others. Duplication is generally allowed in selecting the data-points from our data-set for the decision trees in the forest.

<img src="https://files.codingninjas.in/rf1-7334.png" width="700">

As shown above, some of the trees have features F1 and F2 but not F3. There is no point of repeating a feature in a decision tree of the forest but as shown data point may be duplicate. This randomness in selecting the features and data-points helps in reducing overfitting. Note that we will make multiple decision trees so that there is very less chance of a feature/data-point getting missed out.

Random forest consists of many decision tress. Final answer of prediction is the majority of the answers from decision trees of the forest.

There are a few important ways to do so, let's have a look.

### **Data Bagging and Feature Selection**

**B**ootstrap **Agg**regation or Bagging is a simple but very powerful ensemble method.

An ensemble method is a technique that combines the predictions from multiple machine learning algorithms together to make more accurate predictions than any individual model . 

Bagging is used when our goal is to reduce the variance of a decision tree. It is a general purpose procedure for reducing the variance of a predictive model. When applied to trees the basic idea is to grow multiple trees which are then combined to give a single prediction. Combining multiple trees helps in improving precision and accuracy at the expense of interpretation. <br>
In bagging, we take multiple smaller data-sets in which we also allow repetition of data points and randomly select some features. Bagging is generally done in reference of data-points.


<img src="https://files.codingninjas.in/bagging-7335.png" width="700">

As shown in the diagram above we have created multiple data-sets from the original dataset shown as D1, D2 etc. Classifiers C1, C2 etc are actually individual trees in our forest. To find the final answer we just take majority of the answers given by these trees.<br> These smaller data-sets are obtained by choosing the data-points and the features in the following manner:

1.  <b>Features</b> are selected at random <b>without</b> repetiton<br>
2.  <b>Data-points</b> are selected at random <b>with</b> repetition (which is actually bagging)


While doing bagging, we must be sure that no data point is left out. Increasing the number of trees in the random forest significantly reduces the chances of missing out on any data points. Same goes for the features. As already discussed, selecting features in random forest helps us in knowing the relative importance of each feature.


### **Extra Trees**

The Extra-Tree method stands for <b>ext</b>remely <b>ra</b>ndomized <b>trees</b>. With respect to random forests, the method drops the idea of using bootstrap copies of the learning sample, and instead of trying to find an optimal cut-point for each one of the K randomly chosen features at each node, it selects a cut-point at random.
<br>In the implementation of Random Forest, we randomly select some features from the main data-set. Then for these randomly selected features we make a decision tree. The decision tree is made by calculating which feature should be selected to split at a particular node. The cost of choosing each feature was calculated and the feature which gave the least cost was selected. Multiple trees were made using the same approach to form a forest.
<br>In the Extra Trees approach, we do not choose the features randomly to form a tree. We take all the features to form a tree. This is the first difference. The second one is in selecting the feature to split the data points at each node. Rather than considering the cost due to taking a certain feature to split Extra Trees just pick a feature at random. So in this case, any two trees will be different in terms of the feature selected to partition the data-points at each node. It is quite possible in both the approaches discussed that there exists a pair of trees that are exactly same.
<br>We can combine Extra Trees approach with Random forests as well. This can be done by selecting randomly some features as Random forest does and then applying Extra Trees to this feature subset and make multiple Trees out of the bootstraped data-set formed.


<img src="https://files.codingninjas.in/extra_trees-7336.png" width="700">

As shown in the image above, the data-set is first converted into several bootstrap data-sets and then Extra Trees are made for each of these bootstrap data-sets.
<br><br>We have an inbuilt Classifier for Extra Trees in sklearn as sklearn.tree.ExtraTreeClassifier. More information about it is available at [sklearn.tree.ExtraTreeClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html)

### **Random Forest Using Sklearn**

Feature selection is picking only useful features that makes up major contribution in the output.
Advantages of feature selection are as follows :

1. Reduces Overfitting
2. Improves accuracy of the model
3. Reduces training time

If we have too many irrelevant features, **the accuracy of our classifier decreases.** <br/>
Random Forests can be used for determing the importance of each feature and then picking up only important and useful features.

**Dataset Used** - Iris Dataset

In [None]:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets,tree
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
import pandas as pd
from sklearn.externals.six import StringIO  
from IPython.display import Image  
import pydotplus
import graphviz



In [None]:
iris = datasets.load_iris()
features = iris.feature_names
X = iris.data
Y = iris.target

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.4, random_state = 14)

In [None]:
clf = RandomForestClassifier(n_estimators=10000, n_jobs=-1, random_state = 14)

In [None]:
# Train the classifier
clf.fit(X_train, Y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10000,
                       n_jobs=-1, oob_score=False, random_state=14, verbose=0,
                       warm_start=False)

In [None]:
clf.score(X_test, Y_test)

0.95

In [None]:
feature_importances = pd.DataFrame(clf.feature_importances_, index = features, columns=['importance']).sort_values('importance', ascending=False)
feature_importances

Unnamed: 0,importance
petal width (cm),0.459124
petal length (cm),0.400622
sepal length (cm),0.11811
sepal width (cm),0.022145


This shows that <b>petal length</b> and <b>petal width</b> are important features as compared to the other two features i.e. <b>sepal length</b> and <b>sepal width</b>.

In [None]:
# Making a classifier picking only important features, 
# picking only those features that have importance value greater than 0.15
sfm = SelectFromModel(clf, threshold = 0.15)

In [None]:
sfm.fit(X_train, Y_train)

SelectFromModel(estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                                 class_weight=None,
                                                 criterion='gini',
                                                 max_depth=None,
                                                 max_features='auto',
                                                 max_leaf_nodes=None,
                                                 max_samples=None,
                                                 min_impurity_decrease=0.0,
                                                 min_impurity_split=None,
                                                 min_samples_leaf=1,
                                                 min_samples_split=2,
                                                 min_weight_fraction_leaf=0.0,
                                                 n_estimators=10000, n_jobs=-1,
                                                 oob_score=False,

In [None]:
# Create a data subset picking only important features out of all the features.
X_important_train = sfm.transform(X_train)
X_important_test = sfm.transform(X_test)

In [None]:
X_important_train.shape

(90, 2)

In [None]:
X_important_test.shape

(60, 2)

In [None]:
# New random forest classifier with only important features
clf_important = RandomForestClassifier(n_estimators=10000, n_jobs=-1, random_state = 14)

In [None]:
clf_important.fit(X_important_train, Y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10000,
                       n_jobs=-1, oob_score=False, random_state=14, verbose=0,
                       warm_start=False)

In [None]:
clf_important.score(X_important_test, Y_test)

0.9666666666666667

As you can see, even after removing two insignificant features from our dataset, we are able to predict the answers with an **increased score**.

Thus, using Random Forest, we can easily find out what features to focus on.

In [None]:
# All the estimators
len(clf_important.estimators_)

10000

It can be seen that this forest contains 10000 Decision Trees.