# Random forests

#### Introduction:

Random forest is a  machine learning method with many applications ranging from healthcare to insurance. It can be used to model the impact of marketing on customer acquisition, retention, and churn or to predict disease risk and susceptibility in patients.

Random forests (Breiman, 2001) is a substantial modification of [bagging](https://en.wikipedia.org/wiki/Bootstrap_aggregating) that builds a large collection of de-correlated trees, and then averages them. On many problems the performance of random forests is very similar to [boosting](https://en.wikipedia.org/wiki/Boosting_(machine_learning)), and they are simpler to train and tune. As a consequence, random forests are popular.

Random forest is capable of regression and classification. It can handle a large number of features, and it's helpful for estimating which of your variables are important in the underlying data being modeled.

#### What is a Random Forest?

A random forest is essentially a collection of decision trees, where each tree is slightly different from
the others. The idea behind random forests is that each tree might do a relatively
good job of predicting, but will likely overfit on part of the data. If we build many
trees, all of which work well and overfit in different ways, we can reduce the amount
of overfitting by averaging their results. This reduction in overfitting, while retaining
the predictive power of the trees, can be shown using rigorous mathematics.
To implement this strategy, we need to build many decision trees. Each tree should do
an acceptable job of predicting the target, and should also be different from the other
trees. Random forests get their name from injecting randomness into the tree build‐
ing to ensure each tree is different. There are two ways in which the trees in a random
forest are randomized: by selecting the data points used to build a tree and by select‐
ing the features in each split test. Let’s go into this process in more detail.  

#### Building random forests:  

To build a random forest model, you need to decide on the
number of trees to build (the `n_estimators` parameter of `RandomForestRegressor` or
`RandomForestClassifier`). Let’s say we want to build 10 trees. These trees will be
built completely independently from each other, and the algorithm will make differ‐
ent random choices for each tree to make sure the trees are distinct. To build a tree,
we first take what is called a bootstrap sample of our data. That is, from our `n_samples`
data points, we repeatedly draw an example randomly with replacement (meaning the
same sample can be picked multiple times), `n_samples` times. This will create a data‐
set that is as big as the original dataset, but some data points will be missing from it
(approximately one third), and some will be repeated.  

To illustrate, let’s say we want to create a bootstrap sample of the list [1, 2, 3, 4] . A possible bootstrap sample would be [2, 4, 4, 3]. Another possible sample would be [4, 1, 4, 1].
Next, a decision tree is built based on this newly created dataset. Next the decision tree algorithm looks for the best test for each node, in each node the algorithm randomly selects a subset of the features, and it looks for the best possible test involving one of these features. The number of features that are selected is controlled by the `max_features` parameter.
This selection of a subset of features is repeated separately in each node, so that each
node in a tree can make a decision using a different subset of the features.
The bootstrap sampling leads to each decision tree in the random forest being built
on a slightly different dataset. Because of the selection of features in each node, each
split in each tree operates on a different subset of features. Together, these two mech‐
anisms ensure that all the trees in the random forest are different.
A critical parameter in this process is `max_features` . If we set `max_features` to `n_fea
tures` , that means that each split can look at all features in the dataset, and no ran‐
domness will be injected in the feature selection (the randomness due to the
bootstrapping remains, though). If we set `max_features` to 1 , that means that the
splits have no choice at all on which feature to test, and can only search over different
thresholds for the feature that was selected randomly. Therefore, a high `max_fea
tures` means that the trees in the random forest will be quite similar, and they will be
able to fit the data easily, using the most distinctive features. A low `max_features`
means that the trees in the random forest will be quite different, and that each tree
might need to be very deep in order to fit the data well.
To make a prediction using the random forest, the algorithm first makes a prediction
for every tree in the forest. For regression, we can average these results to get our final
prediction. For classification, a “soft voting” strategy is used. This means each algo‐
rithm makes a “soft” prediction, providing a probability for each possible output label. The probabilities predicted by all the trees are averaged, and the class with the
highest probability is predicted.


#### Features of Random forests:

some of the Features of Random Forests which make it so useful are: 

- It is unexcelled in accuracy among current algorithms.
- It runs efficiently on large datasets.
- It can handle thousands of input variables without variable deletion.
- It gives estimates of what variables are important in the classification.
- It generates an internal unbiased estimate of the generalization error as the forest building progresses.
- It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing.
- It has methods for balancing error in class population unbalanced data sets.
- Generated forests can be saved for future use on other data.
- Prototypes are computed that give information about the relation between the variables and the classification.
- It computes proximities between pairs of cases that can be used in clustering, locating outliers, or (by scaling) give interesting views of the data.
- The capabilities of the above can be extended to unlabeled data, leading to unsupervised clustering, data views and outlier detection.
- It offers an experimental method for detecting variable interactions.
- Random forests does not overfit. You can run as many trees as you want.  

#### Random Forest algorithm:

Random forest is like bootstrapping algorithm with Decision tree model. Say, we have 1000 observation in the complete population with 10 variables. Random forest tries to build multiple CART model with different sample and different initial variables. For instance, it will take a random sample of 100 observation and 5 randomly chosen initial variables to build a CART model. It will repeat the process (say) 10 times and then make a final prediction on each observation. Final prediction is a function of each prediction. This final prediction can simply be the mean of each prediction.  

![](http://slideplayer.com/slide/7835984/25/images/26/Random+Forest:+Algorithm+Steps.jpg)  
  
#### Analysing Random forests: with an example:   

For this example we’ll use the python package scikit-learn and its implementation of the random forest classifier. As you shall see, it is extremely easy to use.

Let’s start by creating our dataset. Let us create a dataset of furnitures consisting of seats and beanchs and the task will be to classify them appropriatly.   
We’ll work with the following characteristics:

1. Color
2. Length of Legs
3. Area of top surface

We can add as many other features if we would like, such as number of legs, age, location, and so on, but we’ll keep it simple. 

We will represent all possibilities of each attribute as a number. The colors of our beanchs and seats will be drawn from the same uniform distribution of, say, integers 0 through 5, where you can think of 0 as being brown, 1 as black, etc. We’ll pull the length of the legs from different uniform distributions of real numbers: between 2 and 5 for seats, and between 4 and 10 for beanchs. Finally, the area of the top surface will draw from a normal distribution: for seats, we’ll use mean 2 and standard deviation 0.25; for beanchs, mean 5 and standard deviation 1.

Let’s construct our data in python. The classifier expects two lists: one containing the data and the other containing the corresponding labels, so we shall build these separately:

In [15]:
%xmode plain

import pandas as pd
from random import randint, uniform, gauss
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_score
from pandas import DataFrame, Series
from matplotlib.pylab import plot as plt
from numpy import asarray


seats = beanchs = 5000
 
data = [[randint(0, 5), # Possible colors for seats
         uniform(2, 5), # Possible leg lengths for seats
         gauss(2,0.25)] # Possible top surface areas for seats
        for i in range(seats)] \
     + [[randint(0, 5), # Six possible colors for beanchs
         uniform(4,10), # Possible leg lengths for beanchs
           gauss(5, 1)] # Possible top surface areas for beanchs
        for i in range(beanchs)]
 
labels = asarray(['seat']*seats + ['beanch']*beanchs)

Exception reporting mode: Plain


Now that we have 10,000 labelled records of our furnitures, we can initialize a random forest classifier and train it on this data. I have arbitrarily chosen to use 100 estimators (trees in our forest). Unless you’re seeking the most concise representation, as can often be the goal in the sciences, the more estimators the merrier as that tends to lead to better performance — but training and testing time can be a limiting factor

In [3]:
rf_classifier = RandomForestClassifier(n_estimators=100)
rf_classifier.fit(data, labels)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=100, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

Now we can perform cross-validation to see how well this classifier performs when distinguishing seats from beanchs. This technique involves segregating the data into training and testing samples, fitting the classifier with the training data, and evaluating it on the left out testing data. We can repeat this several times and then average over the results to get a sense of how well this classifier would perform on as yet unseen data. Below I have (again, arbitrarily) chosen to do 100 trials of cross-validation.

In [4]:
scores = cross_val_score(rf_classifier, data, labels, cv=100)
print("Accuracy: %0.2f (+/- %0.2f)"
      % (scores.mean(), scores.std()*2))

Accuracy: 1.00 (+/- 0.01)


This seems a little too good, it gets them all right all the time! Let’s make our data noisy so that we can see how robust it is. Let’s change every 100th beanch to a seat and vice versa.

In [5]:
labels[:seats][::100]='beanch'
labels[seats:seats+beanchs][::100]='seat'

rf_classifier.fit(data, labels)
scores = cross_val_score(rf_classifier, data, labels, cv=100)
print("Accuracy: %0.2f (+/- %0.2f)"
      % (scores.mean(), scores.std()*2))

Accuracy: 0.99 (+/- 0.14)


And there we have it. The classifier shows its robust to the mislabeled data we introduced.

#### Closing Notes: 

Machine learning tools like random forest, SVM, neural networks etc. are all used for high performance. They do give high performance, but users generally don’t understand how they actually work. Not knowing the statistical details of the model is not a concern however not knowing how the model can be tuned well to clone the training data restricts the user to use the algorithm to its full potential.  

Random forest are more accurate with predictions when compared to a CART/CHAID or some other regression models in many case. These cases generally have high number of predictive variables and huge sample size. This is because it captures the variance of several input variables at the same time and enables high number of observations to participate in the prediction. 