<a href="https://colab.research.google.com/github/jubinmehta08/10-steps-to-become-a-data-scientist/blob/master/RandomForest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Random Forest Algorithm with Python

![](https://cdn-images-1.medium.com/max/2000/1*iIx75bixLRbE45mXkH2oCg.png/)

Random forest algorithm has become the most common algorithm to be used in ML competitions like Kaggle competitions, If you ever search for an easy to use and accurate ML algorithm, you will absolutely get random forest in the top results.
To understand Random forest algorithm you have to be familiar with decision trees at first .


## What are Decision Trees?
*	Decision trees are predictive models that use a set of binary rules to
calculate a target value.  
*	There are two types of decision trees are classification trees and regression
trees.
*	Classification trees are used to create categorical data sets such as land cover
classification
*	Regression trees are used to create continuous data sets such as biomass and
percent tree cover.
*	Each individual tree is a fairly simple model that has branches, nodes and leaves.
*	 The nodes contain the attributes the objective function depends on.



Now after you got familiar with decision trees, you are ready to understand random forest.

## What is Random Forest?
As Leo Breiman defined it in the [research paper](https://medium.com/r/?url=https%3A%2F%2Fwww.stat.berkeley.edu%2F~breiman%2Frandomforest2001.pdf), “ Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest ”

Another definition “A random forest is a classifier consisting of a collection of tree structured classifiers {h(x,Θk ), k=1, ...} where the {Θk} are independent identically distributed random vectors and each tree casts a unit vote for the most popular class at input x ”
Briefly, Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction. 


## Advantages of Random Forests
*	It can be used for both classification and regression problems
*	Reduction in overfitting: by averaging several trees, there is a significantly lower risk of overfitting.
*	Random forests make a wrong prediction only when more than half of the base classifiers are wrong
*	 It is very easy to measure the relative importance of each feature on the prediction. Sklearn provides a great tool for this

Because of that, it is more accurate than most of the other algorithms.

## Disadvantages of Random Forests 
*	Random forests have been observed to overfit for some datasets with noisy classification/regression tasks.
*	It’s more complex and computationally expensive than decision tree algorithm.


## Important Terminology related to Decision Trees [[1]](https://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-tree-based-modeling-scratch-in-python/)

Let's look at the basic terminology used with decision trees and random forests :
1.	Root Node: It represents entire population or sample and this further gets divided into two or more homogeneous sets.
2.	Splitting: It is a process of dividing a node into two or more sub-nodes.
3.	Decision Node: When a sub-node splits into further sub-nodes, then it is called decision node.
4.	Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal node.
5.	Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say opposite process of splitting.
6.	Branch / Sub-Tree: A sub section of entire tree is called branch or sub-tree.
7.	Parent and Child Node: A node, which is divided into sub-nodes is called parent node of sub-nodes where as sub-nodes are the child of parent node.


After we got to know some essentials about random forests, let us use this algorithm in some dataset, In our case we will use  [Kaggle's titanic survivors dataset](https://www.kaggle.com/c/titanic/data)
 that I preprocessed before

And then we will use a neural network to compare the results.

## Import needed dependencies :

In [0]:
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from keras.callbacks import ModelCheckpoint
from sklearn.metrics import accuracy_score

Using TensorFlow backend.


## Load the preprocessed dataset:

Download the preprocessed dataset [Download](https://drive.google.com/file/d/1rzbDYv3tYLQ7J-P3cgG7mHVwxWzgBwdr/view?usp=sharing)

In [0]:
dataset =pd.read_csv('TitanicPreprocessed.csv') 

In [0]:
dataset.head()

Unnamed: 0,Sex,Age,SibSp,Parch,Fare,Title_Master,Title_Miss,Title_Mr,Title_Mrs,Title_Officer,...,Ticket_STONOQ,Ticket_SWPP,Ticket_WC,Ticket_WEP,Ticket_XXX,FamilySize,Singleton,SmallFamily,LargeFamily,Survived
0,1,22.0,1,0,7.25,0,0,1,0,0,...,0,0,0,0,0,2,0,1,0,0
1,0,38.0,1,0,71.2833,0,0,0,1,0,...,0,0,0,0,0,2,0,1,0,1
2,0,26.0,0,0,7.925,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,1
3,0,35.0,1,0,53.1,0,0,0,1,0,...,0,0,0,0,1,2,0,1,0,1
4,1,35.0,0,0,8.05,0,0,1,0,0,...,0,0,0,0,1,1,0,0,0,0


In [0]:
y = dataset['Survived']
X = dataset.drop(['Survived'], axis = 1)

# Split the dataset to trainand test data
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.25, random_state=0)

## Set the parameters for the random forest model :

In [0]:
parameters = {'bootstrap': True,
              'min_samples_leaf': 3,
              'n_estimators': 50, 
              'min_samples_split': 10,
              'max_features': 'sqrt',
              'max_depth': 6,
              'max_leaf_nodes': None}

## Hyperparameters of Sklearn Random forest classifier[[2]](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) :

*	**bootstrap** : boolean, optional (default=True)

> Whether bootstrap samples are used when building trees.

*	**min_samples_leaf** : int, float, optional (default=1)

> The minimum number of samples required to be at a leaf node:

> - If int, then consider min_samples_leaf as the minimum number.

> - If float, then min_samples_leaf is a percentage and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.

* **n_estimators** : integer, optional (default=10)
> The number of trees in the forest.

* 	**min_samples_split** :  int, float, optional (default=2)
> The minimum number of samples required to split an internal node:

> - If int, then consider min_samples_split as the minimum number.
> -	If float, then min_samples_split is a percentage and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.

*	**max_features** : int, float, string or None, optional (default=”auto”)
> The number of features to consider when looking for the best split:

> -	If int, then consider max_features features at each split.
> -If float, then max_features is a percentage and int(max_features * n_features) features are considered at each split.
> -	If “auto”, then max_features=sqrt(n_features).
> -	If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).
> -	If “log2”, then max_features=log2(n_features).
> -	If None, then max_features=n_features.


*	**max_depth** :  integer or None, optional (default=None)
> The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.


*	**max_leaf_nodes** : int or None, optional (default=None)
> Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.


If you want to learn more about the rest of hyperparameters , check out  [sklearn.ensemble.RandomForestClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

## Define the model :

In [0]:
RF_model = RandomForestClassifier(**parameters)

## Train the model :

In [0]:
RF_model.fit(train_X, train_y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=6, max_features='sqrt', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=3, min_samples_split=10,
            min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

## Test the trained model on test data :

In [0]:
RF_predictions = RF_model.predict(test_X)

In [0]:
score = accuracy_score(test_y ,RF_predictions)
print(score)

0.8251121076233184


We see that the model's accuracy is  82%, not bad at all.

## Using Neural Networks:


## Define the model :

In [0]:
# Build a neural network :
NN_model = Sequential()

NN_model.add(Dense(128, input_dim = 68, activation='relu'))
NN_model.add(Dense(256, activation='relu'))
NN_model.add(Dense(256, activation='relu'))
NN_model.add(Dense(256, activation='relu'))
NN_model.add(Dense(1, activation='sigmoid'))
NN_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])


Define a checkpoint callback :

In [0]:
checkpoint_name = 'Weights-{epoch:03d}-{val_acc:.5f}.hdf5' 
checkpoint = ModelCheckpoint(checkpoint_name, monitor='val_acc', verbose = 1, save_best_only = True, mode ='max')
callbacks_list = [checkpoint]

## Train the model :

In [0]:
NN_model.fit(train_X, train_y, epochs=150, batch_size=64, validation_split = 0.2, callbacks=callbacks_list)

Train on 534 samples, validate on 134 samples
Epoch 1/150

Epoch 00001: val_acc improved from -inf to 0.62687, saving model to Weights-001-0.62687.hdf5
Epoch 2/150

Epoch 00002: val_acc improved from 0.62687 to 0.67164, saving model to Weights-002-0.67164.hdf5
Epoch 3/150

Epoch 00003: val_acc improved from 0.67164 to 0.70149, saving model to Weights-003-0.70149.hdf5
Epoch 4/150

Epoch 00004: val_acc did not improve from 0.70149
Epoch 5/150

Epoch 00005: val_acc did not improve from 0.70149
Epoch 6/150

Epoch 00006: val_acc improved from 0.70149 to 0.71642, saving model to Weights-006-0.71642.hdf5
Epoch 7/150

Epoch 00007: val_acc improved from 0.71642 to 0.73881, saving model to Weights-007-0.73881.hdf5
Epoch 8/150

Epoch 00008: val_acc did not improve from 0.73881
Epoch 9/150

Epoch 00009: val_acc improved from 0.73881 to 0.79851, saving model to Weights-009-0.79851.hdf5
Epoch 10/150

Epoch 00010: val_acc did not improve from 0.79851
Epoch 11/150

Epoch 00011: val_acc improved from 0

<keras.callbacks.History at 0x7f47cb549320>

In [0]:
# Load wights file of the best model :
wights_file = './Weights-016-0.88060.hdf5' # choose the best checkpoint 
NN_model.load_weights(wights_file) # load it
NN_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

## Test the trained model on test data :

In [0]:
predictions = NN_model.predict(test_X)

In [0]:
# round predictions
rounded = [round(x[0]) for x in predictions]
predictions = rounded

In [0]:
score = accuracy_score(test_y ,predictions)
print(score)

0.8116591928251121


The accuracy of this neural network model is 81%, we notice that using random forest gives us a higher accuracy.

## References :
* [The Random Forest Algorithm](https://towardsdatascience.com/the-random-forest-algorithm-d457d499ffcd)
*	[What are some advantages of using a random forest over a decision tree given that a decision tree is simpler?](https://www.quora.com/What-are-some-advantages-of-using-a-random-forest-over-a-decision-tree-given-that-a-decision-tree-is-simpler)
*	[A Complete Tutorial on Tree Based Modeling from Scratch](https://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-tree-based-modeling-scratch-in-python/)
*	[Sklearn documents of random forest](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
* [ImageSource](https://cdn5.vectorstock.com/i/1000x1000/88/99/city-and-forest-vector-1078899.jpg)


