### Bagging, Boostin, Bias Variane tradeoff

In [30]:
### Random Forest Algorithm

- Random forest is a bagging( an ensemble) techniques, because model are built in parallel
- It builds decision trees on different samples and takes their majority vote for classification and average in case of regression.

In [1]:
### Ensemble Learning/Techniques

Ensemble simply means combining multiple models.
Thus a collection of models is used to make predictions rather than an individual model.

Ensemble uses two types of methods:

        1.Bagging
        2.Boosting

In [2]:
# Bagging

- also known as Bootstrap Aggregation
- it chooses a random sample, each model is generated from this samples (Bootstrap Samples) provided by the Original Data with replacement, known as row sampling. 
- This step of row sampling with replacement is called bootstrap
-  some observations may be repeated in each new training data set
- so,In Bagging, every element has the same probability to appear in a new dataset
- Now each model is trained independently which generates results. 
- The final output is based on majority voting after combining(Aggregation) the results of all models.

1. Multiple subsets are created from the original data set with equal tuples, selecting observations with replacement.
2. A base model is created on each of these subsets.
3. Each model is learned in parallel with each training set and independent of each other.
4. The final predictions are determined by combining the predictions from all the models

In [29]:
# Boosting

    Firstly, a model is built from the training data. 
    Then the second model is built which tries to correct the errors present in the first model. 
    This procedure is continued and models are added until either the complete training data set 
      is predicted correctly or the maximum number of models is added.

1. Initialise the dataset and assign equal weight to each of the data point.
2. Provide this as input to the model and identify the wrongly classified data points.
3. Increase the weight of the wrongly classified data points and decrease the weights of correctly classified data points. 
4. And then normalize the weights of all data points.
5. if (got required results) model is built or ends here, else Goto step 2

In [22]:
# Bias and Variance 

There are 3 types of prediction error: 
- bias, 
- variance, and 
- irreducible error. 

Irreducible error, also known as “noise,” can’t be reduced by the choice of algorithm. 

The other two types of errors, however, can be reduced because they stem from your algorithm choice.

In [24]:
# bias

Bias is an assumption made by a model to make the target function easier to learn. Models with high bias are less flexible and are not fully able to learn from the training data.

Such as a linear regression model, the regression line fails to fit the majority of the data points and thus, this model has high bias and low learning power. Generally, models with low bias are preferred.

In [25]:
# Variance 

- it defines how much the predictions of a model will change from one dataset to another
- It can also be defined as the Difference between actual and predicted data
- Ideally, we want a model with low variance, model with low variance would mean that the difference between actual and predicted value is low
- But there seem to be tradeoffs between the bias and variance. This is known as a bias-variance tradeoff. Hence when we decrease one, the other increases, and vice versa. 

In [26]:
# Ensemble methods can be divided into two groups:

Parallel ensemble methods:
- or Bagging techniques
- base learners are generated in parallel simultaneously

Sequential ensemble methods:
- or Boosting techniques
- different learners learn sequentially with early learners fitting simple models to the data. Then the data is analyzed for errors. The goal is to solve for net error from the prior mode

In [28]:
# homogeneous & heterogeneous ensembling

Most ensemble methods use a single base learning algorithm to produce homogeneous base learners i.e. learners of the same type, leading to homogeneous ensembles.

For example, Random forests & Adaboost both do homogeneous ensembling

Some methods use learners of different types as base learners.
In Scikit-learn, there is a model known as a voting classifier. This is an example of heterogeneous learners. 

In [3]:
#### Steps involved in random forest algorithm:


1. 'n' number of random features are taken from the data set having 'k' number of records.
2. Individual decision trees are constructed for each sample.
3. Each decision tree will generate an output.
4. Final output is considered based on Majority Voting or Averaging for Classification and regression respectively.

In [4]:
### Important Features of Random Forest

1. Diversity:
Not all attributes/variables/features are considered while making an individual tree, each tree is different.

2. Immune to the curse of dimensionality- Since each tree does not consider all the features, the feature space is reduced.

3. Parallelization:
Each tree is created independently out of different data and attributes. This means that we can make full use of the CPU to build random forests.

4.  Train-Test split:In a random forest we don’t have to segregate the data for train and test as there will always be 30% of the data which is not seen by the decision tree.

5.  Stability- Stability arises because the result is based on majority voting/ averaging.

In [5]:
### Difference Between Decision Tree & Random Forest

- Decision trees normally suffer from the problem of overfitting if it’s allowed to grow without any control.
- While in RandomForest, final output is based on average or majority ranking and hence the problem of overfitting is taken care of..
- A single decision tree is faster in computation.
- Random Forest is comparatively slower

In [6]:
### Important Hyperparameters Of Random Forest Algorithm

In [7]:
# Following hyperparameters increases the predictive power:

1. n_estimators: number of trees the algorithm builds before averaging the predictions.
2. max_features: maximum number of features random forest considers while splitting a node.
3. mini_sample_leaf: determines the minimum number of leaves required to split an internal node.

In [2]:
# Following hyperparameters increases the speed:

1. n_jobs– it tells the engine how many processors it is allowed to use. If the value is 1, it can use only one processor but if the value is -1 there is no limit.

2. random_state– controls randomness of the sample. The model will always produce the same results if it has a definite value of random state and if it has been given the same hyperparameters and the same training data.

3. oob_score – OOB means out of the bag. It is a random forest cross-validation method. In this one-third of the sample is not used to train the data instead used to evaluate its performance. These samples are called out of bag samples.

### Python Implementaiton of Random Forest

In [8]:
import pandas as pd, numpy as np
import matplotlib.pyplot as plt, seaborn as sns
%matplotlib inline

In [9]:
df = pd.read_csv('heart_v2.csv')
df.head()

Unnamed: 0,age,sex,BP,cholestrol,heart disease
0,70,1,130,322,1
1,67,0,115,564,0
2,57,1,124,261,1
3,64,1,128,263,0
4,74,0,120,269,0


In [10]:
X = df.drop('heart disease',axis=1)
# Putting response variable to y
y = df['heart disease']

In [11]:
from sklearn.model_selection import train_test_split

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)
X_train.shape, X_test.shape

((189, 4), (81, 4))

In [13]:
from sklearn.ensemble import RandomForestClassifier

In [14]:
rf=RandomForestClassifier(n_estimators=100,n_jobs=-1,max_depth=5,random_state=42,oob_score=True)

    max_depth : int, default=None
        The maximum depth of the tree. If None, then nodes are expanded until
        all leaves are pure or until all leaves contain less than
        min_samples_split samples.

In [15]:
%%time
rf.fit(X_train, y_train)

Wall time: 372 ms


RandomForestClassifier(max_depth=5, n_jobs=-1, oob_score=True, random_state=42)

In [16]:
rf.oob_score_

0.656084656084656

In [12]:
# Let’s do hyperparameter tuning for Random Forest using GridSearchCV and fit the data.

In [17]:
# make another instance of the classifier
rfc = RandomForestClassifier(random_state=42,n_jobs=-1)

In [18]:
# define a variable to store parameters listings
params={
    'max_depth':[2,3,5,10,20],
    'min_samples_leaf':[5,10,20,50,100,200],
    'n_estimators':[10,25,30,50,100,200]
}

In [19]:
# import GridsearchCV
from sklearn.model_selection import GridSearchCV

In [20]:
# make an instance of GridsearchCV
gridcv=GridSearchCV(estimator=rfc,param_grid=params,cv=4,n_jobs=-1,verbose=1,scoring='accuracy')

In [21]:
# train the data on GridsearchCV
gridcv.fit(X_train,y_train)

Fitting 4 folds for each of 180 candidates, totalling 720 fits


GridSearchCV(cv=4, estimator=RandomForestClassifier(n_jobs=-1, random_state=42),
             n_jobs=-1,
             param_grid={'max_depth': [2, 3, 5, 10, 20],
                         'min_samples_leaf': [5, 10, 20, 50, 100, 200],
                         'n_estimators': [10, 25, 30, 50, 100, 200]},
             scoring='accuracy', verbose=1)

In [20]:
gridcv.best_score_

0.6985815602836879

In [21]:
rf_best = gridcv.best_estimator_
rf_best

RandomForestClassifier(max_depth=5, min_samples_leaf=10, n_estimators=10,
                       n_jobs=-1, random_state=42)

    The above output gives the best parameters values.