# Ensemble models with random forests

Ensemble methods are machine learning methods that combine several based models to produce one optimal predictive model.  
They combine decisions from multiple models to improve the overall performance.  
Ensemble learning involves creating a collection, or ensemble, of multiple algorithms for the purpose of generating a single model that's far more powerful and reliable than its component parts.  

## Types of ensemble methods

- majority voting
- averaging
- weighted averaging
- bagging
- boosting



## The majority voting method 
 
Picks the result based on the majority of votes from different models.  
This method is generally used in classification problems.  

## The averaging method 

Is quite similar to majority voting. Multiple models are run, and predictions are averaged.  
Averaging method can be used in both classification and regression problems.

## The weighted average method

Uses multiple models to make predictions. The method allocates weights to different model predictions, and averages them out.

## Bagging

Is a method wherein the results from multiple models are combined to get a final result.  
Decision trees are used frequently with bagging.  
The main idea of bagging is to create subsets of the original data and run different models on the subsets.  
Finally, the results are aggregated. Bagging works in parallel.  

## Boosting

Is a slightly more complex version of bagging.  
Boosting has a sequential approach. The six main steps of boosting are:

- create a subset of the data
- run a model on the subset of the data and get the predictions
- calculate errors on those predictions
- assign weight to the incorrect predictions
- create another model with the same data, and the next subset of data is created
- the cycle repeats itself until a strong learner is created


## Random forest

Is an ensemble model which follows the bagging method.  
This model uses decision trees to form ensembles.  
This approach is useful for both classification and regression problems.  

### How random forests works

When predicting a new value for a target feature, each tree is either using regression or classification to come up with a value that serves as a vote.  
The random forest algorithm then takes an average of all votes from all trees in the ensemble.  
This average is the predicted value of the target feature for the variable in question.

### There are five main steps in random forest

- createa random subset from the original data
- randomly select a feature at each node in the decision tree
- the best split is decided
- for each subset of data, a separate model is created (this is called a *base learner*)
- compute the final prediction by averaging the predictions from all the individual models

## The advantages of random forest are

- Easy to understand
- Useful for data exploration
- Reduced data cleaning (scaling not required)
- Highly flexible
- Gives good accuracy.
- Works well on large datasets
- Handle multiple data types
- Overfitting is avoided (due to averaging)

## The disadvantages of random forest are

- Does not work well with sparse datasets
- Requires a bit of computational resources to run
- No interpretability
- Not for continuous variables

In [2]:
import numpy as np 
import pandas as pd 

import sklearn.datasets as datasets 
from sklearn.model_selection import train_test_split
from sklearn import metrics

In [3]:
from sklearn.ensemble import RandomForestClassifier

In [5]:
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.DataFrame(iris.target)

y.columns=['labels']

print(df.head())
y[0:5]

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                5.1               3.5                1.4               0.2
1                4.9               3.0                1.4               0.2
2                4.7               3.2                1.3               0.2
3                4.6               3.1                1.5               0.2
4                5.0               3.6                1.4               0.2


Unnamed: 0,labels
0,0
1,0
2,0
3,0
4,0


In [6]:
# The sepal length, sepal width, pedal length, and pedal width, these are indicators or predictors of species type, and species type is our label here

df.isnull().any() == True

sepal length (cm)    False
sepal width (cm)     False
petal length (cm)    False
petal width (cm)     False
dtype: bool

In [7]:
y.labels.value_counts()

0    50
1    50
2    50
Name: labels, dtype: int64

## Preparing the data for training the model

In [9]:
x_train, x_test, y_train, y_test = train_test_split(df, y, train_size=.2, random_state=17)

## Build a random forest model

In [11]:
classifier = RandomForestClassifier(n_estimators=200, random_state=0)

y_train_array = np.ravel(y_train)

classifier.fit(x_train, y_train_array)

y_pred = classifier.predict(x_test)

## Evaluating the model on the test data

In [15]:
print(metrics.classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        36
           1       0.95      0.95      0.95        43
           2       0.95      0.95      0.95        41

    accuracy                           0.97       120
   macro avg       0.97      0.97      0.97       120
weighted avg       0.97      0.97      0.97       120



In [16]:
y_test_array = np.ravel(y_test)
print(y_test_array)
print(y_pred)

[0 1 2 1 2 2 1 2 1 2 2 0 1 0 2 0 0 2 2 2 2 0 2 1 1 1 1 1 0 1 0 1 0 0 1 1 1
 2 1 0 1 1 0 1 2 1 1 2 1 0 2 1 1 1 1 0 1 2 2 0 0 2 0 2 2 0 2 0 0 1 2 0 0 1
 0 2 2 0 0 1 2 2 0 0 2 0 0 2 2 2 2 0 2 1 0 1 0 0 1 1 1 2 1 2 2 1 1 2 2 1 0
 2 2 1 2 1 0 1 0 1]
[0 1 2 1 2 2 1 2 1 2 2 0 1 0 2 0 0 2 2 2 1 0 2 1 1 1 1 1 0 1 0 1 0 0 1 1 1
 2 1 0 1 1 0 1 2 1 2 2 1 0 2 1 1 2 1 0 1 2 2 0 0 2 0 1 2 0 2 0 0 1 2 0 0 1
 0 2 2 0 0 1 2 2 0 0 2 0 0 2 2 2 2 0 2 1 0 1 0 0 1 1 1 2 1 2 2 1 1 2 2 1 0
 2 2 1 2 1 0 1 0 1]
