# Ensemble Models


Ensemble modeling is a process where multiple diverse models are created to predict an outcome, either by using many different modeling algorithms or using different training data sets. The ensemble model then aggregates the prediction of each base model and results in once final prediction for the unseen data. \
Source ~ https://www.sciencedirect.com/topics/computer-science/ensemble-modeling#:~:text=Ensemble%20modeling%20is%20a%20process,prediction%20for%20the%20unseen%20data.

# Need of Ensemble Models 

## The problem of High Bias and Variance

![](https://miro.medium.com/max/700/1*9hPX9pAO3jqLrzt0IE3JzA.png)
Source ~ https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229

Bias is training error and Variance is testing error.
<hr>
Imagine training error is 1% and testing error is 13%. This is a <b>Overfitting</b> condition also termed as <b>High Variance</b>.
<hr>
Imagine training error is 19% and testing error is 21%. This is an <b>Underfitting</b> condition also termed as <b>High Bias</b>.
<hr>
Imagine training error is 32% and testing error is 34%. This is a <b>High Bias</b> and <b>High Variance</b> condition.
<hr>
Imagine training error is 1% and testing error is 2%. This is a <b>Generalized</b> model.

# Why these concepts ?

**Decision Tree** goes into the **High Variance** problem due to its ability of reaching **maximum depth**. This can be particularly solved using **Random Forest**.

# But what does Ensemble Model has to do with Random Forest ?

There are 2 types of **Ensemble Methods**.\
One of is known as **Bagging** and another is **Boosting**.

The **Bagging** Method is also known as **Bootstrap Aggregation**.

## Random Forest is a type of Bagging Ensemble Model.

![](https://miro.medium.com/max/700/1*-PXzSlXtFEGTxgcmCyMkjQ.png) \
Source ~ https://towardsdatascience.com/ensemble-models-5a62d4f4cb0c

## Bagging 
![](https://upload.wikimedia.org/wikipedia/commons/thumb/c/c8/Ensemble_Bagging.svg/440px-Ensemble_Bagging.svg.png)\
Source ~ https://en.wikipedia.org/wiki/Bootstrap_aggregating

## Random Forest
![](https://editor.analyticsvidhya.com/uploads/74060RF%20image.jpg)
Source ~ https://www.analyticsvidhya.com/blog/2020/12/lets-open-the-black-box-of-random-forests/

In [6]:
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()

In [7]:
import pandas as pd

  return f(*args, **kwds)


In [8]:
X = pd.DataFrame(data.data, columns = data.feature_names)

In [10]:
y = pd.Series(data = data.target, name = 'Target')

In [11]:
X.shape

(569, 30)

In [12]:
y.shape

(569,)

In [13]:
X.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [14]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10)

In [15]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()

In [16]:
model.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [17]:
model.score(X_train, y_train)

1.0

In [18]:
y_pred = model.predict(X_test)
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.9824561403508771

In [19]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)

array([[ 59,   0],
       [  3, 109]], dtype=int64)