<img src="AV_Logo.png" style="width: 200px;height: 75px"/>

Table of Contents:
------------
* [Ensemble Model - example and intuition](#Ensemble-Model---example-and-intuition)
* [What is Ensemble Modeling?](#What-is-Ensemble-Modeling?)
* [Ensemble Techniques](#Ensemble-Techniques)
* [Advantages and Disadvantages of ensemble models](#Advantages-and-Disadvantages-of-ensemble-models)
* [Implementation of Ensemble modeling](#Implementation-of-Ensemble-modeling)

### Ensemble Model - example and intuition

Let’s try to understand ensemble model using an example. Suppose we are trying to solve a classification challenge. The problem is to set the rules for classification of spam emails.

What we can do is that we generate various rules for classification of spam emails. 

Let’s look at the some of them:

**Spam**
* Have length less than 20 words
* Have only images (promotional images)
* Have specific key words like “make money and grow” and “reduce your fat”
* More miss spelled words in the email

**Not Spam**
* Email from Analytics Vidhya domain
* Email from family members or anyone from e-mail address book

Listed here are some common rules for filtering the SPAM e-mails. Do you think that all these rules individually can predict the correct class?

Most of us would say no – And that’s true! Combining these rules will provide robust prediction as compared to prediction done by individual rules. This is the principle of Ensemble Modeling. Ensemble model combines multiple ‘individual’ (diverse) models together and delivers superior prediction power.

If you want to relate this to real life, a group of people are likely to make better decisions compared to individuals, especially when group members come from diverse background. The same is true with machine learning. Basically, an ensemble is a supervised learning technique for combining multiple weak learners/ models to produce a strong learner. Ensemble model works better, when we ensemble models with low correlation.

## What is Ensemble Modeling?

In general, ensembling is a technique of combining two or more algorithms of similar or dissimilar types called base learners. This is done to make a more robust system which incorporates the predictions from all the base learners. It can be understood as conference room meeting between multiple traders to make a decision on whether the price of a stock will go up or not.

Since all of them have a different understanding of the stock market and thus a different mapping function from the problem statement to the desired outcome. Therefore, they are supposed to make varied predictions on the stock price based on their own understandings of the market.

Now we can take all of these predictions into account while making the final decision. This will make our final decision more robust, accurate and less likely to be biased. The final decision might have been different if one of these traders would have made this decision alone.

You can consider another example of a candidate going through multiple rounds of job interviews. The final decision of candidate’s ability is generally taken based on the feedback of all the interviewers. Although a single interviewer might not be able to test the candidate for each required skill and trait. But the combined feedback of multiple interviewers usually helps in better assessment of the candidate.

### Ensemble Techniques

Ensemble techniques can be as simple and averaging and can go on to more complicated bagging and stacking. We'll see these as we go further today. 

**Averaging** - Averaging works well for a wide range of problems (both classification and regression). We dont have to worry much about averaging as it is simply taking the mean of individual model predictions. Averaging predictions often reduces overfit. You ideally want a smooth separation between classes, and a single model’s predictions can be a little rough around the edges. Basically all we're doing is making multiple models of the same or different type and averaging their results for a final output. For example - 

<img src="average.png" style="width: 300px;height: 50px">


**Weighted Averaging** - When we simply average the results from different models, we assume all the predictions and models as equally stable and confident. However that's not the case. Some models might be more stable and show more accuracy as compared to the others. What we can do is we can give more weightage to the  more stable predictions as compared to the less stable ones resulting in a combined output which is more stable and confident as compared to simple averaging.

<img src="waverage2.png" style="width: 350px;height: 80px">


**Voting** - This is mostly used during classification problems. In this case multiple classification algorithms are made. Therefore we have multiple predictions for each instance. The final prediction is the one which recieves more than half of the votes. If none of the predictions get more than half of the votes, we generally take the most voted prediction. 

<img src="voting.png" style="width: 400px;height: 50px">

**Weighted Voting** - Similar to weighted averaging we can give weights to different predictions here as well. The more stable predictions have more say towards the final predictions as compared the less confident predictions. Due to more stability weighted voting is preferred over normal voting.

<img src="wvoting.png" style="width: 400px;height: 80px">


**Bagging** - Bagging is also referred to as bootstrap aggregation. To understand bagging, we first need to understand bootstrapping. Bootstrapping is a sampling technique in which we choose ‘n’ observations or rows out of the original dataset of ‘n’ rows as well. But the key is that each row is selected with replacement from the original dataset so that each row is equally likely to be selected in each iteration. Let’s say we have 3 rows numbered 1, 2 and 3.

<img src="bagging1.png" style="width:250px;height: 100px">

For bootstrapped sample, we choose one out of these three randomly. Say we chose Row 2

<img src="bagging2.png" style="width: 300px;height: 100px">

You see that even though Row 2 is chosen from the data to the bootstrap sample, it’s still present in the data. Now, each of the three:

<img src="bagging3.png" style="width: 300px;height: 100px">

Rows have the same probability of being selected again. Let’s say we choose Row 1 this time.

Again, each row in the data has the same probability to be chosen for Bootstrapped sample. Let’s say we randomly choose Row 1 again.

<img src="bagging4.png" style="width: 300px;height: 100px">

Thus, we can have multiple bootstrapped samples from the same data. Once we have these multiple bootstrapped samples, we can apply the algorithm for each of these bootstrapped samples and use the majority vote or averaging concepts to get the final prediction. This is how bagging works.

One important thing to note here is that it’s done mainly to reduce the variance.

There are many more techniques which can be used while ensembling models. Boosting and Stacking being the commonly used ones. You can read about them [here](https://www.analyticsvidhya.com/blog/2017/02/introduction-to-ensembling-along-with-implementation-in-r/). 

### Advantages and Disadvantages of ensemble models

#### Advantages

Below are the advantages of ensemble models - 

* Ensembling is a proven method for improving the accuracy of the model and works in most of the cases.
* It is the key ingredient for winning almost all of the machine learning hackathons.
* Ensembling makes the model more robust and stable thus ensuring decent performance on the test cases in most scenarios.
* You can use ensembling to capture linear and simple as well non-linear complex relationships in the data. This can be done by using two different models and forming an ensemble of two.

#### Disadvantages

Below are the disadvantages of ensemble

* Ensembling reduces the model interpretability and makes it very difficult to draw any crucial business insights at the end.
* It is time-consuming and thus might not be the best idea for real-time applications.
* The selection of models for creating an ensemble is an art which is really hard to master.

Ensemble techniques are being used in every DataHack Problem. Choosing the right ensembles is more of an art than straight forward science. With experience, you will develop a knack of which ensemble learner to use in different kinds of scenario and base learners.

## Implementation of Ensemble modeling

In [1]:
## code for Averaging

In [2]:
import pandas as pd

In [3]:
data = pd.read_csv('winequality.csv')

In [4]:
data.head()

Unnamed: 0,ID,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,W0001,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,2
1,W0002,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,,9.5,2
2,W0003,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,,10.1,2
3,W0004,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,2
4,W0005,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,2


In [5]:
data.loc[data.quality == 2, 'quality'] = 0

In [6]:
# first separate dependent and independent variables
X = data.drop(['ID', 'quality'], axis=1)
y = data.quality

In [7]:
# fill missing values
X.fillna(X.mean(), inplace=True)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
0,7.0,0.270,0.360000,20.70,0.045,45.0,170.0,1.00100,3.000000,0.450000,8.800000
1,6.3,0.300,0.340000,1.60,0.049,14.0,132.0,0.99400,3.300000,0.490158,9.500000
2,8.1,0.280,0.400000,6.90,0.050,30.0,97.0,0.99510,3.260000,0.490158,10.100000
3,7.2,0.230,0.320000,8.50,0.058,47.0,186.0,0.99560,3.190000,0.400000,9.900000
4,7.2,0.230,0.320000,8.50,0.058,47.0,186.0,0.99560,3.190000,0.400000,9.900000
5,8.1,0.280,0.400000,6.90,0.050,30.0,97.0,0.99510,3.260000,0.440000,10.100000
6,6.2,0.320,0.334031,7.00,0.045,30.0,136.0,0.99490,3.188762,0.470000,9.600000
7,7.0,0.270,0.360000,20.70,0.045,45.0,170.0,1.00100,3.000000,0.450000,8.800000
8,6.3,0.300,0.340000,1.60,0.049,14.0,132.0,0.99400,3.300000,0.490158,9.500000
9,8.1,0.220,0.430000,1.50,0.044,28.0,129.0,0.99380,3.220000,0.450000,11.000000


In [8]:
from sklearn.naive_bayes import GaussianNB

In [9]:
nb = GaussianNB()

In [10]:
# train model
nb.fit(X, y)

GaussianNB(priors=None)

In [11]:
from sklearn.metrics import roc_auc_score

In [12]:
roc_auc_score(y, nb.predict_proba(X)[:, 1])

0.75977172513437852

In [13]:
from sklearn.linear_model import LogisticRegression

In [14]:
logReg = LogisticRegression()

In [15]:
logReg.fit(X, y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [16]:
roc_auc_score(y, logReg.predict_proba(X)[:, 1])

0.7966601910494242

In [17]:
pred = (nb.predict_proba(X)[:, 1] + logReg.predict_proba(X)[:, 1]) / 2

In [18]:
roc_auc_score(y, pred)

0.78461535582206665

In [19]:
## code for Weighted Averaging

In [20]:
pred = nb.predict_proba(X)[:, 1]*0.1 + logReg.predict_proba(X)[:, 1] * 0.9

In [21]:
roc_auc_score(y, pred)

0.79807696626690017

In [22]:
## code for Voting

In [23]:
from sklearn.ensemble import VotingClassifier

In [24]:
vt = VotingClassifier([('nb', nb), ('logReg', logReg)], voting='soft')

In [25]:
vt.fit(X, y)

VotingClassifier(estimators=[('nb', GaussianNB(priors=None)), ('logReg', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))],
         n_jobs=1, voting='soft', weights=None)

In [26]:
roc_auc_score(y, vt.predict_proba(X)[:, 1])

0.78461535582206665

In [27]:
## code for Weighted Voting

In [28]:
vt = VotingClassifier([('nb', nb), ('logReg', logReg)])

In [29]:
vt.fit(X, y)

VotingClassifier(estimators=[('nb', GaussianNB(priors=None)), ('logReg', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))],
         n_jobs=1, voting='hard', weights=None)

In [30]:
from sklearn.metrics import accuracy_score
print ('naive bayes score:', accuracy_score(y, nb.predict(X)))
print ('logReg score:', accuracy_score(y, logReg.predict(X)))
print ('ensemble score:', accuracy_score(y, vt.predict(X)))

naive bayes score: 0.70641077991
logReg score: 0.750918742344
ensemble score: 0.738260514496


In [31]:
vt = VotingClassifier([('nb', nb), ('logReg', logReg)], weights=[0.1, 0.9])

In [32]:
vt.fit(X, y)

VotingClassifier(estimators=[('nb', GaussianNB(priors=None)), ('logReg', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))],
         n_jobs=1, voting='hard', weights=[0.1, 0.9])

In [33]:
print ('naive bayes score:', accuracy_score(y, nb.predict(X)))
print ('logReg score:', accuracy_score(y, logReg.predict(X)))
print ('ensemble score:', accuracy_score(y, vt.predict(X)))

naive bayes score: 0.70641077991
logReg score: 0.750918742344
ensemble score: 0.750918742344


In [34]:
## code for Bagging (Random Forest)

In [35]:
from sklearn.ensemble import RandomForestClassifier

In [36]:
rmf = RandomForestClassifier(n_estimators=100, max_depth=10)

In [37]:
rmf.fit(X, y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=10, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=100, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [38]:
roc_auc_score(y, rmf.predict(X))

0.8817269685127791

**Exercise**:

Q1. Apply your learnings on [Loan Prediction practice problem](https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/) 

Q2. Apply your learnings on [Big Mart Sales practice problem](https://datahack.analyticsvidhya.com/contest/practice-problem-big-mart-sales-iii/) 

That's all for today!
----------------
-------------------------------
<img src="AV_Datafest_logo.png" style="width: 200px;height: 200px"/>
[www.analyticsvidhya.com](www.analyticsvidhya.com)

DATAFEST 2017