# Ensemble Methods

So far we have learnt how to create a classifier and use a range of preprocessing methods and hyperparameter tuning techniques to improve our accuracy or prediction. In this lab, we shall look at some ensemble techniques which combine multiple classifiers to achieve better results.

- Bagging
- Boosting
- Stacking

In [1]:
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import sklearn
from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns

## Loading Dataset

In [2]:
df = pd.read_csv('train.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120 entries, 0 to 119
Data columns (total 11 columns):
id           120 non-null int64
chem_0       120 non-null float64
chem_1       120 non-null float64
chem_2       120 non-null float64
chem_3       120 non-null float64
chem_4       120 non-null float64
chem_5       120 non-null float64
chem_6       120 non-null float64
chem_7       120 non-null float64
attribute    120 non-null float64
class        120 non-null int64
dtypes: float64(9), int64(2)
memory usage: 10.4 KB


In [3]:
df.head()

Unnamed: 0,id,chem_0,chem_1,chem_2,chem_3,chem_4,chem_5,chem_6,chem_7,attribute,class
0,80,4.21,3.82,1.1,11.77,4.7,0.0,0.0,9.57,2.213,1
1,81,3.71,3.93,5.4,11.81,15.4,1.5,0.0,8.21,1.8,2
2,32,5.79,1.83,3.1,10.43,13.1,0.0,16.8,8.61,2.365,7
3,170,2.87,3.56,6.5,13.14,16.4,0.0,0.0,7.99,1.674,2
4,48,2.84,3.5,5.6,13.27,11.4,0.0,0.0,8.55,1.747,1


## Splitting the dataset

Split dataset into train and validation sets

In [12]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=1)

## Model Creation and Evaluation

Create a classifier and train it.

In [None]:
clf = None

Generate predictions on the validation data and print the accuracy of the model on it.

In [None]:
y_pred = None
accuracy = None

print(accuracy)

How were the results? We will now try to use some additional techniques to improve the accuracy.

## Bagging

Use the BaggingClassifier from sklearn as a model, and let the base estimator be the model you previously used. Generate the new accuracy.

In [None]:
#BaggingClassifier
from sklearn.ensemble import BaggingClassifier

bag_clf = None
y_pred_bag = None
bag_acc = None

print(bag_acc)

The RandomForest algorithm uses bagging on decision trees. Use the RandomForestClassifier from sklearn and print its accuracy.

In [None]:
#RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

rf_clf = None
y_pred_rf = None
rf_acc = None

print(rf_acc)

## Boosting

### Weight-based

Use adaboost classifier to generate predictions on the validation data and print the accuracy.

In [None]:
#AdaBoostClassifier
from sklearn.ensemble import AdaBoostClassifier

ab_clf = None
y_pred_ab = None
ab_acc = None

print(ab_acc)

### Residual-based

Using gradient boosted decision trees from sklearn generate predictions and print the accuracy.

In [None]:
#GradientBoostingClassifier
from sklearn.ensemble import GradientBoostingClassifier

gb_clf = None
y_pred_gb = None
gb_acc = None

print(gb_acc)

Using the xgboost classifier, generate predictions on the validation data and print the new accuracy.

You can use the following commands to install xgboost.

`conda install -c conda-forge xgboost` (Linux and OSX)

`conda install -c anaconda py-xgboost` (All)

In [None]:
#XGBClassifier
from xgboost import XGBClassifier

xgb_clf = None
y_pred_xgb = None
xgb_acc = None

print(xgb_acc)

## Stacking

A few base models are used to predict the output. A meta model is trained on the outputs of these models

We'll split the training dataset into two parts equally - A & B. The base models will be trained on A. Their predictions on B will be used to train a meta model.

In [None]:
X_A = None
y_A = None
X_B = None
y_B = None

Train the base models on dataset A and generate predictions on dataset B

In [None]:
clf_1 = None
y_pred_1 = None
clf_2 = None
y_pred_2 = None
clf_3 = None
y_pred_3 = None

Create a new dataset C with predictions of base models on B

In [None]:
X_C = None
y_C = None

X_C.head()

Combine predictions made by base models on validation set to create a dataset D

In [None]:
X_D = None
y_D = None

Train a meta model on C and print its accuracy on D.

In [None]:
meta_clf = None
y_pred_meta = None
meta_acc = None

print(meta_acc)

## Majority Voting Techniques

Instead of just using one classifier, you can gather predictions from different classifiers, and let them 'vote' for the most appropriate label. This can be done by using sklearn's VotingClassifier.

Use a list of different classifiers and instantiate a VotingClassifier. Create 2 such classifiers, one with hard voting, and one with soft voting.

In [None]:
from sklearn.ensemble import VotingClassifier

estimators = None

soft_voter = None
hard_voter = None

Fit the voting classifiers, and generate the accuracies on the test data.

In [None]:
soft_acc = None
hard_acc = None

print("Acc of soft voting classifier:{}".format(soft_acc))
print("Acc of hard voting classifier:{}".format(hard_acc))

Apply hyperparameter tuning on the voting classifier by trying different weights for the estimators.