<a href="https://colab.research.google.com/github/salma71/blog_post/blob/master/Evaluate_ML_models_with_ensamble.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install -q kaggle

In [2]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"salmaeng","key":"231d5b452cf64da4dc6c6ff6eb15b34a"}'}

In [3]:
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/

In [4]:

#Make directory named kaggle and copy kaggle.json file there.
#Change the permissions of the file.
!chmod 600 ~/.kaggle/kaggle.json

In [5]:
!kaggle datasets download -d uciml/pima-indians-diabetes-database

Downloading pima-indians-diabetes-database.zip to /content
  0% 0.00/8.91k [00:00<?, ?B/s]
100% 8.91k/8.91k [00:00<00:00, 3.44MB/s]


In [6]:
#unzip the folder
!mkdir train
!unzip pima-indians-diabetes-database.zip -d train

Archive:  pima-indians-diabetes-database.zip
  inflating: train/diabetes.csv      


In [7]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [8]:
# load the data
data = pd.read_csv('/content/train/diabetes.csv')
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


# Improve performance using ensambles

Ensambles could boost the accuracy of the machine learning model.

In this section, We will use Bagging ensamble method. Also, I we will use boosting ensamble methods such as Ada Boost and schochastic gradient boosting. Additionally, we will utilize voting ensamble methods to combine the predictions from multiple algorithms. 

So, let's dig!


# Combine models into ensamble predictions

There are three popular methods to combine the predictions from different models. These are :

- Bagging: The other name is [Bootstrap aggrigating](https://en.wikipedia.org/wiki/Bootstrap_aggregating). **B**ootstrap **agg**regat**ing** tends to build multiple models(usually from the same type) from different subsamples of the training dataset. 

- [Boosting](https://en.wikipedia.org/wiki/Boosting_(machine_learning)): is another technique to build multiple models (also from the same type); however each model learns to fix the prediction errors of the previous model in the sequence of models. It is mainly used to balance the bias and variance in the supervised machine learning models. It is an algorithm that converts weak learners into strong one. 


- Voting: It intended to build multiple models **from different types** an then it uses the statistical methods (mean for example) to combine predictions. 



For this tutorial, I will utilize the pima indian diabetes from the UCI machine learning repository. In the colab notebook, I managed to fetch the data via kaggle API. 

# 1. Bagging Algorithm 
Bootstrap aggrigation or baggibg for short manages to take multiple samples with replacement for each sample from the training set. 

The final predictions are calculated by taking the average across the predictions generated by the submodels. 

There are three bagging models we might use: 
- Bagging decision trees
- Random Forest
- Extra Trees

Bagging has its best performance with algorithms that have high variance. In the following example, we will develop the `BaggingClassifier()` with the classification and regression trees algorithm (`DecisionTreeClassifier()`) within sklearn package. 

In [11]:
from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

In [12]:
# split the data into train and test
X = data.iloc[:, :-1].values

In [13]:
y = data.iloc[:, -1].values

In [14]:
print(X.shape, y.shape)

(768, 8) (768,)


In [15]:
#set the seed for reproducibility 
seed = 42
kfold = KFold(n_splits=10, random_state=seed)
cart = DecisionTreeClassifier()
num_trees = 100
# model building
model = BaggingClassifier(
    base_estimator = cart,
    n_estimators = num_trees,
    random_state = seed
)

res = cross_val_score(model, X, y, cv=kfold)
print(res.mean())



0.7720608339029391


As shown, we got a good performance compared to the model we developed in the previous results of using normal methods. 

Now, let's take a shot and try the Randomforest model. It works like the bagged decision tree class, however it involving reducing the correlation between individual classifiers. It only consider the random subset of features per split instead of following the greedy approach to pick the best split point. 

In [25]:
from sklearn.ensemble import RandomForestClassifier

In [26]:
# specify the # of features that would be considered 
num_feat = 4
model = RandomForestClassifier(n_estimators=num_trees, max_features=num_feat)
res = cross_val_score(model, X, y, cv=kfold)
print(res.mean())


0.7642344497607655


We got a less accuracy! that's because it is only considered a certain number of features to choose the best split. If we tried to reduce `num_feat`, the accuracy would increase, and if we omit that variable inside the §RandomForestClassifier()`, you would get almost the same result using the bagging decision tree approach. 

# Extra trees
It is a modification of bagging where random trees are built from samples of the training set. 

In [27]:
from sklearn.ensemble import ExtraTreesClassifier

In [32]:
num_feat = 3
model = ExtraTreesClassifier(n_estimators=num_trees, max_features=num_feat)
res = cross_val_score(model, X, y, cv = kfold)
print(res.mean())

0.7708133971291866




---
# 2. Boosting 
This creates a sequence of models that tries to correct errors for the models preceding in the sequence. Once developed, the model make predictions that might be weighted by their accuracy. After that, the results are combined to create the final out prediction. There are two most common boosting ensemble machine learning algorithms:

- AdaBoost 
It works by weighting the dataset instances by classifying difficulties allowing the algorithm to pay attention to them in the construction of the subsequent model in the sequence.  


In [33]:
from sklearn.ensemble import AdaBoostClassifier

In [38]:
num_trees = 30
model = AdaBoostClassifier(n_estimators=num_trees, random_state=seed)
res = cross_val_score(model, X, y, cv = kfold)
print(res.mean())

0.760457963089542


- Stochastic Gradient Boosting (Gradient Boosting Machines) 

Although it considered the most sophisticated ensemble technique, it is considered to be the best technique to improve the machine learning performance via ensamble

In [39]:
from sklearn.ensemble import GradientBoostingClassifier

In [40]:
num_trees = 100
model = GradientBoostingClassifier(n_estimators=num_trees, random_state=seed)
res = cross_val_score(model, X, y, cv=kfold)
print(res.mean())

0.7642857142857143


# 3. Voting Ensemble
It is simple and easy to implement. First, it creates a two standalone models(may be more depending on the use case) from the dataset. Then, a voting classifier is used to wrap the models and average the predictions of the submodels when introducing the new data. 

Aalthough the predictions of the submodels have weights, unfortunately, we cannot adjust those weights to increase the performance. Those weights may be adjusted using different approach which is **[Stacked Aggregation](https://en.wikipedia.org/wiki/Ensemble_learning#Stacking)**. Unfortunately, this algorithm is not included within the sklearn library yet. If you are interested in implementing the stack aggregation ensemble, you need to develop the algorithm from scratch. 

> [H2O library](http://docs.h2o.ai/h2o-tutorials/latest-stable/tutorials/ensembles-stacking/index.html) offers a decent API to implement the Stack Aggregation ensemble in R, you can check the documentation [here](http://docs.h2o.ai/h2o-tutorials/latest-stable/tutorials/ensembles-stacking/index.html). 

In the following snippet, we combine three different ml algorithms and we will use the `VotingClassifier()` class within the sklearn to get max accuracy out of the models combined. 

In [41]:
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier

In [43]:
estimators = []
model1 = LogisticRegression(max_iter=1000)
estimators.append(('Logistic regression', model1))

model2 = DecisionTreeClassifier()
estimators.append(('Decission trees', model2))

model3 = SVC()
estimators.append(('SVM', model3))

ensa = VotingClassifier(estimators=estimators)
res = cross_val_score(ensa, X, y, cv=kfold)
print(res.mean())

0.7617224880382775
