<a href="https://colab.research.google.com/github/marcelounb/ML-Mastery-with-Python-Course/blob/master/chap15_Improve_Performance_with_Ensembles.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Ensembles can give you a boost in accuracy on your dataset

In [0]:
from pandas import read_csv 
from sklearn.model_selection import KFold 
from sklearn.model_selection import cross_val_score 
from sklearn.ensemble import BaggingClassifier 
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import ExtraTreesClassifier

In [0]:
# load data 
filename = '/content/diabetes_moddd.csv' 
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] 
dataframe = read_csv(filename, names=names) 
array = dataframe.values 
X = array[:,0:8] 
Y = array[:,8] 

In [5]:
seed = 7 
kfold = KFold(n_splits=10, random_state=seed) 



Bagged Decision Trees


In [0]:
# Bagged Decision Trees for Classification 
cart = DecisionTreeClassifier() 
num_trees = 100 
model = BaggingClassifier(base_estimator=cart, n_estimators=num_trees, random_state=seed)
results = cross_val_score(model, X, Y, cv=kfold)

In [7]:
results.mean()

0.770745044429255

Random Forest

In [8]:
num_trees = 100 
max_features = 3 
kfold = KFold(n_splits=10, random_state=7) 
model = RandomForestClassifier(n_estimators=num_trees, max_features=max_features) 
results = cross_val_score(model, X, Y, cv=kfold) 



In [9]:
results.mean()

0.7668489405331511

Extra Tree

In [10]:
num_trees = 100 
max_features = 7 
kfold = KFold(n_splits=10, random_state=7) 
model = ExtraTreesClassifier(n_estimators=num_trees, max_features=max_features) 
results = cross_val_score(model, X, Y, cv=kfold) 



In [11]:
results.mean()

0.7551093643198906

# Boosting Algorithms

Boosting ensemble algorithms creates a sequence of models that attempt to correct the mistakes of the models before them in the sequence. Once created, the models make predictions which may be weighted by their demonstrated accuracy and the results are combined to create a ﬁnal output prediction. The two most common boosting ensemble machine learning algorithms are:


1.   AdaBoost
2.   Stochastic Gradient Boosting



In [12]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
seed=7 
kfold = KFold(n_splits=10, random_state=seed)



AdaBoost

In [0]:
num_trees = 30
model = AdaBoostClassifier(n_estimators=num_trees, random_state=seed) 
results = cross_val_score(model, X, Y, cv=kfold)

In [14]:
results.mean()

0.760457963089542

Stochastic Gradient Boosting

In [0]:
num_trees = 100
model = GradientBoostingClassifier(n_estimators=num_trees, random_state=seed) 
results = cross_val_score(model, X, Y, cv=kfold)

In [16]:
results.mean()

0.7681989063568012

# Voting Ensemble


Voting is one of the simplest ways of combining the predictions from multiple machine learning algorithms. It works by ﬁrst creating two or more standalone models from your training dataset. A Voting Classiﬁer can then be used to wrap your models and average the predictions of the sub-models when asked to make predictions for new data. The predictions of the sub-models can be weighted, but specifying the weights for classiﬁers manually or even heuristically is diﬃcult. 

In [0]:
from pandas import read_csv 
from sklearn.model_selection import KFold 
from sklearn.model_selection import cross_val_score 
from sklearn.linear_model import LogisticRegression 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.svm import SVC 
from sklearn.ensemble import VotingClassifier

In [19]:
"""
# load data 
filename = '/content/diabetes_moddd.csv' 
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] 
dataframe = read_csv(filename, names=names) 
array = dataframe.values 
X = array[:,0:8] 
Y = array[:,8] 
kfold = KFold(n_splits=10, random_state=7) 
"""

"\n# load data \nfilename = '/content/diabetes_moddd.csv' \nnames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] \ndataframe = read_csv(filename, names=names) \narray = dataframe.values \nX = array[:,0:8] \nY = array[:,8] \nkfold = KFold(n_splits=10, random_state=7) \n"

In [0]:
# create the sub models 
estimators = [] 
model1 = LogisticRegression(max_iter=200) 
estimators.append(('logistic', model1)) 
model2 = DecisionTreeClassifier() 
estimators.append(('cart', model2))
model3 = SVC() 
estimators.append(('svm', model3))

In [24]:
# create the ensemble model 
ensemble = VotingClassifier(estimators) 
results = cross_val_score(ensemble, X, Y, cv=kfold)
results.mean()

0.7604237867395763