## Ensemble on Diabetes

### Methods of Ensemble
There are three most prominent ensemble methods namely

- **Bagging**
- **Boosting**
    - *AdaBoost*
    - *GradientBoost*
    - *XGBoost*
- **Stacking**  


### Bagging

#### Importing the libraries

In [160]:
import itertools
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets


from warnings import filterwarnings
filterwarnings('ignore')

sns.set_style('darkgrid')

In [161]:
df = pd.read_csv('Diabetes.csv')
df.head()

Unnamed: 0,age,gender,cp,trestbps,chol,fps,restecg,thalach,exang,oldpeak,slope,ca,thal,class
0,63,1,1,145,233,1,2,150,0,2.3,3,0,6,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3,3,2
2,67,1,4,120,229,0,2,129,1,2.6,2,2,7,1
3,37,1,3,130,250,0,0,187,0,3.5,3,0,3,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0,3,0


In [172]:
for col in df.columns:
    print(col)
    print(df[col].unique())
    print()

age
[63 67 37 41 56 62 57 53 44 52 48 54 49 64 58 60 50 66 43 40 69 59 42 55
 61 65 71 51 46 45 39 68 47 34 35 29 70 77 38 74 76]

gender
[1 0]

cp
[1 4 3 2]

trestbps
[145 160 120 130 140 172 150 110 132 117 135 112 105 124 125 142 128 170
 155 104 180 138 108 134 122 115 118 100 200  94 165 102 152 101 126 174
 148 178 158 192 129 144 123 136 146 106 156 154 114 164]

chol
[233 286 229 250 204 236 268 354 254 203 192 294 256 263 199 168 239 275
 266 211 283 284 224 206 219 340 226 247 167 230 335 234 177 276 353 243
 225 302 212 330 175 417 197 198 290 253 172 273 213 305 216 304 188 282
 185 232 326 231 269 267 248 360 258 308 245 270 208 264 321 274 325 235
 257 164 141 252 255 201 222 260 182 303 265 309 307 249 186 341 183 407
 217 288 220 209 227 261 174 281 221 205 240 289 318 298 564 246 322 299
 300 293 277 214 207 160 394 184 315 409 244 195 196 126 313 259 200 262
 215 228 193 271 210 327 149 295 306 178 237 218 223 242 319 166 180 311
 278 342 169 187 157 176 241 131]

fps

Substituting the '?' with NaN value. So we can rop it easily.

In [163]:
df = df.replace('?',np.NaN)
df = df.dropna()

In [171]:
df['ca'] = df['ca'].astype('int')
df['thal'] = df['thal'].astype('int')

#### Splitting the dataset

In [173]:
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

In [174]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, shuffle = True, random_state = 101)

**Instantiating the Classifier Models** 

We are using two individual classifiers

- Decision Tree
- K Nearest Neighbors

In [175]:
from sklearn.tree import DecisionTreeClassifier
classifier1 = DecisionTreeClassifier(criterion= 'gini', max_depth= 1)

In [176]:
from sklearn.neighbors import KNeighborsClassifier
classifier2 = KNeighborsClassifier(n_neighbors= 2)

In [177]:
from sklearn.ensemble import BaggingClassifier
bagging1 = BaggingClassifier(base_estimator= classifier1, n_estimators= 10, max_samples=0.8)
bagging2 = BaggingClassifier(base_estimator= classifier2, n_estimators= 10, max_samples=0.8)

In [178]:
from sklearn.model_selection import cross_val_score

In [179]:
scores = cross_val_score(classifier1, X_train, y_train, scoring= 'accuracy')
print("Accuracy for Single Decision Tree  : {:.2f} %".format(scores.mean()*100))
scores = cross_val_score(classifier2, X_train, y_train, scoring= 'accuracy')
print("Accuracy for Single K-Nearest Neighbors: {:.2f} %".format(scores.mean()*100))
scores = cross_val_score(bagging1, X_train, y_train, scoring= 'accuracy')
print("Accuracy for Bagged Decision Tree: {:.2f} %".format(scores.mean()*100))
scores = cross_val_score(bagging2, X_train, y_train, scoring= 'accuracy')
print("Accuracy for Bagged K-Nearest Neighbors: {:.2f} %".format(scores.mean()*100))

Accuracy for Single Decision Tree  : 48.64 %
Accuracy for Single K-Nearest Neighbors: 48.18 %
Accuracy for Bagged Decision Tree: 53.64 %
Accuracy for Bagged K-Nearest Neighbors: 45.47 %


### Boosting 

Using **AdaBoost**

In [180]:
from sklearn.ensemble import AdaBoostClassifier

In [181]:
estimators_names = ['Estimators = 1', 'Estimators = 3', 'Estimators = 5', 'Estimators = 7']
numEst = [1, 3, 5, 7]

for num_est, est_name in zip(numEst, estimators_names):
    
    estimator = AdaBoostClassifier(base_estimator= classifier1, n_estimators= num_est, algorithm= 'SAMME')
    estimator.fit(X_train,y_train)
    
    scores = cross_val_score(estimator, X_train, y_train, scoring= 'accuracy')
    print("Accuracy for {}: {:.5f}".format(est_name, scores.mean()))

Accuracy for Estimators = 1: 0.48636
Accuracy for Estimators = 3: 0.49091
Accuracy for Estimators = 5: 0.53586
Accuracy for Estimators = 7: 0.59889


Using **Gradient** and **Xtra-Gradient Boosting**

In [182]:
from xgboost import XGBClassifier
from sklearn.ensemble import GradientBoostingClassifier

In [183]:
estimators = [GradientBoostingClassifier, XGBClassifier]
est_names = ['GradientBoostingClassifier', 'XGBClassifier']
for i, ests in enumerate(zip(estimators,est_names)):
    
    est, est_name = ests
    estimator = est()
    estimator.fit(X_train,y_train)
    scores = cross_val_score(estimator, X_train, y_train, scoring= 'accuracy')
    print("Accuracy for {}: {:.5f}".format(est_name, scores.mean()))    

Accuracy for GradientBoostingClassifier: 0.55414
Accuracy for XGBClassifier: 0.54485


### Stacking 

In [184]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from mlxtend.classifier import StackingClassifier

In [185]:
classifier1 = KNeighborsClassifier(n_neighbors= 10)
classifier2 = RandomForestClassifier()
classifier3 = LogisticRegression()
nb = GaussianNB()
stackedLR = StackingClassifier(classifiers= [classifier1, classifier2, classifier3], meta_classifier= nb)

In [186]:
estimators_names = ['K Neighbors', 'Random Forest', 'Logistic Regression', 'Stacked with Naive Bayes']
estimators_list = [classifier1, classifier2, classifier3, stackedLR ]

stacking_mean = []
stacking_std = []

for estimator, est_name in zip(estimators_list, estimators_names):

    scores = cross_val_score(estimator, X_train, y_train, scoring= 'accuracy')
    print("Accuracy for {}: {:.5f}".format(est_name, scores.mean()))    
    
    estimator.fit(X_train, y_train)

    stacking_mean.append(scores.mean())
    stacking_std.append(scores.std())

Accuracy for K Neighbors: 0.49990
Accuracy for Random Forest: 0.55838
Accuracy for Logistic Regression: 0.59030
Accuracy for Stacked with Naive Bayes: 0.55434


#### *de nada!*