# Ensemble Methods - Diabetes Dataset

### Models Created 

1. Decision Tree

2. Random Forest

3. Support Vector Machines

4. Logistics Regression

5. Bagging Model

6. Adaboost Model

7. Gradient Boosting Method

8. Applied the Voting Classifier Model

Accuracy 80% and Kappa - 0.53

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [12]:
# Importing the Dataset
diabetes = pd.read_csv("~/Downloads/diabetes.csv")

In [13]:
diabetes.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [14]:
# Scaling is required because the data is in different scales of measurement
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

In [15]:
# Defined the X and Ys
x = diabetes.drop("Outcome", axis = 1)
y = diabetes.Outcome

In [19]:
# Apply Standard Scaling on X
scaled_x = pd.DataFrame(sc.fit_transform(x), columns=diabetes.columns[:8])

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


In [20]:
scaled_x.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,0.639947,0.848324,0.149641,0.90727,-0.692891,0.204013,0.468492,1.425995
1,-0.844885,-1.123396,-0.160546,0.530902,-0.692891,-0.684422,-0.365061,-0.190672
2,1.23388,1.943724,-0.263941,-1.288212,-0.692891,-1.103255,0.604397,-0.105584
3,-0.844885,-0.998208,-0.160546,0.154533,0.123302,-0.494043,-0.920763,-1.041549
4,-1.141852,0.504055,-1.504687,0.90727,0.765836,1.409746,5.484909,-0.020496


The purpose of scaling is to make sure that the different scales of measurement are done away with and we have a single scale of measurement. 

Further, the Standard Scaling is based on Z Score Transformation (Continuous Prob Distt). 

It means that the transformed dataset will have the mean == 0 and SD == 1.



In [21]:
scaled_x.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Pregnancies,768.0,2.5442610000000002e-17,1.000652,-1.141852,-0.844885,-0.250952,0.639947,3.906578
Glucose,768.0,3.614007e-18,1.000652,-3.783654,-0.685236,-0.121888,0.605771,2.444478
BloodPressure,768.0,-1.3272440000000001e-17,1.000652,-3.572597,-0.367337,0.149641,0.563223,2.734528
SkinThickness,768.0,7.994184000000001e-17,1.000652,-1.288212,-1.288212,0.154533,0.719086,4.921866
Insulin,768.0,-3.556183e-17,1.000652,-0.692891,-0.692891,-0.428062,0.412008,6.652839
BMI,768.0,2.295979e-16,1.000652,-4.060474,-0.595578,0.000942,0.584771,4.455807
DiabetesPedigreeFunction,768.0,2.398978e-16,1.000652,-1.189553,-0.688969,-0.300128,0.466227,5.883565
Age,768.0,1.8576e-16,1.000652,-1.041549,-0.786286,-0.360847,0.660206,4.063716


Decision Tree Model

1. It is based on Entropy or Gini

2. Decision Tree Models are easier to understand

3. Disadv of Decision Tree Model is that it tends to Overfit the Training Data

Overfitting means that the machine is able to learn very good patterns from TRAIN data 
but when it comes to applying all the learning on the TEST Data, it does not perform well.

This means that the Variance Error of the Model is very High.

Our objective then becomes to reduce the error of variance. If this error is reduced, the model will become more stable.

How will I identify whether the problem is due to data or my model?

Specifically, in classification problems, we need to see one thing is the count of Target Variable.

The target variable 0 and 1 values should be evenly distributed otherwise the data is said be to suffering from Imbalance. Such kinda of datasets are called Imbalanced Datasets.

1. You need to see if the data has different scales of measurements.

2. Also check if the data has lots of categories and hence Dummy Variables (0 and 1) due to Label Encoding

3. In such cases, its recommended to scale the Dataset.

4. Target Variable space should be evenly distributed and there should not be any Imbalance in terms of classes.




In [25]:
# Importing the required library
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier()

In [26]:
# Splitting the Data in Train and Test
from sklearn.model_selection import train_test_split
xtrain,xtest,ytrain,ytest = train_test_split(scaled_x, y, test_size = 0.30, random_state = 0)

In [27]:
# Fit the Model on Training Dataset
dtree.fit(xtrain,ytrain)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [29]:
# Make Predictions
model_tree = dtree.predict(xtest)

In [34]:
# Check the Accuracy of the Model
from sklearn.metrics import accuracy_score, classification_report, cohen_kappa_score, confusion_matrix

In [31]:
accuracy_score(ytest, model_tree)

0.7229437229437229

In [33]:
print(classification_report(ytest, model_tree))

              precision    recall  f1-score   support

           0       0.81      0.78      0.79       157
           1       0.56      0.61      0.58        74

   micro avg       0.72      0.72      0.72       231
   macro avg       0.69      0.69      0.69       231
weighted avg       0.73      0.72      0.73       231



In [35]:
confusion_matrix(ytest, model_tree)

array([[122,  35],
       [ 29,  45]])

In [37]:
cohen_kappa_score(ytest, model_tree)

0.3770961489845791

Why Kappa is a very good indicator of classification models

1. Kappa actually tells you the model's effectiveness in terms of implementing in real life scenarios.

2. If the data is suffering from Outliers and Imbalanced Class, Kappa penalizes such kind of things and hence Kappa becomes a neutral metric on which you can rely.

3. Kappa Score as it appears has the values in the range of 0 and 1. If the values are closer to 1, better the model and if the Kappa score has values closer to 0, worse the model it is.


Taking about Decision Trees

if the decision tree is suffering from Overfitting, the best way to reduce the error of variance is to create a Random Forest Model.

A RF Model is a collection of Decision Trees. It builts several alike Decision Trees and averages the prediction from the trees. This way it reduces the error of variance. 

So, the RF would be a better model which reduces the error of variance and hence the overall model's performance gets improved. 

In a nutshell, the RF Model is a type of Ensemble which takes multiple decision trees into consideration and outputs the labels basis the average of all the Trees behind.



In [38]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()

In [40]:
# Fit the Random Forest Model
rf.fit(xtrain,ytrain)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [41]:
# Fit the Random Forest Model
rf_model = rf.predict(xtest)

In [42]:
accuracy_score(ytest, rf_model)

0.7835497835497836

In [43]:
cohen_kappa_score(ytest, rf_model)

0.4687701223438506

In [54]:
# Logistics Regression
from sklearn.linear_model import LogisticRegression
lg = LogisticRegression()

In [55]:
lg.fit(xtrain,ytrain)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [56]:
lg_model = lg.predict(xtest)

In [57]:
print(accuracy_score(ytest,lg_model))
print(cohen_kappa_score(ytest,lg_model))

0.7792207792207793
0.4560690705942102


### Ensembling the Random Forest and Decision Tree Classifier Models

It means I would combine both the models and will fit it on my training data and will try to make the predictions

Voting Classifier Algo in Sklearn.

In [150]:
from sklearn.ensemble import VotingClassifier
v = VotingClassifier(estimators=[("Tree", dtree), ("LG", lg), 
                                 ("RF", rf), ("GBM", gb),("Bag", bagg)])

In [151]:
v.fit(xtrain, ytrain)



VotingClassifier(estimators=[('Tree', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_le...imators=10, n_jobs=None, oob_score=False, random_state=None,
         verbose=0, warm_start=False))],
         flatten_transform=None, n_jobs=None, voting='hard', weights=None)

In [152]:
vott_model = v.predict(xtest)

In [153]:
accuracy_score(ytest, vott_model)

0.8051948051948052

In [154]:
cohen_kappa_score(ytest, vott_model)

0.5343367826904986

In [78]:
# Apply SVM
from sklearn.svm import SVC
svm = SVC()

In [65]:
svm.fit(xtrain,ytrain)



SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [66]:
svm_model = svm.predict(xtest)

In [67]:
print(accuracy_score(ytest,svm_model))
print(cohen_kappa_score(ytest,svm_model))

0.7532467532467533
0.3920771965464702


In [140]:
# Apply Gradient Boosting
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, BaggingClassifier
ada = AdaBoostClassifier()
gb = GradientBoostingClassifier()

In [89]:
ada.fit(xtrain,ytrain)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None)

In [90]:
ada_model = ada.predict(xtest)

In [91]:
print(accuracy_score(ytest,ada_model))
print(cohen_kappa_score(ytest,ada_model))

0.7489177489177489
0.3930415873878771


In [92]:
gb.fit(xtrain, ytrain)

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              n_iter_no_change=None, presort='auto', random_state=None,
              subsample=1.0, tol=0.0001, validation_fraction=0.1,
              verbose=0, warm_start=False)

In [93]:
gb_model = gb.predict(xtest)

In [94]:
print(accuracy_score(ytest,gb_model))
print(cohen_kappa_score(ytest,gb_model))

0.7878787878787878
0.5003751931141028


In [141]:
bagg = BaggingClassifier()

In [142]:
bagg.fit(xtrain,ytrain)

BaggingClassifier(base_estimator=None, bootstrap=True,
         bootstrap_features=False, max_features=1.0, max_samples=1.0,
         n_estimators=10, n_jobs=None, oob_score=False, random_state=None,
         verbose=0, warm_start=False)

In [143]:
bagg_model = bagg.predict(xtest)

In [144]:
print(accuracy_score(ytest,bagg_model))
print(cohen_kappa_score(ytest,bagg_model))

0.7705627705627706
0.451552210724365
