**ENSEMBLE TECHNIQUE**

Ensemble Methods, what are they? Ensemble methods is a machine learning technique that combines several base models in order to produce one optimal predictive model. To better understand this definition lets take a step back into ultimate goal of machine learning and model building.

The most popular ensemble methods are boosting, bagging, and stacking. Ensemble methods are ideal for regression and classification, where they reduce bias and variance to boost the accuracy of models.

AdaBoost. AdaBoost is an ensemble machine learning algorithm for classification problems. It is part of a group of ensemble methods called boosting, that add new machine learning models in a series where subsequent models attempt to fix the prediction errors made by prior models.

**PROBLEM STATEMENT**

Perform AdaBoost and Extreme Gradient Boosting for the following diabetes dataset.

In [1]:
#Importing the required packages and libraries
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
import xgboost as xgb

In [2]:
#loading the dataset and exploring the columns
data = pd.read_csv(r"C:\Users\D\Desktop\New Assignments  Keys\Datasets\Diabeted_Ensemble.csv")
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0    Number of times pregnant      768 non-null    int64  
 1    Plasma glucose concentration  768 non-null    int64  
 2    Diastolic blood pressure      768 non-null    int64  
 3    Triceps skin fold thickness   768 non-null    int64  
 4    2-Hour serum insulin          768 non-null    int64  
 5    Body mass index               768 non-null    float64
 6    Diabetes pedigree function    768 non-null    float64
 7    Age (years)                   768 non-null    int64  
 8    Class variable                768 non-null    object 
dtypes: float64(2), int64(6), object(1)
memory usage: 54.1+ KB


**DATA UNDERSTANDING**

The Pima Indian Diabetes Dataset, originally from the National Institute of Diabetes and Digestive and Kidney Diseases, contains information of 768 women from a population near Phoenix, Arizona, USA. The outcome tested was Diabetes, 258 tested positive and 500 tested negative. Therefore, there is one target (dependent) variable and the following attributes (TYNECKI, 2018):

  Pregnancies (number of times pregnant),

  Oral glucose tolerance test - OGTT (two hour plasma glucose concentration after 75g anhydrous       glucose in mg/dl),

  Blood Pressure (Diastolic Blood Pressure in mmHg),

  Skin Thickness (Triceps skin fold thickness in mm),

  Insulin (2 h serum insulin in mu U/ml),

  BMI (Body Mass Index in kg/m2),

  Age (years),

  Pedigree Diabetes Function ('function that represents how likely they are to get the disease by extrapolating from their ancestor’s history')

**DATA PRE PROCESSING**

There are no null values inthe dataset

All the independent variable columns have numerical values.

So We converted the target(output column) using label encoding

In [3]:
# using label encoder to convert the categorical(Class Variable) column to numerical
lb = LabelEncoder()
data[" Class variable"] = lb.fit_transform(data[" Class variable"])

**AdaBoost**   
AdaBoost or Adaptive Boosting is the first Boosting ensemble model. The method automatically adjusts its parameters to the data based on the actual performance in the current iteration. Meaning, both the weights for re-weighting the data and the weights for the final aggregation are re-computed iteratively. 

In practice, this boosting technique is used with simple classification trees or stumps as base-learners, which resulted in improved performance compared to the classification by one tree or other single base-learner.

**Gradient Boosting**
Gradient Boost is a robust machine learning algorithm made up of Gradient descent and Boosting. The word ‘gradient’ implies that you can have two or more derivatives of the same function. Gradient Boosting has three main components: additive model, loss function and a weak learner. 

The technique yields a direct interpretation of boosting methods from the perspective of numerical optimisation in a function space and generalises them by allowing optimisation of an arbitrary loss function.



**Model Building**

Splitted the into predictors and Target.

Then splitted the data into train and test dataset.

Applied the Adaboost and XGBoosting Classification models.

Created the confusion matrix.

Calculated the Accuracy score for predictors and Target.

In [4]:
#Splitting the dataset to predictors(input columns) and the target(output)
predictors = data.loc[:, data.columns!=" Class variable"]
type(predictors)

pandas.core.frame.DataFrame

In [5]:
target = data[" Class variable"]
type(target)

pandas.core.series.Series

In [6]:
# splitting the predictors and target to train and test dataset
x_train, x_test, y_train, y_test = train_test_split(predictors, target, test_size = 0.2, random_state=0)

In [7]:
# Applying Adaboost Classification Technique
ada_clf = AdaBoostClassifier(learning_rate = 0.02, n_estimators = 5000)

In [8]:
#fitting the model and train the data
ada_clf.fit(x_train, y_train)

AdaBoostClassifier(learning_rate=0.02, n_estimators=5000)

In [9]:
#creating confusion matrix for the target 
confusion_matrix(y_test, ada_clf.predict(x_test))

array([[91, 16],
       [15, 32]], dtype=int64)

In [10]:
#calculating the accuracy for target 
accuracy_score(y_test, ada_clf.predict(x_test))

0.7987012987012987

In [11]:
#calculating the accuracy score for predictors
accuracy_score(y_train, ada_clf.predict(x_train))

0.8420195439739414

In [12]:
#Applying XSBoosting Classification Technique
xgb_clf = xgb.XGBClassifier(max_depths = 5, n_estimators = 10000, learning_rate = 0.3, n_jobs = -1)

In [13]:
#fitting the model and train the data
xgb_clf.fit(x_train, y_train)



Parameters: { "max_depths" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
              gamma=0, gpu_id=-1, importance_type=None,
              interaction_constraints='', learning_rate=0.3, max_delta_step=0,
              max_depth=6, max_depths=5, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=10000, n_jobs=-1,
              num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [14]:
#creating confusion matrix for the target
confusion_matrix(y_test, xgb_clf.predict(x_test))

array([[84, 23],
       [12, 35]], dtype=int64)

In [15]:
#calculating the accuracy score for target
accuracy_score(y_test, xgb_clf.predict(x_test))

0.7727272727272727

In [16]:
#calculating the accuracy score for predictors
accuracy_score(y_train,xgb_clf.predict(x_train))

1.0

**Results**

As per Adaboost Classification model 79 percent is the chances that a patient can have diabetes.

As per XGBoosting Classification model 77 percent is the chances that a patient can have diabetes.

## Hyperparameter Tuning

In [17]:
#Applying XSBoosting Classification Technique
xgb1 = xgb.XGBClassifier(max_depths = 5, n_estimators = 10000, learning_rate = 0.3, n_jobs = -1,gamma = 5,min_child_weight=0)

In [18]:
#fitting the model and train the data
xgb1.fit(x_train, y_train)



Parameters: { "max_depths" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
              gamma=5, gpu_id=-1, importance_type=None,
              interaction_constraints='', learning_rate=0.3, max_delta_step=0,
              max_depth=6, max_depths=5, min_child_weight=0, missing=nan,
              monotone_constraints='()', n_estimators=10000, n_jobs=-1,
              num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [19]:
#creating confusion matrix for the target
confusion_matrix(y_test, xgb1.predict(x_test))

array([[97, 10],
       [16, 31]], dtype=int64)

In [20]:
#calculating the accuracy score for target
accuracy_score(y_test, xgb1.predict(x_test))

0.8311688311688312

In [21]:
#calculating the accuracy score for predictors
accuracy_score(y_train,xgb1.predict(x_train))

0.8517915309446255