# Classification - Adaptive Boosting

Why use Adaptive Boosting?

- Boosting Technique: AdaBoost works by combining multiple weak classifiers into a strong classifier iteratively. Each weak classifier is trained sequentially, and at each iteration, the algorithm assigns higher weights to the misclassified data points, focusing on the difficult-to-classify instances.

- Handling Imbalanced Data: If the diabetes dataset is imbalanced (i.e., one class is more prevalent than the other), AdaBoost can effectively handle this situation by adjusting the weights of misclassified instances, thus giving more attention to the minority class.

- Robustness to Overfitting: AdaBoost tends to be less prone to overfitting compared to other ensemble methods like Random Forest. This is because it focuses on improving the performance of misclassified instances in subsequent iterations, leading to a generalized model.

- Simple Weak Learners: AdaBoost can work well with simple weak learners (e.g., decision trees with a depth of one or stump), which are computationally efficient and less prone to overfitting. This simplicity can be advantageous for the diabetes dataset, especially if the dataset is not extremely complex.

- Predictive Performance: AdaBoost often produces high predictive performance, especially when combined with weak learners that have different biases. This diversity among weak learners can help capture different aspects of the diabetes dataset, leading to improved classification accuracy.

- Feature Importance: Although not as straightforward as in some other ensemble methods, AdaBoost indirectly provides information about feature importance based on how frequently each feature is used by the weak learners during training.

In [1]:
# Import packages
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

from imblearn.over_sampling import SMOTE

Normalised - Without outlier

In [3]:
df_no_outlier = pd.read_csv('../Final_Data_Set/Original Dataset without Outliers Normalized.csv')
df_no_outlier

Unnamed: 0,age,hypertension,heart_disease,bmi,HbA1c_level,blood_glucose_level,diabetes,smoking_history_encoded,gender_encoded
0,1.713008,0,1,-0.286437,1.134061,0.126046,0,-0.633042,-0.841116
1,0.560337,0,0,0.072849,1.134061,-1.523079,0,-0.633042,-0.841116
2,-0.592335,0,0,0.072849,0.232946,0.620784,0,-0.633042,1.188683
3,-0.237667,0,0,-0.579938,-0.467921,0.538328,0,1.579675,-0.841116
4,1.535674,1,1,-1.138266,-0.668169,0.538328,0,1.579675,1.188683
...,...,...,...,...,...,...,...,...,...
96303,1.713008,0,0,0.072849,0.733566,-1.248225,0,-0.633042,-0.841116
96304,-1.745006,0,0,-1.605507,1.033937,-0.973371,0,-0.633042,-0.841116
96305,1.092339,0,0,0.158875,0.232946,0.538328,0,1.579675,1.188683
96306,-0.769669,0,0,1.439149,-1.469161,-0.973371,0,-0.633042,-0.841116


In [11]:
# Split the dataset into features and target variable
X = df_no_outlier.drop(columns=['diabetes'])
y = df_no_outlier['diabetes']

In [12]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

In [13]:
# Use SMOTE for oversampling
# Initialize SMOTE
smote = SMOTE(random_state=42)

# Apply SMOTE to the training dataset
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

In [14]:
# Train a Adaptive Boosting classifier
# n_estimators, default=50
# learning_rate, default=1.0
ada_classifier = AdaBoostClassifier(n_estimators=50, learning_rate=1.0, random_state=42)
ada_classifier.fit(X_train_resampled, y_train_resampled)



In [15]:
# Evaluate the Gradient Boosting classifier
y_pred_ada = ada_classifier.predict(X_test)

accuracy = accuracy_score(y_test, y_pred_ada)
precision = precision_score(y_test, y_pred_ada)
recall = recall_score(y_test, y_pred_ada)
f1 = f1_score(y_test, y_pred_ada)
roc_auc = roc_auc_score(y_test, y_pred_ada)

In [16]:
print("Adaptive Boosting Evaluation Report (Normalised w/o outlier):")
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)
print("ROC AUC:", roc_auc)

Adaptive Boosting Evaluation Report (Normalised w/o outlier):
Accuracy: 0.9178469078373551
Precision: 0.40579458709229704
Recall: 0.8149825783972126
F1-score: 0.5418114431318045
ROC AUC: 0.8696744002312005


Normalised - With Outliers

In [17]:
df_outlier = pd.read_csv('../Final_Data_Set/Original Dataset with Outliers Included Normalized.csv')
df_outlier

Unnamed: 0,age,hypertension,heart_disease,bmi,HbA1c_level,blood_glucose_level,diabetes,smoking_history_encoded,gender_encoded
0,1.692577,0,1,-0.321051,1.001692,0.047709,0,-0.640425,-0.841175
1,0.537899,0,0,-0.000114,1.001692,-1.426157,0,-0.640425,-0.841175
2,-0.616779,0,0,-0.000114,0.161089,0.489869,0,-0.640425,1.188813
3,-0.261494,0,0,-0.583225,-0.492714,0.416175,0,1.561464,-0.841175
4,1.514935,1,1,-1.081957,-0.679515,0.416175,0,1.561464,1.188813
...,...,...,...,...,...,...,...,...,...
99977,1.692577,0,0,-0.000114,0.628091,-1.180513,0,-0.640425,-0.841175
99978,-1.771458,0,0,-1.499326,0.908292,-0.934869,0,-0.640425,-0.841175
99979,1.070828,0,0,0.076730,0.161089,0.416175,0,1.561464,1.188813
99980,-0.794422,0,0,1.220350,-1.426718,-0.934869,0,-0.640425,-0.841175


In [18]:
# Split the dataset into features and target variable
X = df_outlier.drop(columns=['diabetes'])
y = df_outlier['diabetes']

In [19]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [20]:
# Use SMOTE for oversampling
# Initialize SMOTE
smote = SMOTE(random_state=42)

# Apply SMOTE to the training dataset
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

In [21]:
# Train a Adaptive Boosting classifier
# n_estimators, default=50
# learning_rate, default=1.0
ada_classifier = AdaBoostClassifier(n_estimators=50, learning_rate=1.0, random_state=42)
ada_classifier.fit(X_train_resampled, y_train_resampled)



In [22]:
# Evaluate the Adaptive Boosting classifier
y_pred_ada = ada_classifier.predict(X_test)

accuracy = accuracy_score(y_test, y_pred_ada)
precision = precision_score(y_test, y_pred_ada)
recall = recall_score(y_test, y_pred_ada)
f1 = f1_score(y_test, y_pred_ada)
roc_auc = roc_auc_score(y_test, y_pred_ada)


In [23]:
print("Adaptive Boosting Evaluation Report (Normalised with outliers):")
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)
print("ROC AUC:", roc_auc)

Adaptive Boosting Evaluation Report (Normalised with outliers):
Accuracy: 0.9313897084562685
Precision: 0.5772919064058305
Recall: 0.847887323943662
F1-score: 0.6869009584664537
ROC AUC: 0.8937054883355672


Not Normalised - Without Outliers

In [25]:
df_without_outlier_notnorm = pd.read_csv('../Final_Data_Set/Original Dataset without Outliers.csv')
df_without_outlier_notnorm

Unnamed: 0,age,hypertension,heart_disease,bmi,HbA1c_level,blood_glucose_level,diabetes,smoking_history_encoded,gender_encoded
0,80.0,0,1,25.19,6.6,140,0,-0.247356,-0.128959
1,54.0,0,0,27.32,6.6,80,0,-0.247356,-0.128959
2,28.0,0,0,27.32,5.7,158,0,-0.247356,0.160772
3,36.0,0,0,23.45,5.0,155,0,0.452953,-0.128959
4,76.0,1,1,20.14,4.8,155,0,0.452953,0.160772
...,...,...,...,...,...,...,...,...,...
96303,80.0,0,0,27.32,6.2,90,0,-0.247356,-0.128959
96304,2.0,0,0,17.37,6.5,100,0,-0.247356,-0.128959
96305,66.0,0,0,27.83,5.7,155,0,0.452953,0.160772
96306,24.0,0,0,35.42,4.0,100,0,-0.247356,-0.128959


In [26]:
# Split the dataset into features and target variable
X = df_without_outlier_notnorm.drop(columns=['diabetes'])
y = df_without_outlier_notnorm['diabetes']

In [27]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [28]:
# Use SMOTE for oversampling
# Initialize SMOTE
smote = SMOTE(random_state=42)

# Apply SMOTE to the training dataset
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

In [29]:
# Train a Adaptive Boosting classifier
# n_estimators, default=50
# learning_rate, default=1.0
ada_classifier = AdaBoostClassifier(n_estimators=50, learning_rate=1.0, random_state=42)
ada_classifier.fit(X_train_resampled, y_train_resampled)



In [30]:
# Evaluate the Adaptive Boosting classifier
y_pred_ada = ada_classifier.predict(X_test)

accuracy = accuracy_score(y_test, y_pred_ada)
precision = precision_score(y_test, y_pred_ada)
recall = recall_score(y_test, y_pred_ada)
f1 = f1_score(y_test, y_pred_ada)
roc_auc = roc_auc_score(y_test, y_pred_ada)

In [31]:
print("Adaptive Boosting Evaluation Report (Not Normalised w/o outlier):")
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)
print("ROC AUC:", roc_auc)

Adaptive Boosting Evaluation Report (Not Normalised w/o outlier):
Accuracy: 0.9584155331741252
Precision: 0.6522911051212938
Recall: 0.6368421052631579
F1-score: 0.644474034620506
ROC AUC: 0.8077434232308505


Not Normalised - With Outliers

In [33]:
df_outlier_notnorm = pd.read_csv('../Final_Data_Set/Original Dataset with Outliers Included.csv')
df_outlier_notnorm

Unnamed: 0,age,hypertension,heart_disease,bmi,HbA1c_level,blood_glucose_level,diabetes,smoking_history_encoded,gender_encoded
0,80.0,0,1,25.19,6.6,140,0,-0.246527,-0.119227
1,54.0,0,0,27.32,6.6,80,0,-0.246527,-0.119227
2,28.0,0,0,27.32,5.7,158,0,-0.246527,0.150651
3,36.0,0,0,23.45,5.0,155,0,0.450465,-0.119227
4,76.0,1,1,20.14,4.8,155,0,0.450465,0.150651
...,...,...,...,...,...,...,...,...,...
99977,80.0,0,0,27.32,6.2,90,0,-0.246527,-0.119227
99978,2.0,0,0,17.37,6.5,100,0,-0.246527,-0.119227
99979,66.0,0,0,27.83,5.7,155,0,0.450465,0.150651
99980,24.0,0,0,35.42,4.0,100,0,-0.246527,-0.119227


In [34]:
# Split the dataset into features and target variable
X = df_outlier_notnorm.drop(columns=['diabetes'])
y = df_outlier_notnorm['diabetes']

In [35]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [36]:
# Use SMOTE for oversampling
# Initialize SMOTE
smote = SMOTE(random_state=42)

# Apply SMOTE to the training dataset
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)


In [37]:
# Train a Adaptive Boosting classifier
# n_estimators, default=50
# learning_rate, default=1.0
ada_classifier = AdaBoostClassifier(n_estimators=50, learning_rate=1.0, random_state=42)
ada_classifier.fit(X_train_resampled, y_train_resampled)



In [38]:
# Evaluate the Adaptive Boosting classifier
y_pred_ada = ada_classifier.predict(X_test)

accuracy = accuracy_score(y_test, y_pred_ada)
precision = precision_score(y_test, y_pred_ada)
recall = recall_score(y_test, y_pred_ada)
f1 = f1_score(y_test, y_pred_ada)
roc_auc = roc_auc_score(y_test, y_pred_ada)

In [39]:
print("Adaptive Boosting Evaluation Report (Not Normalised with outliers):")
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)
print("ROC AUC:", roc_auc)

Adaptive Boosting Evaluation Report (Not Normalised with outliers):
Accuracy: 0.9588938340751113
Precision: 0.770892552586697
Recall: 0.763943661971831
F1-score: 0.767402376910017
ROC AUC: 0.8709137693022364
