# Classification - Adaptive Boosting

Why use Adaptive Boosting?

- Boosting Technique: AdaBoost works by combining multiple weak classifiers into a strong classifier iteratively. Each weak classifier is trained sequentially, and at each iteration, the algorithm assigns higher weights to the misclassified data points, focusing on the difficult-to-classify instances.

- Handling Imbalanced Data: If the diabetes dataset is imbalanced (i.e., one class is more prevalent than the other), AdaBoost can effectively handle this situation by adjusting the weights of misclassified instances, thus giving more attention to the minority class.

- Robustness to Overfitting: AdaBoost tends to be less prone to overfitting compared to other ensemble methods like Random Forest. This is because it focuses on improving the performance of misclassified instances in subsequent iterations, leading to a generalized model.

- Simple Weak Learners: AdaBoost can work well with simple weak learners (e.g., decision trees with a depth of one or stump), which are computationally efficient and less prone to overfitting. This simplicity can be advantageous for the diabetes dataset, especially if the dataset is not extremely complex.

- Predictive Performance: AdaBoost often produces high predictive performance, especially when combined with weak learners that have different biases. This diversity among weak learners can help capture different aspects of the diabetes dataset, leading to improved classification accuracy.

- Feature Importance: Although not as straightforward as in some other ensemble methods, AdaBoost indirectly provides information about feature importance based on how frequently each feature is used by the weak learners during training.

In [27]:
# Import packages
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

Normalised - Without outlier

In [28]:
# We will be using the new_df_without_outliers_copy_smote_resampled.xlsx
df_without_outlier = pd.read_excel('C:\wamp64\www\IS424-Data-Mining\Data_Set\\new_df_without_outliers_copy_smote_resampled.xlsx')
df_without_outlier

In [None]:
# Split the dataset into features and target variable
X = df_without_outlier.drop(columns=['diabetes'])
y = df_without_outlier['diabetes']

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Train a Gradient Boosting classifier
ada_classifier = AdaBoostClassifier(random_state=42)
ada_classifier.fit(X_train, y_train)

AdaBoostClassifier(random_state=42)

In [None]:
# Evaluate the Gradient Boosting classifier
y_pred_ada = ada_classifier.predict(X_test)

accuracy = accuracy_score(y_test, y_pred_ada)
precision = precision_score(y_test, y_pred_ada)
recall = recall_score(y_test, y_pred_ada)
f1 = f1_score(y_test, y_pred_ada)

In [None]:
print("Adaptive Boosting Evaluation Report (Normalised w/o outlier):")
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)

Gradient Boosting Classifier Evaluation Report (Normalised w/o outlier):
Accuracy: 0.9589279306632807
Precision: 0.9622232113052186
Recall: 0.9553137428192665
F1-score: 0.9587560286046898


Normalised - Outlier Only

In [None]:
# We will be using the new_df_without_outliers_copy_smote_resampled.xlsx
df_outlier = pd.read_excel('C:\wamp64\www\IS424-Data-Mining\Data_Set\\new_df_outliers_only_copy_smote_resampled.xlsx')
df_outlier

Unnamed: 0,age,hypertension,heart_disease,bmi,HbA1c_level,blood_glucose_level,smoking_history_encoded,gender_encoded,diabetes
0,0.174516,0,0,0.930849,-0.357602,-1.154539,1.806645,-0.777719,0
1,-1.018012,0,0,1.073804,-0.233789,0.206884,-0.280453,-0.777719,0
2,0.949660,0,0,1.656370,1.375782,-0.405756,-0.280453,-0.777719,1
3,-0.779506,0,0,1.006045,-0.048069,-0.746112,-0.280453,1.417943,0
4,-1.256518,0,0,0.553214,-0.357602,-1.154539,-1.553924,-0.777719,0
...,...,...,...,...,...,...,...,...,...
6163,0.449954,0,0,0.397771,-0.410728,-0.364914,-0.280453,-0.777719,0
6164,-0.874241,0,0,0.341732,-1.100480,-0.786955,-1.030285,-0.777719,0
6165,-2.108582,0,0,0.585866,-0.225281,-1.399595,-1.009532,-0.777719,0
6166,-0.640560,0,0,0.629500,-0.108843,-0.405756,-1.133316,-0.052526,0


In [None]:
# Split the dataset into features and target variable
X = df_outlier.drop(columns=['diabetes'])
y = df_outlier['diabetes']

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Train a Adaptive Boosting classifier
ada_classifier = AdaBoostClassifier(random_state=42)
ada_classifier.fit(X_train, y_train)

AdaBoostClassifier(random_state=42)

In [None]:
# Evaluate the Adaptive Boosting classifier
y_pred_gb = ada_classifier.predict(X_test)

accuracy = accuracy_score(y_test, y_pred_gb)
precision = precision_score(y_test, y_pred_gb)
recall = recall_score(y_test, y_pred_gb)
f1 = f1_score(y_test, y_pred_gb)

In [None]:
print("Adaptive Boosting Evaluation Report (Normalised outliers only):")
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)

Gradient Boosting Classifier Evaluation Report (Normalised outliers only):
Accuracy: 0.9821717990275527
Precision: 0.9918433931484503
Recall: 0.9728
F1-score: 0.9822294022617124


Not Normalised - Without Outliers

In [None]:
df_without_outlier_notnorm = pd.read_csv('C:\wamp64\www\IS424-Data-Mining\Data_Set\\new_df_without_outliers_copy_smote_resampled_noNormalised.csv')
df_without_outlier_notnorm

Unnamed: 0,age,hypertension,heart_disease,bmi,HbA1c_level,blood_glucose_level,smoking_history_encoded,gender_encoded,diabetes
0,80.0,0,1,25.190000,6.600000,140,0.131757,-0.128959,0
1,54.0,0,0,27.320000,6.600000,80,-0.797024,-0.128959,0
2,28.0,0,0,27.320000,5.700000,158,0.131757,0.160772,0
3,36.0,0,0,23.450000,5.000000,155,0.165669,-0.128959,0
4,76.0,1,1,20.140000,4.800000,155,0.165669,0.160772,0
...,...,...,...,...,...,...,...,...,...
181139,80.0,0,0,27.191924,6.949298,126,0.554826,0.160772,1
181140,80.0,0,0,27.320000,5.800000,155,0.313504,-0.052799,1
181141,35.0,0,0,27.320357,6.995535,159,0.383605,0.160772,1
181142,58.0,0,0,27.320000,6.017092,155,-0.797024,0.160772,1


In [None]:
# Split the dataset into features and target variable
X = df_without_outlier_notnorm.drop(columns=['diabetes'])
y = df_without_outlier_notnorm['diabetes']

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Train a Gradient Boosting classifier
ada_classifier = AdaBoostClassifier(random_state=42)
ada_classifier.fit(X_train, y_train)

AdaBoostClassifier(random_state=42)

In [None]:
# Evaluate the Gradient Boosting classifier
y_pred_gb = ada_classifier.predict(X_test)

accuracy = accuracy_score(y_test, y_pred_gb)
precision = precision_score(y_test, y_pred_gb)
recall = recall_score(y_test, y_pred_gb)
f1 = f1_score(y_test, y_pred_gb)

In [None]:
print("Adaptive Boosting Evaluation Report (Not Normalised w/o outlier):")
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)

Gradient Boosting Classifier Evaluation Report (Not Normalised w/o outlier):
Accuracy: 0.9589279306632807
Precision: 0.9622232113052186
Recall: 0.9553137428192665
F1-score: 0.9587560286046898


Not Normalised - Outliers Only 

In [None]:
df_outlier_notnorm = pd.read_csv('C:\wamp64\www\IS424-Data-Mining\Data_Set\\new_df_outliers_only_copy_smote_resampled_noNormalised.csv')
df_outlier_notnorm

Unnamed: 0,age,hypertension,heart_disease,bmi,HbA1c_level,blood_glucose_level,smoking_history_encoded,gender_encoded,diabetes
0,54.000000,0,0,54.700000,6.000000,100,0.310171,-0.310005,0
1,34.000000,0,0,56.430000,6.200000,200,-0.052540,-0.310005,0
2,67.000000,0,0,63.480000,8.800000,155,-0.052540,-0.310005,1
3,38.000000,0,0,55.610000,6.500000,130,-0.052540,0.561670,0
4,30.000000,0,0,50.130000,6.000000,100,-0.273853,-0.310005,0
...,...,...,...,...,...,...,...,...,...
6163,58.619390,0,0,48.248887,5.914183,158,-0.052540,-0.310005,0
6164,36.411190,0,0,47.570729,4.800000,127,-0.182851,-0.310005,0
6165,15.709946,0,0,50.525147,6.213743,82,-0.179245,-0.310005,0
6166,40.330285,0,0,51.053191,6.401829,155,-0.200757,-0.022104,0


In [None]:
# Split the dataset into features and target variable
X = df_outlier_notnorm.drop(columns=['diabetes'])
y = df_outlier_notnorm['diabetes']

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Train a Gradient Boosting classifier
ada_classifier = AdaBoostClassifier(random_state=42)
ada_classifier.fit(X_train, y_train)

AdaBoostClassifier(random_state=42)

In [None]:
# Evaluate the Gradient Boosting classifier
y_pred_gb = ada_classifier.predict(X_test)

accuracy = accuracy_score(y_test, y_pred_gb)
precision = precision_score(y_test, y_pred_gb)
recall = recall_score(y_test, y_pred_gb)
f1 = f1_score(y_test, y_pred_gb)

In [None]:
print("Adaptive Boosting Evaluation Report (Not Normalised outliers only):")
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)

Gradient Boosting Classifier Evaluation Report (Not Normalised outliers only):
Accuracy: 0.9821717990275527
Precision: 0.9918433931484503
Recall: 0.9728
F1-score: 0.9822294022617124
