# Classification - Gradient Boosting Classifier

Why use Gradient Boosting Classfifer?

- Handles Non-linearity: Diabetes classification often involves complex, non-linear relationships between features and the target variable. Gradient boosting can effectively model such relationships.

- Feature Importance: It provides insights into feature importance, which can help identify the most relevant features for diabetes prediction, aiding in feature selection and interpretation.

- Robust to Overfitting: Gradient boosting techniques, such as XGBoost and LightGBM, offer regularization parameters and early stopping mechanisms, preventing overfitting even with large and high-dimensional datasets.

- Ensemble of Weak Learners: By combining multiple weak learners (typically decision trees), gradient boosting creates a strong ensemble model capable of capturing intricate patterns in the data, leading to improved predictive performance.

- Handling Missing Data: Gradient boosting algorithms can handle missing data well, reducing the need for extensive data preprocessing, which is beneficial if the diabetes dataset contains missing values.

- High Predictive Accuracy: Gradient boosting often yields high predictive accuracy on a variety of datasets, making it a reliable choice for diabetes classification tasks where accuracy is crucial for patient diagnosis and treatment.

In [43]:
# Import packages
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

Normalised - Without outlier

In [44]:
# We will be using the new_df_without_outliers_copy_smote_resampled.xlsx
df_without_outlier = pd.read_excel('C:\wamp64\www\IS424-Data-Mining\Data_Set\\new_df_without_outliers_copy_smote_resampled.xlsx')
df_without_outlier

Unnamed: 0,age,hypertension,heart_disease,bmi,HbA1c_level,blood_glucose_level,smoking_history_encoded,gender_encoded,diabetes
0,1.349487,0,1,-0.580455,0.629251,-0.317312,0.263730,-0.974068,0
1,0.149555,0,0,-0.241118,0.629251,-1.649552,-1.579747,-0.974068,0
2,-1.050377,0,0,-0.241118,-0.272192,0.082360,0.263730,1.211318,0
3,-0.681167,0,0,-0.857661,-0.973313,0.015748,0.331039,-0.974068,0
4,1.164882,1,1,-1.384988,-1.173634,0.015748,0.331039,1.211318,0
...,...,...,...,...,...,...,...,...,...
181139,1.349487,0,0,-0.261522,0.979109,-0.628168,1.103451,1.211318,1
181140,1.349487,0,0,-0.241118,-0.172031,0.015748,0.624467,-0.399610,1
181141,-0.727318,0,0,-0.241061,1.025419,0.104564,0.763605,1.211318,1
181142,0.334160,0,0,-0.241118,0.045409,0.015748,-1.579747,1.211318,1


In [45]:
# Split the dataset into features and target variable
X = df_without_outlier.drop(columns=['diabetes'])
y = df_without_outlier['diabetes']

In [46]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [47]:
# Train a Gradient Boosting classifier
gb_classifier = GradientBoostingClassifier(random_state=42)
gb_classifier.fit(X_train, y_train)

GradientBoostingClassifier(random_state=42)

In [48]:
# Evaluate the Gradient Boosting classifier
y_pred_gb = gb_classifier.predict(X_test)

accuracy = accuracy_score(y_test, y_pred_gb)
precision = precision_score(y_test, y_pred_gb)
recall = recall_score(y_test, y_pred_gb)
f1 = f1_score(y_test, y_pred_gb)

In [49]:
print("Gradient Boosting Classifier Evaluation Report (Normalised w/o outlier):")
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)

Gradient Boosting Classifier Evaluation Report (Normalised w/o outlier):
Accuracy: 0.9735294929476387
Precision: 0.9868802180950759
Recall: 0.9597878921785241
F1-score: 0.9731455293887039


Normalised - Outlier Only

In [50]:
# We will be using the new_df_without_outliers_copy_smote_resampled.xlsx
df_outlier = pd.read_excel('C:\wamp64\www\IS424-Data-Mining\Data_Set\\new_df_outliers_only_copy_smote_resampled.xlsx')
df_outlier

Unnamed: 0,age,hypertension,heart_disease,bmi,HbA1c_level,blood_glucose_level,smoking_history_encoded,gender_encoded,diabetes
0,0.174516,0,0,0.930849,-0.357602,-1.154539,1.806645,-0.777719,0
1,-1.018012,0,0,1.073804,-0.233789,0.206884,-0.280453,-0.777719,0
2,0.949660,0,0,1.656370,1.375782,-0.405756,-0.280453,-0.777719,1
3,-0.779506,0,0,1.006045,-0.048069,-0.746112,-0.280453,1.417943,0
4,-1.256518,0,0,0.553214,-0.357602,-1.154539,-1.553924,-0.777719,0
...,...,...,...,...,...,...,...,...,...
6163,0.449954,0,0,0.397771,-0.410728,-0.364914,-0.280453,-0.777719,0
6164,-0.874241,0,0,0.341732,-1.100480,-0.786955,-1.030285,-0.777719,0
6165,-2.108582,0,0,0.585866,-0.225281,-1.399595,-1.009532,-0.777719,0
6166,-0.640560,0,0,0.629500,-0.108843,-0.405756,-1.133316,-0.052526,0


In [51]:
# Split the dataset into features and target variable
X = df_outlier.drop(columns=['diabetes'])
y = df_outlier['diabetes']

In [52]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [53]:
# Train a Gradient Boosting classifier
gb_classifier = GradientBoostingClassifier(random_state=42)
gb_classifier.fit(X_train, y_train)

GradientBoostingClassifier(random_state=42)

In [54]:
# Evaluate the Gradient Boosting classifier
y_pred_gb = gb_classifier.predict(X_test)

accuracy = accuracy_score(y_test, y_pred_gb)
precision = precision_score(y_test, y_pred_gb)
recall = recall_score(y_test, y_pred_gb)
f1 = f1_score(y_test, y_pred_gb)

In [55]:
print("Gradient Boosting Classifier Evaluation Report (Normalised outliers only):")
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)

Gradient Boosting Classifier Evaluation Report (Normalised outliers only):
Accuracy: 0.9813614262560778
Precision: 0.9966996699669967
Recall: 0.9664
F1-score: 0.9813160032493907


Not Normalised - Without Outliers

In [56]:
df_without_outlier_notnorm = pd.read_csv('C:\wamp64\www\IS424-Data-Mining\Data_Set\\new_df_without_outliers_copy_smote_resampled_noNormalised.csv')
df_without_outlier_notnorm

Unnamed: 0,age,hypertension,heart_disease,bmi,HbA1c_level,blood_glucose_level,smoking_history_encoded,gender_encoded,diabetes
0,80.0,0,1,25.190000,6.600000,140,0.131757,-0.128959,0
1,54.0,0,0,27.320000,6.600000,80,-0.797024,-0.128959,0
2,28.0,0,0,27.320000,5.700000,158,0.131757,0.160772,0
3,36.0,0,0,23.450000,5.000000,155,0.165669,-0.128959,0
4,76.0,1,1,20.140000,4.800000,155,0.165669,0.160772,0
...,...,...,...,...,...,...,...,...,...
181139,80.0,0,0,27.191924,6.949298,126,0.554826,0.160772,1
181140,80.0,0,0,27.320000,5.800000,155,0.313504,-0.052799,1
181141,35.0,0,0,27.320357,6.995535,159,0.383605,0.160772,1
181142,58.0,0,0,27.320000,6.017092,155,-0.797024,0.160772,1


In [57]:
# Split the dataset into features and target variable
X = df_without_outlier_notnorm.drop(columns=['diabetes'])
y = df_without_outlier_notnorm['diabetes']

In [58]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [59]:
# Train a Gradient Boosting classifier
gb_classifier = GradientBoostingClassifier(random_state=42)
gb_classifier.fit(X_train, y_train)

GradientBoostingClassifier(random_state=42)

In [60]:
# Evaluate the Gradient Boosting classifier
y_pred_gb = gb_classifier.predict(X_test)

accuracy = accuracy_score(y_test, y_pred_gb)
precision = precision_score(y_test, y_pred_gb)
recall = recall_score(y_test, y_pred_gb)
f1 = f1_score(y_test, y_pred_gb)

In [61]:
print("Gradient Boosting Classifier Evaluation Report (Not Normalised w/o outlier):")
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)

Gradient Boosting Classifier Evaluation Report (Not Normalised w/o outlier):
Accuracy: 0.9735570951447735
Precision: 0.9869362717255481
Recall: 0.9597878921785241
F1-score: 0.973172780733688


Not Normalised - Outliers Only 

In [62]:
df_outlier_notnorm = pd.read_csv('C:\wamp64\www\IS424-Data-Mining\Data_Set\\new_df_outliers_only_copy_smote_resampled_noNormalised.csv')
df_outlier_notnorm

Unnamed: 0,age,hypertension,heart_disease,bmi,HbA1c_level,blood_glucose_level,smoking_history_encoded,gender_encoded,diabetes
0,54.000000,0,0,54.700000,6.000000,100,0.310171,-0.310005,0
1,34.000000,0,0,56.430000,6.200000,200,-0.052540,-0.310005,0
2,67.000000,0,0,63.480000,8.800000,155,-0.052540,-0.310005,1
3,38.000000,0,0,55.610000,6.500000,130,-0.052540,0.561670,0
4,30.000000,0,0,50.130000,6.000000,100,-0.273853,-0.310005,0
...,...,...,...,...,...,...,...,...,...
6163,58.619390,0,0,48.248887,5.914183,158,-0.052540,-0.310005,0
6164,36.411190,0,0,47.570729,4.800000,127,-0.182851,-0.310005,0
6165,15.709946,0,0,50.525147,6.213743,82,-0.179245,-0.310005,0
6166,40.330285,0,0,51.053191,6.401829,155,-0.200757,-0.022104,0


In [63]:
# Split the dataset into features and target variable
X = df_outlier_notnorm.drop(columns=['diabetes'])
y = df_outlier_notnorm['diabetes']

In [64]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [65]:
# Train a Gradient Boosting classifier
gb_classifier = GradientBoostingClassifier(random_state=42)
gb_classifier.fit(X_train, y_train)

GradientBoostingClassifier(random_state=42)

In [66]:
# Evaluate the Gradient Boosting classifier
y_pred_gb = gb_classifier.predict(X_test)

accuracy = accuracy_score(y_test, y_pred_gb)
precision = precision_score(y_test, y_pred_gb)
recall = recall_score(y_test, y_pred_gb)
f1 = f1_score(y_test, y_pred_gb)

In [67]:
print("Gradient Boosting Classifier Evaluation Report (Not Normalised outliers only):")
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)

Gradient Boosting Classifier Evaluation Report (Not Normalised outliers only):
Accuracy: 0.9813614262560778
Precision: 0.9966996699669967
Recall: 0.9664
F1-score: 0.9813160032493907
