# 🩺 Diabetes Prediction using XGBoost

This notebook trains an XGBoost classifier on the Pima Indians Diabetes dataset to predict diabetes occurrence.  
We use RandomizedSearchCV to optimize hyperparameters (focused on recall score), evaluate the model, and save the trained model and scaler for deployment.


In [70]:
import pandas as pd
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, classification_report

In [71]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
import pickle

In [72]:
import numpy as np
import matplotlib.pyplot as plt

2. Load Dataset

In [74]:
diabetes = pd.read_csv('diabetes.csv')

In [75]:
diabetes.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [76]:
diabetes['Pregnancies'].unique()

array([ 6,  1,  8,  0,  5,  3, 10,  2,  4,  7,  9, 11, 13, 15, 17, 12, 14],
      dtype=int64)

In [77]:
diabetes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [78]:
diabetes.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [79]:
X = diabetes.drop(columns = ['Outcome'])
y = diabetes['Outcome']

3. Train-Test Split

In [81]:
X_train,X_test,y_train,y_test = train_test_split(X,y, test_size=0.2, random_state=42,stratify=y)

In [82]:
X_train.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age'],
      dtype='object')

4. Feature Scaling

In [84]:
# 2. Fit and save the scaler
scaler = StandardScaler()

In [85]:
diabetes_feature_order = X_train.columns.to_list()
X_train_scaled = scaler.fit_transform(X_train)
pickle.dump(scaler, open('diabetes_scaler.pkl', 'wb'))


5. Define XGBoost Model & Hyperparameter Space

In [87]:
xgb = XGBClassifier(
    eval_metric="logloss",
    random_state=42
)


In [88]:
param_dist = {
    "n_estimators": np.arange(100, 600, 100),        
    "max_depth": np.arange(3, 10, 1),                
    "learning_rate": np.linspace(0.01, 0.3, 10),     
    "subsample": np.linspace(0.6, 1.0, 5),           
    "colsample_bytree": np.linspace(0.6, 1.0, 5),    
    "gamma": np.linspace(0, 0.4, 5),                 
    "reg_lambda": np.arange(1, 6, 1)             
}


6. Hyperparameter Tuning with RandomizedSearchCV
 We use RandomizedSearchCV with 50 random parameter combinations, evaluated using 5-fold cross-validation.
The optimization metric is recall, since in a medical diagnosis problem, minimizing false negatives is more important (i.e., we don’t want to miss actual diabetic cases)

In [90]:

rand_search = RandomizedSearchCV(
    estimator=xgb,
    param_distributions=param_dist,
    n_iter=50,          # number of random combintions to try
    scoring="recall",
    cv=5,
    verbose=2,
    random_state=42,
    n_jobs=-1
)


In [91]:
rand_search.fit(X_train, y_train)


Fitting 5 folds for each of 50 candidates, totalling 250 fits


In [92]:
print("Best Parameters:", rand_search.best_params_)
print("Best Recall Score (CV):", rand_search.best_score_)

Best Parameters: {'subsample': 0.7, 'reg_lambda': 4, 'n_estimators': 100, 'max_depth': 5, 'learning_rate': 0.07444444444444444, 'gamma': 0.4, 'colsample_bytree': 0.9}
Best Recall Score (CV): 0.621594684385382


7. Evaluate Model on Test Data

In [94]:

best_model = rand_search.best_estimator_
y_pred = best_model.predict(X_test)
y_pred_proba = best_model.predict_proba(X_test)[:, 1]

In [64]:
print("\nTest Set Performance:")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall (Sensitivity):", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))
print("ROC-AUC:", roc_auc_score(y_test, y_pred_proba))


Test Set Performance:
Accuracy: 0.7402597402597403
Precision: 0.6296296296296297
Recall (Sensitivity): 0.6296296296296297
F1 Score: 0.6296296296296297
ROC-AUC: 0.8255555555555556


## Save Model

In [39]:
pickle.dump(best_model, open('diabetes_model.pkl', 'wb'))

In [41]:
# Also save the feature order
pickle.dump(diabetes_feature_order, open('diabetes_feature_order.pkl', 'wb'))

print("Model and scaler have been saved!")

Model and scaler have been saved!
