Dataset link: https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset

In [2]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix

import pickle

import warnings
warnings.filterwarnings("ignore")

# **Task 1: Data Loading**

In [3]:
df = pd.read_csv("diabetes_prediction_dataset.csv")

print(df.shape)
df.head()

(100000, 9)


Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
0,Female,80.0,0,1,never,25.19,6.6,140,0
1,Female,54.0,0,0,No Info,27.32,6.6,80,0
2,Male,28.0,0,0,never,27.32,5.7,158,0
3,Female,36.0,0,0,current,23.45,5.0,155,0
4,Male,76.0,1,1,current,20.14,4.8,155,0


# **Task 2: Data Preprocessing**

2.1 Missing Value Check

In [4]:
df.isnull().sum()

Unnamed: 0,0
gender,0
age,0
hypertension,0
heart_disease,0
smoking_history,0
bmi,0
HbA1c_level,0
blood_glucose_level,0
diabetes,0


The dataset does not contain missing values. A missing value check was performed to ensure the dataset clean and reliable, and all features show zero missing values.

2.2 Outlier Detection and Removal

In [5]:
num_cols = ['age', 'bmi', 'HbA1c_level', 'blood_glucose_level']
Q1 = df[num_cols].quantile(0.25)
Q3 = df[num_cols].quantile(0.75)

IQR = Q3 - Q1

df = df[~((df[num_cols] < (Q1 - 1.5 * IQR)) | (df[num_cols] > (Q3 + 1.5 * IQR))).any(axis=1)]

print("Outliers removed using IQR, Shape:", df.shape)

Outliers removed using IQR, Shape: (90387, 9)


2.3 Encoding (Logical Explanation)

Categorical features like gender and smoking history must be converted into numbers for machine learning models. However, encoding is not done in Task 2 because it is handled later inside the preprocessing pipeline in Task 3 using OneHotEncoder. This ensures the same encoding is applied during training, cross-validation, and testing, and val avoids unnecessary repeated transformations that could harm model performance.

2.4 Scaling (Logical Explanation)

Numerical features like age, BMI, HbA1c level, and blood glucose level have different scales, which can affect how the model learns. That is why scaling is important. However, scaling is not done in Task 2 because it it handled inside the preprocessing pipeline Task 3 using StandardScaler. This helps prevent data leakage and ensures the same scaling is applied during training and testing.

2.5 Feature Engineering

In [6]:
df['health_risk'] = df['hypertension'] + df['heart_disease']
print("Feature 'health_risk' added")

Feature 'health_risk' added


# **TASK 3: Pipeline Creation**

3.1 Preprocessing Pipeline

In [7]:
num_features = ['age', 'bmi', 'HbA1c_level', 'blood_glucose_level', 'health_risk']
cat_features = ['gender', 'hypertension', 'heart_disease', 'smoking_history']

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), num_features),
    ('cat', OneHotEncoder(drop='first'), cat_features)
])

3.2 Full pipeline with RandomForestClassifier

In [8]:
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', RandomForestClassifier(random_state=42))
])

# **TASK 4: Primary Model Selection**

Selected Model: Random Forest Classifier

Justification: I chose the Random Forest Classifier as the primary model because it works well with both numeric and categorical data. It can capture complex, non-linear patterns in the data, making it suitable for predicting diabates. This model is also robust against outliers and doesn't require heavy scaling of features. Additionally, it's easy to tune to get better accuracy, making it a reliable choice for diabete prediction.

# **TASK 5: Model Training**

5.1 train-test-split

In [10]:
X = df.drop('diabetes', axis=1)
y = df['diabetes']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

5.2 Train the model

In [11]:
pipeline.fit(X_train, y_train)

# **TASK 6: Cross-Validation**

In [12]:
cv_scores = cross_val_score(
    pipeline,
    X_train,
    y_train,
    cv=5,
    scoring='accuracy'
)
print("Average Score:", np.mean(cv_scores))
print("Standard Deviation:", np.std(cv_scores))

Average Score: 0.9729355914258285
Standard Deviation: 0.00027563890341307214


# **TASK 7: Hyperparameter Tuning**

In [13]:
param_grid = {
    'model__n_estimators': [50, 100],
    'model__max_depth': [10, 20, None],
    'model__min_samples_split': [2, 5]
}

grid = GridSearchCV(
    pipeline,
    param_grid,
    cv=3,
    scoring='accuracy',
    n_jobs=-1
)

grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)
print("Best Score:", grid.best_score_)

Best Parameters: {'model__max_depth': 10, 'model__min_samples_split': 2, 'model__n_estimators': 50}
Best Score: 0.9745951403006542


# **TASK 8: Best Model Selection**

In [14]:
best_model = grid.best_estimator_

best_model.fit(X_train, y_train)

# **TASK 9: Model Performance Evaluation**

In [15]:
y_pred = best_model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1-score:", f1_score(y_test, y_pred))

print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.973227126894568
Precision: 1.0
Recall: 0.46162402669632924
F1-score: 0.6316590563165906
Confusion Matrix:
 [[17179     0]
 [  484   415]]
Classification Report:
               precision    recall  f1-score   support

           0       0.97      1.00      0.99     17179
           1       1.00      0.46      0.63       899

    accuracy                           0.97     18078
   macro avg       0.99      0.73      0.81     18078
weighted avg       0.97      0.97      0.97     18078



In [16]:
filename = "diabetes_prediction_model.pkl"

with open(filename, "wb") as file:
  pickle.dump(best_model, file)