## Model Training - Diabetes Dataset
### Introduction
This notebook develops and compares multiple machine learning models for diabetes prediction. Using the engineered features from `03_feature_engineering.ipynb`, we train, tune, and evaluate classification models to identify the best performing approach.

**Dataset:** Diabetes Dataset (Kaggle)

**Objective:** Train, tune, and compare classification models to predict diabetes with optimal performance.

**Author:** NGUYEN Ngoc Dang Nguyen - Final-year Student in Computer Science, Aix-Marseille University

**Model Training Steps:**
1. Import Libraries and Load Data
2. Baseline Model
3. Compare Multiple Models
4. Cross Validation
5. Hyperparameter Tuning
6. Final Model Evaluation & Save Model

### 1. Import Libraries and Load Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score
from sklearn.model_selection import cross_validate, GridSearchCV

plt.style.use('default')
sns.set_palette("husl")
%matplotlib inline

X_train = pd.read_csv("../data/processed/X_train_final.csv")
X_test = pd.read_csv("../data/processed/X_test_final.csv")

y_train = pd.read_csv("../data/processed/y_train.csv").squeeze()
y_test = pd.read_csv("../data/processed/y_test.csv").squeeze()

print(f"Loaded processed datasets:")
print(f"X_train: {X_train.shape}, X_test: {X_test.shape}")
print(f"y_train: {y_train.shape}, y_test: {y_test.shape}")

### 2. Baseline Model

In [None]:
log_reg = LogisticRegression(max_iter=1000, random_state=42)
log_reg.fit(X_train, y_train)

y_pred = log_reg.predict(X_test)

acc = accuracy_score(y_test, y_pred)
print(f"Accuracy: {acc:.4f}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred, digits=4))

plt.figure(figsize=(6, 4))
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d', cmap='Blues', cbar=False)
plt.title("Confusion Matrix – Logistic Regression")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.tight_layout()
plt.show()

### 3. Compare Multiple Models

In [None]:
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000, random_state=42),
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "Random Forest": RandomForestClassifier(random_state=42),
    "SVM": SVC(probability=True, random_state=42),
    "KNN": KNeighborsClassifier()
}

results = []

for name, model in models.items():
    X_tr = X_train.to_numpy() if hasattr(X_train, "to_numpy") else X_train
    X_te = X_test.to_numpy() if hasattr(X_test, "to_numpy") else X_test

    model.fit(X_tr, y_train)
    y_pred = model.predict(X_te)

    if hasattr(model, "predict_proba"):
        y_proba = model.predict_proba(X_te)[:, 1]
        auc = roc_auc_score(y_test, y_proba)
    else:
        auc = None

    acc = accuracy_score(y_test, y_pred)
    results.append({"Model": name, "Accuracy": acc, "AUC": auc})

results_df = pd.DataFrame(results).sort_values(by="AUC", ascending=False)
print(results_df)

### 4. Cross Validation

In [None]:
model = RandomForestClassifier(random_state=42)

cv_results = cross_validate(
    model,
    X, y,
    cv=5,
    scoring=['accuracy', 'roc_auc', 'f1'],
    return_train_score=False
)

cv_df = pd.DataFrame({
    'Fold': range(1, 6),
    'Accuracy': cv_results['test_accuracy'],
    'AUC': cv_results['test_roc_auc'],
    'F1': cv_results['test_f1']
})

print("Mean CV Scores:")
print(cv_df.mean(numeric_only=True).round(3))

plt.figure(figsize=(8, 5))
sns.boxplot(data=cv_df.drop(columns='Fold'), palette='Set2')
plt.title("Cross-Validation Score Distribution (5-Fold)")
plt.ylabel("Score")
plt.tight_layout()
plt.show()

### 5. Hyperparameter Tuning

In [None]:
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 5, 10, 20],
    'min_samples_split': [2, 5, 10]
}

grid = GridSearchCV(
    estimator=RandomForestClassifier(random_state=42, class_weight='balanced'),
    param_grid=param_grid,
    cv=5,
    scoring='roc_auc',
    n_jobs=-1,
    verbose=1
)

grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)
print("Best AUC Score:", round(grid.best_score_, 4))

results_df = pd.DataFrame(grid.cv_results_)
results_df = results_df.sort_values(by='mean_test_score', ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(
    data=results_df.head(10),
    x='mean_test_score',
    y=results_df.head(10)['params'].astype(str),
    palette='viridis'
)
plt.title("Top 10 Hyperparameter Combinations – AUC Score")
plt.xlabel("Mean AUC (CV)")
plt.ylabel("Hyperparameters")
plt.tight_layout()
plt.show()

### 6. Final Model Evaluation & Save Model

In [None]:
best_model = grid.best_estimator_

y_pred = best_model.predict(X_test)
y_prob = best_model.predict_proba(X_test)[:, 1]

print("Final Model Evaluation")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"AUC Score: {roc_auc_score(y_test, y_prob):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, digits=4))

joblib.dump(best_model, "../models/final_model.pkl")

plt.figure(figsize=(6, 4))
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d', cmap='Greens', cbar=False)
plt.title("Confusion Matrix – Final Model")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.tight_layout()
plt.show()

from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
plt.figure(figsize=(6, 4))
plt.plot(fpr, tpr, label=f"AUC = {roc_auc_score(y_test, y_prob):.4f}", color='darkgreen')
plt.plot([0, 1], [0, 1], linestyle='--', color='gray')
plt.title("ROC Curve – Final Model")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.legend()
plt.tight_layout()
plt.show()

### Conclusion
Model comparison identified RandomForest as the best performer after hyperparameter tuning via GridSearchCV, achieving an AUC of 0.8289 and accuracy of 74.0% on the test set. The tuned model demonstrated balanced performance across precision and recall metrics. The final model was saved to `models/final_model.pkl` for deployment and future predictions.