# 🧠 Diabetes Prediction using Support Vector Machines (SVM)

## 📝 Project Overview
This project aims to predict diabetes using patient health data with Support Vector Machines (SVM). After preprocessing (including encoding, scaling, and handling class imbalance), both linear and RBF SVM models were trained. The RBF model was tuned using GridSearchCV on a 10k sample subset. Evaluation on the test set focused on F1-score and ROC AUC due to class imbalance (~9% diabetic cases). The RBF SVM achieved an F1-score of 0.61 and recall of 0.92, reducing false negatives compared to the linear SVM. Initial results show the RBF model is more effective for identifying diabetic patients, making it a strong baseline for further improvements.

⚠️ *Note: This project is still in progress. Further model tweaks, feature engineering, and comparisons with other classifiers may be made to improve performance.*




In [1]:
# 📦 1. Import libraries
import pandas as pd
import time
import logging
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import f1_score, roc_auc_score, classification_report, confusion_matrix, roc_curve
from pre_processing import load_and_clean_data


In [2]:
# 📝 2. Setup logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger()

In [3]:
# 📥 3. Load and preprocess data
logger.info("Loading and preprocessing dataset...")
X_train, X_val, X_test, y_train, y_val, y_test = load_and_clean_data("diabetes_prediction_dataset.csv", split=True, standardize=False)


2025-05-13 11:52:56,893 - INFO - Loading and preprocessing dataset...


In [4]:
# 📊 4. Feature scaling
logger.info("Scaling features...")
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)


2025-05-13 11:52:57,183 - INFO - Scaling features...


In [5]:
# 📏 5. Check class balance
print("Class distribution in training set:")
print(y_train.value_counts(normalize=True))


Class distribution in training set:
diabetes
0    0.911774
1    0.088226
Name: proportion, dtype: float64


In [6]:
# 🧪 6. Subset for fast tuning (optional)
subset_size = 10000
X_train_small, _, y_train_small, _ = train_test_split(
    X_train, y_train, train_size=subset_size, stratify=y_train, random_state=42
)
logger.info(f"Using subset of {subset_size} samples for tuning")


2025-05-13 11:52:57,279 - INFO - Using subset of 10000 samples for tuning


In [7]:
# 🔍 7. GridSearchCV for RBF kernel with class_weight balanced
param_grid_rbf = {
    'C': [2**i for i in range(-5, 16, 2)],        
    'gamma': [2**j for j in range(-15, 4, 2)]    
}

grid_rbf = GridSearchCV(
    SVC(kernel='rbf', class_weight='balanced'),
    param_grid_rbf,
    cv=3,
    scoring='f1',
    verbose=2,
    n_jobs=-1
)

logger.info("Starting GridSearch for RBF SVM...")
start = time.time()
grid_rbf.fit(X_train_small, y_train_small)
logger.info("RBF GridSearch completed in {:.2f} seconds".format(time.time() - start))
logger.info(f"Best RBF parameters: {grid_rbf.best_params_}")


2025-05-13 11:52:57,291 - INFO - Starting GridSearch for RBF SVM...


Fitting 3 folds for each of 110 candidates, totalling 330 fits


2025-05-13 12:00:09,724 - INFO - RBF GridSearch completed in 432.43 seconds
2025-05-13 12:00:09,724 - INFO - Best RBF parameters: {'C': 32768, 'gamma': 0.03125}


In [8]:
# 🚀 8. Train final RBF SVM on full training set
best_rbf = SVC(kernel='rbf',
               C=grid_rbf.best_params_['C'],
               gamma=grid_rbf.best_params_['gamma'],
               class_weight='balanced',
               probability=True)

logger.info("Training best RBF SVM on full dataset...")
start = time.time()
best_rbf.fit(X_train, y_train)
logger.info("RBF training completed in {:.2f} seconds".format(time.time() - start))


2025-05-13 12:00:09,737 - INFO - Training best RBF SVM on full dataset...
2025-05-13 12:59:10,158 - INFO - RBF training completed in 3540.42 seconds


In [9]:
# 🚀 9. Train Linear SVM with class_weight balanced
svm_linear = SVC(kernel='linear', C=1.0, class_weight='balanced', probability=True, random_state=42)
logger.info("Training Linear SVM on full dataset...")
start = time.time()
svm_linear.fit(X_train, y_train)
logger.info("Linear SVM training completed in {:.2f} seconds".format(time.time() - start))


2025-05-13 12:59:10,201 - INFO - Training Linear SVM on full dataset...
2025-05-13 13:08:24,514 - INFO - Linear SVM training completed in 554.31 seconds


In [10]:
# ✅ 10. Evaluate both models
y_pred_linear = svm_linear.predict(X_test)
y_pred_rbf = best_rbf.predict(X_test)

y_score_linear = svm_linear.decision_function(X_test)
y_score_rbf = best_rbf.decision_function(X_test)

print("=== Linear SVM ===")
print("F1 Score:", f1_score(y_test, y_pred_linear))
print("ROC AUC:", roc_auc_score(y_test, y_score_linear))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_linear))
print("\nClassification Report:\n", classification_report(y_test, y_pred_linear))

print("\n=== RBF SVM ===")
print("F1 Score:", f1_score(y_test, y_pred_rbf))
print("ROC AUC:", roc_auc_score(y_test, y_score_rbf))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_rbf))
print("\nClassification Report:\n", classification_report(y_test, y_pred_rbf))


=== Linear SVM ===
F1 Score: 0.5875486381322957
ROC AUC: 0.9667443280010158
Confusion Matrix:
 [[7800  967]
 [  93  755]]

Classification Report:
               precision    recall  f1-score   support

           0       0.99      0.89      0.94      8767
           1       0.44      0.89      0.59       848

    accuracy                           0.89      9615
   macro avg       0.71      0.89      0.76      9615
weighted avg       0.94      0.89      0.91      9615


=== RBF SVM ===
F1 Score: 0.6029526029526029
ROC AUC: 0.9699404095762194
Confusion Matrix:
 [[7817  950]
 [  72  776]]

Classification Report:
               precision    recall  f1-score   support

           0       0.99      0.89      0.94      8767
           1       0.45      0.92      0.60       848

    accuracy                           0.89      9615
   macro avg       0.72      0.90      0.77      9615
weighted avg       0.94      0.89      0.91      9615

