## *🔹 Hyperparameter Tuning Setup*
pandas: For data manipulation and analysis.

matplotlib.pyplot: For plotting graphs such as ROC curves.

train_test_split, GridSearchCV (from sklearn.model_selection):

  train_test_split: Split data into training and testing sets.

  GridSearchCV: Perform hyperparameter tuning with cross-validation.

StandardScaler (from sklearn.preprocessing): Standardize features for models sensitive to scale.

Pipeline (from sklearn.pipeline): Streamline sequential steps like scaling + modeling.

LogisticRegression (from sklearn.linear_model): Classification model.

RandomForestClassifier (from sklearn.ensemble): Ensemble-based tree model.

SVC (from sklearn.svm): Support Vector Classifier for classification tasks.

accuracy_score, classification_report, roc_curve, roc_auc_score (from sklearn.metrics):

  Evaluate model performance (accuracy, precision, recall, F1-score).

  Compute ROC curves and AUC for probabilistic classifiers.

In [1]:

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, roc_curve, roc_auc_score
import pickle


## *🔹 Load and Split Data*

Load X_full (all features), X_pca (PCA features), and y (target).

Split both datasets 80/20 for training/testing with stratification:

X_train_full, X_test_full, y_train_full, y_test_full

X_train_pca, X_test_pca, y_train_pca, y_test_pca

In [3]:
X_full = pd.read_csv("../data/processed/processed_X.csv")   # All features
X_pca = pd.read_csv("../data/processed/X_pca_95.csv")       # PCA features
y = pd.read_csv("../data/processed/y.csv").values.ravel()

# Split data for testing baseline
X_train_full, X_test_full, y_train_full, y_test_full = train_test_split(
    X_full, y, test_size=0.2, random_state=42, stratify=y
)

X_train_pca, X_test_pca, y_train_pca, y_test_pca = train_test_split(
    X_pca, y, test_size=0.2, random_state=42, stratify=y
)



## *🔹 SVM (All Features) – Hyperparameter Tuning*

- Hyperparameter grid:

  - C: [0.1, 1, 10]

  - gamma: ['scale', 'auto']

  - kernel: ['rbf', 'poly']

- GridSearchCV with 5-fold cross-validation to find the best parameters.

- Best estimator used to predict on the test set.

### Analysis:

- The SVM achieves good overall accuracy, comparable to other classifiers.

- Class 0 shows slightly higher recall, meaning negative cases are detected more reliably than positive ones.

- Hyperparameter tuning improves performance by selecting the best kernel, C, and gamma, balancing bias and variance.

- Using all features ensures the model has access to full information, but feature selection or PCA could reduce complexity without much loss in accuracy.

In [4]:
param_grid_svm = {
    'C': [0.1, 1, 10],
    'gamma': ['scale', 'auto'],
    'kernel': ['rbf', 'poly']
}

svm = SVC(probability=True, random_state=42)
grid_svm = GridSearchCV(svm, param_grid_svm, cv=5, scoring='accuracy', n_jobs=-1)
grid_svm.fit(X_train_full, y_train_full)

best_svm = grid_svm.best_estimator_
y_pred_svm = best_svm.predict(X_test_full)
y_prob_svm = best_svm.predict_proba(X_test_full)[:,1]

print("🔹 SVM (All Features) - Best Hyperparameters:", grid_svm.best_params_)
print("Accuracy:", accuracy_score(y_test_full, y_pred_svm))
print(classification_report(y_test_full, y_pred_svm))

🔹 SVM (All Features) - Best Hyperparameters: {'C': 0.1, 'gamma': 'auto', 'kernel': 'poly'}
Accuracy: 0.819672131147541
              precision    recall  f1-score   support

           0       0.87      0.79      0.83        33
           1       0.77      0.86      0.81        28

    accuracy                           0.82        61
   macro avg       0.82      0.82      0.82        61
weighted avg       0.82      0.82      0.82        61



## *🔹 Random Forest (PCA) – Hyperparameter Tuning*

- Hyperparameter grid:

  - n_estimators: [100, 200] → Number of trees in the forest

  - max_depth: [6, 7, 8, None] → Maximum depth of each tree

  - min_samples_split: [2, 5] → Minimum samples required to split a node, to prevent overfitting

  - min_samples_leaf: [1, 2] → Minimum samples per leaf, to protect small trees

  - max_features: ['sqrt', None] → Number of features to consider when looking for the best split

- GridSearchCV with 5-fold cross-validation to find the best parameters.

- Best estimator used to predict on the test set.

### Analysis:

- Hyperparameter tuning allows the Random Forest to balance bias and variance, improving generalization.

- min_samples_split and min_samples_leaf help prevent overfitting on small samples.

- Using PCA features reduces dimensionality while maintaining strong predictive performance.

Overall, the tuned Random Forest is robust, with good accuracy and balanced class performance.

In [5]:
param_grid_rf = {
    'n_estimators': [100, 200],       # أقل خيارات لتقليل وقت التدريب
    'max_depth': [6, 7,8, None],        # أهم القيم فقط
    'min_samples_split': [2, 5],      # الأساسيات لتجنب overfitting
    'min_samples_leaf': [1, 2],       # لحماية الأشجار الصغيرة
    'max_features': ['sqrt', None]    # خيارات شائعة
}


rf = RandomForestClassifier(random_state=42)
grid_rf = GridSearchCV(rf, param_grid_rf, cv=5, scoring='accuracy', n_jobs=-1)
grid_rf.fit(X_train_pca, y_train_pca)

best_rf = grid_rf.best_estimator_
y_pred_rf = best_rf.predict(X_test_pca)
y_prob_rf = best_rf.predict_proba(X_test_pca)[:,1]

print("🔹 Random Forest (PCA) - Best Hyperparameters:", grid_rf.best_params_)
print("Accuracy:", accuracy_score(y_test_pca, y_pred_rf))
print(classification_report(y_test_pca, y_pred_rf))

🔹 Random Forest (PCA) - Best Hyperparameters: {'max_depth': 7, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 200}
Accuracy: 0.8688524590163934
              precision    recall  f1-score   support

           0       0.96      0.79      0.87        33
           1       0.79      0.96      0.87        28

    accuracy                           0.87        61
   macro avg       0.88      0.88      0.87        61
weighted avg       0.89      0.87      0.87        61

