# Support Vector Machine (SVM) with RBF Kernel

Support Vector Machines (SVM) are supervised learning models that aim to find the optimal hyperplane separating classes with the maximum margin.  
They are particularly powerful when combined with non-linear kernels, which allow the model to capture complex relationships between features.

In this project, we use the **Radial Basis Function (RBF) kernel**, as it can map the data into a higher-dimensional space and model non-linear decision boundaries. This is well-suited for biomedical datasets, where the relationships between features and outcomes are often non-linear.

In [12]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score, roc_auc_score,
    confusion_matrix, RocCurveDisplay, ConfusionMatrixDisplay, precision_recall_curve
)

from sklearn.svm import SVC 

# 1. Load and prepare the data

In [6]:
data=pd.read_csv("dataR2.csv")
data["target"] = data["Classification"].map({1: 0, 2: 1})
data = data.drop(columns=["HOMA"])

During our Exploratory Data Analysis, we saw that there was no missing value, and we found that the feature **HOMA** was highly correlated with other variables (VIF ≈ 18). Although SVM with RBF is less sensitive to multicollinearity than linear models, such redundancy does not add meaningful information and may increase model complexity.  
For this reason, we decided to **remove HOMA** to reduce redundancy and simplify the feature space.

We apply a log transformation to right-skewed features in order to reduce the impact of extreme outliers and make the feature distributions closer to normal, which helps improve the performance and stability of SVM models that are sensitive to scale and outliers.  


In [7]:
skewed_features = ["Insulin", "Leptin", "Resistin", "MCP.1", "Adiponectin"] # Features to log-transform (already dropped HOMA)

# Apply log(1+x) transformation to handle skewness & avoid log(0)
for col in skewed_features:
    data[col] = np.log1p(data[col])

We scale the features to ensure that all variables contribute equally to the SVM model. Since SVM relies on distance-based calculations in the feature space, unscaled variables with larger ranges could dominate the model and bias the decision boundary. Scaling standardizes the feature ranges, improving both training stability and model performance.  


In [10]:
X = data.drop(columns=["target", "Classification"])
y = data["target"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 2. Premilinary Support Vector Machine 

In [19]:
clf_svm = SVC(probability=True, random_state=42)
clf_svm.fit(X_train_scaled, y_train)

0,1,2
,C,1.0
,kernel,'rbf'
,degree,3
,gamma,'scale'
,coef0,0.0
,shrinking,True
,probability,True
,tol,0.001
,cache_size,200
,class_weight,


In [22]:
y_proba = clf_svm.predict_proba(X_test_scaled)[:, 1]


thresholds = [0.5, 0.4, 0.3, 0.2, 0.1]

results = []

for thresh in thresholds:
    y_pred_thresh = (y_proba >= thresh).astype(int)
    results.append({
        'Threshold': thresh,
        'Accuracy': accuracy_score(y_test, y_pred_thresh),
        'Precision': precision_score(y_test, y_pred_thresh),
        'Recall': recall_score(y_test, y_pred_thresh),
        'F1 Score': f1_score(y_test, y_pred_thresh)
    })

df_results = pd.DataFrame(results)

print(df_results.to_string(index=False))


auc  = roc_auc_score(y_test, y_proba)
print(f"ROC AUC:   {auc:.3f}")

 Threshold  Accuracy  Precision   Recall  F1 Score
       0.5  0.750000   0.818182 0.692308  0.750000
       0.4  0.875000   0.857143 0.923077  0.888889
       0.3  0.791667   0.722222 1.000000  0.838710
       0.2  0.625000   0.590909 1.000000  0.742857
       0.1  0.541667   0.541667 1.000000  0.702703
ROC AUC:   0.874


- At the default threshold of **0.5**, the model achieves a balanced performance (Accuracy = 0.75, F1 = 0.75), but recall is relatively low (0.69).  
- Lowering the threshold to **0.4** significantly improves recall (0.92) while keeping precision high (0.86), resulting in the best overall F1-score (0.89) and highest accuracy (0.88).  
- Further decreasing the threshold (≤0.3) maximizes recall (reaching 1.0), but at the cost of precision, which drops substantially. This leads to more false positives.  

The ROC AUC score of **0.874** confirms that the model has good discriminative ability overall.  
Given the medical context, where minimizing false negatives is critical, a threshold around **0.3–0.4** seems most appropriate, as it ensures high recall while maintaining a reasonable precision.

In the next section, we will try to improve predictions using Cross Validation to optimize the parameters. 

# 3. Tuning SVM Hyperparameters with GridSearchCV

Support Vector Machines (SVM) depend on two key hyperparameters that strongly influence their performance:

- **Regularization parameter (C):**  
  Controls the trade-off between maximizing the margin and minimizing classification errors.  
  - A **large C** means the model tries to classify all training points correctly, which may lead to overfitting.  
  - A **small C** allows more misclassifications but can result in a simpler and more generalizable model.  

- **Gamma (γ):**  
  Defines how far the influence of a single training sample reaches in the RBF kernel.  
  - A **large γ** creates a complex, highly flexible decision boundary that risks overfitting.  
  - A **small γ** produces a smoother, simpler boundary that risks underfitting.  

Since both parameters interact, finding the right combination is crucial.  
We use **GridSearchCV** with cross-validation to test multiple values of `C` and `gamma`, aiming to identify the configuration that best balances bias and variance, and maximizes performance on unseen data.


In [91]:
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': ['scale', 'auto', 0.01, 0.1, 1],
    'kernel':['rbf']
}


svm=SVC(kernel="rbf", probability=True, random_state=42)

grid_search_svm = GridSearchCV(
    estimator=svm,
    param_grid=param_grid,
    scoring="recall",
    cv=5,
    n_jobs=-1,
    verbose=0
)

grid_search_svm.fit(X_train_scaled, y_train)

best_svm = grid_search_svm.best_estimator_
print("Best params:", grid_search_svm.best_params_)

Best params: {'C': 0.1, 'gamma': 'scale', 'kernel': 'rbf'}


In [92]:
y_proba_best_svm = best_svm.predict_proba(X_test_scaled)[:, 1]


thresholds = [0.5, 0.4, 0.3, 0.2, 0.1]

results = []

for thresh in thresholds:
    y_pred_best_svm = (y_proba_best_svm >= thresh).astype(int)
    results.append({
        'Threshold': thresh,
        'Accuracy': accuracy_score(y_test, y_pred_best_svm),
        'Precision': precision_score(y_test, y_pred_best_svm),
        'Recall': recall_score(y_test, y_pred_best_svm),
        'F1 Score': f1_score(y_test, y_pred_best_svm)
    })

df_results = pd.DataFrame(results)

print(df_results.to_string(index=False))


auc  = roc_auc_score(y_test, y_proba_best_svm)
print(f"ROC AUC:   {auc:.3f}")

 Threshold  Accuracy  Precision   Recall  F1 Score
       0.5  0.708333   0.800000 0.615385  0.695652
       0.4  0.833333   0.800000 0.923077  0.857143
       0.3  0.666667   0.631579 0.923077  0.750000
       0.2  0.583333   0.565217 1.000000  0.722222
       0.1  0.541667   0.541667 1.000000  0.702703
ROC AUC:   0.846


After running GridSearchCV, the results did not show a clear improvement compared to the default SVM model.  
In fact, the ROC AUC slightly decreased (0.846 vs 0.874), and the other metrics remained very similar.  
This suggests that the parameter grid we defined was either too restrictive (missing better values) or not well adapted to the dataset.  

The limitation of GridSearchCV is that it tries every single combination of a predefined grid.  
- If the grid is too small, we risk missing the optimal region.  
- If the grid is too large, the computation becomes very expensive without necessarily yielding better results.  

To address this, we turn to **RandomizedSearchCV**. Instead of testing all combinations, it randomly samples parameter values from specified distributions.  
This approach has several advantages:
- It explores a much wider search space more efficiently.  
- We can control the computation time by setting the number of iterations.  
- It is particularly effective for SVM with RBF, since `C` and `gamma` are very sensitive and perform best when searched across a wide, often logarithmic, range.  

Using RandomizedSearchCV should give us a better chance of finding stronger hyperparameters without being limited by a manually fixed grid.

# 4. Tuning SVM Hyperparameters with RandomizedSearchCV

In [85]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform

numeric_gamma = uniform(loc=0.01, scale=0.99).rvs(500, random_state=42)
gamma_values = list(numeric_gamma) + ['scale']  

param_dist = {
    'C': uniform(loc=0.05, scale=10),  
    'gamma': gamma_values,
    'kernel': ['rbf']
}



svm = SVC(probability=True, random_state=42)

random_search = RandomizedSearchCV(
    estimator=svm,
    param_distributions=param_dist,
    n_iter=200,  
    scoring='recall',
    cv=5,
    random_state=42,
    n_jobs=-1,
    verbose=1
)

random_search.fit(X_train_scaled, y_train)
print("Best params:", random_search.best_params_)


Fitting 5 folds for each of 200 candidates, totalling 1000 fits
Best params: {'C': np.float64(0.6308361216819947), 'gamma': np.float64(0.641181896641661), 'kernel': 'rbf'}


In [86]:
random_svm = random_search.best_estimator_
y_proba_random_svm = random_svm.predict_proba(X_test_scaled)[:, 1]


thresholds = [0.5, 0.4, 0.3, 0.2, 0.1]

results = []

for thresh in thresholds:
    y_pred_random_svm = (y_proba_random_svm >= thresh).astype(int)
    results.append({
        'Threshold': thresh,
        'Accuracy': accuracy_score(y_test, y_pred_random_svm),
        'Precision': precision_score(y_test, y_pred_random_svm),
        'Recall': recall_score(y_test, y_pred_random_svm),
        'F1 Score': f1_score(y_test, y_pred_random_svm)
    })

df_results = pd.DataFrame(results)

print(df_results.to_string(index=False))


auc  = roc_auc_score(y_test, y_proba_random_svm)
print(f"ROC AUC:   {auc:.3f}")

 Threshold  Accuracy  Precision   Recall  F1 Score
       0.5  0.791667   0.833333 0.769231  0.800000
       0.4  0.791667   0.750000 0.923077  0.827586
       0.3  0.750000   0.684211 1.000000  0.812500
       0.2  0.708333   0.650000 1.000000  0.787879
       0.1  0.666667   0.619048 1.000000  0.764706
ROC AUC:   0.839


### SVM with RBF Kernel – RandomizedSearchCV Results

The RandomizedSearchCV run produced results that were broadly consistent with both the baseline and GridSearchCV models:

- At **threshold 0.5**, performance was balanced with  
  - Accuracy: 0.79  
  - Precision: 0.83  
  - Recall: 0.77  
  - F1-score: 0.80  

- At **threshold 0.4**, recall improved to 0.92 with an F1-score of 0.83.  
- As before, recall reached 1.0 for thresholds ≤ 0.3, but at the cost of lower precision.  
- The overall **ROC AUC was 0.839**, slightly below the baseline (0.874) and GridSearchCV (0.846).  

**Interpretation:**  
Exploring a wider hyperparameter space with RandomizedSearchCV did not lead to a clear improvement. Instead, it confirmed that the default SVM with RBF kernel was already well-tuned for this dataset, and that performance remains stable across different search strategies.


# 5. Comparison of SVM Models (RBF Kernel)

| Model                 | Best Threshold | Accuracy | Precision | Recall | F1-score | ROC AUC |
|------------------------|----------------|----------|-----------|--------|----------|---------|
| **Baseline (default)** | 0.4            | 0.88     | 0.86      | 0.92   | 0.89     | 0.874   |
| **GridSearchCV**       | 0.4            | 0.83     | 0.80      | 0.92   | 0.86     | 0.846   |
| **RandomizedSearchCV** | 0.4            | 0.79     | 0.75      | 0.92   | 0.83     | 0.839   |

- The **baseline SVM** actually performed best overall, with the highest ROC AUC and F1-score at threshold 0.4.  
- **GridSearchCV** and **RandomizedSearchCV** confirmed the robustness of the model but did not yield better results.  
- This suggests that the default hyperparameters of the RBF SVM are already well-suited for this dataset.

