# Support Vector Machines

## Motivation: Why Support Vector Machines?

Support Vector Machines (SVMs) are a powerful class of supervised learning algorithms used for classification and regression tasks. While models like logistic regression are probabilistic, SVMs are deterministic and based on geometrical properties of the data. The core idea behind SVMs is to find an optimal hyperplane that best separates different classes of data points. This 'optimality' is achieved by maximizing the margin, which is the distance between the hyperplane and the closest data points from each class. These closest points are called **support vectors**, and they are the critical elements of the training data that define the decision boundary.

- Maximizes **margin** → better generalization.
- Works in high-dimensional spaces (e.g., gene expression, spectroscopy)
- Flexible with kernels for non-linear separation.
- Robust to outliers with soft margins.
- The number of features > number of samples (common in biochemistry, genomics).



In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification, make_circles
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# Generate linear data
X_lin, y_lin = make_classification(n_samples=100, n_features=2, n_redundant=0, n_informative=2, n_clusters_per_class=1, class_sep=2.0, random_state=42)

# Generate non-linear data
X_nonlin, y_nonlin = make_circles(n_samples=100, factor=0.5, noise=0.1, random_state=42)

def plot_decision_boundary(clf, X, y, ax, title):
    h = .02
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    ax.contourf(xx, yy, Z, alpha=0.3)
    ax.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k')
    ax.set_title(title)

# Fit models
fig, axes = plt.subplots(2, 2, figsize=(10, 8))
for data, ax_row, kernel in zip([(X_lin, y_lin), (X_nonlin, y_nonlin)], axes, ['linear', 'rbf']):
    X, y = data
    log_reg = LogisticRegression().fit(X, y)
    svm = SVC(kernel=kernel).fit(X, y)
    plot_decision_boundary(log_reg, X, y, ax_row[0], 'Logistic Regression')
    plot_decision_boundary(svm, X, y, ax_row[1], f'SVM ({kernel} kernel)')
plt.tight_layout()
plt.show()

## Some applications to basic sciences

### **Biology & Bioinformatics**

1.  Ahmad, M., & Hayat, M. (2022). A comparative study of machine learning approaches for the prediction of anticancer peptides. *Briefings in Bioinformatics, 23*(5), bbac327. [https://doi.org/10.1093/bib/bbac327](https://doi.org/10.1093/bib/bbac327)

2.  Deo, R. C. (2023). Machine learning in medicine. *Circulation, 148*(20), 1637–1655. [https://doi.org/10.1161/CIRCULATIONAHA.123.064953](https://doi.org/10.1161/CIRCULATIONAHA.123.064953)

3.  El-Saadawy, A. A., & El-Bakry, H. M. (2023). A proposed model for breast cancer diagnosis using a hybrid feature selection and a support vector machine. *Scientific Reports, 13*(1), 17799. [https://doi.org/10.1038/s41598-023-45091-8](https://doi.org/10.1038/s41598-023-45091-8)

4.  Su, R., Liu, Y., Zhang, R., Liu, T., & Wang, X. (2024). Deep-Resp-Forest: A deep learning model for predicting respiratory system-related adverse drug reactions. *Computers in Biology and Medicine, 171*, 108130. [https://doi.org/10.1016/j.compbiomed.2024.108130](https://doi.org/10.1016/j.compbiomed.2024.108130)

### **Chemistry & Materials Science**

5.  Al-Janabi, A. H. (2024). Classification of crude oil using machine learning algorithms based on their physical properties. *Egyptian Journal of Chemistry, 67*(11), 385-391. [https://doi.org/10.21608/ejchem.2024.267815.9221](https://doi.org/10.21608/ejchem.2024.267815.9221)

6.  Gao, C., Wang, Z., Li, Y., Wang, C., Li, D., & Lu, S. (2024). Development and validation of machine learning models for predicting the oral bioavailability of drugs. *AAPS PharmSciTech, 25*(2), 52. [https://doi.org/10.1208/s12249-024-02758-1](https://doi.org/10.1208/s12249-024-02758-1)

7.  He, Y., Zhao, Y., & Ai, Q. (2024). A data-driven machine learning framework for performance prediction and parameter optimization of vanadium flow batteries. *Journal of Energy Storage, 81*, 110433. [https://doi.org/10.1016/j.est.2024.110433](https://doi.org/10.1016/j.est.2024.110433)

8.  Zhu, X., Li, S., Yuan, S., & Li, G. (2024). Machine learning prediction of single-atom catalysts for CO2 reduction reaction. *Journal of Materials Chemistry A, 12*(2), 652-661. [https://doi.org/10.1039/D3TA05770K](https://doi.org/10.1039/D3TA05770K)

### **Physics**

9.  Lei, B., Wang, Y., Jiang, G., Li, C., & Ding, Z. (2024). Prediction of rockburst intensity grade based on the K-means and support vector machine. *International Journal for Numerical and Analytical Methods in Geomechanics, 48*(8), 1999-2019. [https://doi.org/10.1002/nag.3729](https://doi.org/10.1002/nag.3729)

10. Singh, V. K., & Foufoula-Georgiou, E. (2024). A hierarchical machine learning model for predicting sub-grid scale atmospheric turbulence. *Journal of Advances in Modeling Earth Systems, 16*(1), e2023MS003889. [https://doi.org/10.1029/2023MS003889](https://doi.org/10.1029/2023MS003889)

11. Wrembel, M., & Bąk, K. (2023). A new combined model for predicting the top-of-atmosphere daily incoming solar radiation. *Astronomy and Computing, 44*, 100742. [https://doi.org/10.1016/j.ascom.2023.100742](https://doi.org/10.1016/j.ascom.2023.100742)

12. Xie, Y., & Yang, B. (2021). Identification of the physical parameters of the nonlinear pendulum by the support vector machine method. *Physical Review E, 104*(4), 044201. [https://doi.org/10.1103/PhysRevE.104.044201](https://doi.org/10.1103/PhysRevE.104.044201)


## SVM Fundamentals
SVM finds the hyperplane that maximizes the margin (distance to nearest data points of each class). These nearest points are called support vectors.

Given a dataset $(x_i, y_i)$ with $y_i \in \{-1, 1\}$, SVM solves:
$$
\min_{w,b} \frac{1}{2} ||w||^2 \quad \text{s.t.} \quad y_i (w \cdot x_i + b) \geq 1,
$$
- $ \mathbf{w} $: normal vector to hyperplane
- $ b $: bias
- $ y_i \in \{-1, +1\} $
- Margin = $ \frac{2}{\|\mathbf{w}\|} $
  
Soft-margin SVM introduces slack variables $\xi_i$ and penalty $C$.

:::{exercise}
Derive the dual form of the SVM optimization problem and explain the role of Lagrange multipliers.
:::

### Visualizing the Margin

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.datasets import make_blobs

# Create linearly separable data
X, y = make_blobs(n_samples=100, centers=2, random_state=6)

# Fit the SVM model
clf = svm.SVC(kernel='linear', C=1000)
clf.fit(X, y)

# Plot the data points
plt.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap=plt.cm.Paired)

# Plot the decision function
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()

# Create grid to evaluate model
xx = np.linspace(xlim[0], xlim[1], 30)
yy = np.linspace(ylim[0], ylim[1], 30)
YY, XX = np.meshgrid(yy, xx)
xy = np.vstack([XX.ravel(), YY.ravel()]).T
Z = clf.decision_function(xy).reshape(XX.shape)

# Plot decision boundary and margins
ax.contour(XX, YY, Z, colors='k', levels=[-1, 0, 1], alpha=0.5,
           linestyles=['--', '-', '--'])
# Plot support vectors
ax.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1], s=100,
           linewidth=1, facecolors='none', edgecolors='k')
plt.title('Linearly Separable Data with SVM')
plt.show()

```{tip} The hyperplane is equidistant from the closest points (support vectors). The margin is maximized.

:::{exercise} Exploring the Margin
1.  Modify the `C` parameter in the `svm.SVC` function in the code above. What do you observe about the margin and the number of support vectors when you use a very small `C` (e.g., 0.01) versus a very large `C` (e.g., 10000)?
2.  What does the `C` parameter control in the context of the SVM classifier? *Hint: Think about the trade-off between a smooth decision boundary and classifying training points correctly.*
:::

:::{exercise}Identify Support Vectors
Use `svm.support_vectors_` to extract and plot the support vectors on the above figure.
:::

## Comparison with Other Methods

### SVM vs. Logistic Regression

| Feature | Support Vector Machine (SVM) | Logistic Regression |
|---|---|---|
| **Underlying Principle** | Geometric: finds the optimal hyperplane that maximizes the margin between classes. | Statistical: models the probability of a certain class or event existing. |
| **Decision Boundary** | Can be linear or non-linear using the kernel trick. | Fundamentally linear, though can be extended to non-linear with feature engineering. |
| **Sensitivity to Outliers**| Less sensitive to outliers due to the margin-based optimization.  | Can be more sensitive to outliers as it tries to classify all points correctly. |
| **Use Cases** | Effective in high-dimensional spaces and for both linear and non-linear problems. Works well with unstructured data like text and images. | Performs well on linearly separable data and when a probabilistic interpretation is needed. Often a good first model to try.  |
| **Robust to overfitting** | Strong (especially in high-dim) | Moderate |

In general, if the number of features is much larger than the number of training examples, logistic regression or a linear SVM is recommended. If the number of examples is intermediate and the data is not linearly separable, an SVM with a non-linear kernel is a good choice.

## The Kernel Trick: Handling Non-Linear Data

What if the data is not linearly separable? This is where the **kernel trick** comes in. The kernel trick is a powerful technique that allows SVMs to classify non-linear data. [5] It works by mapping the input data into a higher-dimensional space where a linear separator can be found. [4, 5] This is done implicitly, without ever having to compute the coordinates of the data in this higher-dimensional space, which is computationally efficient. [5]

Commonly used kernels include:
*   **Linear:** For linearly separable data.
*   **Polynomial:** Creates a polynomial decision boundary. $ K(x_i, x_j) = (\gamma x_i^T x_j + r)^d $
*   **Radial Basis Function (RBF):** A popular choice for its flexibility in handling complex relationships. $ K(x_i, x_j) = \exp(-\gamma \|x_i - x_j\|^2) $
*   **Sigmoid:** Can be useful in certain neural network-like scenarios.

```{note}
**RBF is default**: Handles complex, non-linear patterns (e.g., in metabolomics, protein folding)
```


In [None]:
from sklearn.datasets import make_circles

X, y = make_circles(n_samples=100, factor=.1, noise=.1, random_state=42)

# Fit the SVM model with an RBF kernel
clf = svm.SVC(kernel='rbf', C=1, gamma='auto')
clf.fit(X, y)

# Plot the data points
plt.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap=plt.cm.Paired)

# Plot the decision function
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()

# Create grid to evaluate model
xx = np.linspace(xlim[0], xlim[1], 30)
yy = np.linspace(ylim[0], ylim[1], 30)
YY, XX = np.meshgrid(yy, xx)
xy = np.vstack([XX.ravel(), YY.ravel()]).T
Z = clf.decision_function(xy).reshape(XX.shape)

# Plot decision boundary and margins
ax.contour(XX, YY, Z, colors='k', levels=[-1, 0, 1], alpha=0.5,
           linestyles=['--', '-', '--'])
# Plot support vectors
ax.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1], s=100,
           linewidth=1, facecolors='none', edgecolors='k')
plt.title('Non-Linearly Separable Data with RBF Kernel SVM')
plt.show()

:::{exercise}Experimenting with Kernels

1.  In the code above, change the `kernel` parameter to `'linear'` and `'poly'`. How does the decision boundary change? Which kernel performs best for this dataset?
2.  For the RBF kernel, experiment with different values for the `gamma` parameter. What is the effect of a very small `gamma` versus a very large `gamma` on the decision boundary? What might this imply about the model's complexity and potential for overfitting?
:::

## A more complex boundary

In [None]:
from sklearn.datasets import make_moons
from sklearn.metrics import accuracy_score

# Generate non-linear data
X_moon, y_moon = make_moons(n_samples=100, noise=0.2, random_state=42)
X_moon_scaled = scaler.fit_transform(X_moon)

# Fit SVM with RBF kernel
svm_rbf = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42)
svm_rbf.fit(X_moon_scaled, y_moon)

# Plot
plt.figure(figsize=(8, 6))
plt.scatter(X_moon_scaled[y_moon == 0, 0], X_moon_scaled[y_moon == 0, 1], c='red', label='Class 0', alpha=0.7)
plt.scatter(X_moon_scaled[y_moon == 1, 0], X_moon_scaled[y_moon == 1, 1], c='blue', label='Class 1', alpha=0.7)

# Decision boundary (grid)
xx, yy = np.meshgrid(np.linspace(X_moon_scaled[:, 0].min(), X_moon_scaled[:, 0].max(), 100),
                     np.linspace(X_moon_scaled[:, 1].min(), X_moon_scaled[:, 1].max(), 100))
Z = svm_rbf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.contour(xx, yy, Z, levels=[0], colors='k', linestyles='-', alpha=0.7)
plt.contourf(xx, yy, Z, levels=50, alpha=0.3, cmap='RdBu')

plt.xlabel('Feature 1 (scaled)')
plt.ylabel('Feature 2 (scaled)')
plt.title('SVM with RBF Kernel: Non-Linear Separation')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()


:::{exercise} Compare SVM vs Logistic Regression
Fit a logistic regression model on the same moons dataset. Compare accuracy and decision boundary.
:::

In [None]:
# YOUR CODE HERE




### Interactive role of hyper parameters

In [None]:
# Setup
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.datasets import make_classification
import ipywidgets as widgets
from IPython.display import display


In [None]:
# Generate synthetic 2D data
X, y = make_classification(n_samples=150, n_features=2, n_redundant=0, n_informative=2,
                           n_clusters_per_class=1, class_sep=1.5, random_state=42)

# Scale
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Create interactive controls
@widgets.interact(
    C=widgets.FloatSlider(value=1.0, min=0.1, max=10.0, step=0.1, description='C'),
    gamma=widgets.FloatSlider(value=0.1, min=0.01, max=1.0, step=0.01, description='Gamma'),
    kernel=widgets.Dropdown(options=['rbf', 'linear', 'poly'], value='rbf', description='Kernel')
)
def plot_svm(C=1.0, gamma=0.1, kernel='rbf'):
    # Fit SVM
    svm = SVC(kernel=kernel, C=C, gamma=gamma, random_state=42)
    svm.fit(X_scaled, y)
    
    # Plot
    plt.figure(figsize=(10, 7))
    plt.scatter(X_scaled[y == 0, 0], X_scaled[y == 0, 1], c='red', label='Class 0', alpha=0.7)
    plt.scatter(X_scaled[y == 1, 0], X_scaled[y == 1, 1], c='blue', label='Class 1', alpha=0.7)
    
    # Decision boundary (grid)
    xx, yy = np.meshgrid(np.linspace(X_scaled[:, 0].min(), X_scaled[:, 0].max(), 100),
                         np.linspace(X_scaled[:, 1].min(), X_scaled[:, 1].max(), 100))
    Z = svm.decision_function(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    plt.contour(xx, yy, Z, levels=[0], colors='k', linestyles='-', alpha=0.8)
    plt.contourf(xx, yy, Z, levels=50, alpha=0.3, cmap='RdBu')
    
    # Support vectors
    sv = svm.support_vectors_
    plt.scatter(sv[:, 0], sv[:, 1], s=100, facecolors='none', edgecolors='green', linewidth=2, label='Support Vectors')
    
    plt.xlabel('Feature 1 (scaled)')
    plt.ylabel('Feature 2 (scaled)')
    plt.title(f'SVM Decision Boundary (C={C}, gamma={gamma:.2f}, kernel={kernel})')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()

## Applications in Basic Sciences

SVMs have found numerous applications in various scientific domains:

*   **Bioinformatics:** SVMs are widely used for tasks like protein classification, gene expression analysis, and cancer classification. Their ability to handle high-dimensional data makes them suitable for analyzing genomic and proteomic datasets.
*   **Image Classification:** In fields like medical imaging and satellite imagery analysis, SVMs can be used to classify images, for instance, to identify tumors in medical scans or to classify different types of land cover from satellite data. 
*   **Chemistry:** SVMs can be used in cheminformatics to predict the properties of molecules, such as their bioactivity or toxicity, based on their chemical structure.

### Biochemistry: Enzyme Substrate Classification
Features: 3D molecular descriptors (e.g., hydrophobicity, size, charge).
Task: Classify if a molecule is a substrate (yes/no) for an enzyme.
Why SVM? High-dim, small samples, non-linear relationships

### Spectroscopy: Material Phase Classification
Features: Raman or IR intensity at 1000+ wavenumbers.
Task: Classify solid vs. liquid phase.
Why SVM? RBF kernel captures subtle spectral shifts.

:::{exercise}
Dataset: breast_cancer from sklearn (classic biomedical dataset).
Goal: Classify tumors as malignant/benign using 30 morphological features.
:::

In [None]:
# YOUR CODE HERE



## Final Exercises: Applying SVM to a Biological Dataset

For this final set of exercises, we will use the **Breast Cancer Wisconsin (Diagnostic) Dataset**, which is available through scikit-learn. The task is to predict whether a tumor is malignant or benign based on several features of the cell nuclei.

**Dataset Information:** The dataset contains 30 numeric, predictive attributes and the class (malignant or benign). [23]

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix

# Load the dataset
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scale the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# --- Your Code Here --- #
# 1. Train a linear SVM classifier


# 2. Evaluate the linear SVM model


# 3. Train an SVM classifier with an RBF kernel


# 4. Evaluate the RBF SVM model


# 5. Compare the performance of the two models

### Instructions:

1.  **Train a Linear SVM:** In the provided code cell, create and train an `svm.SVC` model with a linear kernel on the training data.
2.  **Evaluate the Linear SVM:** Use the trained linear model to make predictions on the test set. Then, print the confusion matrix and the classification report to evaluate its performance.
3.  **Train an RBF SVM:** Now, create and train another `svm.SVC` model, but this time use an RBF kernel.
4.  **Evaluate the RBF SVM:**  Similarly, evaluate the performance of the RBF kernel SVM on the test set by printing its confusion matrix and classification report.
5.  **Compare Models:** Which model performed better on this dataset? Why do you think that is the case? Consider the nature of the data and the strengths of each kernel.

## EXTRA: scikit pipeline
 A full scikit pipeline allows you to estimate the best hyper parameters for a given problem. It uses <https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html>

In [None]:
from sklearn.metrics import accuracy_score
# Build full pipeline
def create_svm_pipeline():
    return Pipeline([
        ('scaler', StandardScaler()),
        ('feature_selection', SelectKBest(f_classif, k=10)),  # Select top 10 features
        ('svm', SVC(kernel='rbf', random_state=42))
    ])

# Use on breast cancer dataset
data = datasets.load_breast_cancer()
X, y = data.data, data.target

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Create pipeline
pipe = create_svm_pipeline()

# Hyperparameter tuning
param_grid = {
    'svm__C': [0.1, 1, 10, 100],
    'svm__gamma': ['scale', 'auto', 0.001, 0.01, 0.1],
    'svm__kernel': ['rbf', 'poly'],
    'feature_selection__k': [5, 10, 15]
}

# Grid search
grid = GridSearchCV(pipe, param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=0)
grid.fit(X_train, y_train)

print("Best parameters:", grid.best_params_)
print("Best CV score:", grid.best_score_.round(3))

# Final prediction
y_pred = grid.predict(X_test)
print("\nTest Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Benign', 'Malignant']))


Comparing svm and logistic regression

In [None]:
# Logistic Regression pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

def compare_models(X_train, X_test, y_train, y_test):
    models = {
        'SVM (RBF)': Pipeline([
            ('scaler', StandardScaler()),
            ('selector', SelectKBest(f_classif, k=10)),
            ('clf', SVC(kernel='rbf', C=10, gamma=0.01, probability=True))
        ]),
        'Logistic Regression': Pipeline([
            ('scaler', StandardScaler()),
            ('selector', SelectKBest(f_classif, k=10)),
            ('clf', LogisticRegression(max_iter=1000))
        ])
    }
    
    results = {}
    for name, model in models.items():
        model.fit(X_train, y_train)
        score = model.score(X_test, y_test)
        cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
        results[name] = {'test_score': score, 'cv_mean': cv_scores.mean(), 'cv_std': cv_scores.std()}
        print(f"{name}: Test Accuracy = {score:.3f}, CV Accuracy = {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")
    
    return results

# Compare on breast cancer
compare_models(X_train, X_test, y_train, y_test)


:::{exercise} Build Your Own Pipeline
Use make_classification to generate a dataset with 100 features, 500 samples, and non-linear structure.
:::

In [None]:
# YOUR CODE HERE



## EXTRA: Support Vector Machines (SVM) with Model Explainability (SHAP)


In [None]:
# Setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer, make_classification
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, f1_score
import shap
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")


In [None]:
# "Scientific" data
# Simulate metabolomics data (n=100 samples, p=50 features)
np.random.seed(42)
n_samples = 100
n_features = 50
X_metab = np.random.normal(0, 1, (n_samples, n_features))

# Introduce signal: 10 features differ between classes
signal_indices = np.random.choice(n_features, 10, replace=False)
X_metab[:50, signal_indices] += 1.0  # Healthy (class 0)
X_metab[50:, signal_indices] += -0.8  # Diseased (class 1)

y_metab = np.hstack([np.zeros(50), np.ones(50)])

# Split
X_train_m, X_test_m, y_train_m, y_test_m = train_test_split(X_metab, y_metab, test_size=0.3, random_state=42, stratify=y_metab)

print("Metabolomics Dataset: Shape =", X_train_m.shape)


### Build SVM Pipeline with Hyperparameter Tuning

In [None]:
# Define pipeline
def create_svm_pipeline():
    return Pipeline([
        ('scaler', StandardScaler()),
        ('selector', SelectKBest(f_classif, k=10)),
        ('svm', SVC(kernel='rbf', probability=True, random_state=42))
    ])

# Tune
param_grid = {
    'svm__C': [0.1, 1, 10, 100],
    'svm__gamma': ['scale', 'auto', 0.001, 0.01],
    'selector__k': [5, 10, 15]
}

# Grid search
grid = GridSearchCV(create_svm_pipeline(), param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=0)
grid.fit(X_train_m, y_train_m)

print("Best Parameters:", grid.best_params_)
print("CV Accuracy:", grid.best_score_.round(3))


### Explainability with SHAP
Why SHAP for SVM?
SVM is not inherently interpretable, but SHAP provides global and local explanations by attributing prediction changes to input features, based on game theory.

Create SHAP Explainer

In [None]:
# Use the best model
best_model = grid.best_estimator_

# Use a subset for SHAP (faster)
X_train_shap = X_train_m[:100]  # Use 100 samples for explainability

# Create SHAP explainer (use TreeExplainer for speed; fallback to KernelExplainer for SVM)
explainer = shap.KernelExplainer(
    lambda x: best_model.predict_proba(x)[:, 1],  # Prob of class 1
    X_train_shap
)

# Compute SHAP values
shap_values = explainer.shap_values(X_test_m[:10], nsamples=100)

# Show shape
print(f"SHAP values shape: {shap_values.shape}")


### Visualize SHAP Results
#### Summary Plot (Global Feature Importance)


In [None]:
# Use first 10 test samples for summary
shap_values = explainer.shap_values(X_test_m[:10], nsamples=100)

plt.figure(figsize=(10, 6))
shap.summary_plot(shap_values, X_test_m[:10], feature_names=[f'Feature_{i}' for i in range(50)], 
                 plot_type="bar", color='blue', show=False)
plt.title("SVM: Feature Importance (SHAP Bar Plot)", fontsize=14)
plt.ylabel("Mean |SHAP| Value")
plt.tight_layout()
plt.show()


#### Force Plot (Local Explanation)

In [None]:
# Show prediction for first test sample
idx = 0
pred_prob = best_model.predict_proba(X_test_m[idx:idx+1])[0, 1]
print(f"Predicted probability (class 1): {pred_prob:.3f}")

# Force plot
shap.initjs()
shap.force_plot(explainer.expected_value, shap_values[idx], X_test_m[idx:idx+1], 
               feature_names=[f'Feature_{i}' for i in range(50)], show=False)

#### Dependence Plot (Feature Effect)

In [None]:
# Plot how Feature_23 affects prediction
shap.dependence_plot(
    ind=23,  # Index of top feature
    shap_values=shap_values,
    features=X_test_m[:10],
    feature_names=[f'Feature_{i}' for i in range(50)],
    show=False
)
plt.title("Feature Effect: Feature_23 on Prediction", fontsize=14)
plt.xlabel("Feature_23 Value")
plt.ylabel("SHAP Value")
plt.tight_layout()
plt.show()


### Exercise:Explain a Prediction from Real Scientific Data
Use the load_breast_cancer dataset and explain the prediction for a malignant tumor using SHAP.

In [None]:
# Load real data
data = load_breast_cancer()
X, y = data.data, data.target

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Re-train best model
grid = GridSearchCV(create_svm_pipeline(), param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(X_train, y_train)

mal = 2
# Explain a malignant sample
malignant_idx = np.where(y_test == 1)[0][0]  # First malignant test sample
X_mal = X_test[malignant_idx:mal+1]

# SHAP explainer
explainer = shap.KernelExplainer(
    lambda x: grid.best_estimator_.predict_proba(x)[:, 1],
    X_train[:100]  # Use 100 for speed
)

shap_values = explainer.shap_values(X_mal, nsamples=100)

# Plot
shap.plots.waterfall(
    shap_values=shap_values[0], 
    #feature_names=data.feature_names, 
    max_display=10
)
plt.title("SHAP Waterfall Plot: Malignant Tumor Prediction", fontsize=14)
plt.show()


In [None]:
help(shap.plots.waterfall)