
# Module 6 — SVM Classification and Kernel Comparison (Health Dataset)
**Student:** _[Your Name]_  
**Course:** _[Your Course / Term]_  
**Date:** 2025-11-04

---

## Overview
This notebook implements a complete **SVM classification** workflow on a binary **health dataset** (sklearn's *Breast Cancer Wisconsin*). It compares **four kernels** (linear, RBF, polynomial, sigmoid) under **k-fold cross-validation** with **GridSearchCV**, using **ROC-AUC** as the primary selection metric (tie-breakers: **F1**, then **Accuracy**).

We include:
- **Stratified** train–test split (fixed random state) and **feature scaling**.
- **Model selection** with compact grids to avoid overengineering.
- **Two ensemble baselines** (RandomForest, GradientBoosting).
- **Confusion matrices** and a **bar chart** comparing Accuracy/F1/ROC-AUC.
- A **PCA(2D) kernel decision-surface visualization** to contrast margin geometry.
- An automatic **Kernel Selection Note (Recommendation)** (≤ 6 lines).
- A one-page PDF report exported to `artifacts/Report.pdf`.


In [None]:

# Imports and configuration
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import StratifiedKFold, train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import roc_auc_score, f1_score, accuracy_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.decomposition import PCA
from sklearn.utils.validation import check_is_fitted

from IPython.display import Markdown, display

import warnings
warnings.filterwarnings('ignore')

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)



## Data Loading and Preprocessing
We use the **Breast Cancer Wisconsin** dataset (binary target). Features are standardized, and a **stratified** split preserves class prevalence. Scaling is required for SVMs because margin geometry depends on feature magnitudes.


In [None]:

# Load dataset
data = load_breast_cancer(as_frame=True)
X = data.data
y = pd.Series(data.target, name='target')

# Basic prevalence insight
class_counts = y.value_counts().rename({0:'class 0', 1:'class 1'})
prevalence = (class_counts / len(y)).round(3)
print('Class counts:\n', class_counts.to_string())
print('\nPrevalence:\n', prevalence.to_string())

# Stratified split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=RANDOM_STATE
)

# We'll use StandardScaler inside pipelines for model training.
print(f"Train shape: {X_train.shape}, Test shape: {X_test.shape}")



## Helpers
We define:  
1) **evaluate_model**: compute ROC-AUC, F1, Accuracy on the held-out test set.  
2) **plot_decision_surface_2d**: fit on PCA2D of the training set and visualize decision regions.


In [None]:

def evaluate_model(model, X_test, y_test):
    """Return ROC-AUC, F1, Accuracy for the given fitted model and test set."""
    y_pred = model.predict(X_test)
    # For ROC-AUC on SVM with probability=False we can use decision_function if available
    if hasattr(model, 'decision_function'):
        scores = model.decision_function(X_test)
        # Convert to probability-like by ranking; roc_auc_score accepts scores
        roc = roc_auc_score(y_test, scores)
    elif hasattr(model, 'predict_proba'):
        roc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
    else:
        # Fallback: use predictions (may be less sensitive)
        roc = roc_auc_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    acc = accuracy_score(y_test, y_pred)
    return roc, f1, acc


def plot_decision_surface_2d(clf, X_pca_train, y_train, X_pca_test, y_test, title=''):
    """
    Fit 'clf' on 2D PCA training data, then draw decision regions
    and overlay train/test points. Axes labeled as PCA1/PCA2.
    """
    # Fit
    clf.fit(X_pca_train, y_train)

    # Meshgrid
    x_min, x_max = X_pca_train[:, 0].min() - 1.0, X_pca_train[:, 0].max() + 1.0
    y_min, y_max = X_pca_train[:, 1].min() - 1.0, X_pca_train[:, 1].max() + 1.0
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 300),
                         np.linspace(y_min, y_max, 300))
    grid = np.c_[xx.ravel(), yy.ravel()]

    # Decision values
    if hasattr(clf, 'decision_function'):
        Z = clf.decision_function(grid)
        Z = Z.reshape(xx.shape)
        plt.contourf(xx, yy, Z, levels=20, alpha=0.2)
        # Zero contour as boundary
        plt.contour(xx, yy, Z, levels=[0], linewidths=1)
    else:
        Z = clf.predict(grid).reshape(xx.shape)
        plt.contourf(xx, yy, Z, alpha=0.2)

    # Scatter points
    plt.scatter(X_pca_train[:, 0], X_pca_train[:, 1], s=15, marker='o', label='Train')
    plt.scatter(X_pca_test[:, 0],  X_pca_test[:, 1],  s=15, marker='^', label='Test')

    plt.xlabel('PCA1')
    plt.ylabel('PCA2')
    plt.title(title)
    plt.legend(loc='best')



## Modeling Setup (SVM Kernels + Compact Grids)
**Primary metric:** ROC-AUC. Tie-breakers: F1, then Accuracy.  
We clarify that **epsilon (ε)** is an SVR hyperparameter; for **classification (SVC)** we tune `C`, `gamma`, and `degree` (poly).


In [None]:

# Common CV and scoring
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
scoring = {'roc': 'roc_auc', 'f1': 'f1', 'acc': 'accuracy'}

# Pipelines per kernel
pipe_linear = Pipeline([('scaler', StandardScaler()), ('svc', SVC(kernel='linear', random_state=RANDOM_STATE))])
pipe_rbf    = Pipeline([('scaler', StandardScaler()), ('svc', SVC(kernel='rbf',    random_state=RANDOM_STATE))])
pipe_poly   = Pipeline([('scaler', StandardScaler()), ('svc', SVC(kernel='poly',   random_state=RANDOM_STATE))])
pipe_sig    = Pipeline([('scaler', StandardScaler()), ('svc', SVC(kernel='sigmoid',random_state=RANDOM_STATE))])

# Compact grids
param_linear = {'svc__C': [0.1, 1, 10]}
param_rbf    = {'svc__C': [0.1, 1, 10],
                'svc__gamma': ['scale', 0.1, 0.01]}
param_poly   = {'svc__C': [0.1, 1, 10],
                'svc__gamma': ['scale', 0.1, 0.01],
                'svc__degree': [2, 3]}
param_sig    = {'svc__C': [0.1, 1, 10],
                'svc__gamma': ['scale', 0.1, 0.01]}

grids = [
    ('SVM', 'linear', GridSearchCV(pipe_linear, param_linear, cv=cv, scoring=scoring, refit='roc', n_jobs=-1)),
    ('SVM', 'rbf',    GridSearchCV(pipe_rbf,    param_rbf,    cv=cv, scoring=scoring, refit='roc', n_jobs=-1)),
    ('SVM', 'poly',   GridSearchCV(pipe_poly,   param_poly,   cv=cv, scoring=scoring, refit='roc', n_jobs=-1)),
    ('SVM', 'sigmoid',GridSearchCV(pipe_sig,    param_sig,    cv=cv, scoring=scoring, refit='roc', n_jobs=-1)),
]



## Model Fitting (SVMs) and Cross-Validated Selection
We run `GridSearchCV` for each kernel and retain the best estimator (by ROC-AUC). We then evaluate on the held-out test set.


In [None]:

results = []
best_estimators = {}

for model_name, kernel, grid in grids:
    grid.fit(X_train, y_train)
    best_estimators[kernel] = grid.best_estimator_
    roc, f1, acc = evaluate_model(grid.best_estimator_, X_test, y_test)
    results.append([model_name, kernel, roc, f1, acc])
    print(f"{kernel.upper()} best params: {grid.best_params_}")


# Two ensemble baselines with small grids
rf = Pipeline([('scaler', StandardScaler()), ('rf', RandomForestClassifier(random_state=RANDOM_STATE))])
gb = Pipeline([('scaler', StandardScaler()), ('gb', GradientBoostingClassifier(random_state=RANDOM_STATE))])

rf_grid = {'rf__n_estimators': [200, 400], 'rf__max_depth': [None, 5, 10]}
gb_grid = {'gb__n_estimators': [100, 200], 'gb__learning_rate': [0.05, 0.1]}

rf_cv = GridSearchCV(rf, rf_grid, cv=cv, scoring=scoring, refit='roc', n_jobs=-1)
gb_cv = GridSearchCV(gb, gb_grid, cv=cv, scoring=scoring, refit='roc', n_jobs=-1)

rf_cv.fit(X_train, y_train)
gb_cv.fit(X_train, y_train)

rf_roc, rf_f1, rf_acc = evaluate_model(rf_cv.best_estimator_, X_test, y_test)
gb_roc, gb_f1, gb_acc = evaluate_model(gb_cv.best_estimator_, X_test, y_test)

results.append(['RandomForest', None, rf_roc, rf_f1, rf_acc])
results.append(['GradientBoosting', None, gb_roc, gb_f1, gb_acc])

print("\nRF best params:", rf_cv.best_params_)
print("GB best params:", gb_cv.best_params_)

metrics_df = pd.DataFrame(results, columns=['model','kernel','ROC_AUC','F1','Accuracy'])
metrics_df = metrics_df.sort_values(by=['ROC_AUC','F1','Accuracy'], ascending=False).reset_index(drop=True)
metrics_df_rounded = metrics_df.copy()
for c in ['ROC_AUC','F1','Accuracy']:
    metrics_df_rounded[c] = metrics_df_rounded[c].round(3)
metrics_df_rounded



## Confusion Matrices (Best SVM vs Ensembles)
We visualize confusion matrices for the top SVM and the two ensemble baselines on the test set.


In [None]:

# Identify top SVM by metric hierarchy (ROC->F1->ACC) among SVM rows
svm_only = metrics_df[metrics_df['model'] == 'SVM'].sort_values(by=['ROC_AUC','F1','Accuracy'], ascending=False)
top_kernel = svm_only.iloc[0]['kernel']
top_svm = best_estimators[top_kernel]

fig = plt.figure(figsize=(12, 3.5))

ax1 = plt.subplot(1, 3, 1)
ConfusionMatrixDisplay.from_estimator(top_svm, X_test, y_test, normalize='true', ax=ax1)
ax1.set_title(f"Top SVM ({top_kernel})")


ax2 = plt.subplot(1, 3, 2)
ConfusionMatrixDisplay.from_estimator(rf_cv.best_estimator_, X_test, y_test, normalize='true', ax=ax2)
ax2.set_title("RandomForest")


ax3 = plt.subplot(1, 3, 3)
ConfusionMatrixDisplay.from_estimator(gb_cv.best_estimator_, X_test, y_test, normalize='true', ax=ax3)
ax3.set_title("GradientBoosting")

plt.tight_layout()
plt.show()



## Metric Comparison (Bar Chart)
A compact bar chart contrasting **Accuracy**, **F1**, and **ROC-AUC** across all tuned models.


In [None]:

fig = plt.figure(figsize=(8, 4))
x = np.arange(len(metrics_df_rounded))
width = 0.22

plt.bar(x - width, metrics_df_rounded['Accuracy'], width, label='Accuracy')
plt.bar(x,          metrics_df_rounded['F1'],       width, label='F1')
plt.bar(x + width,  metrics_df_rounded['ROC_AUC'],  width, label='ROC-AUC')

plt.xticks(x, [f"{m}/{k if k is not None else ''}".strip('/') for m,k in zip(metrics_df_rounded['model'], metrics_df_rounded['kernel'])], rotation=45, ha='right')
plt.ylabel('Score')
plt.title('Model Metrics (Test Set)')
plt.legend()
plt.tight_layout()
plt.show()



## Kernel Difference — Visualization (PCA to 2D)
We project **scaled** features to **2D via PCA** (fit on training only) and train each SVM **on the 2D projection** (for visualization only). Plots below show **decision regions** and margin geometry.  
**Note:** This 2D view is for interpretability and does **not** represent the full feature space used for evaluation.


In [None]:

# Scale outside of pipeline for visualization-only PCA fit
scaler_vis = StandardScaler().fit(X_train)
X_train_scaled = scaler_vis.transform(X_train)
X_test_scaled  = scaler_vis.transform(X_test)

pca = PCA(n_components=2, random_state=RANDOM_STATE)
X_pca_train = pca.fit_transform(X_train_scaled)
X_pca_test  = pca.transform(X_test_scaled)

# Build per-kernel SVM using best params but trained on PCA(2D)
svm_2d_specs = []
for kernel in ['linear', 'rbf', 'poly', 'sigmoid']:
    best = best_estimators[kernel]
    # Extract key params from the tuned best estimator
    svc = best.named_steps['svc']
    params = dict(kernel=svc.kernel, C=svc.C, gamma=getattr(svc, 'gamma', 'scale'), degree=getattr(svc, 'degree', 3), random_state=RANDOM_STATE)
    # Create a fresh SVC with these params (no scaling here, data already PCA-projected)
    clf_2d = SVC(kernel=params['kernel'], C=params['C'], gamma=params['gamma'],
                 degree=params.get('degree', 3), random_state=RANDOM_STATE)
    svm_2d_specs.append((kernel, clf_2d, params))

# Plot 2x2 grid
fig = plt.figure(figsize=(10, 8))
for i, (kernel, clf, params) in enumerate(svm_2d_specs, start=1):
    ax = plt.subplot(2, 2, i)
    plt.sca(ax)
    title = f"{kernel.capitalize()} — C={params['C']}, gamma={params.get('gamma','-')}, deg={params.get('degree','-')}"
    plot_decision_surface_2d(clf, X_pca_train, y_train, X_pca_test, y_test, title=title)

plt.tight_layout()
plt.show()

display(Markdown("**Caption:** Linear shows planar separability; RBF offers flexible local decision boundaries; Polynomial encodes global curvature (degree-dependent); Sigmoid resembles a squashed linear separator and can underperform without careful scaling/hyperparameters."))



## Kernel Selection Note (Recommendation)
The short recommendation below is auto-generated from the **tuned** models and **test** metrics, prioritizing **ROC-AUC**, then **F1**, then **Accuracy**.


In [None]:

# Auto-generate recommendation Markdown
svm_only_rounded = metrics_df_rounded[metrics_df_rounded['model']=='SVM'].copy()
svm_sorted = svm_only_rounded.sort_values(by=['ROC_AUC','F1','Accuracy'], ascending=False).reset_index(drop=True)
best_row = svm_sorted.iloc[0]

rec_lines = [
    f"**Recommended kernel:** **{best_row['kernel'].upper()}**.",
    f"It achieved the top ROC-AUC ({best_row['ROC_AUC']}) with strong F1 ({best_row['F1']}) and Accuracy ({best_row['Accuracy']}) on the held-out test set.",
    "This choice balances bias–variance effectively and yields a margin geometry consistent with the dataset’s class structure.",
    "Compared with alternatives, it shows lower overfitting risk at tuned hyperparameters while retaining adequate interpretability and efficient compute.",
    "Decision-surface inspection in the PCA(2D) projection corroborates its superior generalization among the evaluated kernels."
]
display(Markdown('### Kernel Selection Note (Recommendation)\n' + '\n'.join(rec_lines[:5])))



## Export: One-Page PDF Report
We generate a concise PDF (`artifacts/Report.pdf`) summarizing preprocessing, CV setup, best parameters, metrics, the visualization insight, and the **recommended kernel**.


In [None]:

import os
os.makedirs('artifacts', exist_ok=True)

# Compose a compact textual summary for the PDF
best_params_text = []
for k in ['linear','rbf','poly','sigmoid']:
    # Derive best params from fitted best_estimator_
    be = best_estimators[k]
    svc = be.named_steps['svc']
    best_params_text.append(f"{k}: C={svc.C}, gamma={getattr(svc, 'gamma', 'n/a')}, degree={getattr(svc, 'degree', 'n/a')}")

summary_text = f"""Module 6 — SVM Classification (Health Dataset)

Dataset & Preprocessing:
- Breast Cancer Wisconsin (binary). Standardized features. Stratified 70/30 split (random_state={RANDOM_STATE}).

Cross-Validation & Grids:
- 5-fold StratifiedKFold with GridSearchCV; primary metric ROC-AUC (refit='roc').
- Compact grids over C, gamma, degree (poly). (Note: epsilon pertains to SVR, not SVC.)

Best SVM Hyperparameters:
- {chr(10).join(best_params_text)}

Test Metrics (Top Rows):
{metrics_df_rounded.head(6).to_string(index=False)}

Visualization Insight:
- PCA(2D) decision-surface plots show linear vs. non-linear margin geometry; RBF flexibly captures local structure, polynomial adds global curvature, sigmoid resembles squashed linear.

Recommendation:
- See “Kernel Selection Note (Recommendation)” cell in the notebook (ROC-AUC primary, F1/Accuracy tie-breakers).
"""

# Render to a single-page PDF using matplotlib
fig = plt.figure(figsize=(8.27, 11.69))  # A4 portrait in inches
plt.axis('off')
wrapped = textwrap.fill(summary_text, width=95)
plt.text(0.05, 0.98, wrapped, va='top', ha='left')
plt.tight_layout()
plt.savefig('artifacts/Report.pdf')
plt.close(fig)

print("Saved Report to artifacts/Report.pdf")



---

### Notes
- **Metric hierarchy**: ROC-AUC (primary) → F1 → Accuracy.  
- **PCA plots**: 2D projection for interpretability — not the full feature space used in evaluation.  
- **Grids kept small** to ensure fast runtime and avoid overengineering.
