# Support Vector Machines (SVM)

_Welcome back! Today we’ll master Support Vector Machines — powerful, margin‑based classifiers that shine in high‑dimensional spaces._

- **Linear SVM** — maximum‑margin hyperplanes
- **Soft margins & hinge loss** — robust to overlap and noise
- **Dual form & kernels** — non‑linear decision boundaries via the kernel trick
- **Practical tuning** — scaling, `C`, `gamma`, and when to use `LinearSVC` vs `SVC`.

---

## What you’ll learn

- The margin idea and why “wider is better.”
- Hard-margin vs. soft-margin SVMs.
- Primal, hinge-loss view; dual and kernels.
- How to tune `C`, `gamma`, and choose a kernel.
- Minimal scikit-learn code for linear and kernel SVMs.

---

## 1) The SVM idea — separate with the **widest margin**

Think of a line (or plane) that splits the classes. Among all lines that correctly split them, prefer the one that leaves the **largest gap** to the nearest points of either class. That gap is the **margin**. A larger margin usually means a simpler, more robust boundary that generalizes better.

Given labeled points $(x_i, y_i)$ with $y_i \in \{-1,+1\}$, SVM finds a hyperplane

$$
f(x) = w^\top x + b
$$

that separates the classes while **maximizing the margin**

- **Geometric margin** of a point: $\displaystyle \gamma_i = \frac{y_i (w^\top x_i + b)}{\lVert w\rVert}$
- **Margin width** between class boundaries: $\displaystyle \frac{2}{\lVert w\rVert}$  
  → Max margin $\Longleftrightarrow$ minimize $\lVert w\rVert$ (subject to correct classification).

There are two types of SVMs:

- **Support Vector Classification** (SVC): in scikit-learn, $SVC$ is the kernel SVM classifier, while LinearSVC is the fast linear-only solver. It is for classification tasks
- **Support Vector Regression** (SVR): for regression tasks

---

## 2) Hard‑margin SVM (separable case)

Here, every point must be on the correct side with **room to spare** (at least distance 1 in the scaled units). Minimizing $\tfrac12\lVert w\rVert^2$ is equivalent to **maximizing the margin**. This version only works when data are perfectly separable; a single mislabeled or noisy point can break feasibility.

**Optimization:**

$$
\begin{aligned}
\min_{w,b}\quad & \tfrac12 \lVert w\rVert^2 \\
\text{s.t.}\quad & y_i (w^\top x_i + b) \ge 1,\quad i=1,\dots,n
\end{aligned}
$$

- Constraints enforce that every point sits **outside** the margin band.
- Works only when data are perfectly separable.

---

## 3) Soft‑margin SVM (realistic case)

Real data must overlap.

We introduce **slack** variables $\xi_i \ge 0$ that measures how much a point breaks the margin rule:

- $\xi_i=0$ means safely outside
- $0<\xi_i<1$ means inside the margin but on the correct side
- $\xi_i>1$ means misclassified.

$$
\begin{aligned}
\min_{w,b,\xi}\quad & \tfrac12 \lVert w\rVert^2 + C \sum_{i=1}^n \xi_i \\
\text{s.t.}\quad & y_i (w^\top x_i + b) \ge 1 - \xi_i,\quad \xi_i \ge 0
\end{aligned}
$$

- The constant **C>0** balances two desires: keep the margin wide (small $\lVert w\rVert$), yet don’t allow too many/too large violations (small $\sum\xi_i$):
  - large $C$ → penalize violations heavily (lower bias, higher variance, risking overfit)
  - small $C$ → wider margin, more violations allowed (higher bias, lower variance, smoother, possibly underfit)

**Hinge‑loss view (equivalent):**

$$
\min_{w,b}\quad \frac{\lambda}{2}\lVert w\rVert^2 + \frac{1}{n}\sum_{i=1}^n \max\!\big(0, 1 - y_i(w^\top x_i + b)\big),
$$

with $\lambda$ inversely related to $C$ (roughly, $\lambda \approx 1/(nC)$ in many libraries).

---

## 4) Support vectors & the decision function

Only points that lie **on or inside** the margin influence the final classifier; these are the **support vectors**. Points far from the boundary have zero hinge loss and do not change $w,b$. This is why SVM solutions are often sparse: the model depends on a subset of the training data.

Points with zero loss **away from the margin** don’t affect the solution. The model depends only on a subset — the **support vectors** — that lie **on or inside** the margin band.

Prediction:

$$
\hat y = \mathrm{sign}(w^\top x + b).
$$

---

## 5) Dual problem & the kernel trick (non‑linear SVM)

The dual re-expresses the problem in terms of **similarities between pairs of points** via a kernel $K(x_i,x_j)$. Replacing dot-products with kernels lets the classifier act as if data were mapped into a higher-dimensional space **without computing that mapping explicitly** (the “kernel trick”). The prediction becomes a weighted sum over support vectors: only those with $\alpha_i>0$ matter.

The Lagrange dual of the soft‑margin problem (for kernel $K$) is:

$$
\begin{aligned}
\max_{\alpha}\quad & \sum_{i=1}^n \alpha_i - \frac12 \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j\, K(x_i,x_j) \\
\text{s.t.}\quad & 0 \le \alpha_i \le C,\quad \sum_{i=1}^n \alpha_i y_i = 0
\end{aligned}
$$

The decision function becomes:

$$
f(x) = \sum_{i=1}^n \alpha_i y_i\, K(x_i, x) + b.
$$

**Common kernels $K(x,z)$**

- **Linear:** $x^\top z$ (useful for very high‑dimensional sparse features; scalable with `LinearSVC`).
- **RBF (Gaussian):** $\exp(-\gamma \lVert x - z\rVert^2)$ with $\gamma > 0$
- **Polynomial:** $(\gamma\, x^\top z + r)^d$ (degree $d$)
- **Sigmoid:** $\tanh(\gamma\, x^\top z + r)$ (less common)

**Hyperparameters**

- `C` — regularization (as above)
- `gamma` — RBF/poly scale; large `gamma` → tighter, more wiggly boundaries; small `gamma` → smoother

---

## 6) Best practices

- **Scale features** (standardize) — SVMs are distance‑based.
- Start with **linear SVM** for many features/large $n$ (`LinearSVC` or `SGDClassifier(loss="hinge")`).
- Use **RBF SVC** for moderate $n$ when nonlinearity helps.
- For **imbalanced** data, set `class_weight="balanced"` or provide weights.
- Enable probability estimates (Platt scaling) via `probability=True` in `SVC` (costs extra fitting).

---

## 7) Minimal code — linear and RBF SVM (scikit‑learn)

The flow below is: make data → train/test split → **pipeline** with `StandardScaler` → fit a **LinearSVC** (fast for large/high‑dimensional sets). Then try an **RBF SVC** and use a tiny **grid search** to pick `C` and `gamma`. `LinearSVC(dual="auto")` chooses an efficient solver depending on the feature/sample ratio.

In [None]:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.svm import LinearSVC, SVC
from sklearn.metrics import accuracy_score, classification_report

# Data
X, y = make_classification(n_samples=2000, n_features=50, n_informative=10,
                           class_sep=1.5, random_state=0)
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.25, random_state=0, stratify=y)

# 1) Linear SVM (good for large & high-dim data)
lin_clf = make_pipeline(StandardScaler(), LinearSVC(class_weight="balanced"))
lin_clf.fit(Xtr, ytr)
print("LinearSVC acc:", accuracy_score(yte, lin_clf.predict(Xte)))

# 2) RBF-kernel SVM with small grid search (for moderate-sized data)
rbf_pipe = make_pipeline(StandardScaler(), SVC(kernel="rbf"))
param_grid = {"svc__C": [0.1, 1, 10], "svc__gamma": ["scale", 0.01, 0.1]}
grid = GridSearchCV(rbf_pipe, param_grid, cv=3, n_jobs=-1)
grid.fit(Xtr, ytr)
print("Best RBF params:", grid.best_params_)
print("RBF acc:", accuracy_score(yte, grid.predict(Xte)))
print(classification_report(yte, grid.predict(Xte)))

---

## 8) (Optional) Hinge‑loss SGD — from scratch (toy)

This toy optimizer does **stochastic subgradient descent** on the hinge loss plus $\ell_2$ penalty. If a sample is correctly classified with margin $\ge 1$, we only apply weight decay. If it violates the margin, we also step in the direction that reduces the hinge loss. This mirrors what large‑scale linear SVM libraries do under the hood.

In [None]:
import numpy as np

def sgd_linear_svm(X, y, lr=0.1, lam=1e-3, epochs=10):
    # y in {-1, +1}
    n, d = X.shape
    w = np.zeros(d); b = 0.0
    for _ in range(epochs):
        idx = np.random.permutation(n)
        for i in idx:
            margin = y[i]*(X[i] @ w + b)
            if margin < 1:
                # subgradient of hinge + L2
                w = (1 - lr*lam)*w + lr*y[i]*X[i]
                b = b + lr*y[i]
            else:
                w = (1 - lr*lam)*w
    return w, b

---

## 9) How to choose `C`, `gamma`, and the kernel

1. **Start simple:** linear vs RBF; pick by validation.
2. **Grid search / log‑scale**: Use a small **log‑grid** `C ∈ {0.01, 0.1, 1, 10, 100}`, `gamma ∈ {"scale", 0.001, 0.01, 0.1, 1}`.
3. **Watch for overfit:** very high training acc + lower validation acc → reduce `C` or `gamma`.
4. **Speed:** If training is slow with kernel SVMs, reduce the grid size first and/or switch to `LinearSVC`.

---

## 10) Multiclass strategies

SVMs are inherently binary. For $K$ classes:

- **One‑vs‑One (OvO):** train $K(K-1)/2$ binary classifiers (default in `SVC`). It compares every pair of classes and tends to work well when classes are balanced
- **One‑vs‑Rest (OvR):** train $K$ classifiers vs the rest (default in `LinearSVC`). It is simpler and pairs well with linear models on high‑dimensional data.

---

## 11) FAQs & Gotchas

Probability outputs from `SVC` come from an extra calibration step (Platt scaling); they can look conservative on small datasets. For speed problems, try fewer features, subsampling, or a linear model. Always re‑check that features are standardized.

- **“My SVM is slow.”** → Too many samples with kernel SVC; try `LinearSVC` or sub-sample + RBF.
- **“Predicted probabilities look odd.”** → They’re calibrated via Platt scaling; try `CalibratedClassifierCV`.
- **“Decision boundary is jagged.”** → Likely large `gamma` (RBF) or very large `C`; reduce them and re‑scale features.
- **“Imbalanced classes.”** → Use `class_weight="balanced"` and evaluate with F1/ROC‑AUC, not just accuracy.

---

## 12) Quick cheat sheet

Linear SVMs shine with many features (e.g., text). Kernel SVMs capture curvature at the cost of speed and tuning more hyperparameters. Both require scaling; both are sensitive to `C`, and kernels add `gamma` (and `degree` for polynomial).

| Aspect              | Linear SVM                      | Kernel SVM (RBF/Poly)                 |
| ------------------- | ------------------------------- | ------------------------------------- |
| Nonlinearity        | No                              | Yes (via kernels)                     |
| Scale with #samples | Great (use `LinearSVC`)         | Moderate/Slow for large $n$           |
| Hyperparams         | `C`                             | `C`, `gamma`, (degree for poly)       |
| Feature scaling     | **Required**                    | **Required**                          |
| Typical use         | Text, high‑dim sparse, big data | Moderate data with nonlinear boundary |

---

## 13) Practice

Try plotting the margin band (the two lines where $y(w^\top x + b)=1$) on a 2‑D toy set to see which points become support vectors. Then vary `C` and watch how the number of support vectors and the margin width change.

1. Standardize features, then compare `LinearSVC` vs `SVC(RBF)` on your dataset.
2. Grid‑search `C` and `gamma`; report validation curves and best test score.
3. Inspect support vectors: how many are there, and which points become SVs?
4. Try class imbalance: set `class_weight="balanced"` and compare metrics.

---

## Summary

- SVMs maximize margin → robust decision boundaries.
- Soft margins + hinge loss handle overlap and noise.
- Dual form enables **kernels** for powerful nonlinear separation.
- In practice: **scale**, pick kernel by validation, tune `C`/`gamma`, and use `LinearSVC` for large datasets.

**Next:** Principal component analysis + Dimensionality reduction