# DATASCI 503, Homework 8: Support Vector Machines

Support Vector Machines (SVMs) are powerful classifiers that find the optimal separating hyperplane between classes by maximizing the marginâ€”the distance between the decision boundary and the nearest training points. This assignment covers the **maximal margin classifier** for linearly separable data, **soft-margin SVMs** that allow some misclassification via slack variables, and **kernel methods** that enable non-linear decision boundaries by implicitly mapping data to higher-dimensional feature spaces.

---

**Problem 1:** Non-linear Decision Boundaries (ISLP 9.2)

We have seen that in $p=2$ dimensions, a linear decision boundary takes the form $\beta_0 + \beta_1 X_1 + \beta_2 X_2 = 0$. We now investigate a non-linear decision boundary.

**(a)** Sketch the curve $(1 + X_1)^2 + (2 - X_2)^2 = 4$.

**(b)** On your sketch, indicate the set of points for which $(1 + X_1)^2 + (2 - X_2)^2 > 4$, as well as the set of points for which $(1 + X_1)^2 + (2 - X_2)^2 \leq 4$.

> BEGIN SOLUTION

The curve $(1 + X_1)^2 + (2 - X_2)^2 = 4$ is a circle centered at $(-1, 2)$ with radius 2.

- Points **inside** the circle (including the boundary) satisfy $(1 + X_1)^2 + (2 - X_2)^2 \leq 4$
- Points **outside** the circle satisfy $(1 + X_1)^2 + (2 - X_2)^2 > 4$

> END SOLUTION

**(c)** Suppose that a classifier assigns an observation to the blue class if $(1 + X_1)^2 + (2 - X_2)^2 > 4$ and to the red class otherwise. To what class is the observation (0, 0) classified? (-1, 1)? (2, 2)? (3, 8)?

> BEGIN SOLUTION

Anything within the circle or on the boundary is classified as red, and everything else is blue. We evaluate each point:

- $(0, 0)$: $(1 + 0)^2 + (2 - 0)^2 = 1 + 4 = 5 > 4$ $\Rightarrow$ **Blue**
- $(-1, 1)$: $(1 - 1)^2 + (2 - 1)^2 = 0 + 1 = 1 \leq 4$ $\Rightarrow$ **Red**
- $(2, 2)$: $(1 + 2)^2 + (2 - 2)^2 = 9 + 0 = 9 > 4$ $\Rightarrow$ **Blue**
- $(3, 8)$: $(1 + 3)^2 + (2 - 8)^2 = 16 + 36 = 52 > 4$ $\Rightarrow$ **Blue**

> END SOLUTION

**(d)** Argue that while the decision boundary in (c) is not linear in terms of $X_1$ and $X_2$, it is linear in terms of $X_1$, $X_1^2$, $X_2$, and $X_2^2$.

> BEGIN SOLUTION

We expand the equation:

$$(1 + X_1)^2 + (2 - X_2)^2 = 4$$
$$1 + 2X_1 + X_1^2 + 4 - 4X_2 + X_2^2 = 4$$
$$1 + 2X_1 + X_1^2 - 4X_2 + X_2^2 = 0$$

This can be written as:
$$\beta_0 + \beta_1 X_1 + \beta_2 X_1^2 + \beta_3 X_2 + \beta_4 X_2^2 = 0$$

where $\beta_0 = 1$, $\beta_1 = 2$, $\beta_2 = 1$, $\beta_3 = -4$, $\beta_4 = 1$.

While there are squared terms $X_1^2$ and $X_2^2$, making it non-linear in $X_1, X_2$, the expression is a **linear combination** of the features $X_1$, $X_1^2$, $X_2$, and $X_2^2$. This demonstrates how kernel methods can implicitly map to higher-dimensional feature spaces where decision boundaries become linear.

> END SOLUTION

---

**Problem 2:** Maximal Margin Classifier (ISLP 9.3)

**(a)** We are given $n = 7$ observations in $p = 2$ dimensions. For each observation, there is an associated class label.

| Obs. | $X_1$ | $X_2$ | $Y$ |
|------|-------|-------|-----|
| 1 | 3 | 4 | Red |
| 2 | 2 | 2 | Red |
| 3 | 4 | 4 | Red |
| 4 | 1 | 4 | Red |
| 5 | 2 | 1 | Blue |
| 6 | 4 | 3 | Blue |
| 7 | 4 | 1 | Blue |

Sketch these points on a graph.

> BEGIN SOLUTION

Plot the seven observations on a 2D graph with $X_1$ on the horizontal axis and $X_2$ on the vertical axis. Red points are at (3,4), (2,2), (4,4), (1,4). Blue points are at (2,1), (4,3), (4,1).

> END SOLUTION

**(b)** Sketch the optimal separating hyperplane, and provide the equation for this hyperplane in the form $\beta_0 + \beta_1 X_1 + \beta_2 X_2 = 0$.

> BEGIN SOLUTION

The optimal separating hyperplane maximizes the minimum distance from any training point to the hyperplane while still separating the data.

Observing the geometry:
- The Blue boundary points (2,1) and (4,3) lie on the line $X_2 = X_1 - 1$
- The Red boundary points (2,2) and (4,4) lie on the line $X_2 = X_1$

The optimal hyperplane is halfway between these two parallel lines:
$$\frac{1}{2} + X_2 - X_1 = 0$$

or equivalently: $X_2 = X_1 - \frac{1}{2}$

> END SOLUTION

**(c)** Describe the classification rule for the maximal margin classifier. It should be something along the lines of "Classify to Red if $\beta_0 + \beta_1 X_1 + \beta_2 X_2 > 0$, and classify to Blue otherwise." Provide the values for $\beta_0$, $\beta_1$, and $\beta_2$.

> BEGIN SOLUTION

The classification rule is:
- Classify to **Red** if $\frac{1}{2} + X_2 - X_1 > 0$ (above the line)
- Classify to **Blue** if $\frac{1}{2} + X_2 - X_1 \leq 0$ (on or below the line)

Thus: $\beta_0 = \frac{1}{2}$, $\beta_1 = -1$, $\beta_2 = 1$

> END SOLUTION

**(d)** On your sketch, indicate the margin for the maximal margin hyperplane.

> BEGIN SOLUTION

The margin is the perpendicular distance from the hyperplane to the nearest training observations. The margin lines are $X_2 = X_1$ (through Red support vectors) and $X_2 = X_1 - 1$ (through Blue support vectors). The margin width is $\frac{1}{2\sqrt{2}} = \frac{\sqrt{2}}{4}$ on each side.

> END SOLUTION

**(e)** Indicate the support vectors for the maximal margin classifier.

> BEGIN SOLUTION

The support vectors are the points that lie exactly on the margin boundaries:
- Red support vectors: (2, 2) and (4, 4)
- Blue support vectors: (2, 1) and (4, 3)

> END SOLUTION

**(f)** Argue that a slight movement of the seventh observation would not affect the maximal margin hyperplane.

> BEGIN SOLUTION

Observation 7 is at (4, 1), which is not a support vector. It lies well within the Blue region, far from the decision boundary. A slight movement of this point would not change which points are closest to the hyperplane (the support vectors remain the same), so the maximal margin hyperplane is unchanged.

> END SOLUTION

**(g)** Sketch a hyperplane that is not the optimal separating hyperplane, and provide the equation for this hyperplane.

> BEGIN SOLUTION

Any hyperplane that separates the classes but is not optimal. For example: $X_2 = X_1 - 0.3$ (or $0.3 + X_2 - X_1 = 0$) still separates the classes but has a smaller margin.

> END SOLUTION

**(h)** Draw an additional observation on the plot so that the two classes are no longer separable by a hyperplane.

> BEGIN SOLUTION

Adding a point that violates the separability, such as a Blue point at (3, 3) or a Red point at (3, 2), would make the classes no longer linearly separable.

> END SOLUTION

---

**Problem 3:** SVM Margins and Slack Variables

For this question, use graph paper, a drawing application, or Python.

**(a)** Suppose

$$f(X_1, X_2) = \frac{1}{\sqrt{2}}X_1 - \frac{1}{\sqrt{2}}X_2 - 1$$

Draw the hyperplane defined by $f(X_1, X_2) = 0$. Indicate the region of possible inputs $X_1, X_2$ for which $f(X_1, X_2) > 0$ with a "+" sign. Indicate the region of possible inputs $X_1, X_2$ for which $f(X_1, X_2) < 0$ with a "-" sign.

> BEGIN SOLUTION

The hyperplane $f(X_1, X_2) = 0$ is defined by:
$$\frac{1}{\sqrt{2}}X_1 - \frac{1}{\sqrt{2}}X_2 - 1 = 0$$
$$X_1 - X_2 = \sqrt{2}$$

This is a line with slope 1 passing through $(\sqrt{2}, 0)$ and $(0, -\sqrt{2})$.

- Region with "+": $f(X_1, X_2) > 0$ is **below-right** of the line (where $X_1 - X_2 > \sqrt{2}$)
- Region with "-": $f(X_1, X_2) < 0$ is **above-left** of the line (where $X_1 - X_2 < \sqrt{2}$)

> END SOLUTION

**(b)** Suppose an SVM classifier is fit to data, resulting in the following decision rule:

$$
\hat{y}(X_1, X_2) = 
\begin{cases}
    + & \text{if } f(X_1, X_2) > 0 \\
    - & \text{if } f(X_1, X_2) \leq 0
\end{cases}
$$

Suppose further that this SVM is associated with a margin of $m = \sqrt{2}$. Draw the margin lines.

> BEGIN SOLUTION

With margin $m = \sqrt{2}$, the margin lines are at distance $\sqrt{2}$ from the decision boundary. Since $\|\beta\| = \sqrt{(1/\sqrt{2})^2 + (1/\sqrt{2})^2} = 1$, the margin lines are:
- Upper margin: $f(X_1, X_2) = -\sqrt{2}$, i.e., $X_1 - X_2 = 0$
- Lower margin: $f(X_1, X_2) = \sqrt{2}$, i.e., $X_1 - X_2 = 2\sqrt{2}$

> END SOLUTION

**(c)** What class label (+ or -, as defined above) does this SVM predict for the following points: (1, 4); (1, 1); (2, -5); (2, -1); (4, 2)?

> BEGIN SOLUTION

Evaluate $f(X_1, X_2) = \frac{1}{\sqrt{2}}X_1 - \frac{1}{\sqrt{2}}X_2 - 1$ for each point:

- $(1, 4)$: $f = \frac{1}{\sqrt{2}}(1) - \frac{1}{\sqrt{2}}(4) - 1 = \frac{-3}{\sqrt{2}} - 1 < 0$ $\Rightarrow$ **-**
- $(1, 1)$: $f = \frac{1}{\sqrt{2}}(1) - \frac{1}{\sqrt{2}}(1) - 1 = -1 < 0$ $\Rightarrow$ **-**
- $(2, -5)$: $f = \frac{1}{\sqrt{2}}(2) - \frac{1}{\sqrt{2}}(-5) - 1 = \frac{7}{\sqrt{2}} - 1 > 0$ $\Rightarrow$ **+**
- $(2, -1)$: $f = \frac{1}{\sqrt{2}}(2) - \frac{1}{\sqrt{2}}(-1) - 1 = \frac{3}{\sqrt{2}} - 1 > 0$ $\Rightarrow$ **+**
- $(4, 2)$: $f = \frac{1}{\sqrt{2}}(4) - \frac{1}{\sqrt{2}}(2) - 1 = \frac{2}{\sqrt{2}} - 1 = \sqrt{2} - 1 > 0$ $\Rightarrow$ **+**

> END SOLUTION

**(d)** Suppose that these five points were part of the training data, and their true labels, given in the same order, are -, -, +, +, -. Calculate the corresponding slack values ($\xi_i$) for each of the five points.

> BEGIN SOLUTION

True labels: (1,4) is -, (1,1) is -, (2,-5) is +, (2,-1) is +, (4,2) is -.

Slack $\xi_i$ is non-zero only when a point is on the wrong side of its margin or misclassified.

For class "+" (true label +), the constraint is $f(x) \geq 1 - \xi_i$, so $\xi_i = \max(0, 1 - f(x))$.
For class "-" (true label -), the constraint is $f(x) \leq -1 + \xi_i$, so $\xi_i = \max(0, 1 + f(x))$.

- $(1, 4)$, true "-": $f = \frac{-3}{\sqrt{2}} - 1 \approx -3.12$. Since $f < -1$, correctly classified with margin. $\xi_1 = 0$
- $(1, 1)$, true "-": $f = -1$. Exactly on the margin. $\xi_2 = 0$
- $(2, -5)$, true "+": $f = \frac{7}{\sqrt{2}} - 1 \approx 3.95 > 1$. Correctly classified with margin. $\xi_3 = 0$
- $(2, -1)$, true "+": $f = \frac{3}{\sqrt{2}} - 1 \approx 1.12 > 1$. Correctly classified with margin. $\xi_4 = 0$
- $(4, 2)$, true "-": $f = \sqrt{2} - 1 \approx 0.41$. Should have $f \leq -1$ but $f > 0$, so misclassified. $\xi_5 = 1 + f = 1 + (\sqrt{2} - 1) = \sqrt{2} \approx 1.41$

> END SOLUTION

---

**Problem 4:** SVM on Crabs Data

This question uses the `crabs.csv` data, with the five body measurements (`FL`, `RW`, `CL`, `CW`, `BD`) as predictors and species (`sp`) as the response. Omit all other variables including sex. We will assess error using misclassification rate.

**Hint:** Use [`sklearn.svm.SVC`](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) with appropriate kernels. Use [`sklearn.model_selection.cross_val_score`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) or manual K-fold cross-validation to estimate error. Remember to scale your features using [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html).

**(a)** Fit a linear SVM to the data for a range of values of the C hyperparameter, to predict species from the five numerical measurements. Plot the cross-validated estimate of the error as a function of C.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

In [None]:
# Load data and create train/test sets
crabs_df = pd.read_csv("data/crabs.csv", index_col=0)
features = crabs_df[["FL", "RW", "CL", "CW", "BD"]]
target = crabs_df["sp"]

X_train, X_test, y_train, y_test = train_test_split(
    features, target, test_size=0.25, random_state=42
)

In [None]:
# BEGIN SOLUTION
# Fit linear SVM for a range of C values and compute cross-validated error
c_values = np.arange(0.01, 2.01, 0.01)
cv_errors_linear = []
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

for c_val in c_values:
    pipeline = Pipeline([("scaler", StandardScaler()), ("svm", SVC(kernel="linear", C=c_val))])
    # cross_val_score returns accuracy, convert to error
    cv_accuracy = cross_val_score(pipeline, X_train, y_train, cv=kfold, scoring="accuracy")
    cv_errors_linear.append(1 - cv_accuracy.mean())
# END SOLUTION

In [None]:
# BEGIN SOLUTION
# Plot CV error vs C for linear kernel
plt.figure(figsize=(10, 5))
plt.plot(c_values, cv_errors_linear, color="red", linewidth=1.5)
plt.title("Linear SVM: Cross-Validated Error vs C")
plt.xlabel("C (regularization parameter)")
plt.ylabel("Cross-Validated Error")
plt.grid(alpha=0.3)
plt.show()

best_c_linear = c_values[np.argmin(cv_errors_linear)]
print(f"Best C for linear SVM: {best_c_linear:.2f}")
print(f"Minimum CV error: {min(cv_errors_linear):.4f}")
# END SOLUTION

In [None]:
# Test assertions
assert len(cv_errors_linear) == len(c_values), "Should have error for each C value"
assert all(0 <= err <= 1 for err in cv_errors_linear), "Errors should be between 0 and 1"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert len(cv_errors_linear) >= 50, "Should test a reasonable range of C values"
assert min(cv_errors_linear) < 0.15, "Linear SVM should achieve reasonable accuracy on crabs data"
# END HIDDEN TESTS

**(b)** Now, fit nonlinear SVMs with polynomial kernels. Consider at least three values of degree. For each one, plot the cross-validated estimate of the error as a function of C.

In [None]:
# BEGIN SOLUTION
# Fit polynomial SVMs for different degrees and C values
degrees = [1, 2, 3, 4]
cv_errors_poly = {degree: [] for degree in degrees}

for degree in degrees:
    for c_val in c_values:
        pipeline = Pipeline(
            [
                ("scaler", StandardScaler()),
                ("svm", SVC(kernel="poly", degree=degree, C=c_val)),
            ]
        )
        cv_accuracy = cross_val_score(pipeline, X_train, y_train, cv=kfold, scoring="accuracy")
        cv_errors_poly[degree].append(1 - cv_accuracy.mean())
# END SOLUTION

In [None]:
# BEGIN SOLUTION
# Plot CV error vs C for each polynomial degree
fig, axes = plt.subplots(1, 4, figsize=(20, 4))

for idx, degree in enumerate(degrees):
    axes[idx].plot(c_values, cv_errors_poly[degree], color="red", linewidth=1.5)
    axes[idx].set_title(f"Polynomial SVM (degree={degree})")
    axes[idx].set_xlabel("C")
    axes[idx].set_ylabel("CV Error")
    axes[idx].grid(alpha=0.3)

plt.tight_layout()
plt.show()

for degree in degrees:
    best_c = c_values[np.argmin(cv_errors_poly[degree])]
    min_error = min(cv_errors_poly[degree])
    print(f"Degree {degree}: Best C = {best_c:.2f}, Min CV error = {min_error:.4f}")
# END SOLUTION

In [None]:
# Test assertions
assert len(cv_errors_poly) >= 3, "Should test at least 3 polynomial degrees"
assert all(
    len(errors) == len(c_values) for errors in cv_errors_poly.values()
), "Should have error for each C value"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert 1 in cv_errors_poly or 2 in cv_errors_poly, "Should include low degree polynomials"
assert all(
    0 <= err <= 1 for errors in cv_errors_poly.values() for err in errors
), "All errors should be valid"
# END HIDDEN TESTS

**(c)** Finally, fit nonlinear SVMs with radial (RBF) kernels. Consider at least three values of gamma. For each one, plot the cross-validated estimate of the error as a function of C.

In [None]:
# BEGIN SOLUTION
# Fit RBF SVMs for different gamma values and C values
gamma_values = [0.01, 0.1, 1, 5, 10]
cv_errors_rbf = {gamma: [] for gamma in gamma_values}

for gamma in gamma_values:
    for c_val in c_values:
        pipeline = Pipeline(
            [
                ("scaler", StandardScaler()),
                ("svm", SVC(kernel="rbf", gamma=gamma, C=c_val)),
            ]
        )
        cv_accuracy = cross_val_score(pipeline, X_train, y_train, cv=kfold, scoring="accuracy")
        cv_errors_rbf[gamma].append(1 - cv_accuracy.mean())
# END SOLUTION

In [None]:
# BEGIN SOLUTION
# Plot CV error vs C for each gamma value
fig, axes = plt.subplots(1, 5, figsize=(20, 4))

for idx, gamma in enumerate(gamma_values):
    axes[idx].plot(c_values, cv_errors_rbf[gamma], color="red", linewidth=1.5)
    axes[idx].set_title(f"RBF SVM (gamma={gamma})")
    axes[idx].set_xlabel("C")
    axes[idx].set_ylabel("CV Error")
    axes[idx].grid(alpha=0.3)

plt.tight_layout()
plt.show()

for gamma in gamma_values:
    best_c = c_values[np.argmin(cv_errors_rbf[gamma])]
    min_error = min(cv_errors_rbf[gamma])
    print(f"Gamma {gamma}: Best C = {best_c:.2f}, Min CV error = {min_error:.4f}")
# END SOLUTION

In [None]:
# Test assertions
assert len(cv_errors_rbf) >= 3, "Should test at least 3 gamma values"
assert all(
    len(errors) == len(c_values) for errors in cv_errors_rbf.values()
), "Should have error for each C value"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert 0.1 in cv_errors_rbf or 1 in cv_errors_rbf, "Should include common gamma values"
assert all(
    0 <= err <= 1 for errors in cv_errors_rbf.values() for err in errors
), "All errors should be valid"
# END HIDDEN TESTS