# DATASCI 503, Group Work 8: Support Vector Machines

**Instructions:** During lab section, and afterward as necessary, you will collaborate in two-person teams (assigned by the GSI) to complete the problems that are interspersed below. The GSI will help individual teams encountering difficulty, make announcements addressing common issues, and help ensure progress for all teams. During lab, feel free to flag down your GSI to ask questions at any point!

## Overview of SVMs

In this section, we will experiment with SVMs and how to optimize SVM parameters using cross validation.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.svm import SVC

We will again use our old friend: the iris dataset.

This is one of the earliest datasets used in the literature on classification methods and widely used in statistics and machine learning. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are not linearly separable from each other.

In [None]:
# Loading a new sample of data for the next example
iris = load_iris()
X = iris.data
Y = iris.target

Instead of directly running SVC (Support Vector Classification) method, we define the parameter space and ask the model to find the optimal parameter in the defined space using cross validation.

In [None]:
# Defining the parameter space to do the search

param_grid = {
    "C": np.logspace(-3, 3, 25),  # from .001 to 1000
    "gamma": np.logspace(-3, 3, 25),
}

In [None]:
svc = SVC(kernel="rbf")
grid_search = GridSearchCV(svc, param_grid, cv=5, scoring="accuracy")  # 1-misclassification rate
grid_search.fit(X, Y)

By using the following code, we can see the parameters leading to best performance

In [None]:
print("Best parameters: {}".format(grid_search.best_params_))

Note that there could be two different hyperparmeter choices that both lead to the highest cross-validated accuracy.  The hyperparamers listed above indicate one possible choice that leads to highest cross-validated accuracy.

For a more comprehensive look, we can inspect `grid_search.cv_results_`.  This contains a dictionary of all the evaluation metrics from the gridsearch, usually we use a pd dataframe to visualize it properly.

I find a good explanation different columns and their meanings [here](https://stackoverflow.com/questions/54608088/what-is-gridsearch-cv-results-could-any-explain-all-the-things-in-that-i-e-me).

In [None]:
results = pd.DataFrame(grid_search.cv_results_)
results.head()

For a variety of choices of gamma, let's look at how cross-validated performance varies with $C$.

In [None]:
# choose a few values of gamma,
gamchoices = [param_grid["gamma"][5], param_grid["gamma"][8], param_grid["gamma"][10]]

for gam in gamchoices:
    subresults = results.loc[results["param_gamma"] == gam]
    plt.plot(subresults.param_C, subresults.mean_test_score, label=f"$\\gamma={gam}$")

plt.ylabel("Accuracy")

plt.xlabel("C")
plt.legend(bbox_to_anchor=[1, 1])
plt.gca().set_xscale("log")

Note that $\gamma=0.017$ and $\gamma=0.1$ both achieved the maximum cross-validated accuracy (98.67%).

This lab is adapted from [this github file](https://github.com/jpcolino/IPython_notebooks/blob/master/Cross-Validation%20in%20SVM.ipynb).

## Problems

---

**Problem 1:** Generate Synthetic Data

Using `make_blobs` generate an approximately linearly separable dataset. You will need to have 100 total samples and 2 binary classes and 2 features, the centers will need to be within $(-4,4) \times (-4, 4)\subset \mathbb{R}^2$. Use a cluster standard deviation of 1.5 and random state 503.

Store the features in `X_blobs` and labels in `y_blobs`.

In [None]:
# generate a synthetic dataset for linear SVMs using make_blobs
from sklearn.datasets import make_blobs

# BEGIN SOLUTION
X_blobs, y_blobs = make_blobs(
    n_samples=100, centers=2, center_box=(-4, 4), random_state=503, cluster_std=1.5
)
# END SOLUTION

In [None]:
# Test assertions
assert X_blobs.shape == (100, 2), f"Expected shape (100, 2), got {X_blobs.shape}"
assert y_blobs.shape == (100,), f"Expected shape (100,), got {y_blobs.shape}"
assert set(y_blobs) == {0, 1}, f"Expected 2 classes (0 and 1), got {set(y_blobs)}"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert X_blobs.min() > -10 and X_blobs.max() < 10, "Centers should be within reasonable bounds"
assert (y_blobs == 0).sum() == 50, "Expected 50 samples in class 0"
assert (y_blobs == 1).sum() == 50, "Expected 50 samples in class 1"
# END HIDDEN TESTS

---

**Problem 2:** Visualize the Data

Visualize your dataset by scattering `X_blobs` features and coloring them according to `y_blobs` labels. The zero class should be orange and the positive class skyblue. Use black contours to the dots and a transparency of 0.75. Add axis labels.

In [None]:
# visualize X_blobs, y_blobs, colors should be orange and skyblue
# BEGIN SOLUTION
colors_blobs = ["orange" if label == 0 else "skyblue" for label in y_blobs]
# scatter points
plt.scatter(X_blobs[:, 0], X_blobs[:, 1], c=colors_blobs, alpha=0.75, edgecolors="black")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()
# END SOLUTION

In [None]:
# Test assertions
assert "colors_blobs" in dir(), "colors_blobs variable should be defined"
assert len(colors_blobs) == len(y_blobs), "colors_blobs should have same length as y_blobs"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert colors_blobs[0] in ["orange", "skyblue"], "colors should be orange or skyblue"
# Verify color mapping is correct
for i in range(len(y_blobs)):
    expected_color = "orange" if y_blobs[i] == 0 else "skyblue"
    assert colors_blobs[i] == expected_color, f"Color at index {i} should be {expected_color}"
# END HIDDEN TESTS

---

**Problem 3:** Train a Linear SVM

Split the blobs data into training and test (80-20 split). Train a linear SVM and report the test accuracy. Store your model in `svm_blobs`.

In [None]:
# BEGIN SOLUTION
X_blobs_train, X_blobs_test, y_blobs_train, y_blobs_test = train_test_split(
    X_blobs, y_blobs, test_size=0.2, random_state=503
)

svm_blobs = SVC(kernel="linear")
svm_blobs.fit(X_blobs_train, y_blobs_train)

accuracy_blobs = svm_blobs.score(X_blobs_test, y_blobs_test)
print(f"Test accuracy: {accuracy_blobs}")
# END SOLUTION

In [None]:
# Test assertions
assert X_blobs_train.shape[0] == 80, f"Expected 80 training samples, got {X_blobs_train.shape[0]}"
assert X_blobs_test.shape[0] == 20, f"Expected 20 test samples, got {X_blobs_test.shape[0]}"
assert hasattr(svm_blobs, "coef_"), "SVM should be fitted and have coef_ attribute"
assert 0 <= accuracy_blobs <= 1, "Accuracy should be between 0 and 1"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert svm_blobs.kernel == "linear", "Should use linear kernel"
assert accuracy_blobs >= 0.9, "Accuracy should be at least 0.9 for this dataset"
# END HIDDEN TESTS

---

**Problem 4:** Plot Decision Boundary and Margins

Extract the weight vector and bias from `svm_blobs`. Then plot the decision boundary
$$
w_1 x_1 + w_2 x_2 + b = 0
$$
as a black dashed line. Also plot the margins
$$
w_1 x_1 + w_2 x_2 + b = \pm 1
$$
as black dashed lines with dots `-.`.

In [None]:
# BEGIN SOLUTION
w_blobs = svm_blobs.coef_[0]
b_blobs = svm_blobs.intercept_[0]

support_vectors_blobs = svm_blobs.support_vectors_

x_min, x_max = X_blobs[:, 0].min() - 1, X_blobs[:, 0].max() + 1
y_min, y_max = X_blobs[:, 1].min() - 1, X_blobs[:, 1].max() + 1

xx = np.linspace(x_min, x_max, 100)
yy_boundary = -(w_blobs[0] * xx + b_blobs) / w_blobs[1]
yy_upper = -(w_blobs[0] * xx + b_blobs + 1) / w_blobs[1]
yy_lower = -(w_blobs[0] * xx + b_blobs - 1) / w_blobs[1]

plt.scatter(X_blobs[:, 0], X_blobs[:, 1], c=colors_blobs, alpha=0.75, edgecolors="black")
plt.scatter(
    support_vectors_blobs[:, 0],
    support_vectors_blobs[:, 1],
    s=100,
    facecolors="none",
    edgecolors="black",
)
plt.plot(xx, yy_boundary, "k--")
plt.plot(xx, yy_upper, "k-.")
plt.plot(xx, yy_lower, "k-.")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.show()
# END SOLUTION

In [None]:
# Test assertions
assert w_blobs.shape == (2,), f"Weight vector should have 2 components, got {w_blobs.shape}"
assert isinstance(b_blobs, float), "Bias should be a scalar"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert len(support_vectors_blobs) >= 2, "Should have at least 2 support vectors"
# END HIDDEN TESTS

---

**Problem 5:** Count Observations Within Margins

Compute how many observations in `X_blobs` are within the margins of `svm_blobs`.

In [None]:
# Compute how many observations are within the margins
# BEGIN SOLUTION
obs_within_margins = (np.abs(svm_blobs.decision_function(X_blobs)) <= 1).sum()
print(f"Number of observations within the margins: {obs_within_margins}")
# END SOLUTION

In [None]:
# Test assertions
assert isinstance(obs_within_margins, int | np.integer), "Should be an integer"
assert obs_within_margins >= 0, "Count should be non-negative"
assert obs_within_margins <= len(X_blobs), "Count should not exceed total observations"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert (
    obs_within_margins == 11
), f"Expected 11 observations within margins, got {obs_within_margins}"
# END HIDDEN TESTS

---

**Problem 6:** Generate Non-Linearly Separable Data

Use `make_circles` to generate a non-linearly separable dataset. Your sample should have 100 datapoints, noise level of 0.15, a scale factor of 0.25 and random state 503.

Store the features in `X_circles` and labels in `y_circles`.

In [None]:
from sklearn.datasets import make_circles

# BEGIN SOLUTION
X_circles, y_circles = make_circles(n_samples=100, noise=0.15, factor=0.25, random_state=503)
# END SOLUTION

In [None]:
# Test assertions
assert X_circles.shape == (100, 2), f"Expected shape (100, 2), got {X_circles.shape}"
assert y_circles.shape == (100,), f"Expected shape (100,), got {y_circles.shape}"
assert set(y_circles) == {0, 1}, f"Expected 2 classes (0 and 1), got {set(y_circles)}"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# Data should roughly form concentric circles
inner_points = X_circles[y_circles == 1]
outer_points = X_circles[y_circles == 0]
inner_dist = np.sqrt((inner_points**2).sum(axis=1)).mean()
outer_dist = np.sqrt((outer_points**2).sum(axis=1)).mean()
assert inner_dist < outer_dist, "Inner circle should have smaller radius"
# END HIDDEN TESTS

---

**Problem 7:** Linear SVM on Circles Data

Split the circles data with a 80-20 train test split. Train a linear SVM (store it as `svm_circles`). Plot the resulting linear boundary and margins. Scatter the points with orange for class 0 and skyblue for class 1. Then add a box to the bottom left of the plot with the test accuracy.

In [None]:
# split data and train linear SVM
# BEGIN SOLUTION
X_circles_train, X_circles_test, y_circles_train, y_circles_test = train_test_split(
    X_circles, y_circles, test_size=0.2, random_state=503
)

svm_circles = SVC(kernel="linear")
svm_circles.fit(X_circles_train, y_circles_train)

test_accuracy_circles = svm_circles.score(X_circles_test, y_circles_test)

w_circles = svm_circles.coef_[0]
b_circles = svm_circles.intercept_[0]
support_vectors_circles = svm_circles.support_vectors_

x_min, x_max = X_circles[:, 0].min() - 0.5, X_circles[:, 0].max() + 0.5
y_min, y_max = X_circles[:, 1].min() - 0.5, X_circles[:, 1].max() + 0.5

xx = np.linspace(x_min, x_max, 100)
yy_boundary = -(w_circles[0] * xx + b_circles) / w_circles[1]
yy_upper = -(w_circles[0] * xx + b_circles + 1) / w_circles[1]
yy_lower = -(w_circles[0] * xx + b_circles - 1) / w_circles[1]

# visualize X_circles, y_circles, colors should be orange and skyblue
colors_circles = ["orange" if label == 0 else "skyblue" for label in y_circles]
plt.scatter(X_circles[:, 0], X_circles[:, 1], c=colors_circles, alpha=0.75, edgecolors="black")
plt.scatter(
    support_vectors_circles[:, 0],
    support_vectors_circles[:, 1],
    s=100,
    facecolors="none",
    edgecolors="black",
)
plt.plot(xx, yy_boundary, "k--")
plt.plot(xx, yy_upper, "k-.")
plt.plot(xx, yy_lower, "k-.")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.text(0.05, 0.05, f"Test accuracy: {test_accuracy_circles:.2f}", transform=plt.gca().transAxes)
plt.show()
# END SOLUTION

In [None]:
# Test assertions
assert (
    X_circles_train.shape[0] == 80
), f"Expected 80 training samples, got {X_circles_train.shape[0]}"
assert hasattr(svm_circles, "coef_"), "SVM should be fitted"
assert 0 <= test_accuracy_circles <= 1, "Accuracy should be between 0 and 1"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# Linear SVM should perform poorly on circles data
assert test_accuracy_circles < 0.8, "Linear SVM should perform poorly on non-linear data"
# END HIDDEN TESTS

---

**Problem 8:** Kernel SVM and Quadratic Features

Fit a Kernel SVM with polynomial kernel of degree 2. Then fill in the quadratic features function. It should transform each row $(x_1, x_2)$ of our data matrix $X$ into
$$
\phi(x_1, x_2) = (1, \sqrt{2} x_1, \sqrt{2} x_2, x_1^2, \sqrt{2} x_1 x_2, x_2^2)$$
Then use this function to fit a linear svm on this feature space. Print both methods accuracies. They should match!


In [None]:
def quadratic_features(features: np.ndarray) -> np.ndarray:
    """
    Generate quadratic features.

    Transforms each row (x1, x2) into:
    (1, sqrt(2)*x1, sqrt(2)*x2, x1^2, sqrt(2)*x1*x2, x2^2)
    """
    # BEGIN SOLUTION
    num_samples = features.shape[0]
    features_new = np.zeros((num_samples, 6))
    features_new[:, 0] = 1
    features_new[:, 1] = np.sqrt(2) * features[:, 0]
    features_new[:, 2] = np.sqrt(2) * features[:, 1]
    features_new[:, 3] = features[:, 0] ** 2
    features_new[:, 4] = np.sqrt(2) * features[:, 0] * features[:, 1]
    features_new[:, 5] = features[:, 1] ** 2
    return features_new
    # END SOLUTION

In [None]:
# Test assertions
test_input = np.array([[1.0, 2.0], [3.0, 4.0]])
test_output = quadratic_features(test_input)
assert test_output.shape == (2, 6), f"Expected shape (2, 6), got {test_output.shape}"
assert test_output[0, 0] == 1.0, "First column should be 1"
print("All tests passed!")

# BEGIN HIDDEN TESTS
expected_row0 = np.array([1, np.sqrt(2), 2 * np.sqrt(2), 1, 2 * np.sqrt(2), 4])
assert np.allclose(test_output[0], expected_row0), "Quadratic features computation incorrect"
# END HIDDEN TESTS

In [None]:
# Fit kernel SVM and linear SVM on quadratic features
# BEGIN SOLUTION
poly_svm = SVC(kernel="poly", degree=2, gamma=1)
poly_svm.fit(X_circles_train, y_circles_train)

X_circles_train_quad = quadratic_features(X_circles_train)
X_circles_test_quad = quadratic_features(X_circles_test)

lin_feature_svm = SVC(kernel="linear")
lin_feature_svm.fit(X_circles_train_quad, y_circles_train)

poly_accuracy = poly_svm.score(X_circles_test, y_circles_test)
lin_feature_accuracy = lin_feature_svm.score(X_circles_test_quad, y_circles_test)

print(f"Polynomial SVM accuracy: {poly_accuracy}")
print(f"Linear SVM with quadratic features accuracy: {lin_feature_accuracy}")
# END SOLUTION

In [None]:
# Test assertions
assert hasattr(poly_svm, "support_vectors_"), "poly_svm should be fitted"
assert hasattr(lin_feature_svm, "coef_"), "lin_feature_svm should be fitted"
assert 0 <= poly_accuracy <= 1, "poly_accuracy should be between 0 and 1"
assert 0 <= lin_feature_accuracy <= 1, "lin_feature_accuracy should be between 0 and 1"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert np.isclose(poly_accuracy, lin_feature_accuracy), "Both methods should give same accuracy"
assert poly_accuracy >= 0.9, "Accuracy should be high for kernel SVM"
# END HIDDEN TESTS

---

**Problem 9:** Compare Decision Boundaries

Plot the decision boundary for both methods. Here is an outline:


1.   create a meshgrid
2.   evaluate $w^\top\phi(x) + b$ on the meshgrid using `svc.decision_function` for both methods. Note that you have to pass the grid through `quadratic_features` for the second method.
3.   scatter data points
4.   use `plt.contour` to get the decision boundary for the 2 methods.



In [None]:
# Plot decision boundaries for both methods
# BEGIN SOLUTION
# 1. Create a mesh (grid) of points covering the region of X_circles
x_min, x_max = X_circles[:, 0].min() - 0.5, X_circles[:, 0].max() + 0.5
y_min, y_max = X_circles[:, 1].min() - 0.5, X_circles[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200), np.linspace(y_min, y_max, 200))

# 2. Evaluate the decision function for poly_svm on this grid
grid_points = np.c_[xx.ravel(), yy.ravel()]
Z_poly = poly_svm.decision_function(grid_points)

# 3. Evaluate the decision function for lin_feature_svm on transformed grid
Z_feature = lin_feature_svm.decision_function(quadratic_features(grid_points))

# Reshape results to match the shape of xx (for contour plotting)
Z_poly = Z_poly.reshape(xx.shape)
Z_feature = Z_feature.reshape(xx.shape)

# 4. Scatter the original data points
plt.scatter(X_circles[:, 0], X_circles[:, 1], c=colors_circles, alpha=0.75, edgecolors="black")

# 5. Plot the zero-level contour (decision boundary) for each model
CS1 = plt.contour(xx, yy, Z_poly, levels=[0], colors="red", linestyles="--")
CS2 = plt.contour(xx, yy, Z_feature, levels=[0], colors="green", linestyles="-.")

plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title("Comparison of Decision Boundaries: Kernel vs. Manual Feature Mapping")
plt.show()
# END SOLUTION

In [None]:
# Test assertions
assert Z_poly.shape == xx.shape, "Z_poly should match meshgrid shape"
assert Z_feature.shape == xx.shape, "Z_feature should match meshgrid shape"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# Decision boundaries should be similar (both are quadratic)
# Boundaries should have similar sign patterns (both classify similarly)
assert (np.sign(Z_poly) == np.sign(Z_feature)).mean() > 0.9, "Boundaries should be similar"
# END HIDDEN TESTS