# Lab 9: Support Vector Machines

In this lab we we apply Support Vector Machines to predict which brand of orange juice a customer will buy. We'll use the [`OJ: Orange Juice Data`](https://www.rdocumentation.org/packages/ISLR2/versions/1.3-1/topics/OJ). You can get the data file from [our course data folder](https://drive.google.com/drive/folders/1YBxtx7KILtcw6mFEkKmznaRASUHPLjI2?usp=sharing).

The data set has the following columns:

1. `Purchase` (the **target**): A categorical variable with levels CH and MM indicating whether the customer purchased Citrus Hill or Minute Maid Orange Juice
1. `WeekofPurchase`: Week of purchase
1. `StoreID`: Store ID
1. `PriceCH`: Price charged for CH
1. `PriceMM`: Price charged for MM
1. `DiscCH`: Discount offered for CH
1. `DiscMM`: Discount offered for MM
1. `SpecialCH`: Indicator of special on CH
1. `SpecialMM`: Indicator of special on MM
1. `LoyalCH`: Customer brand loyalty for CH
1. `SalePriceMM`: Sale price for MM
1. `SalePriceCH`: Sale price for CH
1. `PriceDiff`: Sale price of MM less sale price of CH
1. `Store7`: A categorical variable with levels No and Yes indicating whether the sale is at Store 7
1. `PctDiscMM`: Percentage discount for MM
1. `PctDiscCH`: Percentage discount for CH
1. `ListPriceDiff`: List price of MM less list price of CH
1. `STORE`: Which of 5 possible stores the sale occured at. It codes some store ids using different numbers, e.g. 7 → 0.

The broad outline of the lab is as follows:

1. Explore, clean, and split the dataset
1. Train and examine a Support Vector Classifier
1. Evaluate various SVM kernels and search for the best hypter-parameter values using cross validation
1. Test the best model on the test data



## Load, explore, and prepare dataset

First make sure that you have got the `OJ.csv` file from the linked drive and placed it in the appropriate folder. Then mount the google drive and read the file.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# drive_url = 'https://drive.google.com/file/d/1wECUewbW2-OKfsaBo4Atydv-2H7J2kHK/view?usp=drive_link'
file_url = 'https://docs.google.com/uc?export=download&id=1wECUewbW2-OKfsaBo4Atydv-2H7J2kHK'

rawdata = pd.read_csv(file_url)
rawdata.head(10)

Unnamed: 0,Purchase,WeekofPurchase,StoreID,PriceCH,PriceMM,DiscCH,DiscMM,SpecialCH,SpecialMM,LoyalCH,SalePriceMM,SalePriceCH,PriceDiff,Store7,PctDiscMM,PctDiscCH,ListPriceDiff,STORE
0,CH,237,1,1.75,1.99,0.0,0.0,0,0,0.5,1.99,1.75,0.24,No,0.0,0.0,0.24,1
1,CH,239,1,1.75,1.99,0.0,0.3,0,1,0.6,1.69,1.75,-0.06,No,0.150754,0.0,0.24,1
2,CH,245,1,1.86,2.09,0.17,0.0,0,0,0.68,2.09,1.69,0.4,No,0.0,0.091398,0.23,1
3,MM,227,1,1.69,1.69,0.0,0.0,0,0,0.4,1.69,1.69,0.0,No,0.0,0.0,0.0,1
4,CH,228,7,1.69,1.69,0.0,0.0,0,0,0.956535,1.69,1.69,0.0,Yes,0.0,0.0,0.0,0
5,CH,230,7,1.69,1.99,0.0,0.0,0,1,0.965228,1.99,1.69,0.3,Yes,0.0,0.0,0.3,0
6,CH,232,7,1.69,1.99,0.0,0.4,1,1,0.972182,1.59,1.69,-0.1,Yes,0.201005,0.0,0.3,0
7,CH,234,7,1.75,1.99,0.0,0.4,1,0,0.977746,1.59,1.75,-0.16,Yes,0.201005,0.0,0.24,0
8,CH,235,7,1.75,1.99,0.0,0.4,0,0,0.982197,1.59,1.75,-0.16,Yes,0.201005,0.0,0.24,0
9,CH,238,7,1.75,1.99,0.0,0.4,0,0,0.985757,1.59,1.75,-0.16,Yes,0.201005,0.0,0.24,0


In [None]:
rawdata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1070 entries, 0 to 1069
Data columns (total 18 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Purchase        1070 non-null   object 
 1   WeekofPurchase  1070 non-null   int64  
 2   StoreID         1070 non-null   int64  
 3   PriceCH         1070 non-null   float64
 4   PriceMM         1070 non-null   float64
 5   DiscCH          1070 non-null   float64
 6   DiscMM          1070 non-null   float64
 7   SpecialCH       1070 non-null   int64  
 8   SpecialMM       1070 non-null   int64  
 9   LoyalCH         1070 non-null   float64
 10  SalePriceMM     1070 non-null   float64
 11  SalePriceCH     1070 non-null   float64
 12  PriceDiff       1070 non-null   float64
 13  Store7          1070 non-null   object 
 14  PctDiscMM       1070 non-null   float64
 15  PctDiscCH       1070 non-null   float64
 16  ListPriceDiff   1070 non-null   float64
 17  STORE           1070 non-null   i

There don't seem to be any missing values: all columns have 1070 non-null values, in 1070 records. There are a few other things that require our attention though:

1. `StoreID`, `SpecialCH`, and `SpecialMM` should be categorical variables. Though `Purchase` can be left as object, storing as category is more efficient.
1. `Store7` and `STORE` contain data that same or coarser versions of `StoreID`. We can drop these two derived columns.
1. `PriceDiff` and `ListPriceDiff` are derevied from other price columns too; they can be dropped or kept as engineered features.

Let's make these changes first.

In [None]:
data = rawdata.astype({'Purchase': 'category', 'StoreID':'category', 'SpecialCH':'category', 'SpecialMM':'category', })
data.drop(['Store7', 'STORE'], axis=1, inplace=True)
data.info()
data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1070 entries, 0 to 1069
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   Purchase        1070 non-null   category
 1   WeekofPurchase  1070 non-null   int64   
 2   StoreID         1070 non-null   category
 3   PriceCH         1070 non-null   float64 
 4   PriceMM         1070 non-null   float64 
 5   DiscCH          1070 non-null   float64 
 6   DiscMM          1070 non-null   float64 
 7   SpecialCH       1070 non-null   category
 8   SpecialMM       1070 non-null   category
 9   LoyalCH         1070 non-null   float64 
 10  SalePriceMM     1070 non-null   float64 
 11  SalePriceCH     1070 non-null   float64 
 12  PriceDiff       1070 non-null   float64 
 13  PctDiscMM       1070 non-null   float64 
 14  PctDiscCH       1070 non-null   float64 
 15  ListPriceDiff   1070 non-null   float64 
dtypes: category(4), float64(11), int64(1)
memory usage: 105.2 KB

Unnamed: 0,Purchase,WeekofPurchase,StoreID,PriceCH,PriceMM,DiscCH,DiscMM,SpecialCH,SpecialMM,LoyalCH,SalePriceMM,SalePriceCH,PriceDiff,PctDiscMM,PctDiscCH,ListPriceDiff
0,CH,237,1,1.75,1.99,0.0,0.0,0,0,0.5,1.99,1.75,0.24,0.0,0.0,0.24
1,CH,239,1,1.75,1.99,0.0,0.3,0,1,0.6,1.69,1.75,-0.06,0.150754,0.0,0.24
2,CH,245,1,1.86,2.09,0.17,0.0,0,0,0.68,2.09,1.69,0.4,0.0,0.091398,0.23
3,MM,227,1,1.69,1.69,0.0,0.0,0,0,0.4,1.69,1.69,0.0,0.0,0.0,0.0
4,CH,228,7,1.69,1.69,0.0,0.0,0,0,0.956535,1.69,1.69,0.0,0.0,0.0,0.0


Let's examine the outcome distribution for any significant class imbalance.

In [None]:
data['Purchase'].value_counts(normalize=True)


Unnamed: 0_level_0,proportion
Purchase,Unnamed: 1_level_1
CH,0.61028
MM,0.38972


There doesn't seem to be any. We can use `accuracy` to measure performances.

In a typical ML project we should examine the histograms and scatters to understand the data a little better. We'll skip it in this lab to stay focused on SVM, but you should do that.


Next, separate the `X` from `y`, then split all into training and testing sets.

In [None]:
from sklearn.model_selection import train_test_split
X = data.drop('Purchase', axis=1) # separate X ...
y = data['Purchase'].copy()       # from y
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size = .2, random_state=0) #split all
train_X.shape, test_X.shape, train_y.shape, test_y.shape # check sizes

((856, 15), (214, 15), (856,), (214,))

## Train and examine an SVM

We should create the standard preprocessing pipeline that we have seen in the previous labs: potential imputation followed by standardization for numeric variables and OneHotEncoding for categorical variables.

SVMs are sensitve to scales of the variables: like k-nearest neighbor classifiers these are distance based too. The features with larger magnitude and variance will dominate distance calculation. So, they don't work well when variables are in very different scales. Hence, standardization is important.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer, make_column_selector

from sklearn import set_config
set_config(display='diagram') # shows the pipeline graphically when printed

num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ])
cat_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('cat_encoder', OneHotEncoder())
    ])

prep_pipeline = ColumnTransformer([
    ('num', num_pipeline, make_column_selector(dtype_include=np.number)),
    ('cat', cat_pipeline, make_column_selector(dtype_include='category'))
])

prep_pipeline

In the following block we examine the attributes of the trained SVM (e.g., the support vectors). The data we get are as numpy arrays, not pandas data frames.

In [None]:
# @title Check support vectors {run: "auto"}

from sklearn.svm import SVC

c = 10 # @param {type:"slider", min:1, max:10}

lin_svc = SVC(C=c, kernel='poly', degree=1)

svm_pipeline = Pipeline([
    ("preprocessing", prep_pipeline),
    ("svm", lin_svc),
])

svm_pipeline.fit(train_X, train_y)

# Print the number of support vectors for each class

print('Number of support vectors for each class:',
      dict(zip(lin_svc.classes_, lin_svc.n_support_)))
# See how number of support vectors change as you increase C.

# You can also print the records that are the support vectors.
# print(lin_svc.support_vectors_)

Number of support vectors for each class: {'CH': np.int32(177), 'MM': np.int32(174)}


What happens to the number of support vectors when you increase/decrease $C$? Why?

Now let's check the accuracy of the predictions on test data.

In [None]:
from sklearn.metrics import accuracy_score

print(f'Test accuracy was {accuracy_score(test_y, svm_pipeline.predict(test_X)):.2f}' )

Test accuracy was 0.82


## Tune SVMs

Various kernels in SVMs require tuning to perform well. We can use one of the search strategies we have learnt to do that. Let's start with the grid search.

In [None]:
from sklearn.model_selection import GridSearchCV

svm_pipeline = Pipeline([
    ("preprocessing", prep_pipeline),
    ("svm", SVC()),
])

param_grid = [
  {'svm__kernel': ['linear'], 'svm__C': [1, 10, 100, 1000]},
  {'svm__kernel': ['rbf'], 'svm__C': [1, 10, 100, 1000], 'svm__gamma': [0.001, 0.0001]},
  {'svm__kernel': ['poly'], 'svm__C': [1, 10, 100, 1000], 'svm__gamma': [0.001, 0.0001], 'svm__degree': [2, 3, 4]},
]
# Notice the list of dictionaries syntax: it allows us to explore a different set of parameters for each kernel.
# The grid search explores dictionaries sequentially. For each dictionary it evaluates all hyper-parameter combinations.
# Random search allows something similar too — instead of list of hyperparameter values, it takes distributions.
# With random search the list of dictionary is sampled uniformly first for each iteration, then the
# hyper-parameters within it from their specified distributions. BayesSearchCV on the other hand
# draws n_iter samples for **each** dictionary.

grid_search = GridSearchCV(svm_pipeline, param_grid, cv=3, scoring='accuracy')
grid_search.fit(train_X, train_y)

grid_cv_res = pd.DataFrame(grid_search.cv_results_)
grid_cv_res.sort_values(by="mean_test_score", ascending=False, inplace=True)
grid_cv_res.filter(regex = '(^param_|mean_test_score)', axis=1)

Unnamed: 0,param_svm__C,param_svm__kernel,param_svm__gamma,param_svm__degree,mean_test_score
0,1,linear,,,0.836474
10,1000,rbf,0.001,,0.836474
8,100,rbf,0.001,,0.836474
1,10,linear,,,0.834139
11,1000,rbf,0.0001,,0.834139
2,100,linear,,,0.831804
3,1000,linear,,,0.830634
6,10,rbf,0.001,,0.83063
9,100,rbf,0.0001,,0.829461
30,1000,poly,0.001,2.0,0.789752


As the conventional wisdom would suggest, the linear SVM seems to be working well in this setting along with RBF. There is a separate classifier called `LinearSVC` that contains an optimized implementation of the linear SVM. Polynomial kernels don't work as well as the others in this setting.

(Why do we have NaNs in the above results? Because some hyperparameters aren't applicable/present for some kernels, e.g., `gamma` isn't present for linear kernel.)

In [None]:
# We'll work with the best model obtained from grid search.
model = grid_search.best_estimator_

# What is the accuracy if we applied that to the test data we set aside at near the beginning?
pred_y = model.predict(test_X)
print('The cost under standard prediction strategy is %.2f.' % accuracy_score(test_y, pred_y))

The cost under standard prediction strategy is 0.82.


***
**Exercise**

SVMs have many hyper-parameters that can take values in a wide range. Besides, the performance of the SVMs depends quite a bit on choice of right hyper-parameter values. To make matters even more interesting, different kernels take different hyper-parameters. Thus, SVM application is a prime candidate for randomized search.

1. Apply `RandomizedSearchCV` to select parameters for our classification exercise. Can you get a better classifier than what grid search found?
1. Then use `HalvingRandomSearchCV` to further increase the exploration. Use `loguniform` distribution for `C` and `gamma`, and `randint` for `degree`.
1. We also learnt about intelligently searching for hyper-prameter values using bayesian search. Use the `BayesSearchCV` from `scikit-optimize` to see if we can get an even better model. See [scikit optimize page](https://scikit-optimize.github.io/stable/modules/space.html) for how to specify loguniform distribution.

***


In [None]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import SVC
from scipy.stats import loguniform, randint

# Base SVM pipeline
svm_pipeline = Pipeline([
    ("preprocessing", prep_pipeline),
    ("svm", SVC())
])

# Randomized search parameter distributions
param_distributions = {
    "svm__kernel": ["linear", "rbf", "poly"],
    "svm__C": loguniform(1e-3, 1e3),
    "svm__gamma": loguniform(1e-4, 1e0),
    "svm__degree": randint(2, 6)
}

rnd_search = RandomizedSearchCV(
    svm_pipeline,
    param_distributions,
    n_iter=40,
    cv=3,
    scoring="accuracy",
    random_state=42,
    n_jobs=-1
)

rnd_search.fit(train_X, train_y)

print("Best parameters (RandomizedSearchCV):", rnd_search.best_params_)
print("Best CV score:", rnd_search.best_score_)


Best parameters (RandomizedSearchCV): {'svm__C': np.float64(14.747200953924711), 'svm__degree': 5, 'svm__gamma': np.float64(0.5233480488540082), 'svm__kernel': 'linear'}
Best CV score: 0.8364781417412996


In [None]:
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingRandomSearchCV

halving_search = HalvingRandomSearchCV(
    svm_pipeline,
    param_distributions,
    factor=3,
    random_state=42,
    scoring="accuracy",
    n_jobs=-1
)

halving_search.fit(train_X, train_y)

print("Best parameters (HalvingRandomSearchCV):", halving_search.best_params_)
print("Best CV score:", halving_search.best_score_)


Best parameters (HalvingRandomSearchCV): {'svm__C': np.float64(0.22200545479787925), 'svm__degree': 3, 'svm__gamma': np.float64(0.04467752817973906), 'svm__kernel': 'rbf'}
Best CV score: 0.8172031844929041


In [None]:
pip install scikit-optimize

Collecting scikit-optimize
  Downloading scikit_optimize-0.10.2-py2.py3-none-any.whl.metadata (9.7 kB)
Collecting pyaml>=16.9 (from scikit-optimize)
  Downloading pyaml-25.7.0-py3-none-any.whl.metadata (12 kB)
Downloading scikit_optimize-0.10.2-py2.py3-none-any.whl (107 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m107.8/107.8 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyaml-25.7.0-py3-none-any.whl (26 kB)
Installing collected packages: pyaml, scikit-optimize
Successfully installed pyaml-25.7.0 scikit-optimize-0.10.2


In [None]:
from skopt import BayesSearchCV
from skopt.space import Real, Integer, Categorical

bayes_search = BayesSearchCV(
    estimator=svm_pipeline,
    search_spaces={
        "svm__kernel": Categorical(["linear", "rbf", "poly"]),
        "svm__C": Real(1e-3, 1e3, prior="log-uniform"),
        "svm__gamma": Real(1e-4, 1e0, prior="log-uniform"),
        "svm__degree": Integer(2, 6)
    },
    n_iter=40,
    cv=3,
    random_state=42,
    scoring="accuracy",
    n_jobs=-1
)

bayes_search.fit(train_X, train_y)

print("Best parameters (BayesSearchCV):", bayes_search.best_params_)
print("Best CV score:", bayes_search.best_score_)

Best parameters (BayesSearchCV): OrderedDict({'svm__C': 25.361101250718455, 'svm__degree': 6, 'svm__gamma': 0.00045126194876216193, 'svm__kernel': 'linear'})
Best CV score: 0.8364781417412996


In [None]:
print("Grid Search best score:", grid_search.best_score_)
print("Random Search best score:", rnd_search.best_score_)
print("Halving Random Search best score:", halving_search.best_score_)
print("Bayes Search best score:", bayes_search.best_score_)

Grid Search best score: 0.8364740522635259
Random Search best score: 0.8364781417412996
Halving Random Search best score: 0.8172031844929041
Bayes Search best score: 0.8364781417412996


In [None]:
print("Grid:", grid_search.best_params_)
print("Random:", rnd_search.best_params_)
print("Halving:", halving_search.best_params_)
print("Bayes:", bayes_search.best_params_)

Grid: {'svm__C': 1, 'svm__kernel': 'linear'}
Random: {'svm__C': np.float64(14.747200953924711), 'svm__degree': 5, 'svm__gamma': np.float64(0.5233480488540082), 'svm__kernel': 'linear'}
Halving: {'svm__C': np.float64(0.22200545479787925), 'svm__degree': 3, 'svm__gamma': np.float64(0.04467752817973906), 'svm__kernel': 'rbf'}
Bayes: OrderedDict({'svm__C': 25.361101250718455, 'svm__degree': 6, 'svm__gamma': 0.00045126194876216193, 'svm__kernel': 'linear'})


All four hyperparameter search methods—Grid Search, Randomized Search, Halving Random Search, and Bayesian Optimization—achieved similar best accuracies around 0.8365, except Halving Random Search, which performed worse at 0.8172. This indicates that the optimal SVM configuration is stable across different search strategies.

Despite exploring many kernel types, all the best models—including those from Randomized Search and BayesSearchCV—ended up selecting linear SVMs, often with larger values of C. This suggests nonlinear kernels (RBF, polynomial) did not add value for this dataset.

Halving Random Search selected an RBF kernel with low C and moderate gamma but produced a noticeably lower accuracy. This likely happened because its early-stopping design removed promising parameter combinations before they were fully evaluated on larger data subsets.

Overall, linear SVM consistently performed best across all methods, showing that the high-dimensional feature space created by scaling and one-hot encoding already allows for effective linear separation. More complex kernels did not improve performance and were not necessary.