# SLU12 - Support Vector Machines (SVM): Exercise notebook

In this notebook we will be covering the following:


*  Hyperplanes
*  Maximal Margin Classifier
* Support Vector Classifier
* Support Vector Machine
* Multi-Class extension
* Support Vector Regression

New tools in this unit

* [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)
* [SVR](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html)

In [None]:
import pandas as pd
import numpy as np
from hashlib import sha256
import json

import sklearn
# These will be needed to prepare the dataset
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Seed for reproducibility
np.random.seed(42)

Lisbon has been a hot tourism destination in the recent years.  The housing market is out of control, and many are paying generous amounts of money to spend some nights in this wonderfull city. Having made a fortune using your data science skills, you wonder how you could get in on the action too. You decide that buying some houses and posting them on Airbnb would be a good idea. Before you start your shopping spree however, you wonder how you could predict which rooms will bring in the big bucks. In order to investigate this, you turn to the  Airbnb Lisbon rooms dataset. You first load the dataset.

In [None]:
airbnb_df = pd.read_csv("data/airbnb.csv")
print(airbnb_df.shape)
airbnb_df.sample(5)

You realize that the columns room_id and host_id are probably not very usefull for your porpose, so you decide to drop them. You also use the pandas get_dummies function to the convert categorical variables (you will learn more about this function in SLU15).

In [None]:
airbnb_df = airbnb_df.drop(["room_id", "host_id"], axis=1)
airbnb_df = pd.get_dummies(airbnb_df)

You decide that you are interested in houses that bring in no less than 64€ per night, so you would like to know which rooms are not interesting (low price) and which rooms are  (high price). You proceed to binning the price column into these two categories. 

In [None]:
airbnb_df["price_target"] = airbnb_df.price < 64
airbnb_df.price_target.value_counts()

Since the target variable is binary, you are faced with a binary calssification problem. You remember that really cool class you had about Support Vector Machines, and so decide to give them a shot. 

In order to properly train and evaluate your models, you split your dataset into train set and test set.

In [None]:
# Create the features matrix X and target vector y
X = airbnb_df.drop([col for col in airbnb_df.columns if "price" in col], axis=1)
y = airbnb_df["price_target"]
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# SVMs are not scale invariant, so you scale your data beforehand
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

print("X_train of shape ", X_train.shape)
print("y_train of shape ", y_train.shape)
print("X_test of shape  ", X_test.shape)
print("y_test of shape  ", y_test.shape)

## Exercise 1: Support Vector Classifier


1.1) Use a support vector classifier to predict the price_target of an airbnb room. 

In [None]:
# Create an SVC estimator using sklearn with a linear kernel 
# train it on the data 
# assign your trained estimator to linear_svc
# linear_svc = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
svc_argument_hash = '7f2fe580edb35154041fa3d4b41dd6d3adaef0c85d2ff6309f1d4b520eeecda3'

assert isinstance(linear_svc, sklearn.svm.SVC)
assert svc_argument_hash == sha256(linear_svc.kernel.encode()).hexdigest()
np.testing.assert_almost_equal(linear_svc.score(X_test, y_test), 0.7718171514922554)

1.2) Obtain the support vectors for the above classifier

In [None]:
# Obtain the support vectors for the classifier defined in 1.1
# assign the result to a variable s_vectors
# s_vectors = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
s_vectors_hash = '8eafa96d827117328be11fe4b283186ff7c11a56c266b9187100897018b13eaa'
assert sha256(s_vectors).hexdigest() == s_vectors_hash

1.3) Obtain the number of support vectors for each class

In [None]:
# Obtain the number of support vectors for each class of the target variable
# assign the result to n_s_vectors, which should be an array whose first element
# is the number of support vectors of class 1 and the second element the number
# of support vectors of class 2
# n_s_vectors = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
n_s_vectors_hash = '909b8b8a7a47496625b6e225ae349f822edfd304db21701674578796e8f5a64f'
assert sha256(n_s_vectors).hexdigest() == n_s_vectors_hash

1.4) Create a new SVC estimator that allows for, at most, 10 trainning obervations to be on the wrong side of the decision hyperplane

In [None]:
# Create a new estimator that allows for, at most, 10 trainning obervations to 
# be on the wrong side of the decision hyperplane and train it on the data
# assign the result to linear_svc_10
# linear_svc_10 = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
svc_parameters_hash = '30b1cb69434644c295d3a380c3bfff162404ca811c1620729eb6e2c53d0e98d8'

assert isinstance(linear_svc_10, sklearn.svm.SVC)
assert svc_parameters_hash == sha256(json.dumps(linear_svc_10.get_params()).encode()).hexdigest()
np.testing.assert_almost_equal(linear_svc_10.score(X_test, y_test), 0.7718171514922554)

## Exercise 2 : Support Vector Machines
Having tried the Support Vector Classifier, you turn to Support Vector Machines to see if they can improve the performance of your classifier. You wonder which kernel you should use, and decide to start with the radial kernel

2.1) Create an SVM with a radial kernel and train it

In [None]:
# Use an SVM with a radial kernel to create predictions
# Being by creating the estimator
# then train it on the data
# assign your estimator to the variable radial_svm
# radial_svm = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
radial_parameters_hash = '889e3c8bf0e720c1735b640e5e14d9f846fc5da1e2f2c3723da5c9312d8aa63a'

assert isinstance(radial_svm, sklearn.svm.SVC)
assert radial_parameters_hash == sha256(json.dumps(radial_svm.get_params()).encode()).hexdigest()
np.testing.assert_almost_equal(radial_svm.score(X_test, y_test), 0.7778617302606725)

2.3) Create predictions using the radial SVM

In [None]:
# Use your SVM with radial kernel to create predictions on the test set
# assign your predictions to the variable radial_preds
# radial_preds = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
r_preds_hash = 'bc49863cef11a471becd822019573f0106bbc04706fa25755dae2f19d5497727'

assert r_preds_hash == sha256(radial_preds).hexdigest()

2.3) Create an SVM with polynomial kernel of degree 2. Fit the model to the data and create new predictions

In [None]:
# Use an SVM with a polynomial kernel to create predictions
# Being by creating the estimator
# then train it on the data
# assign your estimator to the variable poly_svm
# and its predictions to the variable poly_preds
# poly_svm = ...
# poly_preds = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
poly_parameters_hash = 'b40f0814699df168646e5df316a68e51f2429dc35ac794ec2a9672b1f7f49d7c'
poly_preds_hash = 'fb35d41fedf36b6d195822853995aa7d81b8cb33dc3bab10deca745118caff6a'

assert isinstance(poly_svm, sklearn.svm.SVC)
assert poly_parameters_hash == sha256(json.dumps(poly_svm.get_params()).encode()).hexdigest()
np.testing.assert_almost_equal(poly_svm.score(X_test, y_test), 0.7517944843218738)
assert poly_preds_hash == sha256(poly_preds).hexdigest()

## Exercise 3 : Support Vector Regression
You  wonder if you could use a Support Vector Regressor to predict the price of the rooms directly, as opposed to predicting the price category.

3.1)Use an SVR estimator to predict the price of a room. 

In [None]:
# Change the target variable to the price itself (float)
y_train, y_test = train_test_split(airbnb_df.price, test_size=0.2, random_state=42)

In [None]:
# Use an SVR with a radial kernel to create predictions
# Begin by creating the estimator
# then train it on the data
# assign your estimator to the variable svr
# svr = ...

# YOUR CODE HERE
raise NotImplementedError()


In [None]:
svr_parameters_hash = '9c671ede03a4d932d5300206ea65d70775e62c14187cd02fe2e66057ddcc215e'

assert isinstance(svr, sklearn.svm.SVR)
assert svr_parameters_hash == sha256(json.dumps(svr.get_params()).encode()).hexdigest()
np.testing.assert_almost_equal(svr.score(X_test, y_test), 0.08202727591603864)