# SLU12- Support Vector Machines: Learning notebook

New tools in this unit:

* [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) (scikit-learn).
* [SVR](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html) (scikit-learn).  

In this notebook we will be covering the following:
*  Maximal Margin Classifier
* Support Vector Classifier
* Support Vector Machine
* Multi-Class extension
* Support Vector Regression

People often loosely refer to the maximal margin classifier, the support vector classifier, and the support vector machine as “support vector machines”. To avoid confusion, we will carefully distinguish between these three notions.



### Maximal Margin Classifier (MMC)

The **Maximal Margin Classifier** is a binary classification method that makes use of the **optimal separating hyperplane**, which is the separating hyperplane that is farthest from the training observations. The minimal distance from the observations to the hyperplane is know as the **margin**. The observations that lie on the margin are known as the **support vectors**.  

The maximal margin classifier is a **very natural way to perform classification, if a separating hyperplane exists**. However, in many cases no separating hyperplane exists, and so there is no maximal margin classifier. In this case, the optimization problem has no solution

### Support Vector Classifier (SVC)

The **Support Vector Classifier**, also known as the soft margin classifier, is **an extension of the maximal margin classifier that allows some observations to be on the incorrect side** of the margin, or even on the incorrect side of the hyperplane. This overcomes the MMC's limitation of not being able to handle cases where no separating hyperplane exists.  

The **C penalty** is the sum of all the slack variables, and it thus determines the number and severity of the violations to the margin (and the hyperplane) that the model will tolerate.  If C=0 then we are back to the MMC. In practice, C is often chosen via cross-validation, though a value of C=1 is usually a good start.

### Support Vector Machine (SVM)

So far we have only considered models with a linear decision boundary. **Support Vector Machines are the extension of the SVC to the non-linear case** using kernels. The kernel approach is simply an efficient computational approach for acomodating non-linear decision boundaries.

## Implementation

In [1]:
import warnings
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris

warnings.simplefilter("ignore")

# load the iris dataset and train-test split it
X, y = load_iris(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=5)

# SVMs are not scale invariant, so we should scale our data beforehand
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)


### Support Vector Classifier

In [2]:
# import the SVC class
from sklearn.svm import SVC

# Create an estimator
linear_svc = SVC(kernel="linear", C=1) # don't worry about the kernel argument for now
linear_svc

SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='linear', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

In [3]:
# fit the estimator
linear_svc.fit(X_train, y_train)
# make predictions
predictions = linear_svc.predict(X_test)
predictions

array([1, 1, 2, 0, 2, 1, 0, 1, 0, 1, 1, 1, 2, 2, 0, 0, 2, 2, 0, 0, 1, 2,
       0, 1, 1, 2, 1, 1, 1, 2])

In [4]:
#score the estimator
linear_svc.score(X_test, y_test)

0.9333333333333333

In [5]:
# get the support vectors
linear_svc.support_vectors_

array([[-1.00078746,  0.93301499, -1.17791196, -0.74612659],
       [-0.52327456,  0.7143396 , -1.23427138, -1.00716213],
       [-1.59767859, -1.69108967, -1.34699023, -1.1376799 ],
       [-0.40389633, -1.69108967,  0.17471421,  0.16749781],
       [ 0.78988593, -0.59771273,  0.51287076,  0.42853335],
       [ 0.55112948,  0.49566421,  0.56923018,  0.55905112],
       [-0.88140924, -1.2537389 , -0.38888002, -0.09353774],
       [ 1.26739883,  0.05831344,  0.68194903,  0.42853335],
       [ 1.02864238, -0.16036195,  0.73830845,  0.68956889],
       [-0.76203101, -0.81638812,  0.11835479,  0.29801558],
       [ 0.1929948 ,  0.7143396 ,  0.45651133,  0.55905112],
       [ 0.1929948 , -0.37903734,  0.45651133,  0.42853335],
       [ 0.1929948 , -0.81638812,  0.79466787,  0.55905112],
       [ 0.43175125, -1.90976506,  0.45651133,  0.42853335],
       [ 0.43175125, -0.59771273,  0.6255896 ,  0.82008666],
       [ 0.07361657, -0.16036195,  0.79466787,  0.82008666],
       [ 0.31237303, -0.

### SVM with polynomial kernel of degree d

In [6]:
# Pass kernel='poly' and degree=d to create an
# SVM with polynomial kernel of degree d
polynomial_svm = SVC(kernel="poly", degree=3)
polynomial_svm

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='poly', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

In [7]:
# fit the model
polynomial_svm.fit(X_train, y_train)
# score the model
polynomial_svm.score(X_test, y_test)

0.8666666666666667

### SVM with radial kernel

In [8]:
# Pass kernel='rbf' (default) to create an 
# SVM with radial kernel
radial_svm = SVC(kernel="rbf")
radial_svm

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

In [9]:
# fit the model
radial_svm.fit(X_train, y_train)
# score the model
radial_svm.score(X_test, y_test)

0.9

### Multi-Class Classification:  

- One-Vs-One
- One-Vs-Rest  

Luckily, all sklearn SVM estimators already implement multi-class classification, so we don't need to do it ourselves.

In [10]:
# You can specify which multi-class method you want your estimator to use
# through the "decision_function_shape" argument.
SVC(decision_function_shape="ovo")

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovo', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

## Implementation examples

In [11]:
airbnb_df = pd.read_csv("data/airbnb.csv")
print(airbnb_df.shape)
airbnb_df.sample(5)

(13232, 9)


Unnamed: 0,room_id,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
9658,15718380,101536082,Private room,Misericórdia,0,0.0,2,1.0,80.0
166,231855,1212503,Entire home/apt,Santa Maria Maior,59,5.0,6,3.0,162.0
4587,7093818,7956592,Private room,Santa Maria Maior,9,3.5,2,1.0,25.0
2064,3017791,15376058,Entire home/apt,Benfica,0,0.0,4,2.0,1384.0
12168,18448032,127887985,Entire home/apt,Parque das Nações,1,0.0,7,2.0,103.0


**Get dummies from dataframe or series (pd.get_dummies)**

In [12]:
airbnb_df = airbnb_df.drop(["room_id", "host_id"], axis=1)
airbnb_df = pd.get_dummies(airbnb_df)

In [13]:
airbnb_df["price_target"] = airbnb_df.price < 64

**Split train and test sets**

In [14]:
# Create the features matrix X and target vector y
X = airbnb_df.drop([col for col in airbnb_df.columns if "price" in col], axis=1)
y = airbnb_df["price_target"]
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**Scaling of the features**

In [15]:
# SVMs are not scale invariant, so you scale your data beforehand
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)