### Support Vector Machines pg. 167

1. The fundamental idea behind an SVM is that it tries to fit the "widest street possible" between different classes as a decision boundary.
    - Soft margin classification finds the balance between keeping the largest street possible and limiting the margin violations
    

2. Support vectors are what determine the decision boundary and keep the "street" as wide as possible. They are the edges of the street. Any instance outside the support vectors will not affect the decision boundary (DB). In soft margin classifcation, any instances that is inside the DB will change the decision boundary

3. It is important to scale the input because if the features are not scaled, then the "street" that SVM defines will be parallel (or almost) or vertical to the x-axis because SVM will neglect smaller features

4. SVM can output the distance between an instance and the decision boundary. This can be used as the confidence score. Though, this scores cannot be directly converted to the predicted class probability. (unless Scikit's SVC classifier is used with hyperparameter <i>probability=true</i>

5. Both the primal an dual form of the linear SVM problem will yield the same solution, but the primal form of the linear SVM objective is faster when the number of training instance is greater than the number of features

6. higher gamma leads to overfitting. If an SVM classifier is underfitting, then gamma should be decreased and/or C should be decreased.

In [2]:
import numpy as np

from sklearn.datasets import load_iris
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC

iris = load_iris() # setosa and veriscolor classes are linearly separable
X = iris["data"]
y = iris["target"]

setosa_or_versicolor = (y == 0) | (y == 1)
X = X[setosa_or_versicolor]
y = y[setosa_or_versicolor]

C = 5
alpha = 1 / (C * len(X))

lin_svc = LinearSVC(loss="hinge", C=C) #default loss is squared hinge (change to hinge)
sgd_clf = SGDClassifier(random_state=42, learning_rate="constant", eta0=0.001, alpha=alpha, max_iter=100000, tol=-np.infty) #default loss is hinge
svc_clf = SVC(kernel="linear", C=C) #default kernel is rbf (change to linear b/c the dataset is linearly separable)

In [3]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test,y_train, y_test = train_test_split(X_scaled, y, test_size=0.33, random_state=42)

In [4]:
from sklearn.model_selection import cross_val_score

lin_svc.fit(X_scaled, y)
sgd_clf.fit(X_scaled, y)
svc_clf.fit(X_scaled, y)

lin_svc_score = cross_val_score(lin_svc, X_test, y_test)
sgd_clf_score = cross_val_score(sgd_clf, X_test, y_test)
svc_clf_score = cross_val_score(sgd_clf, X_test, y_test)

In [5]:
lin_svc_score

array([1., 1., 1., 1., 1.])

In [6]:
sgd_clf_score

array([1., 1., 1., 1., 1.])

In [7]:
svc_clf_score

array([1., 1., 1., 1., 1.])

The scores are 100% because the dataset is linearly separable. To compare the similarity of these two models, their decision boundaries must be plotted

In [8]:
print("LinearSVC:                   ", lin_svc.intercept_, lin_svc.coef_)
print("SVC:                         ", svc_clf.intercept_, svc_clf.coef_)
print("SGDClassifier(alpha={:.5f}):".format(sgd_clf.alpha), sgd_clf.intercept_, sgd_clf.coef_)

LinearSVC:                    [0.20728759] [[ 0.20698778 -0.31798522  0.67690973  0.80207236]]
SVC:                          [0.24282611] [[ 0.26732082 -0.33928505  0.69992451  0.74910678]]
SGDClassifier(alpha=0.00200): [0.244] [[ 0.26765984 -0.33920468  0.70326094  0.75104712]]
