# Chapter 5: Support Vector Machines
Powerful and versatile ML model. Can perform linear and nonlinear classification, regression, outlier detection. Suited for complex but small - medium sized datasets.

## Linear SVM
Fundamental idea: **Fit the widest possible street** between classes of data. Fitting the data with a line (decision boundary) that has the greatest distance between training instances - called **Large margin classification**.

**Def'n** *Linearly Separable*: Can be separated with a straight line.

Once the line has been fitted, adding more training instances does not affect the decision boundary. This is called "fully determined" or "supported" by the instances on the edge of the street.

The instances that are creating the side-of-street boundaries are called the **support vectors**.

Note: SVMs are very sensetive to feature scale.

### Soft Margin Classification

If we strictly impose that all instances be off the street and on the side with all the same, properly labelled instances, this is called hard-margin classification.

Issues with hard-margin:
 - Only works if data is linearly separable; if an outlier instance is mixed in with instances of another class, hard-margin classification is impossible
 - sensetive to outliers; all instances must be off the street; the street will become very small even if the data is linearly separable.
 
**Def'n** Soft-margin classification: The balance between keeping the street as wide as possible and limiting margin violations.

Control the classification balance using the *C* hyperparameter. *C* penalizes the instances on the street: lower *C* gives less penalty to instances on the street. Higher *C* gives more penalty. Larger *C* values shrink the size of the street (better for pure accuracy on training, may overfit), smaller one's widen the street and allow for more error (better for generalization).

Reducing C is to regularize the SVM.

In [1]:
import numpy as np
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

iris = datasets.load_iris()                                     # Load iris dataset
X = iris["data"][:, (2, 3)]                                     # petal length, petal width columns
y = (iris["target"] == 2).astype(np.float64)                    # Iris-Virginica
svm_clf = Pipeline([
 ("scaler", StandardScaler()),                  # Standardize data
 ("linear_svc", LinearSVC(C=1, loss="hinge")),  # Create model
 ])
svm_clf.fit(X, y)        # Fit

svm_clf.predict([[5.5, 1.7]])

array([1.])

Unlike Logistic Regression (lrc), SVMs do not output the probability of a class.

Use `LinearSVC` as it is faster than `SVC(kernel='linear')`. Another option is to use `SGDClassifier(loss = 'hinge', alpha = 1/(m*C))` but only if the dataset is huge and cannot fully fit into memory.

Note: set `dual = False` in LinearSVC unless there are more features than instances (more on this later).

## Nonlinear SVM Classification
Many datasets are absolutely not linearly separable.

We can handle this with `PolynomialFeatures(degree=  )` as seen in Chapter 4, but for SVMs, we can use what is called "**The Kernel Trick**".

Adding too few polynomial features (degrees) cannot capture the complexity of some datasets, and using too many may overfit or cause SVM to perform too slowly. "The Kernel Trick" gets the same result as using PolynomialFeatures without transforming the dataset.

In [3]:
from sklearn.svm import SVC
poly_kernel_svm_clf = Pipeline([
 ("scaler", StandardScaler()),
 ("svm_clf", SVC(kernel="poly", degree=3, coef0=1, C=5)) # 3rd degree polynomial kernel
 ])
poly_kernel_svm_clf.fit(X, y)

The code above uses a 3rd degree polynomial kernel. The **coef0** hyperparameter controls how much the model is influenced by the kernel degree.

Recall: Use GridSearchCV to find good hyperparameter values.

### Adding Similarity Features
Another method for nonlinear problems: *Similarity function*.

Gaussian Radial Basis Function explanation...

### Guassian RBF Kernel
Similar to using the `PolynomialFeatures()`, actually calculating the similarity function is computationally expensive on huge datasets. Recall using a similarity function converts a 'm x n' dataset into an 'm x m' dataset. For example, a dataset that was only '1250 x 12' becomes a '1250 x 1250' dataset which is a huge difference.

We use "The Kernel Trick" again to compute *similarity features*.

In [4]:
rbf_kernel_svm_clf = Pipeline([
 ("scaler", StandardScaler()),
 ("svm_clf", SVC(kernel="rbf", gamma=5, C=0.001)) # Radial Basis Function; similarity function
 ])
rbf_kernel_svm_clf.fit(X, y)

Note the $\gamma$ - **gamma** hyperparameter. Large gamma makes each instance's influence smaller. Smaller gamma makes each instance's range of influence larger.

$\gamma$ and *C* are both regularization hyperparameters.

If the model is overfitting the training set, reduce $\gamma$. If it it underfitting, increase. This is the same for *C*.