# SVM: Support Vector Machines

Young algorithm.

Initial defn: SVM **finds (outputs) a separating line (hyperplane) between data of two classes.**

Q: What makes a good separating line?

A: Maximises **margin**: distance between the line and the nearest point of either of two classes.

Underlying concept is to **maximise robustness**.

Question diagram.

**SVM prioritises correct classification over maximising margin.**

BUT **outliers**: may ignore individual outliers to do the best it can in constructing a decision surface. **SVM is somewhat robust to outliers.** SOmehow mediates attempt to find maximum marginal separator and ability to ignore outliers. There is a tradeoff: can determine via parameters how willing it is to ignore outliers.

From [sk-learn documentation](http://scikit-learn.org/stable/modules/svm.html):

The advantages of support vector machines are:
* Effective in high dimensional spaces.
* Still effective in cases where number of dimensions is greater than the number of samples.
* Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
* Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.

The disadvantages of support vector machines include:
* If the number of features is much greater than the number of samples, the method is likely to give poor performances.
* SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation (see Scores and probabilities, below).


In [17]:
from sklearn import svm
from sklearn import metrics
X = [[0, 0], [1, 1]]
y = [0, 1]
clf = svm.SVC()
clf.fit(X, y)  
"""
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)
"""
X_test = [[2., 2.], [3., 5.]]
predictions = clf.predict(X_test)
y_test_true = [0, 1]
print("Accuracy score: ", metrics.accuracy_score(y_test_true, predictions))

Accuracy score:  0.5


## Nonlinear SVMs

(Insert diagram)

SVM is built on giving linear separation.

* Previously assume inputs x,y - SVM -> Label.
* Now have x, y, $x^2+y^2$ - SVM -> Label.

SVM makes nonlinear decision surfaces by **making new features**. In the case above, $z = x^2 + y^2$ is a new feature.

Now they are linearly separable?

### Finding New Features using the Kernel Trick

Gist: 
* Changing input space X,Y into a much larger input space $X_i$ using the kernel trick, 
* separate using SVM. 
* Take solution back to original space to get non-linear separation.

**One of most central tricks in machine learning.**


(Groups kernel?)

As specified in the documentation, "different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels."

**SVC**: Support Vector Classifier (a type of svm)

Kernel is a SVC parameter. Kernels available for SVC: 
* "linear" (Linear kernels generally draw a straight line)
* "poly"
* "rbf" (default)
* "sigmoid"
* "precomputed"
* a callable

In [None]:
# e.g.:
clf = SVC(kernel="linear")

### Parameters in ML
Arguments passed when you **create** your classifier. I.e. before fitting. Can have huge impact on the resulting decision boundary.

Sample parameters for an SVM:
* kernel
* C
* gamma

### SVM C Parameter

**C** controls the tradeoff between 
1. a smooth decision boundary and 
2. classifying training points correctly.

**Tradeoff**: Complicated might not generalise as well. Straighter decision boundary might generalise better.

Large C means larger error penalty -> More complicated boundary.
-> **C for complicated**

### Overfitting
C, gamma and kernel attributes all affect overfitting in SVM.

## SVM Pros and Cons
* works well in complicated domains where there is a clear margin of separation
* doesn't work well with large datasets because training time is O(n^3) where n is the size of the dataset.
* Don't work well with much noise (e.g. classes overlapping (or many features?)) -> Naive Bayes better.
