# Support Vector Machine (SVM)

SVM is a generalized version of maximal margin classifier. Rather than seeking a hyperplane with largest possible margin that can perfectly separate observations into correct classes (i.e. two sides of the hyperplane), which may suffer from overfitting or even non-existence of such hyperplane, SVM (also called a soft margin classifier) allow misclassification of a few tranining observations in order to achieve

- Greater robustness to individual observations
- Better classification of *most* of the training observations

## Key Concepts

1. **Hyperplane**: Hyperlane is a flat affine subspace of dimension $p - 1$ in a $p$-dimensional space
2. **Support Vectors**: Vectors/Data points that are equidistant from the maximal margin hyperlane that they "support" the hyperplane in te sense that if these points are moved slightly then the maximal margin hyperplane would move as well
3. **Kernel**: When observations are not linearly separable in the $p$-dimensional space, kernel can be applied to enlarge the feature space and find desirable decision boundaries on higher dimensional space

## Assumptions

None


## Advantages

1. **Stability**: Support vector classifier's decision rule (i.e. hyperplane) is only affected by a potentially small subset of the traning observations (i.e. support vectors). Therefore, **SVM is quite robust to the behavior of other observations that are far away from the hyperplane (i.e. extreme values)**

2. SVM is more effective in high dimensional spaces

3. SVM uses only a subset of training data (i.e. support vectors) in the decision function, which makes it memory efficient 

## Limitations

1. Hyperplane used in SVM can only separate two classes (i.e. binary classification). Workarounds including *one-versus-one* and *one-versus all* can be used in addressing multiclass classification. However, it may not be effective.

2. SVM does not perform well when the data set has more noise (i.e. target classes are overlapping in the feature space)

3. SVM does not perform whell when the number of features exceeds the number of training data samples

4. SVM cannot provide probabilistic explanation given the decisioning mechanism is based only on hyperplane

5. **Choosing an appropriate kernel function is difficult**: In case of using a high dimensional kernel, SVM could generate too many support vectors which reduce the training speed drastically


In [4]:
import numpy as np
from sklearn import svm, datasets

In [2]:
iris = datasets.load_iris()
X = iris.data[:, :2]
y = iris.target

In [6]:
C = 1 # SVM regularization parameter
svc = svm.SVC(kernel = 'linear', C = 1).fit(X,y)