# Support Vector Machines

Given training data $(\mathbf x_1, \ldots, \mathbf x_m) \subset \mathcal X \subset \mathbb R^d$ and its corresponding labels $(y_1, \ldots, y_m) \subset \mathcal Y \subset \mathbb R$, we want to find a model $h\colon \mathcal X \to \mathcal Y$ that not only performs well on the training data but also on new data. More precisely, if the data is generated by some underlying distribution $\mathcal D$, we want $L(h) = \mathbb E_{(\mathbf x, y) \sim \mathcal D}\left[\ell(h(\mathbf x), y) \right]$ to be minimized. Here, $\ell$ is some non-negative function that evaluates the similarity of the output of the model to the true target. 

If $\mathcal Y = \{0, 1\}$, then we have what is called a binary classification task. One solution to this problem is to simply find a point $\mathbf x_0 \in \mathbb R^d$ and a weight vector $\mathbf w \in \mathbb R^d$ that together defines a plane that slices the input space $\mathbb R^d$ into two. Then, one side, say the side that $\mathbf w$ points to is designated as the positive class, $y = 1$, and the other as the negative class, $y = 0.$ The resulting plane is called a linear decision boundary since it separates the input space into $m = |\mathcal Y|$ parts which corresponds to the model's output.
Note that the choice of plane is not unique, and there are many ways to choose such a plane. Here we will consider one particular algorithm for generating models called **Support Vector Machines** (SVMs).

As mentioned above, SVM is a non-parametric supervised learning model which can be used for binary classification as a linear classifier, but as we will show later, can also be used to create non-linear decision boundaries through the use of kernels. Intuitively, a good separation is achieved by the plane that has the largest distance to the nearest training data points of any class, since in general the larger the margin the lower the generalization error of the classifier. In fact [citation needed], from the perspective of statistical learning theory, the motivation for considering binary classifier SVMs comes from a theoretical upper bound on the generalization error bound. This has two important features in the case of SVMs: 

* the bound is minimized by maximizing the margin, $\gamma$, i.e., the minimal distance between the hyperplane separating the two classes and the closest datapoints to the hyperplane, and 
* the bound does not depend on the dimensionality of the space.
