# Support Vector Machine

### Distance beween a point and hyperplane
In N dimensional space, the minimal distance from a point $\{y_{i}\}$ to the N-1 hyperplane $w_{i}x_{i}+b=0$ can be achieved by the minimal of the following Lagrange function:
\begin{align}
L=	[(x_{i}-y_{i})(x_{i}-y_{i})+\mu(w_{i}x_{i}+b)],
\end{align}
which is given by
\begin{align}
\frac{\partial L}{\partial x_{j}}=	[2(x_{j}-y_{j})+\mu w_{j}]=0.
\end{align}
Substituting $x_{i}=y_{i}-\frac{1}{2}\mu w_{i}$ back to the hyperplane equation, we have
\begin{align}
0=&	w_{i}(y_{i}-\frac{1}{2}\mu w_{i})+b,\\
\mu=&	\frac{2(b+w_{i}y_{i})}{|w|^{2}}.
\end{align}
Then the minimal distance between y and hyperplane reads:
\begin{align}
d=\frac{1}{2}|\mu||w|=	\frac{|b+w_{i}y_{i}|}{|w|}.
\end{align}

### Objective function

Suppose we have a sample with predictors $\{X^{i}\}$ and outcome $\{Y^{i}\}$. Suppose the outcome takes two values $\pm1$, given feature $x$, the outcome is predicted according to 
\begin{align}
y=	sign(b+w^{T}x)1.
\end{align}
If the outcome is correctly predicted, the distance can also be written as:
\begin{align}
d=	\frac{y(b+w^{T}x)}{|w|},
\end{align}
with the normal direction of the hyperplane pointing to the Y=+1 class. 

The goal of SVM is to maximize the smallest distance among all data points $d_{min}$ by varying $\{w,b\}$. $d_{min}$ is given by:

\begin{align}
d_{min}=	\arg\min_{i}\{d^{i}\},
\end{align}

with $d^{i}=\frac{y^{i}(b+w^{T}x^{i})}{|w|}$. We denote the observation corresponding to $d_{min}$ as $x_{min}$, then an alternative way to formulate this maximazation process is:

\begin{align}
\arg\max_{\{w,b\}}&	\frac{y_{min}(b+w^{T}x_{min})}{|w|},\\
s.t.&	\frac{y^{i}(b+w^{T}x^{i})}{|w|}-\frac{y_{min}(b+w^{T}x_{min})}{|w|}\geqslant0.
\end{align}

If the data point corresponds to the minimal distance is unchanged during the variation process, we can utilize the scale invariance of $\{w,b\}$ in representing the same hyperplane to make $y_{min}(b+w^{T}x_{min})=1$. Then the problem can be reexpressed as:

\begin{align}
\arg\min_{\{w,b\}}&	|w|^{2},\\
s.t.&	y^{i}(b+w^{T}x^{i})-1\geqslant0\\
&	y_{min}(b+w^{T}x_{min})=1.
\end{align}

The problem involves inequality constraints and can be solved through the Karush–Kuhn–Tucker conditions, which will be introduced in the following.

### Lagrange multiplier
<img src="LagrangeMultipliers2D.svg.png" width="50%" height="50%">
The Lagrange multiplier method is to solve the following problem:
\begin{align}
maximize:&	f(x)\\
subject\ to:&	g(x)=0.
\end{align}
In the two-dimensional example shown by the above picture, we need to find the maximum of $f(x,y)$ on the red line of condition $g(x,y)=0$. A necessary condition is that the derivative of $f(x,y)$ along the tangent direction of the red line is zero. This condition happens in two cases: (1) $\nabla f=0$ in regardless of g, (2) $\nabla f$ parallel to $\nabla g$. The two cases can be denoted by a single expression:
\begin{align}
\nabla f=	\lambda\nabla g,
\end{align}
where $\lambda$ is called the Lagrange multiplier and equals to zero for the first case. Of course the above equation has to be combined with the feasibility condtion
\begin{align}
g(x)=	0.
\end{align}
Then the two equations can be further combined as the stationary points condition of the Lagrangian $\mathcal{L}(x,\lambda)=f(x)+\lambda g(x)$:
\begin{align}
\nabla_{x,\lambda}\mathcal{L}=	0,
\end{align}
where $\mathcal{L}(x,\lambda)$ is a function depending on extra dimensions denoted by $\lambda$.

### Karush–Kuhn–Tucker conditions
<img src="Inequality_constraint_diagram.svg.png" width="50%" height="50%">

KKT conditions are generalization of the Lagrange condition to include inequality constraints:
\begin{align}
maximize:&	f(x),\\
subject\ to:&	g(x)\leqslant0\\
	&h(x)=0.
\end{align}
The idea is based on a simple observation that if the maximum happens on the boundary of $g(x)=0$, the problem is reduced to the Lagrange problem with additional constraints, else (happens in the domain $g(x)<0$) the inequality condition can actually be discarded and the problem is reduced to the Lagrange case only with constraint h. Following the Lagrange case, we define the Lagrangian as
\begin{align}
\mathcal{L}=	f(x)+\mu g(x)+\lambda h(x),
\end{align}
the two cases can be denoted by the complementary slackness condition:
\begin{align}
\mu g(x)=	0.
\end{align}
Of course the primal feasibility condion:
\begin{align}
g(x)\leqslant	0,
\end{align}
and stationary condition:
\begin{align}
\nabla_{x,\lambda}\mathcal{L}=	0,
\end{align}
should be satisfied.

In [1]:
import numpy as np
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

In [2]:
iris=datasets.load_iris()
x=iris['data'][:,(2,3)] # petal length, petal width
y=(iris['target']==2).astype(np.float64) # Iris virginica

In [3]:
svm_clf=Pipeline([('scalar',StandardScaler()),('linear_svc',LinearSVC(C=1,loss='hinge'))])

In [4]:
svm_clf.fit(x,y)
svm_clf.predict([[5.5,1.7]])

array([1.])