In a previous post, we talked about how a decision tree (link here) can classify different targets in Iris dataset. 

Today we are going to explore another supervised machine learning called 'Support Vector Machine' or SVM which is also useful when classifying objects and will use the same Iris data.

![svm](wiki_svm1.png)

As we can see above, general idea of SVM is to put a boundary (or multiple boundaries for multi-classification) between distinct classes while maximizing the margin between them. 

When we draw a line between each object, it is either linearly separable as the right figure above or non-linearly separable like the left figure. In the second case we have to transform data samples in some ways so that it is converted to be linearly separable.

This tranformation can be done through Kernel Trick which will be covered later in the post.

Let's start with linearly separable data.

![example1](example1.png)

In this example we have two classes $1$, black circles and $-1$, hollow circles. Our decision boundary in the middle ($wx - b = 0$) is the line that we have to figure out.

In above figure, there are two other lines $wx - b = 1$ and $wx - b = -1$ and these are the lines that contain points closest to our decision boundary and these points are called 'Support Vector'. Based on these vectors we are deciding the boundary. 

Any other non-support vectors don't affect the location of the boundary. Consider the following situation.

![example1](example2.png)

Since the decision boundary is exactly in the middle of support vectors of two classes, it doesn't matter if other vectors move to other points or not.

As shown in figure 2, the distance between any support vector and the line of the boundary is $\frac{1}{\lVert w \rVert}$ and it is easy to verify.

Let $wx + b = 0$ be the boundary line and $x_0, y_0$ be any point of a support vector. The distance between a point and a line is as given. $$d = \frac{|Ax_0 + By_0 + C|}{\sqrt{A^2 + B^2}}$$
$x_0$ and $y_0$ are the coordinates of a point of a support vector and $A$ and $B$ are the coordinates of $w$.

$Ax_0 + By_0$ is $wx$ and $C$ is the bias that it can be rewritten as $wx + b = 1$ for the line passing through a support vector of class 1. Hence the numerator becomes 1 for both $1$ and $-1$ as it is the absolute value. So the distance between the boundary to a support vector is $\frac{1}{\lVert w \rVert}$ and the distance between support vectors of two classes is $\frac{2}{\lVert w \rVert}$ just like in the figure 2.

SVM maximizes the margin between classes and to do so, we have to find the smallest possible $w$ value that as $w \rightarrow 0$, then $d \rightarrow \infty$. Finding $w$ is quadratic problem which will be covered in a different post regarding optimization. This quadratic problem has the following form.
$$\min_{w} \frac{1}{2} \lVert w \rVert^2$$ subject to $$y(wx + b) \geq 1$$

Finding this $w$ completes constructing a SVM model for linearly separable data.

Now let's consider a situation where our data cannot be split nicely.

![example3](example3.png)

In the left figure, blue and green circles cannot be divided with a straight line so we have to somehow transform this data to look like the right figure. This is one of main advantages of SVM. What it does is to transform data into a higher dimension so that we can split data with a straight line. In above example, points in the left are in one-dimension while the points in the right are in two-dimension. The transformation is $T(x): x \rightarrow x^2$. Note that we are not actually changing the whole data but just finding a right function $T(x)$ we can use.

Once we transform data the rest procedure is to find $w$ again for a line or a plane that separates data, same as above.

Implementing is very simple with sklearn.

In [8]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

from sklearn.datasets import load_iris

import matplotlib.pyplot as plt

In [9]:
iris = load_iris()

X = iris.data[:, :2]
y = iris.target

![iris1](iris1.png)

As we can see, there are three classes and it is not linearly separable. Let's see how a SVM model can classify them.

In [19]:
X_train, X_val, y_train, y_val = train_test_split(X, y)

model = SVC()

model.fit(X_train, y_train)



SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)