In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

In [2]:
# My standard magic !  You will see this in almost all my notebooks.

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Reload all modules imported with %aimport
%load_ext autoreload
%autoreload 1

%matplotlib inline

In [3]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Common imports
import os

import math
from sklearn import svm
from sklearn import linear_model

from matplotlib.colors import ListedColormap

import svm_helper
%aimport svm_helper
svmh = svm_helper.SVM_Helper()
svm_ch = svm_helper.Charts_Helper(save_dir="/tmp", visible=False)

import class_helper
%aimport class_helper

clh = class_helper.Classification_Helper()


%matplotlib inline


In [10]:
# Create files containing charts
create = False

if create:
    file_map = svm_ch.create_charts()
    print(file_map)

Create Boundary chart
Create Sens chart
Create margin chart
Done
{'boundary': '/tmp/svm_sep_boundary.jpg', 'sensitivity': '/tmp/svm_sens.jpg', 'margin': '/tmp/svm_margin.jpg'}


# Support Vector Classifier (SVC)
We introduce another model for the Classification task: the Support Vector Classifier (SVC).

We motivate this classifer by some examples.

Consider the following linearly separable dataset and two separating lines:

<table>
    <tr>
        <th><center>Linear separating boundary</center></th>
    </tr>
    <tr>
        <td><img src="images/svm_sep_boundary.jpg"></td>
    </tr>
</table>


Is one separating line "better" than the other ?

Intuitively,the first line feels more fragile
- The distance between the line and the closest example is smaller in the first plot than the second
- If the red point located exactly on the line moved a small amount, we would misclassify it
- Not as likely to generalize out of sample as well


Something else to consider:
- How sensitive is our classifier to additional examples *that are far from the original boundary**

Consider the two rows in the plot below
- The second plot in each row adds a cluster of examples in the bottom right corner

<table>
    <tr>
        <th><center>Sensitivity to far-from-boundary examples</center></th>
    </tr>
    <tr>
        <td><img src="images/svm_sens.jpg"></td>
    </tr>
</table>


As you can see
- Logistic Regression is sensitive
- SVC is not

Why should points far away from the original separating boundary affect the fit ?

The answer, of course, is the Loss function.

We will see that the SVC uses a much different loss that has several advantages.

The SVC introduces two additional separating lines that are parallel to the separating
boundary
- at a distance $\margin$ (measured by length of a line orthogonal to the boundary) on either side of the boundary
- $\margin$ is called the *margin*
- these two lines define a "buffer" of width twice the margin

We draw two plots with diferent size margins


<table>
    <tr>
        <th><center>Margin</th>
    </tr>
    <tr>
        <td><img src="images/svm_margin.jpg"></td>
    </tr>
</table>


The concept of margin helps to address the two issues raised above.

- Given a choice of two separating boundaries (each with perfect separation)
    - the one with the larger margin is "better" 
- An example that is correctly classified and lies *outside* the buffer should not affect the Loss Function


From the above examples
- It may not be possible to have both a large margin and a boundary that separates classes perfectly
    - the left plot, for example

Requiring perfect separation and no examples in the buffer is called *Hard Margin* classification.
- allowing either condition to be false is called *Soft Margin* classification

A *Soft Margin* classifier allows violations but imposes a *penalty* (by increasing the Loss).

An SVC classifier is a Soft Margin classifier.

The main difference between an SVC and Logistic Regression classifier are in their loss functions.

The loss function for an SVC contrasts with Cross Entropy (the loss of Logistic Regression) in that
- there are examples with **zero** loss
    - those that are correctly classified and outside the buffer


We will dig into the loss function for the SVC shortly.

# Support Vector Machines (SVM)

As we have seen before many datasets, in raw form, are not linearly separable.

Transformations of the raw data must be applied to induce Linear Separability.

A *Support Vector Machine* is the process of applying transformation functions and then
 using an SVC.
 
There is a special subclass of transformation functions called **kernel functions** that
- uses a clever mathematical trick 
- to achieve the effect of applying an expensive transformation without actually creating transformed data !

The SVM helps to automate **does it jointly solve for hyperparameters of the transformation ??** the transformations.

# Key concepts

Just like Logistic Regression, the SVC will:
- Use the features $\x^\ip$ to compute a "score" $\hat{s}^\ip$
- Compare the predicted score to a threshold
- Predict Positive if the score exceeds the threshold; Negative otherwise

The score is linear in the features
$$\begin{array}[lll]\\
s(\x) = \Theta^T \x & \text{score} \\
s(\x) = 0 & \text{equation of separating boundary} \\
\end{array}
$$
$$
\hat{y}^\ip = 
\left\{
    {
    \begin{array}{lll}
     \text{Negative} & \textrm{if } \hat{s}^\ip   < 0  &  \\
      \text{Positive}& \textrm{if } \hat{s}^\ip \ge 0  
    \end{array}
    }
\right.
$$

The SVC and SVM incorporate several "tricks"
- A clever Loss Function (Hinge Loss)
- Large Margin Classification
- Kernel Transformations

The combined effect can be overwhelming and makes the SVC/SVM seem complex
- and it *does* involve a bit of math

We will try to reduce the complexity by tackling each trick separately.


In [5]:
print("Done")

Done
