# Week 7: Support Vector Machines

Last of supervised learning algos in this course.  


## Optimization Objective

Recall logistic reg:

$$h_\theta(x) = \frac{1}{1 + exp(-\theta^Tx)}$$

If $y= 1$, we want $h_\theta(x) \approx 1$ means $\theta^Tx >> 0$.    

If $y= 0$, we want $h_\theta(x) \approx 1$ means $\theta^Tx << 0$. 

<img src = "figures/fig1.png" width = 300>

### Cost Func for Log regression:

One example cost: $-(y \log h_\theta(x) + (1-y) \log (1 - h_\theta(x)) $
$$ = -(y \log \frac{1}{1 + exp(-\theta^Tx)} + (1-y) \log (1 - \frac{1}{1 + exp(-\theta^Tx)})$$  

If $y=1$ then want  $\theta^Tx >> 0$ so that $-\log \frac{1}{1 + exp(-\theta^Tx)} \approx 0$

In SVM we are going to take the above cost function (term 1), and modify a bit with an elbow approximation.

<img src = "figures/fig2.png" width = 500>

### Cost Function in Support Vector Machine

**Logistic Regression:**
$$\min _{\theta} \frac{1}{m}\left[\sum_{i=1}^{m}\left[ y^{(i)}\left(-\log h_{\theta}\left(x^{(i)}\right)\right)+\left(1-y^{(i)}\right)\left(-\log \left(1-h_{\theta}\left(x^{(i)}\right)\right)\right)\right]+\frac{\lambda}{2 m} \sum_{j=1}^{n} \theta_{j}^{2}\right]$$

**Support Vector Machine:**

$$\min _{\theta}\left[ c \sum_{i=1}^{m} \left[y^{(i)}cost_1(\theta^T x)+\left(1-y^{(i)}\right)cost_0(\theta^T x)\right]+\frac{1}{2} \sum_{j=1}^{n} \theta_{j}^{2}\right]$$

### SVM Hypothesis

$$
h_\theta(x) = \left\{\begin{array}{ll}
1 & \text{ if } \theta^T x \geq 0 \\
0 & \text{ otherwise }
\end{array}\right.
$$


## Large Margin Intuition

Think of the cost function:

$$\min _{\theta}\left[ c \sum_{i=1}^{m} \left[y^{(i)}cost_1(\theta^T x)+\left(1-y^{(i)}\right)cost_0(\theta^T x)\right]+\frac{1}{2} \sum_{j=1}^{n} \theta_{j}^{2}\right]$$

What makes this func small?  
For $y = 1$ want $\theta^T x \geq 1 \rightarrow cost_1 = 0$  
For $y = 0$ want $\theta^T x \leq -1 \rightarrow cost_0 = 0$  

It's not enough to just barely be greater than (less than) 0! 

### Intuition about the decision boundary

Suppose $C$ is large, so the first term is more impactful in the objective function. We have a lot of motivation to minimize this as much as possible.

Whenever $y^{(i)} = 1:$  need $\theta^T x \geq 1$

Whenever $y^{(i)} = 0:$  need $\theta^T x \leq -1$   

We are left so that the first term is 0:

$\min c*0 + 1/2 \sum \theta_i^2 $  

such that.  the conditions above are in effect.

This gives interesting decision boundaries. The support vector machines will give the black line. The connection between this and the previous knowledge is unclear atm.

<img src = "figures/fig3.png" width=300>

So you might get two decision boundaries. What is the role of $c$? It will give you a black line, which is the large margin. 

Large margins are very sensitive to outliers. You may end up with the magenta decision boundary, if $c$ is very large!  

If $c$ is not too large, you maintain the black boundary, even in the precense of outliers.

<img src = "figures/fig4.png" width=400>

### Mathematics Behind Large Margin Classification

For illustration, ignore $\theta_0$ and define $n = 2$, only 2 features.

Consider $\min_\theta \frac{1}{2} \sum_{j=1}^n \theta_j^2$  observe:

$$\sum_{j=1}^n \theta_j^2 = ||
\theta||$$  

Remember don't need the other terms in the objective function, bc these constraints force those to be nearly 0! (We also have c to be large enough that what's writte above is what the optimizers would mimick.

where we ignore $\theta_0$ in the calculation. 

No consider: 

$$\theta^T x \geq 1$$  
and  
$$ \theta^T x \leq -1$$

and think of: $u^T := \theta^T, \ v := x$

So if you plot on the $x_1^{(i)}, x_2^{(i)}$ axis, we can see this thing is actually an inner product.  

This is like saying 
$$\theta^T x^{(i)} \text{ is the projection of } x \text{ onto } \theta$$

or 

$$\theta^T x^{(i)} = p^{(i)} ||\theta|| $$

<img src="figures/fig5.png" width = 300>

We can rewrite the previous as:

$$p^{(i)} ||\theta||  \geq 1$$  
and  
$$ p^{(i)} ||\theta|| \leq -1$$

Suppose w are trying to make some classification and determine the decision boundary. Look at the green, it is a small margin decision boundary. Parameter vector $\theta$ is orthogonal to the decision boundary.   

What implies: $p^{(1)}$ and $p^{(2)}$ are small, so for the above constraits to be respected, we must have $||\theta||$ be large to compenste. 

<img src = "figures/fig7.png" width=300>

**In constrast** look at this other decision boundary.

<img src = "figures/fig8.png" width=300>

It has $p^{(1,2)}$ as much larger, ie it has chosen $\theta$ so the projections are incredibly aliged with $\theta$. This means the value for $p$'s are must smaller. To maintain the previous constraits $p^{(i)} ||\theta|| \geq 1$ , $||\theta||$ can be smaller, which is what the objective function is trying to do.



<img src="figures/fig9.png" width=300>

  We see that these $p^{(i)}$s are larger, and they are the ones that determine the margine's size. The tell you how far from the boundary you are!

## Kernels I

Kernels are main teqn for nonlinear decision boundaries.  

Want complex high order polynomial feats. Train feats as if they're independent.   

Is there a better way to do this below?  

We know that more complexity means more expensive compu time.

<img src="figures/fig10.png" width=400>

### Kernels To Solve The Problem

Given $x$, compute new features depending on proximity to landmarks $l^{(i)}$s.  

Compute $f_i = similarity(x, l^{(i)}) = exp\left(-\frac{||x - l^{(i)}||^2}{2 \sigma^2} \right)$.  

**Similarity function** is called **Kernel** function.  
$$ K(x, l^{(i)}) $$

**What happens with these kernels?**  

- If $x \approx l^{(i)}:$ 
    - $f_i \approx exp\left(-\frac{||x - l^{(i)}||^2}{2 \sigma^2} \right) \approx exp\left(-\frac{0}{2 \sigma^2} \right) \approx 1$

- If $x$ far from $l^{(i)}:$ 
    - $f_i \approx exp\left(-\frac{||x - l^{(i)}||^2}{2 \sigma^2} \right) \approx exp\left(-\frac{\infty}{2 \sigma^2} \right) \approx 0$
    
    
- each of the landmarks defines a new featur $l^{(i)} \rightarrow f_i$

#### Here Is An Example:

Suppose $l^{(1)} = [3, 5]$, $f_1 =  exp\left(-\frac{||x - l^{(1)}||^2}{2 \sigma^2} \right)$, and $\sigma^2 = 1$.

Given training example $x$, compute $f_1, f_2, f_3$ and predict "1" when 

$$ \theta_0 + \theta_1f_1 + \theta_2 f_2 + \theta_3 f_3,$$

with $\theta_0 = -0.5$, $\theta_1 = 1$, and $\theta_2 = 1$, and $\theta_3 = 0$. 

This gives:  

$\theta_0 + \theta_1 * 1 + \theta_2 *0 + \theta_3 * 0 = -0.5 + 1 \geq 0.$ which predicts $y =1.$  

This is for the magenta x.  
What about for cyan x?

$f_0 , f_1, f_2 \approx 0$, so we get  

$\theta_0 + \theta_1 * 0 + \theta_2 *0 + \theta_3 * 0 = -0.5  \leq 0.$ which predicts $y = 0.$  

Based on this $\theta$ values, we see any points not close enough to $l^1, l^2$ will result in value of $y=0$, creating this given decision boundary!


<img src="figures/fig11.png" width=300>

## Kernels II

Where to get $l^{(i)}$s?. 

Predict $y=1$ if $\theta_0 + \theta_i*f_i ... \geq 0$.  

In practice, suppose we have positiv and neg examples.  

For each train example we have, we will put landmarks as exact locations of training examples.  

That is define $l^{(i)} := x^{(i)}$.  

We will end up with $l^{(i)}$ for $i=1\ldots m$.

Now compute features: $f_i = similarity(x, l^{(i)}).$   

Then group the feature vector $f = [f_0, \ldots f_m]$   

We can now represent $x^{(i)}$ as a feature vector $[f^{(i)}]$ as the way to rep $x$ instead.

### SVM With Kernels

Hypothesis: given $x$ compute features $f \in R^{m+1}$. 

Predict $y = 1$ if $\theta^T f \geq 0$.   

For Training:  

$$\min_\theta C \sum_{i=1}^m y^{(i)} cost_1(\theta^T fT{(i)}) +  (1-y^{(i)}) cost_0(\theta^T f^{(i)}) + \frac{1}{2} \sum_{j=1}^n \theta_j^2 $$  

And the effective number of features is wrt to $f$, which has $m+1$ features (the +1 is for the interecept). Remember, for each $x$ there is an $f$ and there are $m$ $x's!$  

This last piece can be thought of as a deviation form the original $\theta^T\theta,$ because we don't just have $n$ points and now have $m$ points.  

$$\theta^T \theta \rightarrow \theta^T M \theta$$

### SVM Parameters

$C = (1/ \lambda)$ 
- Large $C$: low bias, hi var
- Small $C$: hi bias, low var  

$\sigma^2$
- Large $\sigma^2:$ features $f_i$, vary smooothly.
    - hi bi, lo var
- Small $\sigma^2:$ Features $f_i$, vary not smoohtly.
    - lower bi, hi var

<img src = "figures/fig12.png" width=200>

## Using An SVM

Not recommended to write your own SVM software package to solve for parameters $\theta$. Use libs like (liblinear, libsvm, ...). Many others have written long-standing/tested software libraries to solve many of the math guts of your problem. You should not be writing your own software, but there are other parts of the problem that you will need to be hands-on with.   

Need to specify:
1. Choice of C
2. Choice of kernel (similarity function)  

Note, could specify the no kernel (linear kernel), which is  
$y = 1$ if $\theta^Tx\geq0$.

Or specify params of chosen kernel ex:  
Gaussin kernel, choose $\sigma^2$. 

### Gaussian Kernel As Similarity Function

$$f = exp\left(- \frac{|| x1 - x2||^2}{2\sigma^2}\right).$$

**Notes:**  
1. $x1 = x^{(i)}$ and $x2 = f^{(i)}$
2. Do perform feature scaling before using the Gaussian Kernel.
3. 

### Other Kernel Choices

Not all similarity functions make valid kernels.  
Kernels must satisfy **Mercer's Theorem** to be sure SVM package optimizations run correct and do not diverge!  

https://en.wikipedia.org/wiki/Mercer%27s_theorem  

- polynomial Kernel: $k(x, l) = (x^T l)^2, \ (x^T l)^2, \ (x^T l + 1)^3 \ldots (x^T l + const)^{degree}$ 
    - generally perform worse than Gaussian
    
- String kernel (input data is strings), chi-sq Kernel, histogram intersection kernel, ...
    - ex: may want similarity between two strings, $l, x$, then use string kernel: $string_sim(l,x)$.

## Multi-class Classification

Find approp decis boundary with SVM.  

For $k$ classes use $k$ svms! 

train k svms, one to distinguish $y=i$ from the rest.

Yields vectors $\theta^{(i)}, i = 1\ldots k$, one for each class.  

pick class $i$ with largest ${\theta^{(i)}}^Tx$ ie

$$prediction = \arg \max_{{i}} {\theta^{(i)}}^Tx$$

### Logistic Reg Vs SVMs

$n = $features no. $x \in R^{n+1}$  
$m = $number train samples.  

If $n>>m$ 
- $n$ large relative to $m$, 
- $n=10k$, $m =10...1000$  
- use logistic regression or SVM without a kernel ("linear kernel"). This means there is not enough data to fit a more complicated non-linear kernel/model.  

If $n$ small, $m$ intermediate 
- $n=1-1000k$, $m =10- 10000$ 
- use SVM with Gaussian kernel 

If $n$ small, $m$ large 
- $n=1-1000k$, $m =50k+$ 
- create/add more feats, then use logistic reg or no-kernel SVM

Neural networks likely to work well for most of these settings, but may be slower to train, esepcially when $n$ small, $m$ intermediate.