# Support Vector Machines

SVM hypothesis and cost function can be derived from logistic regression hypothesis. For logistic regression hypothesis is defined as:
$$
h_{\theta} = \frac{1}{1+e^{-\theta^{T}x}}
$$

where:
* if $y=1$, we want $h_{\theta}(x) \approx 1$, $\theta^T x \gg 0 $
* if $y=1$, we want $h_{\theta}(x) \approx 0$, $\theta^T x \ll 0 $

Cost for a single example can be written as:

$$
-\bigg(y log\big(h_{\theta}(x)\big) + (1 - y) log\big(1 - h_{\theta}(x)\big)\bigg) \\=\\ -y log\frac{1}{1+e^{-\theta^{T}x}} -(1-y)log(1 - \frac{1}{1+e^{-\theta^{T}x}})
$$

And finally, from here, the whole optimization process for regularized logistic regression is defined as:
$$
\underset{\theta}{\min} \frac{1}{m}\bigg[\sum_{i=1}^{m}y^{(i)}\big( -log h_{\theta}(x^{(i)})\big) + (1 - y^{(i)})\big( -log(1 - h_{\theta}(x^{(i)}))\big) \bigg] + \frac{\lambda}{2m}\sum_{j=1}^{n}\theta_{j}^{2}
$$
where $\lambda$ is the regularization parameter.

For SVMs the regularization parameter is $C$ and is defined as the inverse of $\lambda$. This parameter is now applied to the left portion of optimization term.

Instead of logarithmic cost, SVM's cost is linear (is shown in figure below the code in the next cell) and optimization for SVM is as follows:
$$
\underset{\theta}{\min} C\bigg[\sum_{i=1}^{m}y^{(i)}\big(cost_{1}(\theta^T x^{(i)})\big) + (1 - y^{(i)})\big(cost_{0}(\theta^T x^{(i)})\big) \bigg] + \frac{1}{2}\sum_{j=1}^{n}\theta_{j}^{2}
$$


In [None]:
import numpy as np 
import matplotlib.pyplot as plt 
from matplotlib import style
style.use('ggplot')
%matplotlib inline
from sklearn.svm import SVC
from scipy.io import loadmat
from sympy import Symbol, diff

In [None]:
# domain
x1 = -3
x2 = 3
x = np.linspace(x1, x2, 100)

# logreg cost
c_lr = lambda x: - np.log(1/(1 + np.exp(-x)))
c_lr_vals = c_lr(x)

# 1st derivative of logreg cost
h = 0.01
diff1_approx = lambda x :(c_lr(x+h)-c_lr(x-h))/(2*h)
diff1_act = lambda x: -1/(1+np.exp(x))

# svm cost        
k = -0.645 # manually configured slope
c_svm = [k * x - k if x<=1 else 0 for x in x]
    
# visualize   
plt.figure(figsize=(12,8))
plt.plot(x, c_lr(x), label='logistic regression cost')
plt.plot(x, diff1_act(x), label='theoretical derivative of logreg cost')
plt.plot(x, diff1_approx(x), marker='x', markevery=5, linestyle='None', label='numerical derivative of logreg cost')
plt.plot(x, c_svm, label='svm cost')

plt.xlabel('x')
plt.ylabel('y')
plt.legend(loc='best')
plt.show()

## 1 Support vector machines - examples

### 1.1 Example dataset 1

In this example, 2D data should be separated by a linear boundary. Because of a single positive outlier at about (0.1, 4.1), regularization parameter C will drasticly affect the position of the linear boundary.

Informally, the C parameter is a positive value that controls the penalty for misclassified training examples; a large C parameter tells the SVM to try to classify **all** the examples correctly. It plays a role similar to $\frac{1}{\lambda}$, where $\lambda$ is the regularization parameter for regularized logistic regression.

In [None]:
dataset1 = loadmat('data/ex6data1.mat')
print(dataset1['__header__'])
X = dataset1['X']
y = dataset1['y']

In [None]:
def plotData(X, y):
    pos = np.where(y==1)[0]
    neg = np.where(y==0)[0]
    plt.plot(X[pos, 0], X[pos, 1], 
             marker='+',
             color='black',
             markersize=7,
             linestyle='None',
             label='pos examps')

    plt.plot(X[neg, 0], X[neg, 1], 
             marker='o',
             color='red',
             markersize=7,
             linestyle='None',
             label='neg examps')
    plt.xlabel('x1')
    plt.ylabel('x2')
    
    
def trainSVM(X, y, C, kernel, tol, max_iter):
    return SVC(C, kernel=kernel, tol=tol, max_iter=max_iter)

def visualizeBoundary(X, y, model):
    # make classification predictions over a grid of values
    x1plot = np.linspace(np.min(X[:, 0]), np.max(X[:, 0]), num=100)
    x2plot = np.linspace(np.min(X[:, 1]), np.max(X[:, 1]), num=100)
    preds = np.zeros((len(x1plot), len(x2plot)))
    for i, x1 in enumerate(x1plot):
        for j, x2 in enumerate(x2plot):
            preds[i,j] = model.predict(np.array([x1, x2]).reshape(1, -1))
    
    #X1, X2 = np.meshgrid(x1plot, x2plot)
    contr = plt.contour(x1plot, x2plot, preds.T)

In [None]:
plotData(X, y)
plt.legend(loc='lower left')
plt.show()

#### C = 1

In [None]:
# train a model
C = 1
kernel = 'linear'
tol = 1e-3
max_iter=1000
model = trainSVM(X, y, C, kernel, tol, max_iter)
model.fit(X, y.ravel())

In [None]:
# plot the SVM boundary
plotData(X, y)
visualizeBoundary(X, y, model)
plt.axis([np.min(X[:, 0])-0.2, np.max(X[:, 0])+0.2,
         np.min(X[:, 1])-0.2, np.max(X[:, 1])+0.2])
plt.legend(loc='lower left')
plt.show()

#### Changing the C parameter

In [None]:
Cs = [0.1, 1, 10, 100]
for i, C in enumerate(Cs):
    kernel = 'linear'
    tol = 1e-2
    max_iter=1000
    model = trainSVM(X, y, C, kernel, tol, max_iter)
    model.fit(X, y.ravel())
    
    plt.figure(figsize=(12,9))
    plt.subplot(2,2,i+1)
    plotData(X, y)
    plt.axis([np.min(X[:, 0])-0.2, np.max(X[:, 0])+0.2,
             np.min(X[:, 1])-0.2, np.max(X[:, 1])+0.2])
    visualizeBoundary(X, y, model)
    plt.title(f'C = {C}')
    plt.legend(loc='lower left')
    plt.show()

### 1.2 SVM for non-linearly separable data

#### 1.2.1 Gaussian kernel

To find non-linear decision boundaries with the SVM, input features have to be mapped to higher dimension (in order to separate the data linearly with a hyperplane). 

The most used mapping function is Gaussian kernel or radial basis function. Gaussian kernel is similarity function that measures the 'distance' between a pair of examples, $(x^{(i)}, x^{(j)})$. The Gaussian kernel is parametrized by a bandwidth parameter, $\sigma$, which determines how fast the similarity metrics decreases to 0 as the examples are further apart. 
$$
K_{gasussian}(x^{(i)}, x^{(j)}) = exp(-\frac{||{x^{(i)} - x^{(j)}||}^2}{2\sigma^{2}} = exp\bigg(-\frac{\sum_{k=1}^{n}(x_{k}^{(i)} - x_{k}^{(j)})^2)}{2\sigma^2}\bigg)
$$

In [None]:
def gaussianKernel(x1, x2, sigma):
    x1 = x1.flatten()
    x2 = x2.flatten()
    
    return np.exp(-np.sum((x1-x2)**2)/(2*sigma**2))

In [None]:
x1 = np.array([1, 2, 1])
x2 = np.array([0, 4, -1])
sigma = 2

sim = gaussianKernel(x1, x2, sigma)

In [None]:
print(f'Gaussian kernel between x1={x1} and x2={x2} with sigma={sigma} is {sim}')
print(f'Expected value ~ 0.324652')

#### 1.2.2 Example dataset 2

This dataset contains data without linear decision boundary that separates the positive and negative examples.  

In [None]:
dataset2 = loadmat('data/ex6data2.mat')
print(dataset2['__header__'])
X = dataset2['X']
y = dataset2['y']

In [None]:
plt.figure(figsize=(12, 8))
plotData(X, y)
plt.legend(loc='best')
plt.show()

In [None]:
# rbf is ootb sklearn gaussian kernel
# instead of sigma it takes gamma=1/sigma^2 as parameter
C = 1
sigma = 0.1
model = SVC(C=C,
            kernel='rbf',
            gamma=sigma**(-2),
           )
model.fit(X, y.ravel())

plt.figure(figsize=(12, 8))
plotData(X, y)
plt.axis([np.min(X[:, 0])-0.05, np.max(X[:, 0])+0.05,
         np.min(X[:, 1])-0.05, np.max(X[:, 1])+0.05])
visualizeBoundary(X, y, model)
plt.legend(loc='lower left')
plt.show()

#### 1.2.2 Example dataset 2

In the provided dataset, *ex6data3.mat*, the given variables are $X$,
$y$, $Xval$, $yval$.
The task is to use the cross validation set to determine the best C and $\sigma$ parameter to use. 

For both C and $\sigma$, suggested values are 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30. All possible pairs of values for C and $\sigma$ should be examined, where the total number of models for suggested list is 64 (8^2). 

In [None]:
dataset3 = loadmat('data/ex6data3.mat')
print(dataset3['__header__'])
X = dataset3['X']
y = dataset3['y']
Xval = dataset3['Xval']
yval = dataset3['yval']

In [None]:
plt.figure(figsize=(12, 8))
plotData(X, y)
plt.legend(loc='best')
plt.show()

In [None]:
def cvParams(X, y, Xval, yval):
    params = [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30]
    
    opt_accuracy = 0
    opt_C, opt_sigma = 0, 0
    
    for i, C in enumerate(params):
        for j, sigma in enumerate(params):
            model = SVC(C, kernel='rbf', gamma=1/(sigma**2))
            model.fit(X, y.ravel())
            predictions = model.predict(Xval)
            accuracy = np.mean(predictions.ravel() == yval.ravel())
            if accuracy > opt_accuracy:
                opt_accuracy = accuracy
                opt_C = C
                opt_sigma = sigma
    return opt_C, opt_sigma, opt_accuracy

In [None]:
C, sigma, acc = cvParams(X, y, Xval, yval)
model = SVC(C, kernel='rbf', gamma=1/(sigma**2))
model.fit(X, y.ravel())

plt.figure(figsize=(12, 8))
plotData(X, y)
plt.axis([np.min(X[:, 0])-0.05, np.max(X[:, 0])+0.05,
         np.min(X[:, 1])-0.05, np.max(X[:, 1])+0.05])
visualizeBoundary(X, y, model)
plt.legend(loc='lower left')
plt.title(f'accuracy = {acc*100}%')
plt.show()

## 2 Spam classification

TBA