# Exercise 6 - SVM and Kernel Trick (30 Points)

In this exercise you will implement a Support Vector Machine and improve its performance by using the famous kernel trick.

In the event of a persistent problem, do not hesitate to contact the course instructors under
- paul.kahlmeyer@uni-jena.de

### Submission

- Deadline of submission:
        x.y.z
- Submission on [moodle page](https://moodle.uni-jena.de/course/view.php?id=18310)

# The Dataset

In this exercise we are using a toy [circle dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_circles.html#sklearn.datasets.make_circles) from scikit learn

### Task 1 (1 Point)
The dataset is stored as numpy arrays under `X.npy` (images) and `Y.npy` (labels).

Load the dataset and display it using matplotlib.

In [None]:
# TODO: load and visualize data

Each feature in this dataset has one of two possible labels. 
The standard approach for binary classification tasks is logistic regression.

### Task 2 (3 Points)
Use scikit learn to fit logistic regression on the dataset. What is the train- and test accuracy?

In [None]:
# TODO: apply logistic regression on data

### Task 3 (1 Point)

Visualize the predictions made with logistic regression in order to get an understanding on the results.

In [None]:
# TODO: visualize predictions

# SVM

Support Vector Machines (SVMs) are using the Kernel Trick to find a linear classifier in a higher dimension (see Lecture Notes).

The standard soft margin support vector machine is defined by the loss function
\begin{align}
L = \cfrac{1}{2}||\theta||^2+C\sum_{i=1}^m\max\{1-y^{(i)}(\theta^Tx^{(i)}), 0\}
\end{align}

We can use the fact, that $\theta$ is a linear combination of our datapoints
\begin{align}
\theta = \sum_{i=1}^m w_i x^{(i)}
\end{align}

to rewrite the objective function as
\begin{align}
L = \cfrac{1}{2}w^TXX^Tw+C\sum_{i=1}^m\max\{1-y^{(i)}(w^TXx^{(i)}+b), 0\}\,.
\end{align}

Note that the labels $y^{(i)}$ have to be binary in $\{-1,1\}$.

This loss function relies on the dot product as a measure of similarity between two vectors.
For two matrices $X_1, X_2\in\mathbb{R^{m\times d}}$, we can get the matrix of pairwise similarity with
\begin{align}
Sim(X_1, X_2) = X_1X_2^T
\end{align}

### Task 4 (2 Points)

Create a Train- and Testset for SVM and implement the pairwise similarity function.

In [None]:
# TODO: Create Train- and Testset

def sim(X1, X2):
    # TODO: Implement similarity function between two sets of vectors
    pass

Lets have a closer look at the second part of our cost function $L$:

The innermost component 

\begin{align}
Dec(x^{(i)}):= w^TXx^{(i)}+b
\end{align}

produces a *decision* of the SVM that can be used for a *prediction*

\begin{align}
\hat{y}^{(i)}=Pred(x^{(i)}) := \begin{cases}
1\text{, if }Dec(x^{(i)})\geq 0\\
-1\text{, else}
\end{cases}
\end{align}

for the label $y^{(i)}$ of the feature $x^{(i)}$.

Next, we multiply this decision with the true label. 
\begin{align}
Marg(x^{(i)}):=y^{(i)}(w^TXx^{(i)}+b)
\end{align}

This term is called *margin*.

Recall, that our true labels are $\in\{-1,1\}$. Therefore
\begin{align}
\max\{1-Marg(x^{(i)}), 0\} = \begin{cases}
0\text{, if }sign(Dec(x^{(i)}))=sign(y^{(i)})\text{ and }|Dec(x^{(i)})|\geq 1\\
\in(0,1]\text{, if }sign(Dec(x^{(i)}))=sign(y^{(i)})\text{ and }|Dec(x^{(i)})|< 1\\
\in(1,\infty)\text{, if }sign(Dec(x^{(i)}))\neq sign(y^{(i)})
\end{cases}
\end{align}

gives us a measure on the error we made with a decision. The term

\begin{align}
\xi_i := 1-Marg(x^{(i)})
\end{align}
is called *slack*.

Note, that we use a corpus of features $X$ (our trainingdata) to calculate the decision (and prediction) for other features.

### Task 5 (6 Points)

Implement the cost function and the functions for decision, prediction, accuracy, margin and slack. 

Define these functions, so that everything is calculated for *multiple* new features $x^{(i)}$ at once. To do so, use the similarity function `sim` from above.

In [None]:
def decision(X, X_train, w, b):
    # TODO: Implement decision function
    pass

def predict(X, X_train, w, b):
    # TODO: Implement prediction function
    pass

def accuracy(X, Y, X_train, w, b):
    # TODO: Implement accuracy function
    pass

def margin(X, Y, X_train, w, b):
    # TODO: Implement margin function
    pass

def slack(X, Y, X_train, w, b):
    # TODO: Implement slack
    pass

def cost(X, Y, X_train, w, b, C):
    # TODO: Implement cost function
    pass

Our goal is to minimize the cost function. We will do so by using gradient descend on the two optimizable parameters $w$ and $b$.

\begin{align}
\frac{\partial L}{\partial w}&=XX^Tw-C\sum_{i=1, \xi_i\geq 0}^m y^{(i)}Xx^{(i)}\\
\frac{\partial L}{\partial b}&=-C\sum_{i=1, \xi_i\geq 0}^m y^{(i)}
\end{align}

To recall gradient descend, have a look [here](https://en.wikipedia.org/wiki/Gradient_descent) or in exercise 2.

### Task 6 (4 Points)
Implement functions that calculate the gradients with respect to $w$ and $b$. Again, use `sim` for computing the dot products.

In [None]:
def grad_b(X, Y, w, b, C):
    # TODO: Implement gradient wrt. b
    pass

def grad_w(X, Y, w, b, C):
    # TODO: Implement gradient wrt. w
    pass

### Task 7 (4 Points)

Implement a function `fit`, that uses gradient descend with a specified learning rate `lr` to get values for $w$ and $b$.
In each iteration (epoch), `fit` should 
- update $w$ and $b$ with gradient descend
- calculate the current loss
- calculate the current accuracy on train- and testdata
- check if the current loss changed by more than a threshold `eps`; if not stop the process

The output of `fit` should be the final $w$, $b$ and the loss and accuracy statistics.

In [None]:
def fit(X_train, X_test, Y_train, Y_test, C=1.0, lr=1e-5, epochs=500, eps=1e-3):
        # TODO: Implement gradient descend
        pass

### Task 8 (1 Point)
Now use your `fit` function to get `w` and `b`. 

In [None]:
# TODO: Use fit function

### Task 9 (1 Point)
Plot the accuracies and losses as well as the predictions on the dataset.

In [None]:
# TODO: plot accuracies, losses and predictions

# The Polynomial kernel

You probably see, why the SVMs performance is limited: Because it is a linear classifier and the data cannot be seperated by a linear regression line. 

This is where kernels come in:

Kernels are functions, that map our features into a higher dimensional space, where they can be seperated by a linear regression line. 
See [here](https://medium.com/@zxr.nju/what-is-the-kernel-trick-why-is-it-important-98a98db0961d) or [here](https://www.youtube.com/watch?v=efR1C6CvhmE) for an explaination with images.

An example would be the *polynomial kernel* of degree 2 that maps a vector onto the vector of its pairwise products:

\begin{align}
\begin{bmatrix}
x_1&x_2
\end{bmatrix}
&\rightarrow
\begin{bmatrix}
x_1^2&x_1x_2&x_2x_1&x_2^2
\end{bmatrix}\\
\begin{bmatrix}
x_1&x_2&x_3
\end{bmatrix}
&\rightarrow
\begin{bmatrix}
x_1^2&x_1x_2&x_1x_3&x_2x_1&x_2^2&x_2x_3&x_3x_1&x_3x_2&x_3^2
\end{bmatrix}
\end{align}

The special property about kernels is the fact that the dot product between two vectors that were produced by the kernel can be calculated directly from the dot products of the original vectors. In other words, if $x_1, x_2$ are features and $\varphi$ is a kernel, then there exists some (simple) function $f$ with

\begin{align}
\varphi(x)^T\varphi(x) = f(x^Tx)
\end{align}

This means, that we can calculate the similarity in a higher dimension without actually having to go into this dimension. This trick is the famous **kernel trick**. In our example of the polynomial kernel of degree 2 we have that

\begin{align}
\varphi_2(x)^T\varphi_2(x) = (x^Tx)^2
\end{align}

And more general, if $p$ is any polynomial degree 

\begin{align}
\varphi_p(x)^T\varphi_p(x) = (x^Tx)^p
\end{align}

### Task 10 (1 Point)

Implement $\varphi_p$ for the polynomial kernel of an an arbitrary degree $p$. The function should map an input vector onto the vector of all $p$-wise products.

Hints:
- Use the package [`itertools`](https://docs.python.org/3/library/itertools.html) for index shuffling (see [here](https://stackoverflow.com/questions/104420/how-to-generate-all-permutations-of-a-list))

In [None]:
def phi(x, poly):
    # TODO: map x onto vector of poly
    pass

### Task 11 (1 Point)
Now perform a small experiment:

1. Pick a polynomial degree $p$ and a feature dimension $d$
2. Draw two random vectors $x_1, x_2\in\mathbb{R}^d$
3. Calculate the dot product $\varphi_p(x_1)^T\varphi_p(x_2)$
4. Calculate $(x_1^Tx_2)^p$
5. Compare the results

Hints:
- Compare with [`np.isclose`](https://numpy.org/doc/stable/reference/generated/numpy.isclose.html)

In [None]:
# TODO: Perform experiment

### Task 12 (3 Points)

Now we want to incorporate this polynomial kernel into our SVM. 
Use your functions from above to fill out the following class. Replace the `sim` function from before with the `kernel` function. This function should calculate the similarity matrix of $\varphi_p(x^{(i)}), \varphi_p(x^{(j)})$ via the kernel trick.

Note: We can immediately see the advantage of defining an algorithm in a class: We do not need to pass all parameters trough the functions, because the functions have access to parameters via the `self` keyword.

In [None]:
class SVMPoly():
    # TODO: Use functions from before to implement 
    def __init__(self, poly, C):
        # TODO: Store algorithm arguments 
        pass
        
    
    def kernel(self, X1, X2):
        # TODO: Implement polynomial kernel
        pass
    
    def decision(self, X):
        # TODO: Implement decision function
        pass

    def predict(self, X):
        # TODO: Implement prediction function
        pass

    def accuracy(self, X, Y):
        # TODO: Implement accuracy function
        pass

    def margin(self):
        # TODO: Implement margin function
        pass

    def slack(self):
        # TODO: Implement slack
        pass

    def cost(self):
        # TODO: Implement cost function
        pass
    
    def grad_b(self):
        # TODO: Implement gradient wrt. b
        pass

    def grad_w(self):
        # TODO: Implement gradient wrt. w
        pass
    
    
    def fit(self, X_train, X_test, Y_train, Y_test, lr=1e-5, epochs=500, eps=1e-3, verbose=False):
        # TODO: Implement fit function
        pass
    

### Task 13 (2 Points)
Now use the `SVMPoly` class to fit an SVM on our train data.
Again plot the accuracies and losses as well as the predictions on the dataset.

In [None]:
# TODO: fit on train data

# TODO: plot accuracies, losses and predictions

### Task 14 (1 Point)

Compare your results with the [Scikit learn implementation](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)

In [None]:
# TODO: Use sklearn to fit svm on traindata

# TODO: Plot predictions, calculate test accuracy