# Support Vector Machine

In this question, we consider an application of SVM as a classifier. The SVM classifier main idea is to separate the two classess (We will transform the data from $y_i \in \{0, 1\}$ to $y_i \in \{-1, 1\}$

## Let's start with $\ell_2$ SVM

The ridged-SVM classification problem can be formulated as the following optimization problem:

$$\underset{w, b}{\text{min }} \frac{\lambda}{2}\left\|w\right\|_2^2 + \frac{1}{n}\sum_{i=1}^{N}{\left(1 - y_i\left(w^\top x_i +b\right)\right)_+}$$

where $y_i$ denotes the $i^{th}$ label, $x_i$ denotes the $i^{th}$ vector of features in the dataset, $w$ is the weights or vector of coefficients, $b$ is the bias term, and $\lambda$ is a model parameter is inversely related to the ridge regularization of the weights vector $w$. This is a quadratic optimization problem.

For this SVM, we are using a Linear Kernel, that explain the hyperplane that we are using on the Loss Function. Later we will play with this idea and introduce more interesting kernels, and introduce some non-linearities.

<!-- Using `cvxpy`, implement this SVM (estimate the $w$ and $b$ parameters) on the training set and tune the parameter $C$ from $0$ to $100$ by checking classification accuracy on the validation set. Plot the training accuracy versus $C$ curve and validation accuracy versus $C$ curve. Briefly comment on the results. -->

### Gradient Descent - Reprise

Let's compute the derivatives of the Loss function so we can use Gradient Descent as our method of solving for the weights of SVM. We have two terms on our Loss function, the regularized part and the sum of the errors with the hyperplane

$$
\frac{\partial}{\partial \omega} \frac{\lambda}{2}\left\|w\right\|_2^2 = \lambda \omega
$$

$$
\frac{\partial}{\partial \omega} {\left(1 - y_i\left(w^\top x_i+b\right)\right)_+}  = \left\{
        \begin{array}{ll}
            0 & \quad \text{if} \quad y_i\left(w^\top x_i+b\right) \geq 1 \\
            -y_ix_i , & \quad otherwise
        \end{array}
    \right.
$$

To understand the process of the gradient, it's divided into 2 parts: the Regularizer and the Hyperplane. When a sample $x_i$ it's correctly classified, we update the vector only by the regularizer, if the sample $x_i$ it's incorrectly misclassified, we update the weights with both the regularizer and the gradient of the plane. 

## Imports and Spark Session

In [1]:
# imports
import re
import time
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
%reload_ext autoreload
%autoreload 2

In [3]:
# store path to notebook
PWD = !pwd
PWD = PWD[0]

In [6]:
# start Spark Session
from pyspark.sql import SparkSession, SQLContext
app_name = "svm_toy"
master = "local[*]"
sc = SparkSession.builder\
     .config('spark.executor.memory',       '4G')\
     .config('spark.driver.memory',        '40G')\
     .config('spark.driver.maxResultSize', '10G')\
     .getOrCreate()
sc = sc.sparkContext
sq = SQLContext(sc)

## Data preparation

In [7]:
# Let's read the toy dataset. Also, let's replace 0 for -1 in the Label data, so we can use the perceptron
toy_data = pd.read_pickle('../data/ToyData.pkl')
toy_data['ctr'] = toy_data['ctr'].replace(0,-1)
toy_data.head()

Unnamed: 0,ctr,i01,i02,i03,i04,i05,i06,i07,i08,i09,...,s17,s18,s19,s20,s21,s22,s23,s24,s25,s26
78,-1,0.0,1,15.0,16.0,1624.0,24.0,5.0,20.0,391.0,...,27c07bd6,c61e82d7,21ddcdc9,5840adea,ff3ce4c0,c9d4222a,423fab69,d691765a,445bbe3b,d1d45fc5
79,-1,0.0,1,675.0,20.0,60.0,659.0,7.0,33.0,2256.0,...,e5ba7672,c21c3e4c,21ddcdc9,b1252a9d,cad88c3b,ad3062eb,bcdee96c,60a57787,9b3e8820,17723a96
97,-1,0.0,1,8.0,4.0,1213.0,11.0,596.0,16.0,16.0,...,27c07bd6,e88ffc9d,2efde463,a458ea53,a23da47b,ad3062eb,c7dc6720,65c55747,cb079c2d,f868e7eb
133,-1,0.0,-1,58.0,20.0,2728.0,152.0,7.0,15.0,298.0,...,27c07bd6,c21c3e4c,21ddcdc9,b1252a9d,29d21ab1,ad3062eb,bcdee96c,69e4f188,e8b83407,bb574173
217,1,6.0,15,1.0,4.0,89.0,4.0,72.0,24.0,125.0,...,3486227d,07070d63,21ddcdc9,5840adea,e5195a68,c9d4222a,32c7478e,9be5c7a4,2bf691b1,2fede552


In [18]:
# Transform our toy data set into a RDD, with the corresponding form (y, features_array)
# We map to a list to be able to use regular RDD commands. We use a helper function to parse the Dataframe
def parse(line):
    """
    Map records from Row --> (tuple,of,fields)
    """
    fields = np.array(line) #Will be added later , dtype = 'float')
    features,y = fields[1:], fields[0]
    return(features, y)

toy_dataRDD = sq.createDataFrame(toy_data).rdd.map(parse).cache()

# Take the first one to chack it's working
toy_dataRDD.take(1)

[(array(['0.0', '1', '15.0', '16.0', '1624.0', '24.0', '5.0', '20.0',
         '391.0', '0.0', '3.0', '0.0', '16.0', '05db9164', '333137d9',
         '0f8b497f', '8d0c7214', '25c83c98', '7e0ccccf', '7c59aadb',
         '0b153874', 'a73ee510', '41c624fe', 'ff78732c', 'a0c32c81',
         '9b656adc', '07d13a8f', '6cfa4ac6', '6b98792b', '27c07bd6',
         'c61e82d7', '21ddcdc9', '5840adea', 'ff3ce4c0', 'c9d4222a',
         '423fab69', 'd691765a', '445bbe3b', 'd1d45fc5'], dtype='<U32'), '-1')]

In [None]:
# part c - gradient descent with regularization
def SVM_GDupdate(dataRDD, W, lr = 0.1, regPar = 0.1, reg = 'l2', kernel = 'linear'):
    """
    Perform one gradient descent update, you can decide kernel or Type of regularization #Work in Progress
    Args:
        dataRDD  - tuple of (features_array, y)
        W        - (array) model coefficients with bias at index 0
        lr       - (float) defaults to 0.1
        regPar   - (float) defaults to 0.1
        reg      - (str) Type of regularization used - defaults to L2, can go to L1
        kernel   - (str) type of kernel used, defaults to Linear
    Returns:
        model   - (array) updated coefficients, bias still at index 0
    """
    # First step, we broadcast the initial weights
    w = sq.broadcast(W)
    
    # Second, let's augment our data
    augmentedData = dataRDD.map(lambda x: (np.append([1.0], x[0]), x[1]))
    
    # Helper functions
    def l2_grad(line):
        """
        Helper function with the L2 gradient
        Args:
            w     - Array of old weights to be updated
            line  - Observation point tuple (feature_array, Y)
        Output:
            w_new - New weights
        """
        # From the tuple of observations, get the y and X
        y, X = line[1], line[0]
        
        # The gradient will depend on the misclassification of any given point
        if (y*np.dot(X,w.value)) < 1:
            grad =  -1*y*x
        else:
            grad = 0
            
        # Finally we yield the new total gradient
        yield grad
                
    # Let's add to each gradient its penalization, depending of the type of regression       
    if reg == 'l2':
        # Calculate the batch gradient
        grad = dataRDD.map(l2_grad).mean()
        # We only regularized features weights
        w_nobias = np.append([0.0], W[1:])
        # We update the gradient including regularization
        grad += regPar*w_nobias
       
    # Update the Weights
    new_model = W-lr*grad
    
    return new_model