# Week 7 - Support Vector Machines

### Aims

By the end of this notebook you will be able to understand 

>* the Separable vs Non-separable data
>* Use of different kernels and parameter tuning for SVM
>* the model assessment for SVM
>* the binary case for Default data 


1. [Setup](#setup)

2. [Separable and Non separable Data Cases](#RBH)

3. [Model assessment](#assess)

4. [Default Data for Binary Example](#default)


- In this WS we will be exploring the basics of support vector machine models. 

- We will be focusing on the most straight forward case, which is a support vector machine classifier which is provide by sklearn as the SVC model. For the details please have a look at https://scikit-learn.org/stable/modules/svm.html

Main function that we are using is [sklearn.svm.SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC)

**NOTE THAT**, for the simplicity we did not use any data partitioning in below for toy data examples. 
But for the real data set (Default), we will have the data splitting procedure as a general procedure. 

As usual, during workshops, you will complete the worksheets together in teams of 2-3, using **pair programming**. When completing worksheets:

>- You will have tasks tagged by (CORE) and (EXTRA). 
>- Your primary aim is to complete the (CORE) components during the WS session, afterwards you can try to complete the (EXTRA) tasks for your self-learning process. 
>- Look for the 🏁 as cue to switch roles between driver and navigator.
>- In some Exercises, you will see some hints at the bottom of questions. Some of them include fill in the blanks or commented line of codes that you need to change while running the code!

Instructions for submitting your workshops can be found at the end of worksheet. As a reminder, you must submit a pdf of your notebook on Learn by 16:00 PM on the Friday of the week the workshop was given.

# 1. General Setup <a id='setup'></a>

## 1.1 Packages

Now lets load in the packages you wil need for this workshop.


In [None]:
# Display plots inline
%matplotlib inline  

# Data libraries
import pandas as pd
import numpy as np

# Plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns

# sklearn modules list that might be useful, maybe you do not need to use all of them
import sklearn
from sklearn.metrics import confusion_matrix
from sklearn.pipeline import make_pipeline
from sklearn.svm import SVC, LinearSVC           # SVM
from sklearn.preprocessing import StandardScaler # scaling features
from sklearn.preprocessing import LabelEncoder   # binary encoding
from sklearn.pipeline import Pipeline            # combining classifier steps
from sklearn.preprocessing import PolynomialFeatures # make PolynomialFeatures
from sklearn.datasets import make_classification, make_moons  # make example data
import warnings # prevent warnings
import joblib # saving models

from time import time
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, KFold, StratifiedKFold
from scipy.stats.distributions import uniform, loguniform
import itertools
from sklearn.model_selection import GridSearchCV, KFold
#  from imblearn.under_sampling import RandomUnderSampler
from collections import Counter
# from imblearn.metrics import classification_report_imbalanced

import re
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

In [None]:
# Plotting defaults
plt.rcParams['figure.figsize'] = (8,8)
plt.rcParams['figure.dpi'] = 80
plt.rcParams['lines.markersize'] = 7.5

##  1.2 Helper Functions

Below are helper functions we will be using in this workshop. You can create your own if you think it is necessary OR directly use already available helper functions within `sklearn library`.  

- `plot_margin()`: visualization of margins in figures.

You can modify the following function based on your needs as well. These practices would be important while you are working on your project either. 

In [None]:
# About visualization of margins in figures
def plot_margin(model, data, x='x', y='y', cat='z', show_support_vectors = True, nx=50, ny=50):
    # Plot the data
    p = sns.scatterplot(x=x, y=y, hue=cat, data=data, legend=False)
    
    # Find the extent of x and y
    xlim = p.get_xlim()
    ylim = p.get_ylim()
    
    # Create a grid of points
    xx = np.linspace(xlim[0]-1, xlim[1]+1, nx)
    yy = np.linspace(ylim[0]-1, ylim[1]+1, ny)
    YY, XX = np.meshgrid(yy, xx)
    
    # Calculate the label for each point in the grid
    xy = np.c_[XX.ravel(), YY.ravel()]
    Z = model.decision_function(xy).reshape(XX.shape)
    
    # plot contours of decision boundary and margins
    p.contour(XX, YY, Z, colors='k', 
              levels=[-1, 0, 1], alpha=0.5,
              linestyles=['--', '-', '--'])

    # highlight support vectors
    if (show_support_vectors):
        p.scatter(model.support_vectors_[:, 0], 
                  model.support_vectors_[:, 1], s=100,
                  linewidth=1, facecolors='none', edgecolors='k')

    # Show confusion table in the title
    p.set_title(
        "TN: {0}, FP: {1}, FN: {2}, TP: {3}".format(
            *confusion_matrix(
                data[cat],
                model.predict(data.drop(cat, axis=1))
            ).flatten()
        )
    )
    plt.legend(loc='lower left')
    plt.show()

 **__REMARK__**

- The implementation of the SVC with sklearn requires a slightly different meaning for the parameter C. Please note that, "Regularization parameter. 
- The strength of the regularization is inversely proportional to C. Must be strictly positive." and "The C parameter trades of correct classification of training examples against maximization of the decision function's margin. 
- For larger values of C, a smaller margin will be accepted if the decision function is better at classifying all training points correctly. 
- A lower C will encourage a larger margin, therefore a simpler decision function, at the cost of training accuracy. In other words C behaves as a regularization parameter in the SVM."

### **Difference between SVC and LinearSVC**

The linear models `LinearSVC()` and `SVC(kernel='linear')` yield slightly different decision boundaries. This can be a consequence of the following differences:

>- `LinearSVC` minimizes the squared hinge loss while SVC minimizes the regular hinge loss.
>- `LinearSVC` uses the One-vs-All (also known as One-vs-Rest) multiclass reduction while `SVC` uses the One-vs-One multiclass reduction.
>- In terms of graphical display, note that unlike `SVC` (based on LIBSVM), `LinearSVC` (based on LIBLINEAR) does not provide the support vectors.

For further details, try to compare differences from their documentations

- `SVC` with `linear` kernel selection: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
- `LinearSVC` function: https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html

**Try to focus on `SVC` usage in general for the simplicity**

# 2 Separable Data <a id='RBH'></a>

We will begin by examining several toy data problems to explore the basics of these models. To begin we will read in data for the first example from ex1.csv


In [None]:
ex1 = pd.read_csv("ex1.csv")
ex1.head()

We can see the that data is composed of two classes in two dimensions, and it is clear that these two classes are perfectly linearly separable

In [None]:
sns.scatterplot(x='x', y='y', hue='z', data=ex1, legend=False)
plt.show()

---

### 🚩 Exercise 1 (CORE)

Like the other models we've already seen, we fit the SVM by constructing our feature matrix and outcome vector and then calling the fit method for our model object;

1. Separate the features and outcome in the toy dataset ex1.csv
2. Fit a SVC model for this data set using `SVC()` function (Note that you need to change the default value of kernel and parameter C)
3. Visualize the decision boundary and the margins using the plot_margin function we defined above.


---

**!!! Add your text solution here !!!**


### 🚩 Exercise 2  (CORE)

Based on the results of previous exercise, state that

- How many support vectors are there for this model?
- How does the boundary line and the margins change as you change the value of C?

---

**!!! Add your text solution here !!!**

# 3 Non-Separable Data

We will not complicate our previous example somewhat by adding two additional points from the blue A class to our data. This is available in the ex2.csv file.

In [None]:
ex2 = pd.read_csv("ex2.csv")
print(ex2.head())

# To visualize
sns.scatterplot(x='x', y='y', hue='z', data=ex2, legend=False)
plt.show()

---

### 🚩 Exercise 3  (CORE)

- Fit a SVC model to these data using the same code we used with example 1.
- How does the "fit" of this model differ compared to the "fit" for example 1. Hint - make your comparison for equivalent values of C.
- How do the boundary line and margins change as you change the value of C?




---

**!!! Add your comments about the answer here !!!**


# 4  Non-linear Case

Next we will look at a new data set that would seem to also fall in the non-separable category. The data set that we are using is ex3.csv now



In [None]:
# For the new data set 
ex3 = pd.read_csv("ex3.csv")

# To visualize
sns.scatterplot(x='x', y='y', hue='z', data=ex3, legend=False)
plt.show()

---

### 🚩 Exercise 4  (CORE)

For this data we will consider a simple polynomial kernel with degree 2 
(choose first C = 1 ) and visualize the margins using `plot_margin()` again

More details on the various kernels that can be used with the SVC model are available 
https://scikit-learn.org/stable/modules/svm.html#svm-kernels

**The kernel function can be any of the following:**

- linear : $\langle x, x'\rangle$
- polynomial : $(\gamma \langle x, x'\rangle + r)^d$
- rbf : $\exp(-\gamma \|x-x'\|^2)$ where  $\gamma$ is specified by parameter gamma, must be greater than 0
- sigmoid : $\tanh(\gamma \langle x,x'\rangle + r)$ where $r$ is specified by `coef0`

---

**!!! Add your text solution here !!!**

## 4.1 Other Kernels

Next we will consider an even more complicated separation task where one class is split into two separate clusters by the second class. The data ara available as `ex4.csv`.

In [None]:
# For the new data set 
ex4 = pd.read_csv("ex4.csv")

# To visualize
plt.figure(figsize=(8, 8))
sns.scatterplot(x='x', y='y', hue='z', data=ex4, legend=False)
plt.show()

---
 
### 🚩 Exercise 5  (CORE)

Set up a function for experimenting with different penalties and kernel functions for this dataset (ex4.csv). For this purpose, consider, 

1. **C** in $[1,5,10,50,100]$
2. **degree** in $[2,3,4]$
3. **kernel** in $['poly', 'rbf', 'linear'])$

inside of the `SVC()` function. Note that the degree value is only used by polynomial kernel and is ignored by the linear and rbf kernels. 

What combination of parameters appears to produce the best fit? Is it easy to tell this by visual inspection alone?

---

**!!! Add your text solution here !!!**

# 5  Model Assessment <a id='assess'></a>

So far we have only inspected the various models by eye to get a sense of how well they fit our data. Since we are undertaking a classification task here we would like to be able to leverage the metrics and scoring tools we have already learned around logistic regression and related tools. The issue is that while we could generate a simple confusion matrix for our models' predictions this is somewhat limiting.

**WARNING:** 

- By default, SVM models do not support the construction of anything like a ROC curve since the predictions are not probabilistic - i.e. labels are assigned based on which side of the separator a point falls. 
- As such, SVC models do not implement predict_proba by default
- Just to recall, these are some metrics that we discussed before

$$
\text{FPR} = \frac{\text{FP}}{\text{FP}+ \text{TN}}
$$

$$
\text{Recall} = \frac{\text{TP}}{\text{TP}+ \text{FN}}
$$

$$
\text{Precision} = \frac{\text{TP}}{\text{TP}+ \text{FP}}
$$

$$
F1 = 2\left(\frac{Precision \times Recall}{Precision + Recall}\right)
$$

---

### 🚩 Exercise 6  (CORE)

Based on the best model that you visualized above,

1. Report the accuracy of the model
2. Obtain the confusion matrix and interpret the results in terms of the quantities defined below 


**!!! Add your comments about the model performance here !!!**


---

### 🚩 Exercise 7  (CORE)

- Construct a full cross validated grid search over the parameter values: 

$C = np.linspace(0.1, 10, 100)$, degree = $[2,3,4]$, and kernel = $['poly', 'rbf', 'linear']$.

- Which SVM model performs best? Use plot_margin to show the resulting seperator and support vectors.

**Note**: Degree of the polynomial kernel function (‘poly’) only!. Must be non-negative. Ignored by all other kernels.

In [None]:
cv = GridSearchCV(
    SVC(),
    param_grid = { 
        'kernel':________________, 
        'C': _______________,
        'degree': [2,3,4]
    },
    cv = KFold(5, shuffle = True, random_state = 42)
)

# Fit the model on ex4 data set

# Get the best model parameters and the accuracy of the model


---

**!!! Add your text solution here !!!**

# 6 Default Data Case <a id='default'></a>

The dataset consists of 10000 individuals and whether their credit card has defaulted or not. Below is the column description: The main aim is to build the model using Logistic Regression and predict the accuracy of it. The included columns in the data set are as follows:

* `default` - Whether the individual has defaulted

* `student` - Whether the individual is the student

* `balance` - The balance in the individual's account

* `income` - Income of an individual

We read the data into python using pandas.


In [None]:
df_default = pd.read_csv("Default.csv", index_col=0)

# for now lets just drop the student varible.
df_default = df_default.drop("student", axis=1)
df_default.head()

---

### 🚩 Exercise 8 (CORE)

1. Convert your response variable into the numerical format

2. Split the data into training and test sets (**Is there anything you should try account for when splitting the data ?**) Use the test size as $10\%$ of the whole sample

3. Use the following function to get a RandomizedSearch results and sort your model results in terms of the value of "mean_test_recall". Comment on the obtained result in terms of accuracy and recall. 

**Note** that you can face with some warnings so try to examine those by searching the possible reasons on the use of `LinearSVC` below, if you prefer that function instead of `SVC`. 

In [None]:
from sklearn.preprocessing import LabelEncoder

# Convert your response into numerical format

In [None]:
C_list = []
pwr = -5
for i in range(6):
    C_list.append(2**pwr)
    pwr += 2

C_list


In [None]:
linear_svm = Pipeline([
        ("scaler", StandardScaler()),
        ("svm_clf", SVC(kernel='linear', C=1)) 
    ])


# specify parameters and distributions to sample from
lin_param_dist = {'svm_clf__C':loguniform(C_list[0], C_list[-1])}

lin_rs = RandomizedSearchCV(linear_svm, lin_param_dist, n_iter=10, 
                            scoring = ["accuracy", "f1","recall"], 
                            cv = StratifiedKFold(n_splits = 5),
                            refit = "recall", 
                            random_state = 42,
                            return_train_score = True)

lin_rs.fit(X_train, y_train)

---

**!!! Add your text solution here !!!**

---

### 🚩 Exercise 9 (CORE)

Using the following code snippet, try different values of kernel, degree and C, what seems to produce the best model? 

This is again written in terms of `SVC()` function and `GridSearchCV` for the simplicity.  

(Hint: Recommended kernels are rbf, poly, and linear).

**Remember that degree is a valid argument for only polynomial type of kernels. You can ignore that for the `linear` or `rbf` kernels by definition**

In [None]:
# Define the pipeline steps
pipeline_steps = [
    ('scaler', StandardScaler()), # First, scale the features
    ('svc', SVC())                # Then, train the SVC model
]


# Define the parameter grid, note the 'svc__' prefix for SVC parameters
param_grid = { 
    'svc__kernel': ('poly', 'linear', 'rbf'), 
    'svc__C': np.linspace(0.1, 10, 10),
    'svc__degree': [2, 3, 4]
}

# Setup the GridSearchCV with the pipeline
cv = GridSearchCV(
    Pipeline(pipeline_steps),
    param_grid = param_grid, # Create a pipeline
    cv = KFold(5, shuffle=True, random_state=42)
)

# Fit the model on the dataset
cv.fit(X_train, y_train)


---

**!!! Add your text solution here !!!**

---

### 🚩 Exercise 10 (EXTRA)

Comment out the line of code that includes the `StandardScaler` in the pipeline below. 

- What happens to the models predictive performance? 

- Try adjusting C and or kernel manually to see if you can improve the performance

In [None]:
C = ? # Use the parameter from the output of Exercise 9
kernel = ? # Use the parameter from the output of Exercise 9
# degree = ? # Use the parameter from the output of Exercise 9 if you get polynomial kernel

m_svc = make_pipeline(
       # StandardScaler(),
        SVC(C=C, kernel=kernel, random_state = 42)
    )

# fitted model
m_svc.fit(X_train,y_train)

---

**!!! Add your text solution here !!!**

---

### 🚩 Exercise 11 (CORE)

Rememger that the main problem is about the data set is the imbalanced case as we observed last week. For that reason, upsampling (over-sampling) or downsampling (under-sampling) can be considered here again!

- Consider adding the oversampling component using the `RandomOverSampler` within the pipeline to repeat the gridsearch in exercise 9. Focus on `rbf` kernel to reduce the computational time!

- Is there any change on the best fitted model now!

- Compare the `RandomOverSampler` impact in terms of confusion matrix results

In [None]:
# Install the imblearn if necessary 
!pip install imblearn

from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler

---

**!!! Add your text solution here !!!**

---

### 🚩 Exercise 12 (EXTRA)

Remember that previously we considered logistic regression on the same data set;

- Fit both logistic regression and SVC model on the oversampled data set. Consider opening the argument called `probability` for the SVC model to enable probability estimates. 

You can use the best set of parameters for SVC from the previous exercise output, if you completed. Otherwise, think about the use of `poly` kernel with some `C` value to implement a reasonable SVC model 

- Evaluate and compare the performance of the SVC and Logistic Regression models on the test set

- Create ROC curve for each model on the same plot and decide which model seems better than the other

---

**!!! Add your text solution here !!!**

# Competing the Worksheet

At this point you have hopefully been able to complete all the CORE exercises and attempted the EXTRA ones. Now 
is a good time to check the reproducibility of this document by restarting the notebook's
kernel and rerunning all cells in order.

Before generating the PDF, please go to Edit -> Edit Notebook Metadata and change 'Student 1' and 'Student 2' in the **name** attribute to include your name. If you are unable to edit the Notebook Metadata, please add a Markdown cell at the top of the notebook with your name(s).

Once that is done and you are happy with everything, you can then run the following cell 
to generate your PDF. Once generated, please submit this PDF on Learn page by 16:00 PM on the Friday of the week the workshop was given. 

In [None]:
!jupyter nbconvert --to pdf mlp_week07.ipynb 

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=a2a9ec8d-a343-4210-b36b-f9db26268fc5' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>