![BTS](https://github.com/vfp1/bts-mbds-data-science-foundations-2019/raw/master/sessions/img/Logo-BTS.jpg)

# Session 5: Support Vector Machines EXERCISES

### Filipa Peleja <filipa.peleja@bts.tech>
### Victor F. Pajuelo Madrigal <victor.pajuelo@bts.tech>

## Classical Data Analysis (16-02-2021)

Open this notebook in Google Colaboratory: [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vfp1/bts-cda-2020/blob/master/Session_5/Session_5_Classical_Data_Analysis_SVM_EXERCISES.ipynb)





## Exercise one [NO CODE]

1.   What is a support vector?
> Support vectors are the datapoints used to determine the hyperplane and its margins. For a linearly seperable dataset, the support vectors should be the "inner-most" data points of the two classes. It is between these two data points that the hyerplane will be oriented with the maximum margin possible. 
2.   Why it is important to scale inputs when using the SVM?
> Unscaled inputs can affect training time, but the biggest issue is that they affect model performance. Depending on what's being measured, unscaled inputs can have very different ranges of values (ex. *[Height, Weight, LDL Cholestorol Level]* to predict *[High risk of heart disease (y/n)]*). As a result, the weights applied to these unscaled inputs will have highly different values as well. Because the SVM cost function often uses an L2 penalty (which penalizes large weights), there is a tendency to drive some weights towards zero, while other weights are overinflated to reduce the overall loss. Scaling the inputs effectively scales the weights, which helps prevent disproportionate values being assigned to the weights.

3. Should you use dual=True or dual=False when training a model with millions of samples but hundreds of features?
> Dual = False, because there are more samples than features.

## Exercise two [NECESSARY]

Train a SVM classifier on the datasets shown in class (not the regression one). Take special care with the hyperparameters for multiclassification, C and other hyperparameters that we discussed. You may want to tune the hyperparameters using smaller validation sets to speed up the process. What accuracy can you reach?

In this exercise you need to:

- Visualize and present the dataset
- Apply the SVC to perform binary classification
- Comment on the usage of the different kernels and the effect of the hyper-parameters (is always needed a nonlinear kernel?)
- Visualize the results with a confusion matrix
- Compute the accuracy score and comment on the results
- Perform any other experiments that you can think of, always reason about the results!

In [311]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, recall_score, precision_score
from sklearn.datasets import load_breast_cancer
from sklearn.svm import SVC

# Load breast cancer data set
data = load_breast_cancer()

# Scale inputs
scaler = StandardScaler()
inputs_scaled = scaler.fit_transform(data.data)

x = pd.DataFrame( data = inputs_scaled, columns = data.feature_names)
y = data.target

# Train/test split
Xtrain, Xtest, Ytrain, Ytest = train_test_split(x, y, test_size=0.3, random_state = 0)

In [312]:
print('Normalized Input Dataset:')
x

Normalized Input Dataset:


Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,1.097064,-2.073335,1.269934,0.984375,1.568466,3.283515,2.652874,2.532475,2.217515,2.255747,...,1.886690,-1.359293,2.303601,2.001237,1.307686,2.616665,2.109526,2.296076,2.750622,1.937015
1,1.829821,-0.353632,1.685955,1.908708,-0.826962,-0.487072,-0.023846,0.548144,0.001392,-0.868652,...,1.805927,-0.369203,1.535126,1.890489,-0.375612,-0.430444,-0.146749,1.087084,-0.243890,0.281190
2,1.579888,0.456187,1.566503,1.558884,0.942210,1.052926,1.363478,2.037231,0.939685,-0.398008,...,1.511870,-0.023974,1.347475,1.456285,0.527407,1.082932,0.854974,1.955000,1.152255,0.201391
3,-0.768909,0.253732,-0.592687,-0.764464,3.283553,3.402909,1.915897,1.451707,2.867383,4.910919,...,-0.281464,0.133984,-0.249939,-0.550021,3.394275,3.893397,1.989588,2.175786,6.046041,4.935010
4,1.750297,-1.151816,1.776573,1.826229,0.280372,0.539340,1.371011,1.428493,-0.009560,-0.562450,...,1.298575,-1.466770,1.338539,1.220724,0.220556,-0.313395,0.613179,0.729259,-0.868353,-0.397100
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,2.110995,0.721473,2.060786,2.343856,1.041842,0.219060,1.947285,2.320965,-0.312589,-0.931027,...,1.901185,0.117700,1.752563,2.015301,0.378365,-0.273318,0.664512,1.629151,-1.360158,-0.709091
565,1.704854,2.085134,1.615931,1.723842,0.102458,-0.017833,0.693043,1.263669,-0.217664,-1.058611,...,1.536720,2.047399,1.421940,1.494959,-0.691230,-0.394820,0.236573,0.733827,-0.531855,-0.973978
566,0.702284,2.045574,0.672676,0.577953,-0.840484,-0.038680,0.046588,0.105777,-0.809117,-0.895587,...,0.561361,1.374854,0.579001,0.427906,-0.809587,0.350735,0.326767,0.414069,-1.104549,-0.318409
567,1.838341,2.336457,1.982524,1.735218,1.525767,3.272144,3.296944,2.658866,2.137194,1.043695,...,1.961239,2.237926,2.303601,1.653171,1.430427,3.904848,3.197605,2.289985,1.919083,2.219635


In [317]:
# Model setup (kernel, hyper-parameters)
def model_setup(kernel, C = 1.0, degree = 3):
    model = SVC(kernel=kernel, C = C, degree = degree)
    model.fit(Xtrain, Ytrain)
    conf_matrix = pd.DataFrame(confusion_matrix(Ytest, model.predict(Xtest)))
    train_score = round(100*model.score(Xtrain, Ytrain),2)
    test_score = round(100*model.score(Xtest, Ytest),2)
    recall = round(100*recall_score(Ytest, model.predict(Xtest)),2)
    precision = round(100*precision_score(Ytest, model.predict(Xtest)),2)
    return conf_matrix, train_score, test_score, recall, precision

def result_printout(kernel, C_vals, degree = 3):
    print(kernel.title(), 'kernel model\n________________________\n')
    scores = pd.DataFrame({'Scores by C-value' : ['Train','Test'] }).set_index('Scores by C-value')
    full_conf_mat = pd.DataFrame()
    for C in C_vals:
        conf_matrix, train_score, test_score, recall, precision = model_setup(kernel, C, degree)
        score = pd.DataFrame({C : [train_score, test_score, recall, precision] , 'Scores by C-value' : ['Train','Test','Recall','Precision']}).set_index('Scores by C-value')
        scores = pd.concat([scores,score], axis = 1)
        print('\nConfusion Matrix: C =',C,'\n',conf_matrix)
    print('\n________________________\n\nAccuracy Scores:')
    return scores

In [318]:
# kernels = ['linear','rbf','poly','sigmoid']
C_vals = [0.10, 1, 10, 100]
result_printout('linear', C_vals)

Linear kernel model
________________________


Confusion Matrix: C = 0.1 
     0    1
0  60    3
1   2  106

Confusion Matrix: C = 1 
     0    1
0  61    2
1   5  103

Confusion Matrix: C = 10 
     0    1
0  59    4
1   7  101

Confusion Matrix: C = 100 
     0    1
0  59    4
1   6  102

________________________

Accuracy Scores:


Unnamed: 0_level_0,0.1,1.0,10.0,100.0
Scores by C-value,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Train,98.74,98.74,99.25,100.0
Test,97.08,95.91,93.57,94.15
Recall,98.15,95.37,93.52,94.44
Precision,97.25,98.1,96.19,96.23


### Linear kernel performance
Performance on the training set increased as C increased (as expected). However, the highest testing score (of the C-values used) was found for C=0.10. Even though C=0.10 had the lowest training score, it appears to be the best option for generalizing to testing data.

In [319]:
# kernels = ['linear','rbf','poly','sigmoid']
C_vals = [0.10, 1, 10, 100]
result_printout('rbf', C_vals)

Rbf kernel model
________________________


Confusion Matrix: C = 0.1 
     0    1
0  55    8
1   3  105

Confusion Matrix: C = 1 
     0    1
0  60    3
1   1  107

Confusion Matrix: C = 10 
     0    1
0  62    1
1   1  107

Confusion Matrix: C = 100 
     0    1
0  63    0
1   5  103

________________________

Accuracy Scores:


Unnamed: 0_level_0,0.1,1.0,10.0,100.0
Scores by C-value,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Train,95.23,98.24,98.74,100.0
Test,93.57,97.66,98.83,97.08
Recall,97.22,99.07,99.07,95.37
Precision,92.92,97.27,99.07,100.0


### RBF Kernel Performance
For C=10, RBF reached a higher testing score than any other tested parameter of the Linear kernel. Preceision and Recall were also significantly higher (~99%).

In [328]:
# kernels = ['linear','rbf','poly','sigmoid']
C_vals = [0.10, 1, 10, 100]
print('NOTE: DF = 3')
result_printout('poly', C_vals, 3)

NOTE: DF = 3
Poly kernel model
________________________


Confusion Matrix: C = 0.1 
     0    1
0  33   30
1   1  107

Confusion Matrix: C = 1 
     0    1
0  46   17
1   2  106

Confusion Matrix: C = 10 
     0    1
0  59    4
1   1  107

Confusion Matrix: C = 100 
     0    1
0  60    3
1   2  106

________________________

Accuracy Scores:


Unnamed: 0_level_0,0.1,1.0,10.0,100.0
Scores by C-value,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Train,83.92,89.7,96.73,99.25
Test,81.87,88.89,97.08,97.08
Recall,99.07,98.15,99.07,98.15
Precision,78.1,86.18,96.4,97.25


In [324]:
# kernels = ['linear','rbf','poly','sigmoid']
C_vals = [0.10, 1, 10, 100]
print('NOTE: DF = 5')
result_printout('poly', C_vals, 5)

NOTE: DF = 5
Poly kernel model
________________________


Confusion Matrix: C = 0.1 
     0    1
0  25   38
1   2  106

Confusion Matrix: C = 1 
     0    1
0  32   31
1   1  107

Confusion Matrix: C = 10 
     0    1
0  39   24
1   1  107

Confusion Matrix: C = 100 
     0    1
0  54    9
1   2  106

________________________

Accuracy Scores:


Unnamed: 0_level_0,0.1,1.0,10.0,100.0
Scores by C-value,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Train,78.89,86.68,90.2,94.72
Test,76.61,81.29,85.38,93.57
Recall,98.15,99.07,99.07,98.15
Precision,73.61,77.54,81.68,92.17


In [325]:
# kernels = ['linear','rbf','poly','sigmoid']
C_vals = [0.10, 1, 10, 100]
print('NOTE: DF = 10')
result_printout('poly', C_vals, 10)

NOTE: DF = 10
Poly kernel model
________________________


Confusion Matrix: C = 0.1 
     0    1
0  18   45
1   2  106

Confusion Matrix: C = 1 
     0    1
0  26   37
1   3  105

Confusion Matrix: C = 10 
     0    1
0  30   33
1   8  100

Confusion Matrix: C = 100 
     0   1
0  29  34
1  12  96

________________________

Accuracy Scores:


Unnamed: 0_level_0,0.1,1.0,10.0,100.0
Scores by C-value,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Train,77.64,81.41,85.68,89.2
Test,72.51,76.61,76.02,73.1
Recall,98.15,97.22,92.59,88.89
Precision,70.2,73.94,75.19,73.85


### Poly Kernel Performance (DOF = 3, 5, 10)
For DOF = 3, performance was highest for C=10. Interestingly, performance for DOF = 5 was highest for C = 100. For both DOF = 5 & 10, Precision metrics were generally quite low.

In [329]:
# kernels = ['linear','rbf','poly','sigmoid']
C_vals = [0.10, 1, 10, 100]
result_printout('sigmoid', C_vals)

Sigmoid kernel model
________________________


Confusion Matrix: C = 0.1 
     0    1
0  55    8
1   2  106

Confusion Matrix: C = 1 
     0    1
0  58    5
1   5  103

Confusion Matrix: C = 10 
     0   1
0  58   5
1  13  95

Confusion Matrix: C = 100 
     0   1
0  55   8
1  13  95

________________________

Accuracy Scores:


Unnamed: 0_level_0,0.1,1.0,10.0,100.0
Scores by C-value,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Train,94.97,95.23,93.97,93.97
Test,94.15,94.15,89.47,87.72
Recall,98.15,95.37,87.96,87.96
Precision,92.98,95.37,95.0,92.23


### Sigmoid Kernel Performance
Across all metrics used, the Sigmoid Kernel performed best for lower C values. 


### Conclusion

Overall, the top performing model (of the models tested) was RBF for C = 10.

## Exercise three [OPTIONAL]

*Try to solve this as an optional assignement, we will review the code in the following class*

Support vector machines (SVMs) are a particularly powerful and flexible class of supervised algorithms. In this exercise, we will develop the intuition behind support vector machines and their use in classification problems.

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

To begin with, let us generate the data for a linear classification problem. In order to do so use the `make_blobs` function from `sklearn`. We want to generate 50 samples, set the `random_state=0` and `cluster_std=0.6`. Finally plot the points with `plt.scatter`

A linear discriminative classifier would attempt to draw a straight line separating the two sets of data, and thereby create a model for classification. For two dimensional data like that shown here, this is a task we could do by hand. Think about a line that separates the two classes, how many are there? Which do you think would be the more appropiate given the data points? Draw some lines over this plot with slopes [1, 0.5, -0.2] and biases [0.65, 1.6, 2.9]. Use the `np.linespace` function to generate the x and the line equation 

$$y = mx + b$$

for the y. Finally plot a "new point" in the coordinates (0.6, 2.1) with a red X.

## SVM margins and support vectors
These are three very different separators which, nevertheless, perfectly discriminate between these samples. Depending on which you choose, a new data point (e.g., the one marked by the "X" in this plot) will be assigned a different label!.

Support vector machines offer one way to improve on this. The intuition is this: rather than simply drawing a line between the classes, we can draw around each line a margin of some width, up to the nearest point (no matter the class). To visualize this, let us repeat the same plot but adding some code to fill the margins. Use the method `plt.fill_between` with `color='#AAAAAA'`. The margins for each of the lines above are [0.33, 0.55, 0.2].

Now fit an SVM to this data. Use Scikit-Learn's support vector classifier to train an SVM model. For the time being, we will use a linear kernel and set the C parameter to a very large number like 1E10. 



Now retrieve the support vectors from the learned model. Would you be able to identify them on the plot? Why are these the support vectors? Why are they important?

This function will plot the decision boundaries of a model and the support vectors. 

In [None]:
def plot_svc_decision_function(model, ax=None, plot_support=True):
    """Plot the decision function for a 2D SVC"""

    ax = plt.gca()
    xlim = ax.get_xlim()
    ylim = ax.get_ylim()
    
    # create grid to evaluate model
    x = np.linspace(xlim[0], xlim[1], 30)
    y = np.linspace(ylim[0], ylim[1], 30)
    Y, X = np.meshgrid(y, x)
    xy = np.vstack([X.ravel(), Y.ravel()]).T
    P = model.decision_function(xy).reshape(X.shape)
    
    # plot decision boundary and margins
    ax.contour(X, Y, P, colors='k',
               levels=[-1, 0, 1], alpha=0.5,
               linestyles=['--', '-', '--'])
    
    # plot support vectors
    if plot_support:
        ax.scatter(model.support_vectors_[:, 0],
                   model.support_vectors_[:, 1],
                   s=300, linewidth=1, facecolors='none',
                   edgecolors='blue')
        
    ax.set_xlim(xlim)
    ax.set_ylim(ylim)

Generate a scatter plot of the data set and use the previous function to draw the support vectors and the decision boundaries 

A key to this classifier's success is that for the fit, only the position of the support vectors matter; any points further from the margin which are on the correct side do not modify the fit! Technically, this is because these points do not contribute to the loss function used to fit the model, so their position and number do not matter so long as they do not cross the margin.

In order to see an example of this, simulate the points with the same random seed, but now simulate 120. Then train again the SVM and plot the decision boundaries. Which are the support vectors this time? Is this an expected result? Why?

## SVM softening the margins
Now add a new point which is inside the decision boundary, say (-0.5, 2) to class 0, what do you expect will happen?

How about adding an even more outlier, like (0, 0) to class 0.

What happened to the margin? Could we use this model if we had a red point further right? Why?

Now, is there a way that we could try to make the SVM more robust to these possible outliers? Which one? Try to implement an SVM model with the same data modifying the C parameter. What do you observe?

## SVM Kernels
Now we have seen that SVM are very useful to find the optimal separating hyper-plane when your data is linearly separable. Even when you have some noise in the data set, you can tune the C-value to be able to adjust this. But what happens if your classes are not linearly separable? Is there a way we could overcome this draw-back? Let us generate a dataset that is not linearly separable. Use the `make_circles` method from `samples_generator` in sklearn. Generate 100 examples with `factor=.1, noise=.1, random_state=0`. Then generate a scatter plot to visualize the samples, use a different color for each class.

Now try to classify this data with a linear SVM, what do you expect will happen? Does this model capture the pattern of the classes? Plot the classification results with the `plot_svc_decision_function`

What could we do in order to make this dataset linearly separable? We can project it into a higher dimensional space. Note the similarity between this, and the polynomial regression. In order to have a better understanding of how kernels work, first implement a third axis with python with the following expression:

$$x_3 = e^{-(x_1^2 + x_2^2)}$$

Actually this is somewhat equivalent to the rbf kernel. You should name the new axis `x_3`

Once you have implemented this 3rd axis, you may 

In [None]:
from mpl_toolkits import mplot3d
from ipywidgets import interact, fixed

def plot_3D(elev=30, azim=30, X=X, y=y):
    ax = plt.subplot(projection='3d')
    ax.scatter3D(X[:, 0], X[:, 1], x_3, c=y, s=50, cmap='autumn')
    ax.view_init(elev=elev, azim=azim)
    ax.set_xlabel('x_1')
    ax.set_ylabel('x_2')
    ax.set_zlabel('x_3')
    
interact(plot_3D, elev=[30, 0, 90], azip=(-180, 180),
         X=fixed(X), y=fixed(y));

Now use the SVM classifier but set the `kernel='rbf'` with high value for C and plot the results

## Evaluating the classification results
In order to assess the goodness of our model we need to compute some quantitative scores, we will review some of the most relevant. First, use the function `make_blobs` to generate a multilabel dataset. Generate 200 samples with 4 different classes set `random_state=0` and `cluster_std=0.6` 

Split the data set in training and test and apply the SVC with `C=1, kernel='rbf', gamma='auto', class_weight='balanced', decision_function_shape='ovr'` 

Predict on the test set and plot the confusion matrix, use the `confusion_matrix` function from sklearn and `heatmap` from seaborn. What would be the confusion matrix result of a perfect classification? What information can we extract from it?

Finally compute the accuracy score, what does this value mean?

You can use the follwing code to visualize the decision function computed by the model. 

In [None]:
def make_meshgrid(x, y, h=.02):
    """Create a mesh of points to plot in

    Parameters
    ----------
    x: data to base x-axis meshgrid on
    y: data to base y-axis meshgrid on
    h: stepsize for meshgrid, optional

    Returns
    -------
    xx, yy : ndarray
    """
    x_min, x_max = x.min() - 1, x.max() + 1
    y_min, y_max = y.min() - 1, y.max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    return xx, yy


def plot_contours(clf, xx, yy, **params):
    """Plot the decision boundaries for a classifier.

    Parameters
    ----------
    ax: matplotlib axes object
    clf: a classifier
    xx: meshgrid ndarray
    yy: meshgrid ndarray
    params: dictionary of params to pass to contourf, optional
    """
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    out = plt.contourf(xx, yy, Z, **params)
    return out

X0, X1 = Xtest[:, 0], Xtest[:, 1]
xx, yy = make_meshgrid(X0, X1)

plot_contours(svc, xx, yy,
               cmap='jet', alpha=0.6)

plt.scatter(X0, X1, c=ytest, cmap='jet', s=20, edgecolors='k')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.show()

Now repeat the computations with a linear kernel, what changes do you observe? Now change the `cluster_std` when you generate the data, what would you expecte when you increase it? and when you decrease it? 