# Q6 - SVM Programming

***In this question, you will employ SVM to solve a classification problem for the provided data file "hw3q6.csv". Each row in the data file indicates a sample. The first 12 columns are features and the last column "Class" indicates the label, with 1 and 0 indicating the positive and negative samples, respectively.***

## Import Libraries

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

import matplotlib.pyplot as plt
import numpy as np
import sys
import os

%matplotlib inline

## (a) Load data. Report the size of positive and negative samples in dataset

In [None]:
def data_and_headers(filename):
    data = None
    with open(filename) as fp:
        data = [x.strip().split(',') for x in fp.readlines()]
    headers = data[0]
    headers = np.asarray(headers)
    class_field = len(headers) - 1
    data_x = [[float(x[i]) for i in range(class_field)] for x in data[1:]]
    data_x = np.asarray(data_x)
    data_y = [[int(x[i]) for i in range(class_field, class_field + 1)] for x in data[1:]]
    data_y = np.asarray(data_y)
    return headers, data_x, data_y

In [None]:
headers, features_x, labels_y = data_and_headers('Data' + os.sep + 'hw3q6.csv')

In [None]:
print('Data')
print('Number of features - ' + str(features_x.shape[1]))
print('Total Number of observations - ' + str(features_x.shape[0]))
print('Number of Positive Samples - ' + str(labels_y[labels_y==1].shape[0]))
print('Number of Negative Samples - ' + str(labels_y[labels_y==0].shape[0]))
print()

## (b) Use stratified random sampling to divide the dataset into training data (75%) and testing data (25%). Report the number of positive and negative samples in both training and testing data.

In [None]:
train_x, test_x, train_y, test_y = train_test_split(features_x,
                                                    labels_y,
                                                    test_size=0.25,
                                                    random_state=10,
                                                    shuffle=True,
                                                    stratify=labels_y)

In [None]:
print('Training Data')
print('Number of features - ' + str(train_x.shape[1]))
print('Total Number of observations - ' + str(train_x.shape[0]))
print('Number of Positive Samples - ' + str(train_y[train_y==1].shape[0]))
print('Number of Negative Samples - ' + str(train_y[train_y==0].shape[0]))
print()
print('Testing Data')
print('Number of features - ' + str(test_x.shape[1]))
print('Total Number of observations - ' + str(test_x.shape[0]))
print('Number of Positive Samples - ' + str(test_y[test_y==1].shape[0]))
print('Number of Negative Samples - ' + str(test_y[test_y==0].shape[0]))
print()

## (c) Take SVM with linear kernel as classifier (third-party packages are allowed to use) and set the regularization parameter C as: [0.1, 0.5, 1, 5, 10, 50, 100], respectively. For each value of C, train a SVM classifier with the training data and get the number of support vectors (SVs). Generate a plot with C as the horizontal axis and number of SVs as the vertical axis. Give a brief analysis for the plot.

In [None]:
C = [0.1, 0.5, 1, 5, 10, 50, 100]
models = [SVC(C[i],'linear',random_state=10).fit(train_x,train_y.flatten()) for i in range(len(C))]

In [None]:
fig, ax = plt.subplots(figsize=(16,9))
ax.set_title('Regularization vs Support Vectors')
ax.set_ylabel('Number of Support Vectors')
ax.set_xlabel('Regularization Parameter')
plty = [x.support_.shape[0] for x in models]
ax.bar(C, plty, color='orange')
ax.plot(C, plty)
for i, txt in enumerate(C):
    ax.annotate(plty[i], (C[i], plty[i]))
plt.show()

## (d) Compare 4 different kernel functions, including linear, polynomial, radial basic function (Gaussian kernel), and sigmoid kernel. Make a table to record the accuracy, precision, recall and f-measure of the classification results for the 4 kernel functions. Try to tune the parameters via grid search and report your best results with the optimal parameters. Based on the results, which kernel function will you choose?