# Homework 3 - SVM
### Mihovil Mandic, Winter 2019

### Libraries and modules

In [1]:
import numpy as np
import time
import pandas as pd
from sklearn import svm
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn import preprocessing

### Loading the datasets

In [2]:
train_data = pd.read_csv("CS74_HW4_training_set.csv", sep=",")
test_data = pd.read_csv("CS74_HW4_test_set.csv", sep=",")
new_test_data = test_data.copy()

X_train = train_data.astype(float).iloc[:, 0:6]
y_train = train_data.iloc[:, 6]

# Uncomment the next line if you'd like to normalize the data
#X_train = preprocessing.scale(X_train)

### Looking at our data

#### Testing data summary

In [3]:
print(train_data.describe(), '\n')

print("Training - Label counts")
print(train_data['Label'].value_counts(normalize=True))

          Feature_1     Feature_2     Feature_3    Feature_4    Feature_5  \
count   5600.000000  5.600000e+03  5.600000e+03  5600.000000  5600.000000   
mean   10903.161545  7.444728e+04 -4.931251e+03    58.316607   209.785714   
std     5147.732641  1.248326e+05  5.778578e+04    44.796549   127.243862   
min     2017.360000  4.359050e+01 -3.396800e+06     0.000000    33.000000   
25%     7521.820000  1.055090e+04 -5.411000e+03    27.000000   123.000000   
50%     9722.810000  3.017500e+04 -3.330285e+03    45.000000   176.000000   
75%    13320.000000  7.766975e+04 -2.260000e+03    79.000000   262.000000   
max    45110.200000  1.566640e+06  1.410000e+06   450.000000   971.000000   

          Feature_6        Label  
count   5600.000000  5600.000000  
mean    2890.982143     0.436071  
std     3871.558555     0.495941  
min        5.000000     0.000000  
25%      621.000000     0.000000  
50%     1523.000000     0.000000  
75%     3450.000000     1.000000  
max    37040.000000     1.

#### Testing data summary

In [4]:
print(test_data.describe(), '\n')

          Feature_1     Feature_2     Feature_3    Feature_4    Feature_5  \
count   2400.000000  2.400000e+03  2.400000e+03  2400.000000  2400.000000   
mean   10808.028067  7.178885e+04 -5.009052e+03    56.897917   206.568750   
std     5157.804265  1.227602e+05  5.125721e+04    44.635581   126.444024   
min     1954.230000  4.741920e+01 -2.508800e+06     0.000000    33.000000   
25%     7318.457500  1.014838e+04 -5.220000e+03    25.000000   119.000000   
50%     9655.080000  2.875425e+04 -3.201320e+03    44.000000   174.000000   
75%    13203.100000  7.349465e+04 -2.120750e+03    78.000000   254.000000   
max    37483.900000  1.332080e+06  7.530000e+04   350.000000   954.000000   

          Feature_6  
count   2400.000000  
mean    2788.262083  
std     3789.484141  
min        5.000000  
25%      610.250000  
50%     1519.000000  
75%     3295.000000  
max    35169.000000   



The means among all features seem the same or pretty similar between the train and the test set. Thus, the classifier should perform equally. The classes are a bit imbalanced - 56:44 ratio.

The training data given isn't normalized, and scaling to (-1, 1) made sense. I have tried scaling and it left a minor impact on runtime (mostly helpful when grid searching with polynomial kernels), however it required modifying my ranges for C and gamma, so I decided not to normalize in the end.

### SVM parameter tuning, 10-fold cross-validation

Here I created a function `svc_param_tuning`.

**Input**: X, y, and the number of folds for cross-validation.

**Output**: Dict of best possible values for param_grid.

I have started grid searching with 
`C = [1e-3, 1e-2, 1e-1, 1, 1e2, 1e3, 1e4]`.
After that step I noticed that I was getting the best results with C in the range of (1, 10).
Thus I modified to: `C = [1e-3, 1e-2, 1e-1, 1, 3, 5, 5.5, 6, 7, 10]` and narrowed it down to the range between (2, 5).
It made sense to keep testing so I ended up my grid search by using `C = [1, 2, 2.4, 2.5, 3, 5, 10]`.

Meanwhile gammas were always of negative power of 10, `gammas = [1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-1]`

The best two values for C were `C = 2.4` and `C = 3`, but after a 10-fold CV, `C = 3` ended up with a higher score.

I imported the sklearn class **GridSearchCV** as it seemed like the cleanest solution and it allowed a custom cv fold parameter. It does all the cross-validation + gridsearching across multiple parameters for me.

Out of the kernels tested, `rbf` or `Radial Basis Function` always produced the most consistent and best results in the 0.72 - 0.77 range (scoring = `accuracy`). 

`poly` and `sigmoid` took a hefty amount of time to run, but I was never able to produce accurate results (compared to `rbf`) even when increasing `coef0` and `degrees`.


In [5]:
def svm_param_tuning(X, y, k_folds):
    #Cs = [1e-3, 1e-2, 1e-1, 1, 3, 5, 5.5, 6, 7, 10]
    Cs = [1, 2, 2.4, 2.5, 3, 5, 10]
    
    gammas = [1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-1]
    
    #kernels = ['rbf', 'poly', 'sigmoid']
    kernels = ['rbf']
    
    #degrees = [0, 1, 2, 3, 4, 5, 6, 7]
    #param_grid = {'C': Cs, 'gamma' : gammas, 'kernel' : kernels, 'degree' : degrees}
    param_grid = {'C': Cs, 'gamma' : gammas, 'kernel' : kernels}
    
    grid_search = GridSearchCV(svm.SVC(cache_size=20000), param_grid, n_jobs=-1, cv=k_folds, scoring='accuracy')
    grid_search.fit(X, y)
    print('Best accuracy: ', grid_search.best_score_)
    return grid_search.best_params_

print(svm_param_tuning(X_train, y_train, 10))

Best accuracy:  0.7675
{'C': 3, 'gamma': 1e-06, 'kernel': 'rbf'}


Thus, after a 10-fold CV, it seems that the best parameters I found are `C = 3`, `gamma = 1e-06`, and `rbf` as the kernel function.

I added the `cache_size` variable value to the SVC class because my machine has plenty of RAM. Feel free to decrease, the value is in MB. `n_jobs` also as it refers to threads.


### Training

In [6]:
def train(X, y):
    clf = svm.SVC(C=3, gamma=1e-06, kernel='rbf', cache_size=20000)
    clf.fit(X, y)
    return clf

clf = train(X_train, y_train)

### Final 10-fold CV accuracy check

In [7]:
accuracy = cross_val_score(clf, X_train, y_train, cv=10, scoring='accuracy')

print(accuracy)
print('Mean accuracy:', np.mean(accuracy))

[0.75935829 0.76292335 0.75357143 0.78392857 0.76785714 0.79642857
 0.77142857 0.72857143 0.7745975  0.7763864 ]
Mean accuracy: 0.7675051254035521


With no feature selection or training data preprocessing the mean accuracy is roughly **76.75%** after a 10-fold cross-validation.

### Prediction - 'hw3_test_data.csv'

In [8]:
def predict(clf, test_data):
    return clf.predict(test_data)
    
new_test_data['Label'] = predict(clf, test_data)

### Looking at the predicted labels

In [9]:
print("Training data - Label counts")
print(train_data['Label'].value_counts(normalize=True), '\n')

print("Testing data - Label counts")
print(new_test_data['Label'].value_counts(normalize=True))

Training data - Label counts
0    0.563929
1    0.436071
Name: Label, dtype: float64 

Testing data - Label counts
0    0.603333
1    0.396667
Name: Label, dtype: float64


The class proportions are a bit skewed when comparing to the training data, but as we don't know the ground truth, I believe that it's safe to assume that SVM helped us classify pretty well.

### Saving into a new CSV

In [11]:
new_test_data.to_csv("hw3_test_data_labeled.csv", index=False)