# MI-ADM: home assignment 2

  * **Deadline**: 15.05.2019 -2 points for late submission, the hard deadline is the first day of the exam period.
  * **What to submit**: Just this notebook with your code and texts, not the dataset!
  * **How to submit**: See the instructions at https://courses.fit.cvut.cz/MI-ADM/tutorials/index.html.
  
Generally speaking, the goal of this assignment is to apply **support vector machines on a classification problem**.

What to do:
  * Use the data from Spambase dataset http://archive.ics.uci.edu/ml/datasets/spambase.
  * Train Support Vector Machine classification model directly (without any kernel approach) and evaluate its accuracy.
  * Train Support Vector Machine classification model based on a kernel function (RBF, polynomial, etc.) of your choice and evaluate its accuracy.
  * Compare the results with a random forest model.

## Solution

In [1]:
import math
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn.metrics as metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from matplotlib.ticker import FormatStrFormatter
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
from sklearn.preprocessing import MinMaxScaler, StandardScaler

In [2]:
data = pd.read_csv('spambase.data', sep = ',', header=None)

In [3]:
# data.nunique()
cols = list(data.columns)
cols[-1] = 'label'
data.columns = cols

In [4]:
# data.describe()

In [5]:
def drange(start, stop, step):
    r = start
    while r < stop:
        yield r
        r += step

For testing, I'll use cross-validation at ten folds.

The test function was also created for testing one hyper-parameter and creating a method dependency graph on the values of this parameter. If not specified, check one value of random_state.

In [6]:
# number of cores use to evaluate cross-validation
n_jobs = -1
# numbe of fold in cross-validation
cv = 10
scoring = {'accuracy': 'accuracy',
           'recall': 'recall',
           'precision': 'precision'}
# method for test model and iterate over one parameter (plot dependency), can be use only for evaluate one set of model.
# default test parametr is for test with get parametrs for one time
def test_model_and_hyper_parameter(model, data, s_params, label, check_results = [], 
                                  test_param = 'random_state', min_value_p = 42, max_value_p = 43, step_p = 1):
    check = {}
    for i in check_results:
        check[i] = []
    param_val = []
    for i in drange(min_value_p, max_value_p, step_p):
        s_params[test_param] = i
        mod = model(**s_params)
        scores = cross_validate(mod, data.drop(columns=[label]), data[label], 
                            scoring=scoring, cv=cv, n_jobs=n_jobs, return_train_score=True)
        param_val.append(i)
        for j in check_results:
            check[j].append(np.sqrt(scores[j]).mean())
    return param_val, check

### Data normalization

Since all attributions are numeric, I have decided to standardize the data and compare the results on standardized data. I have tried both MinMax normalization and Standard normalization.

In [7]:
minMaxScaler = MinMaxScaler()
dataMM = minMaxScaler.fit_transform(data.drop(columns=['label']))

  return self.partial_fit(X, y)


In [8]:
standartScaler = StandardScaler()
dataSS = standartScaler.fit_transform(data.drop(columns=['label']))
# data.describe()

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


In [9]:
dataMM = pd.DataFrame(dataMM)
dataMM['label'] = data['label']

In [10]:
dataSS = pd.DataFrame(dataSS)
dataSS['label'] = data['label']

### Support Vector Machine directly

First I try the Support vector machine (SVC) without any kernel approach. The linear kernel is set, because the rbs is default. The methods are tested on a non-standardized as well as on two normalized datasets.



When specifying max_iter to -1, the stop is dependent on the termination condition, and it can be seen that the test ran, on an 8/16 processor, for a very long time (cross-validation gives the possibility run parallel evaluation).

When I was limiting the number of iterations to obtain a model faster, non-standardized data losing accuracy. In the case of using the termination condition, this difference did not manifest, so training the model on non-standardized data takes much longer, but without restriction to maximum iteration, this model has very similar results to models with standardized data. As a result, Standard Scaling appears to be a suitable method for normalizing data for SVC.

In [11]:
%%time
max_iter = -1
param, check = test_model_and_hyper_parameter(SVC, data, {'kernel': 'linear', 'max_iter': max_iter}, 'label',  
                                          ['score_time','fit_time', 'test_accuracy', 'train_accuracy', 
                                          'test_recall', 'train_recall', 'test_precision', 'train_precision' ])
print('Support vector machine with non normalized data: ')
print("\t Train accuracy: ", check['train_accuracy'])
print("\t Test accuracy: ", check['test_accuracy'])
print()
param, check = test_model_and_hyper_parameter(SVC, dataMM, {'kernel': 'linear', 'max_iter': max_iter}, 'label',  
                                          ['score_time','fit_time', 'test_accuracy', 'train_accuracy', 
                                          'test_recall', 'train_recall', 'test_precision', 'train_precision' ])
print('Support vector machine with data normalized with minMax normalization: ')
print("\t Train accuracy: ", check['train_accuracy'])
print("\t Test accuracy: ", check['test_accuracy'])
print()
param, check = test_model_and_hyper_parameter(SVC, dataSS, {'kernel': 'linear', 'max_iter': max_iter}, 'label',  
                                          ['score_time','fit_time', 'test_accuracy', 'train_accuracy', 
                                          'test_recall', 'train_recall', 'test_precision', 'train_precision' ])
print('Support vector machine with data normalized with standard normalization: ')
print("\t Train accuracy: ", check['train_accuracy'])
print("\t Test accuracy: ", check['test_accuracy'])

Support vector machine with non normalized data: 
	 Train accuracy:  [0.9670340128388046]
	 Test accuracy:  [0.9557163990334981]

Support vector machine with data normalized with minMax normalization: 
	 Train accuracy:  [0.9513977475497883]
	 Test accuracy:  [0.9474823932914556]

Support vector machine with data normalized with standard normalization: 
	 Train accuracy:  [0.9663092294678988]
	 Test accuracy:  [0.9590562249191124]
CPU times: user 228 ms, sys: 96 ms, total: 324 ms
Wall time: 9min 4s


### Support Vector Machine with some kernel function (polynomial, rbf, sigmoid)

#### RBF kernel

In [12]:
%%time
max_iter = -1
param, check = test_model_and_hyper_parameter(SVC, data, {'kernel': 'rbf', 'max_iter': max_iter}, 'label',  
                                          ['score_time','fit_time', 'test_accuracy', 'train_accuracy', 
                                          'test_recall', 'train_recall', 'test_precision', 'train_precision' ])
print('Support vector machinewith rbf kernel and with non normalized data: ')
print("\t Train accuracy: ", check['train_accuracy'])
print("\t Test accuracy: ", check['test_accuracy'])
print()
param, check = test_model_and_hyper_parameter(SVC, dataMM, {'kernel': 'rbf', 'max_iter': max_iter}, 'label',  
                                          ['score_time','fit_time', 'test_accuracy', 'train_accuracy', 
                                          'test_recall', 'train_recall', 'test_precision', 'train_precision' ])
print('Support vector machine with rbf kernel and with data normalized with minMax normalization: ')
print("\t Train accuracy: ", check['train_accuracy'])
print("\t Test accuracy: ", check['test_accuracy'])
print()
param, check = test_model_and_hyper_parameter(SVC, dataSS, {'kernel': 'rbf', 'max_iter': max_iter}, 'label',  
                                          ['score_time','fit_time', 'test_accuracy', 'train_accuracy', 
                                          'test_recall', 'train_recall', 'test_precision', 'train_precision' ])
print('Support vector machine with rbf kernel and with data normalized with standard normalization: ')
print("\t Train accuracy: ", check['train_accuracy'])
print("\t Test accuracy: ", check['test_accuracy'])

Support vector machinewith rbf kernel and with non normalized data: 
	 Train accuracy:  [0.9742872318023472]
	 Test accuracy:  [0.902737850318023]

Support vector machine with rbf kernel and with data normalized with minMax normalization: 
	 Train accuracy:  [0.9003657854301788]
	 Test accuracy:  [0.8963832641186963]

Support vector machine with rbf kernel and with data normalized with standard normalization: 
	 Train accuracy:  [0.9736427091928871]
	 Test accuracy:  [0.9641463592914246]
CPU times: user 152 ms, sys: 8 ms, total: 160 ms
Wall time: 9.96 s


The first kernel I try is the RBF kernel. I play with set gamma, but I could not achieve a better result than the default value. The results show that this method works well for standardized data using StandartScaller. With this model and normalized data, I also managed to achieve the highest accuracy among all models.

#### Polynomial kernel

In [13]:
%%time
max_iter = -1
param, check = test_model_and_hyper_parameter(SVC, data, {'kernel': 'poly', 'max_iter': max_iter, 'degree': 1}, 'label',  
                                          ['score_time','fit_time', 'test_accuracy', 'train_accuracy', 
                                          'test_recall', 'train_recall', 'test_precision', 'train_precision' ])
print('Support vector machine with non normalized data: ')
print("\t Train accuracy: ", check['train_accuracy'])
print("\t Test accuracy: ", check['test_accuracy'])
print()
param, check = test_model_and_hyper_parameter(SVC, dataMM, {'kernel': 'poly', 'max_iter': max_iter, 'degree': 1}, 'label',  
                                          ['score_time','fit_time', 'test_accuracy', 'train_accuracy', 
                                          'test_recall', 'train_recall', 'test_precision', 'train_precision' ])
print('Support vector machine with data normalized with minMax normalization: ')
print("\t Train accuracy: ", check['train_accuracy'])
print("\t Test accuracy: ", check['test_accuracy'])
print()
param, check = test_model_and_hyper_parameter(SVC, dataSS, {'kernel': 'poly', 'max_iter': max_iter, 'degree': 1}, 'label',  
                                          ['score_time','fit_time', 'test_accuracy', 'train_accuracy', 
                                          'test_recall', 'train_recall', 'test_precision', 'train_precision' ])
print('Support vector machine with data normalized with standard normalization: ')
print("\t Train accuracy: ", check['train_accuracy'])
print("\t Test accuracy: ", check['test_accuracy'])

Support vector machine with non normalized data: 
	 Train accuracy:  [0.959512221941959]
	 Test accuracy:  [0.9515544646503166]

Support vector machine with data normalized with minMax normalization: 
	 Train accuracy:  [0.8348372289857828]
	 Test accuracy:  [0.8345743568349497]

Support vector machine with data normalized with standard normalization: 
	 Train accuracy:  [0.9626530971111837]
	 Test accuracy:  [0.9584073874414466]
CPU times: user 168 ms, sys: 4 ms, total: 172 ms
Wall time: 16.7 s


In [14]:
%%time
max_iter = 30000000
param, check = test_model_and_hyper_parameter(SVC, data, {'kernel': 'poly', 'max_iter': max_iter, 'degree': 2}, 'label',  
                                          ['score_time','fit_time', 'test_accuracy', 'train_accuracy', 
                                          'test_recall', 'train_recall', 'test_precision', 'train_precision' ])
print('Support vector machine with non normalized data: ')
print("\t Train accuracy: ", check['train_accuracy'])
print("\t Test accuracy: ", check['test_accuracy'])
print()
param, check = test_model_and_hyper_parameter(SVC, dataMM, {'kernel': 'poly', 'max_iter': max_iter, 'degree': 2}, 'label',  
                                          ['score_time','fit_time', 'test_accuracy', 'train_accuracy', 
                                          'test_recall', 'train_recall', 'test_precision', 'train_precision' ])
print('Support vector machine with data normalized with minMax normalization: ')
print("\t Train accuracy: ", check['train_accuracy'])
print("\t Test accuracy: ", check['test_accuracy'])
print()
param, check = test_model_and_hyper_parameter(SVC, dataSS, {'kernel': 'poly', 'max_iter': max_iter, 'degree': 2}, 'label',  
                                          ['score_time','fit_time', 'test_accuracy', 'train_accuracy', 
                                          'test_recall', 'train_recall', 'test_precision', 'train_precision' ])
print('Support vector machine with data normalized with standard normalization: ')
print("\t Train accuracy: ", check['train_accuracy'])
print("\t Test accuracy: ", check['test_accuracy'])

Support vector machine with non normalized data: 
	 Train accuracy:  [0.7174013292427945]
	 Test accuracy:  [0.7109888481696478]

Support vector machine with data normalized with minMax normalization: 
	 Train accuracy:  [0.7784312621214857]
	 Test accuracy:  [0.7784314013688217]

Support vector machine with data normalized with standard normalization: 
	 Train accuracy:  [0.9301974208900216]
	 Test accuracy:  [0.9159871628905918]
CPU times: user 156 ms, sys: 12 ms, total: 168 ms
Wall time: 1min 10s


As a second kernel, I tried polynomial with one and two degrees. With a higher degree, the success of the model only deteriorated. In the first stage, that is, the line function, it can be seen that it corresponds to the first test, i.e., the linear kernel. Only for minMax normalization are the results significantly worse.

For polynomial with degree 2, I had to limit the number of iterations to make it computable. So with the quadratic polynomial kernel function failed to reach the previous results.

#### Sigmoid kernel

In [15]:
%%time
max_iter = -1
param, check = test_model_and_hyper_parameter(SVC, data, {'kernel': 'sigmoid'}, 'label',  
                                          ['score_time','fit_time', 'test_accuracy', 'train_accuracy', 
                                          'test_recall', 'train_recall', 'test_precision', 'train_precision' ])
print('Support vector machine with non normalized data: ')
print("\t Train accuracy: ", check['train_accuracy'])
print("\t Test accuracy: ", check['test_accuracy'])
print()
param, check = test_model_and_hyper_parameter(SVC, dataMM, {'kernel': 'sigmoid'}, 'label',  
                                          ['score_time','fit_time', 'test_accuracy', 'train_accuracy', 
                                          'test_recall', 'train_recall', 'test_precision', 'train_precision' ])
print('Support vector machine with data normalized with minMax normalization: ')
print("\t Train accuracy: ", check['train_accuracy'])
print("\t Test accuracy: ", check['test_accuracy'])
print()
param, check = test_model_and_hyper_parameter(SVC, dataSS, {'kernel': 'sigmoid'}, 'label',  
                                          ['score_time','fit_time', 'test_accuracy', 'train_accuracy', 
                                          'test_recall', 'train_recall', 'test_precision', 'train_precision' ])
print('Support vector machine with data normalized with standard normalization: ')
print("\t Train accuracy: ", check['train_accuracy'])
print("\t Test accuracy: ", check['test_accuracy'])

Support vector machine with non normalized data: 
	 Train accuracy:  [0.5917488853020746]
	 Test accuracy:  [0.589977662821878]

Support vector machine with data normalized with minMax normalization: 
	 Train accuracy:  [0.8348230246655438]
	 Test accuracy:  [0.834445141005372]

Support vector machine with data normalized with standard normalization: 
	 Train accuracy:  [0.937830719353542]
	 Test accuracy:  [0.9325673063684551]
CPU times: user 148 ms, sys: 12 ms, total: 160 ms
Wall time: 10.3 s


Last I tried sigmoid kernel, but it wasn't successful. But I found that this core function is very sensitive to data standardization.

### Compare model to random forest model

In [16]:
%%time
param, check = test_model_and_hyper_parameter(RandomForestClassifier, data, {'max_depth': 4, 'random_state':1, 
                                        'min_samples_split':2, 'min_samples_leaf':1, 'n_estimators':90}, 'label', 
                                        ['score_time','fit_time', 'test_accuracy', 'train_accuracy', 
                                          'test_recall', 'train_recall','test_precision', 'train_precision' ] )
print('Random forest with non normalized data: ')
print("\t Train accuracy: ", check['train_accuracy'])
print("\t Test accuracy: ", check['test_accuracy'])
print()

param, check = test_model_and_hyper_parameter(RandomForestClassifier, dataMM, {'max_depth': 4, 'random_state':1, 
                                        'min_samples_split':2, 'min_samples_leaf':1, 'n_estimators':90}, 'label', 
                                       ['score_time','fit_time', 'test_accuracy', 'train_accuracy', 
                                          'test_recall', 'train_recall','test_precision', 'train_precision' ] )
print('Random forest machine with non normalized data: ')
print("\t Train accuracy: ", check['train_accuracy'])
print("\t Test accuracy: ", check['test_accuracy'])
print()

param, check = test_model_and_hyper_parameter(RandomForestClassifier, dataSS, {'max_depth': 4, 'random_state':1, 
                                        'min_samples_split':2, 'min_samples_leaf':1, 'n_estimators':90}, 'label', 
                                        ['score_time','fit_time', 'test_accuracy', 'train_accuracy', 
                                          'test_recall', 'train_recall','test_precision', 'train_precision' ] )
print('Random forest machine with non normalized data: ')
print("\t Train accuracy: ", check['train_accuracy'])
print("\t Test accuracy: ", check['test_accuracy'])
print()

Random forest with non normalized data: 
	 Train accuracy:  [0.9622649182465489]
	 Test accuracy:  [0.9575533092443974]

Random forest machine with non normalized data: 
	 Train accuracy:  [0.9622774350558112]
	 Test accuracy:  [0.9575533092443974]

Random forest machine with non normalized data: 
	 Train accuracy:  [0.9622649182465489]
	 Test accuracy:  [0.9575533092443974]

CPU times: user 148 ms, sys: 4 ms, total: 152 ms
Wall time: 1.26 s


To compare to decision trees, I did not expect any sensitivity to attribute standardization, but I gave an example and measured standardized datasets. Thus, compared to the support vector machine, the decision-making forest is insensitive to standardization. The SVC is sensitive, which I assumed. The more complex the kernel function is, the more critical it is to normalize the data for good results from SVC.

With the support vector machine, I have achieved the best result with kernel RBF -> 0.964 test accuracy.
And 0.9575 for Random Decision Forest.