# Classification Using SVM

In this tutorial we use scikitlearn toolbox for suppoort vector machine (SVM) classification.

In [107]:
import numpy as np
import pandas as pd
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit, GridSearchCV
from sklearn import svm
from sklearn.metrics import confusion_matrix
from matplotlib import pyplot as plt

First we load the data and split it to train and test sets. Then scale the train data and apply the scaling transformation to test data using train_test_split function of sklearn.

In [112]:
data_pd = pd.read_csv('Phenotypic_V1_0b_preprocessed1_extended.csv')
x = data_pd.iloc[:, 108:206].to_numpy()
y = data_pd.loc[:, ['SEX']].to_numpy()
labels = np.unique(y)

x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, test_size=0.2, random_state=150)

# scale the features: we fit the scaler to train data and apply the same transfrom to test data
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)


Note that the cross-validation (CV) method we use here is called HoldOut, which means we split data to train and test and then train the model of training data and evaluate the performance of the classification on test data. Another cross-validation method is Kfold, in which data is randomely partitioned to K parts and in each run one of the parts is used as test set and the other K-1 parts are used as training set. The advantage of Kfold over HoldOut is that in Kfold the model sees all the data points. We use Kfold for parameter selection. 

We want to train a kernel SVM classifier. For that we need to find the best parameters based on the training data. For this, we do a grid search in the space of the parameters. For each point of the grid, we calculate the classifier's performance in CV. The parameter set with the best performance is selected as the final parameter set.

As we saw in the data exploration part, the classes are highly unbalanced. Therefore, if the classifier treats the two classes in the same way, it will classify all he data points as class 1. However, sklearn has the option class_weight, in which one can specify the weights of the classes. You can try different values for cw in he following code, or set it as a parameter in grid search and select the best class-weight parameters.

In [113]:
# select parameters
C_range = np.logspace(-3, 3, 13)
gamma_range = np.logspace(-3, 3, 13)
#weights = np.linspace(0.03, 0.97, 55)
cw = [{1: 0.2, 2:0.8}]#[{1: x, 2:1-x} for x in weights]
param_grid = dict(gamma=gamma_range, C=C_range, class_weight=cw) # , class_weight=cw

cv = StratifiedShuffleSplit(n_splits=5, train_size=0.8, test_size=0.2)
grid = GridSearchCV(SVC(kernel='rbf'), param_grid=param_grid, cv=cv, n_jobs=2)
grid.fit(x_train, y_train[:, 0])
print("The best parameters are %s with a score of %0.2f"
      % (grid.best_params_, grid.best_score_))



The best parameters are {'C': 0.001, 'class_weight': {1: 0.2, 2: 0.8}, 'gamma': 0.001} with a score of 0.85


Now that we have the best paramenters, we train the classifier on the whole dataset and then apply it on test data.

In [115]:
clf = svm.SVC(C=grid.best_params_['C'], gamma=grid.best_params_['gamma'], class_weight={1: 0.2, 2:0.8})
clf.fit(x_train, y_train[:, 0])
y_test_pred_1 = clf.predict(x_test)
acc_1 = np.mean(y_test_pred_1 == y_test[:,0])
print('accuracy=%.4f'%(acc_1))
conf_mat = confusion_matrix(y_test, y_test_pred_1)
print('confusion matrix=\n%s' %(conf_mat))

accuracy=0.8502
confusion matrix=
[[176   0]
 [ 31   0]]


We see that the classifier does not have a good performance!