### Support Vector Machines Example 5.3
We now examine the **Khan** data set, which consists of a number of tissue samples corresponding to four distinct types of small round blue cell tumors.
For each tissue sample, gene expression measurements are available. The data set consists of training data, **xtrain** and **ytrain**, and testing data, **xtest** and **ytest**.

We examine the dimension of the data: 


**NOTE**
To get files from LFS, open a terminal and type:
`git lfs pull -I notebooks*`

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import svm
from SVM_def import SVM_def

# SVM_def is a class containing definitions used throughout this Chapter
svm_def = SVM_def()
X_train = pd.read_csv('./data/Khan_xtrain.csv')
y_train = pd.read_csv('./data/Khan_ytrain.csv')
X_test = pd.read_csv('./data/Khan_xtest.csv')
y_test = pd.read_csv('./data/Khan_ytest.csv')

print("Shapes training set: ", X_train.shape, y_train.shape)
print("Shapes test set: ", X_test.shape, y_test.shape)

Shapes training set:  (63, 2309) (63, 2)
Shapes test set:  (20, 2309) (20, 2)


In [2]:
X_train.head()

Unnamed: 0.1,Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V2299,V2300,V2301,V2302,V2303,V2304,V2305,V2306,V2307,V2308
0,V1,0.773344,-2.438405,-0.482562,-2.721135,-1.217058,0.827809,1.342604,0.057042,0.133569,...,-0.238511,-0.027474,-1.660205,0.588231,-0.463624,-3.952845,-5.496768,-1.414282,-0.6476,-1.763172
1,V2,-0.078178,-2.415754,0.412772,-2.825146,-0.626236,0.054488,1.429498,-0.120249,0.456792,...,-0.657394,-0.246284,-0.836325,-0.571284,0.034788,-2.47813,-3.661264,-1.093923,-1.20932,-0.824395
2,V3,-0.084469,-1.649739,-0.241307,-2.875286,-0.889405,-0.027474,1.1593,0.015676,0.191942,...,-0.696352,0.024985,-1.059872,-0.403767,-0.678653,-2.939352,-2.73645,-1.965399,-0.805868,-1.139434
3,V4,0.965614,-2.380547,0.625297,-1.741256,-0.845366,0.949687,1.093801,0.819736,-0.28462,...,0.259746,0.357115,-1.893128,0.255107,0.163309,-1.021929,-2.077843,-1.127629,0.331531,-2.179483
4,V5,0.075664,-1.728785,0.852626,0.272695,-1.84137,0.327936,1.251219,0.77145,0.030917,...,-0.200404,0.061753,-2.273998,-0.039365,0.368801,-2.566551,-1.675044,-1.08205,-0.965218,-1.836966


We need to delete the first column.

In [3]:
X_train = X_train.drop(['Unnamed: 0'], axis=1)
print(X_train.shape)
X_test = X_test.drop(['Unnamed: 0'], axis=1)
print(X_test.shape)

(63, 2308)
(20, 2308)


This data set consists of expression measurements for $2308$ genes. The training and test sets consist of 63 and 20 observations respectively.

In [4]:
y_train = y_train.drop(['Unnamed: 0'], axis=1)
y_test = y_test.drop(['Unnamed: 0'], axis=1)
print("Results training set: ", y_train.value_counts())
print("Results test set: ", y_test.value_counts())

Results training set:  x
2    23
4    20
3    12
1     8
dtype: int64
Results test set:  x
3    6
2    6
4    5
1    3
dtype: int64


In [5]:
# Convert to np array
y_test = np.array(y_test).T[0, :]
y_train = np.array(y_train).T[0, :]

We will use a support vector approach to predict cancer subtype using gene expression measurements. In this data set, there are a very large number of 
features relative to the number of observations. This suggests that we should use a linear kernel, because the additional flexibility that will result from using a polynomial or radial kernel is unnecessary.

In [6]:
""" First fit Linear kernel on 4 classes """
c = 10
clf = svm.SVC(kernel='linear', C=c, probability=True)
clf.fit(X_train, y_train)

ypred_train = clf.predict(X_train)   
tab_scores_train = svm_def.table_scores(ypred_train, y_train)
print("Training scores:\n", tab_scores_train)

Training scores:
         Pred 1  Pred 2  Pred 3  Pred 4
True 1     8.0     0.0     0.0     0.0
True 2     0.0    23.0     0.0     0.0
True 3     0.0     0.0    12.0     0.0
True 4     0.0     0.0     0.0    20.0


We see that there are no training errors. In fact, this is not surprising, because the large number of variables relative to the number of observations
implies that it is easy to find hyperplanes that fully separate the classes. We are most interested not in the support vector classifier’s performance on the
training observations, but rather its performance on the test observations.

In [7]:
ypred_test = clf.predict(X_test)
tab_scores_test = svm_def.table_scores(ypred_test, y_test)
print("Testing scores:\n", tab_scores_test)

Testing scores:
         Pred 1  Pred 2  Pred 3  Pred 4
True 1     3.0     0.0     0.0     0.0
True 2     0.0     6.0     0.0     0.0
True 3     0.0     2.0     4.0     0.0
True 4     0.0     0.0     0.0     5.0


We see that using **cost=10** yields two test set errors on this data.