# Classifying Cancer Subtype with Support Vector Machines 
# Our Problem 
- Doctors diagnose what kind of cancer their patients have using genetic tests. Can we use data to help doctors classify cancer type? 
- multiclass classification rather than binary 
- **class boundaries may not be linear**
- use Support Vector Machines (SVM)

![alt text](svm.png)

# How do SVMs work? 

### A simple example: maximal margin hyperplane 

- a maximal margin hyperplane is a line that has the farthest minimum distance to the training observations.
- This allows us to separate observations into different classes 
![alt text](svm2.png)

### Extending our simple example: using kernels 
- A kernel calculates the similarity of two observations. 
- A linear kernel will give us a decision boundary like the one above. 
- A **polynomial kernel** will give us a more flexible decision boundary. Hence we can apply the kernel to non-linear data. 
- Kernels are only restricted to polynomials. Below is a **radial kernel** 

![alt text](svm.png)

# Our Dataset 

- Khan dataset: tissue samples corresponding to four types of cell tumours  

- Each sample contains gene expression measurements. 

# Code 

In [6]:
library(ISLR)
library(MASS)
library(caret)
names(Khan)

Loading required package: lattice
Loading required package: ggplot2


## Features 

#### - We have gene expression measurements for 2308 genes. The training set has 63 observations, while the test set has 20 observations  
#### - A dataset with many features like this is an example of a high-dimensional dataset. 

In [10]:
dim(Khan$xtrain)

In [11]:
dim(Khan$xtest)

## Are our datasets balanced?

In [12]:
table(Khan$ytrain)


 1  2  3  4 
 8 23 12 20 

In [20]:
table(Khan$ytest)

We will use a support vector approach to predict cancer subtype using gene expression measurements. In this data set, there are a very large number of features relative to the number of observations. This suggests that we should use a linear kernel, because the additional flexibility that will result from using a polynomial or radial kernel is unnecessary.

This time, we will use the e1071 package that is included in caret, to show how to implement analysis outside of caret. We will place the x variables and y variables together into one data frame, then apply the svm( ) function to the data. 

In [42]:
data_train <- data.frame(x=Khan$xtrain, y=as.factor(Khan$ytrain)) 
head(data_train)

Unnamed: 0,x.1,x.2,x.3,x.4,x.5,x.6,x.7,x.8,x.9,x.10,⋯,x.2300,x.2301,x.2302,x.2303,x.2304,x.2305,x.2306,x.2307,x.2308,y
V1,0.7733437,-2.438405,-0.4825622,-2.721135,-1.217058,0.8278092,1.342604,0.05704174,0.1335689,0.5654274,⋯,-0.02747398,-1.660205,0.588231,-0.463624,-3.952845,-5.496768,-1.414282,-0.6476004,-1.763172,2
V2,-0.07817778,-2.415754,0.4127717,-2.825146,-0.6262365,0.05448819,1.429498,-0.1202486,0.4567917,0.1590529,⋯,-0.2462842,-0.836325,-0.5712836,0.03478783,-2.47813,-3.661264,-1.093923,-1.20932,-0.8243955,2
V3,-0.08446916,-1.649739,-0.2413075,-2.875286,-0.8894054,-0.02747398,1.1593,0.01567648,0.1919418,0.4965847,⋯,0.02498525,-1.059872,-0.4037666,-0.6786527,-2.939352,-2.73645,-1.965399,-0.805868,-1.139434,2
V4,0.965614,-2.380547,0.6252965,-1.741256,-0.8453664,0.9496868,1.093801,0.8197358,-0.2846201,0.9947322,⋯,0.3571148,-1.893128,0.2551072,0.1633086,-1.021929,-2.077843,-1.127629,0.3315315,-2.179483,2
V5,0.0756639,-1.728785,0.8526265,0.2726953,-1.84137,0.3279359,1.251219,0.7714499,0.0309171,0.2783133,⋯,0.0617534,-2.273998,-0.03936472,0.3688011,-2.566551,-1.675044,-1.08205,-0.9652184,-1.836966,2
V6,0.4588163,-2.875286,0.1358412,0.4053984,-2.082647,0.1378471,1.73353,0.3964244,0.04583342,0.3520643,⋯,-1.102018,-1.545994,-0.65778,0.3900807,-1.660205,-1.651302,-1.130722,-1.129175,0.04114194,2


When the cost argument is small, then the mar- gins will be wide and many support vectors will be on the margin or will violate the margin. When the cost argument is large, then the margins will be narrow and there will be few support vectors on the margin or violating the margin.

In [51]:
library(e1071)
out = svm(y ~ ., data=data_train, kernel="linear",cost=10)
summary(out)


Call:
svm(formula = y ~ ., data = data_train, kernel = "linear", cost = 10)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  linear 
       cost:  10 
      gamma:  0.0004332756 

Number of Support Vectors:  58

 ( 20 20 11 7 )


Number of Classes:  4 

Levels: 
 1 2 3 4




In [52]:
# manually compute a confusion matrix 
table(out$fitted , data_train$y)

   
     1  2  3  4
  1  8  0  0  0
  2  0 23  0  0
  3  0  0 12  0
  4  0  0  0 20

We see that there are no training errors. In fact, this is not surprising, because the large number of variables relative to the number of observations implies that it is easy to find hyperplanes that fully separate the classes. We are most interested not in the support vector classifier’s performance on the training observations, but rather its performance on the test observations.

In [59]:
data_test=data.frame(x=Khan$xtest , y=as.factor(Khan$ytest))
head(data_test)

Unnamed: 0,x.1,x.2,x.3,x.4,x.5,x.6,x.7,x.8,x.9,x.10,⋯,x.2300,x.2301,x.2302,x.2303,x.2304,x.2305,x.2306,x.2307,x.2308,y
V1,0.139501,-1.1689275,0.5649728,-3.366796,-1.323132,-0.69254736,2.327395,0.92370319,0.1121673,0.5097651,⋯,-0.9426347,-1.210662,-0.5887872,-0.07042246,-2.7838519,-2.8404394,-1.1609133,-0.34305385,-0.05551271,3
V2,1.1642752,-2.0181583,1.1035335,-2.165435,-1.440117,-0.437420279,2.661587,1.2240107,0.21050401,1.0455631,⋯,-1.5329399,-2.385967,-0.389641,0.42278099,-2.8167496,-2.4224954,-1.7226066,-1.70374859,-1.69990982,2
V4,0.8410929,0.2547197,-0.2087477,-2.148149,-1.512765,-1.263722809,2.946642,0.08782771,0.48291986,1.0630197,⋯,-1.8540605,-1.541312,-1.7737231,-1.87993516,-2.2652893,-2.4057259,-0.1763792,-0.12874288,-0.99641678,4
V6,0.6850646,-1.9275792,-0.2330676,-1.640413,-1.008954,0.774450632,1.617168,-0.56792522,0.03662118,-0.1017006,⋯,-0.2639655,-1.966113,-1.0861898,0.885914,-0.2485896,0.3858745,-0.5081625,-0.62698498,-0.69936648,2
V7,-1.9561625,-2.2349264,0.2815634,-2.695628,-1.214697,-1.05987246,2.49807,0.78019606,1.04158328,0.7275003,⋯,-0.6931472,-1.846427,-0.9934418,-3.29413831,-3.3326046,-2.2827825,-0.6566224,-2.0121568,-1.66865727,1
V8,-0.2586412,-1.6847004,0.1758003,-2.323809,-1.692276,-0.008637193,2.302135,0.45577792,-0.34249031,0.7165219,⋯,-1.2238354,-1.140372,-0.9524362,0.294012,-1.205307,-1.4575756,-0.655081,-0.06049338,-0.98056262,3


In [60]:
# run the predict function 
pred_test=predict(out, newdata=data_test)
# output confusion matrix 
table(pred_test, data_test$y)


         
pred_test 1 2 3 4
        1 3 0 0 0
        2 0 6 2 0
        3 0 0 4 0
        4 0 0 0 5

Only two errors! 