### Supervised machine learning for classification of several types of cancer via deep learning

#### Cancer Classification Based on Microarray Gene Expression Data Using Deep Learning 
 
 Gene expression analysis (see for example [here](http://storm.cis.fordham.edu/~cschweikert/cisc4020/MicroArray.pdf) )  have  focused on defining more detailed biological characteristics to improve patient risk stratification. Gene signatures are used as predictors and are useful for the more accurate test for the classification and prediction of cancer. Even-though the microarray data contains small number of samples, the large number of gene expression levels make the classification difficult. Here is where the deep learning algorithms contributes significantly.


- Data consists of 174 samples with 12,533 genes (features) , and the samples belong to a variety of classes (of cancer): 

    - 0 belongs to Ovary class, 
    - 1 belongs to Bladder/Ureter class
    - 2 belongs to Breast class, 
    - 3 belongs to Colorectal class
    - 4 belongs to Gastroesophagus class
    - 5 belongs to Kidney class
    - 6 belongs to Liver class
    - 7 belongs to Prostate class
    - 8 belongs to Pancreas class
    - 9 belongs to Lung Adeno class
    - 10 belongs to Lung Squamous class. 
 

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.colors as colors
from mpl_toolkits.axes_grid1 import make_axes_locatable
import numpy
import pandas
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from keras.utils import np_utils
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
import warnings; warnings.simplefilter('ignore')

Using TensorFlow backend.


In [2]:
data = pd.read_csv('data11tumors.csv', delimiter=',',header=None) 
print(data.shape)
array = data.values
X = array[:,1:12534]
Y = array[:,0]
data

(174, 12534)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,12524,12525,12526,12527,12528,12529,12530,12531,12532,12533
0,7,153,228,32,30,-36,48,984,-19,537,...,100,30,83,28,57,106,128,5,74,-188
1,7,154,99,43,55,66,63,5051,-26,1001,...,27,43,90,30,82,485,896,-2,91,-78
2,7,84,85,19,28,-104,28,2387,-80,1131,...,-32,51,110,10,86,62,76,-47,92,-103
3,7,234,169,40,36,81,6,2657,-6,1214,...,43,52,82,12,135,60,69,22,89,-180
4,7,104,58,42,13,107,5,3562,18,1464,...,159,42,73,1,82,60,50,30,74,-198
5,7,207,408,56,6,41,33,2617,40,1243,...,71,71,131,11,146,204,183,-2,146,-152
6,7,188,245,45,18,-15,44,1885,25,1313,...,22,48,90,14,113,76,131,4,98,-98
7,7,221,132,84,19,81,74,2589,-21,742,...,16,36,182,-35,98,213,198,-5,184,-198
8,7,198,328,13,41,16,38,2536,-35,1006,...,26,19,90,42,122,27,103,8,75,105
9,7,303,261,53,71,-40,66,1184,-28,711,...,49,44,58,7,-25,106,154,12,131,-224


In [3]:
print(X[173])
print(Y[173])

[  83  420   14 ...   12   80 -180]
0


In [4]:
# encode class values as integers
from keras.utils import np_utils
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
# convert integers to dummy variables (i.e. one hot encoded)
dummy_y = np_utils.to_categorical(encoded_Y)
#dummy_y=encoded_Y
feature_vectors = X


In [5]:
from sklearn import preprocessing

scaler = preprocessing.StandardScaler().fit(feature_vectors)
scaled_features = scaler.transform(feature_vectors)

from sklearn.cross_validation import train_test_split

#select 20% of the data for validation, rest for training and tuning model
X_train, X_test, y_train, y_test = train_test_split(scaled_features, dummy_y, test_size=0.2, random_state=42)

# create model
inputDimension=12533 #feature columns
nOutputClass=11
model = Sequential()
model.add(Dense(30, input_dim=inputDimension, kernel_initializer='normal', activation='relu'))
model.add(Dense(30, kernel_initializer='normal', activation='relu'))
model.add(Dense(30, kernel_initializer='normal', activation='relu'))
model.add(Dense(nOutputClass, kernel_initializer='normal', activation='sigmoid'))

# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
#model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# Fit the model
#model.fit(X_train, y_train, epochs=150, batch_size=10)
model.fit(X_train, y_train, epochs=25, batch_size=20)

scores = model.evaluate(X_test, y_test)
print("\n%s: on training data %.2f%%" % (model.metrics_names[1], scores[1]*100))



Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25

acc: on training data 88.57%


In [6]:
print("Number of patients reserved for test:", X_test.shape[0])
probabilities = model.predict(X_test)
#print(probabilities)
predictions = [float(numpy.argmax(X_test)) for X_test in probabilities]
encoder.fit(predictions)

encoded_predictions = encoder.transform(predictions)
print(encoded_predictions)
# convert integers to dummy variables (i.e. one hot encoded)
dummy_predictions = np_utils.to_categorical(encoded_predictions)

#dummy_predictions=encoded_predictions 
#print(dummy_predictions)
accuracy = numpy.mean(dummy_predictions == y_test)
print("Prediction Accuracy: %.2f%%" % (accuracy*100))
#print(dummy_predictions[0])
#print(y_test[0])


Number of patients reserved for test: 35
[ 7  2 10  3  2  2  4  1  3  6  7  9  4  4  3  7  7  5  2  7  2  0  3  2
  5  4  4  2  8  8  3  2 10  6  2]
Prediction Accuracy: 97.92%
