# **Methylation Biomarkers for Predicting Cancer**

## **Deep Learning Approaches to Cancer Classification**

**Author:** Meg Hutch

**Date:** January 26, 2020

**Objective:** Use neural networks to classify colon, esophagus, liver, and stomach cancer patients and healthy subjects.

In [32]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.colors
import seaborn as sns

**Import Training, Testing, and Principal component data**

In [33]:
# Training set
mcTrain = pd.read_csv('C:\\Users\\User\\Box Sync/Projects/Multi_Cancer_DL/02_Processed_Data/mcTrain.csv')
# Testing set
mcTest = pd.read_csv('C:\\Users\\User\\Box Sync/Projects/Multi_Cancer_DL/02_Processed_Data/mcTest.csv')
# All Principal Components
principal_Df_ALL = pd.read_csv('C:\\Users\\User\\Box Sync/Projects/Multi_Cancer_DL/02_Processed_Data/principalDF_ALL.csv')
# Principal Components that make up 90% of the variance of the training set
genesTrain_transformed_90 = pd.read_csv('C:\\Users\\User\\Box Sync/Projects/Multi_Cancer_DL/02_Processed_Data/genesTrain_transformed_90.csv')
# Principal Components projected onto the test set
genesTest_transformed_90 = pd.read_csv('C:\\Users\\User\\Box Sync/Projects/Multi_Cancer_DL/02_Processed_Data/genesTest_transformed_90.csv')

**Pre-Process Data**

* Standarized all data: UPDATE: Not going to worry about this yet!
* Make sure that data is formatted correctly
* Structure the neural network architecture for multi-classifciation - (check loss function?)
* Determine how to do LOOCFV 
* The idea is that I will try and get high AUCs using the LOOCFV and then once I optimize that, I'll test on the testing set (is it cheating at all to test and then go back to change?

In [34]:
# remove genetic data from the mcTrain dataset
mcTrain = mcTrain[['seq_num','diagnosis', 'dilute_library_concentration', 'age', 'gender']]

# do the same for the testing set
mcTest = mcTest[['seq_num','diagnosis', 'dilute_library_concentration', 'age', 'gender']]

In [35]:
# rename the first column name of the PC dataframes
genesTrain_transformed_90.rename(columns={'Unnamed: 0':'seq_num'}, inplace=True)
genesTest_transformed_90.rename(columns={'Unnamed: 0':'seq_num'}, inplace=True)

In [36]:
# merge PCs with clinical/phenotypic data
mcTrain = pd.merge(mcTrain, genesTrain_transformed_90, how="left", on="seq_num") 
mcTest = pd.merge(mcTest, genesTest_transformed_90, how="left", on="seq_num") 

**Create a new numeric index and drop seq_num**
For future code we want the index to be numeric

In [37]:
# Create new ids
mcTrain['id'] = mcTrain.index + 1
mcTest['id'] = mcTest.index + 239

# Drop num_seq
mcTrain= mcTrain.drop(columns=["seq_num"])
mcTest = mcTest.drop(columns=["seq_num"])

**Remove Labels (Diagnosis) from the datasets**

In [38]:
mcTrain_x = mcTrain.drop(columns=["diagnosis"])
mcTest_x = mcTest.drop(columns=["diagnosis"])

**Create Labeled Datasets**

In [39]:
mcTrain_y = mcTrain[['id','diagnosis']]
mcTest_y = mcTest[['id','diagnosis']]

In [40]:
# Examine the unique target variables
mcTrain_y.diagnosis.unique()

array(['CRC', 'ESCA', 'HCC', 'STAD', 'HEA'], dtype=object)

In [41]:
# Replace each outcome target with numerical value
mcTrain_y = mcTrain_y.replace('HEA', 0)
mcTrain_y = mcTrain_y.replace('CRC', 1)
mcTrain_y = mcTrain_y.replace('ESCA', 2)
mcTrain_y = mcTrain_y.replace('HCC', 3)
mcTrain_y = mcTrain_y.replace('STAD', 4)

mcTest_y = mcTest_y.replace('HEA', 0)
mcTest_y = mcTest_y.replace('CRC', 1)
mcTest_y = mcTest_y.replace('ESCA', 2)
mcTest_y = mcTest_y.replace('HCC', 3)
mcTest_y = mcTest_y.replace('STAD', 4)

**Convert seq_num id to index**

In [42]:
mcTrain_x = mcTrain_x.set_index('id')
mcTrain_y = mcTrain_y.set_index('id')

mcTest_x = mcTest_x.set_index('id')
mcTest_y = mcTest_y.set_index('id')

In [43]:
# Import PyTorch packages
import torch
from torch import nn
from torchvision import datasets, transforms
from torch import optim
from torch.utils.data.sampler import SubsetRandomSampler
from torch.utils.data.dataloader import DataLoader
from torch.utils.data import TensorDataset
import torch.nn.functional as F
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

**Pre-Process Data**

**Replace Categorical Outputs to Numeric Value**

In [70]:
# Define the model with hidden layers - 50 inputs
model = nn.Sequential(nn.Linear(55, 30),
                      nn.ReLU(),
                      nn.Linear(30, 5))
                      
                      
# Set optimizer and learning rate
#optimizer = optim.SGD(model.parameters(), lr=0.003)

# Could also use Adam optimizer; similar to stochastic gradient descent, but uses momentum which can speed up the actual fitting process, and it also adjusts the learning rate for each of the individual parameters in the model
optimizer = optim.Adam(model.parameters(), lr=0.001)

criterion = nn.CrossEntropyLoss() #don't use with softmax or sigmoid- PyTorch manual indicates "This criterion combines nn.LogSoftmax() and nn.NLLLoss() in one single class."

#Jan 31.2020: For loop attempt
df_train_x = []
df_train_y = []

for index in range (0,237):
    # X - features
    mcTrain_xy = mcTrain_x.drop(mcTrain_x.index[index])
    mcTrain_xy = np.array(mcTrain_xy, dtype = "float32")
    
    # y - target/outputs
    mcTrain_yz = mcTrain_y.drop(mcTrain_y.index[index]) 
    mcTrain_yz = np.array(mcTrain_yz, dtype = "float32")
    
    # reformatt into tensors
    xb = torch.from_numpy(mcTrain_xy)
    yb = torch.from_numpy(mcTrain_yz)
    
    # squeeze 
    yb = yb.squeeze(1) # function is used when we want to remove single-dimensional entries from the shape of an array.
    
    # append all dfs
    #df_train_x.append(xb)
    #df_train_y.append(yb)
    
#### What I think I can do next is maybe forgo the df_train_x - list of dataframes, rather I may just be able to start iteratively running the code
    # Define the batchsize
    batch_size = 32

    # Combine the arrays
    trainloader = TensorDataset(xb, yb)
    
    # Training Loader
    trainloader = DataLoader(trainloader, batch_size, shuffle=True)
    
    # Set epochs
    epochs = 100
    for e in range(epochs):
        running_loss = 0
        for xb, yb in trainloader:
        
            # Clear the gradients, do this because gradients are accumulated
            optimizer.zero_grad()
        
            # Training pass
            output = model.forward(xb)
            loss = criterion(output, yb.long()) # Loss calculated from the output compared to the labels  
            loss.backward()
            optimizer.step()
        
            running_loss += loss.item() # loss.item() gets the scalar value held in the loss. Running_loss = 0, 
            # += notation, says "Add a value and the variable and assigns the result to that variable." So, adds the running_loss (0) with loss.item and assigns to running_loss
        else:
            print("Epoch {}/{}, Training loss: {:.3f}".format(e+1, epochs, running_loss/len(trainloader)))

    
#df_train_x[1].shape # this works!
#df_train_y[3].shape #237 

SyntaxError: invalid syntax (<ipython-input-70-2ea04dcee107>, line 67)

**Updates:** My for loop looks to be running. I will have to carefully go throgh the code but this look good. One thing that might be helpful too is if I could add a set indicator, each N epochs, that define each fold of the loocv method.

NExt I will also have to make sure that the testing set is formatted appropriately.

**Next Steps:**
* **Will need to figure out how to take the 238 sample target indexes to each iteration's. IT IS NOT OUR FINAL TEST SET -- WE WILL BE MATCHING ONTO THE TRAIN_X STILL SINCE THIS WAS LOOCV!!!**
