# **Methylation Biomarkers for Predicting Cancer**

## **Dimensionality Reduction: Principal Component Anlaysis**

**Author:** Meg Hutch

**Date:** January 26, 2020

**Objective:** Use neural networks to classify colon, esophagus, liver, and stomach cancer patients and healthy subjects.

In [None]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.colors
import seaborn as sns

**Import Training, Testing, and Principal component data**

In [None]:
# Training set
mcTrain = pd.read_csv('C:\\Users\\User\\Box Sync/Projects/Multi_Cancer_DL/02_Processed_Data/mcTrain.csv')
# Testing set
mcTest = pd.read_csv('C:\\Users\\User\\Box Sync/Projects/Multi_Cancer_DL/02_Processed_Data/mcTest.csv')
# All Principal Components
principal_Df_ALL = pd.read_csv('C:\\Users\\User\\Box Sync/Projects/Multi_Cancer_DL/02_Processed_Data/principalDF_ALL.csv')
# Principal Components that make up 90% of the variance of the training set
genesTrain_transformed_90 = pd.read_csv('C:\\Users\\User\\Box Sync/Projects/Multi_Cancer_DL/02_Processed_Data/genesTrain_transformed_90.csv')
# Principal Components projected onto the test set
genesTest_transformed_90 = pd.read_csv('C:\\Users\\User\\Box Sync/Projects/Multi_Cancer_DL/02_Processed_Data/genesTest_transformed_90.csv')

**Pre-Process Data**

* Standarized all data: UPDATE: Not going to worry about this yet!
* Make sure that data is formatted correctly
* Structure the neural network architecture for multi-classifciation - (check loss function?)
* Determine how to do LOOCFV 
* The idea is that I will try and get high AUCs using the LOOCFV and then once I optimize that, I'll test on the testing set (is it cheating at all to test and then go back to change?

In [None]:
# remove genetic data from the mcTrain dataset
mcTrain = mcTrain[['seq_num','diagnosis', 'dilute_library_concentration', 'age', 'gender']]

# do the same for the testing set
mcTest = mcTest[['seq_num','diagnosis', 'dilute_library_concentration', 'age', 'gender']]

In [None]:
# rename the first column name of the PC dataframes
genesTrain_transformed_90.rename(columns={'Unnamed: 0':'seq_num'}, inplace=True)
genesTest_transformed_90.rename(columns={'Unnamed: 0':'seq_num'}, inplace=True)

In [None]:
# merge PCs with clinical/phenotypic data
mcTrain = pd.merge(mcTrain, genesTrain_transformed_90, how="left", on="seq_num") 
mcTest = pd.merge(mcTest, genesTest_transformed_90, how="left", on="seq_num") 

**Remove Labels (Diagnosis) from the datasets**

In [None]:
mcTrain_x = mcTrain.drop(columns=["diagnosis"])
mcTest_x = mcTest.drop(columns=["diagnosis"])

**Create Labeled Datasets**

In [None]:
mcTrain_y = mcTrain[['seq_num', 'diagnosis']]
mcTest_y = mcTest[['seq_num', 'diagnosis']]

**Convert seq_num id to index**

In [None]:
mcTrain_x = mcTrain_x.set_index('seq_num')
mcTrain_y = mcTrain_y.set_index('seq_num')

mcTest_x = mcTest_x.set_index('seq_num')
mcTest_y = mcTest_y.set_index('seq_num')

# **Neural Network Cancer Classification**

In [None]:
# Import PyTorch packages
import torch
from torch import nn
from torchvision import datasets, transforms
from torch import optim
from torch.utils.data.sampler import SubsetRandomSampler
from torch.utils.data.dataloader import DataLoader
from torch.utils.data import TensorDataset
import torch.nn.functional as F
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

**Pre-Process Data**

**Replace Categorical Outputs to Numeric Value**

In [None]:
# Examine the unique target variables
mcTrain_y.diagnosis.unique()

In [None]:
# Replace each outcome target with numerical value
mcTrain_y = mcTrain_y.replace('HEA', 1)
mcTrain_y = mcTrain_y.replace('CRC', 2)
mcTrain_y = mcTrain_y.replace('ESCA', 3)
mcTrain_y = mcTrain_y.replace('HCC', 4)
mcTrain_y = mcTrain_y.replace('STAD', 5)

mcTest_y = mcTest_y.replace('HEA', 1)
mcTest_y = mcTest_y.replace('CRC', 2)
mcTest_y = mcTest_y.replace('ESCA', 3)
mcTest_y = mcTest_y.replace('HCC', 4)
mcTest_y = mcTest_y.replace('STAD', 5)

**Format the Training Set**

In [None]:
# Convert data into arrays
xb = np.array(mcTrain_x, dtype = "float32")
yb = np.array(mcTrain_y, dtype = "float32")

# Convert arrays into tensors
xb = torch.from_numpy(xb)
yb = torch.from_numpy(yb)

# Combine the arrays
trainloader = TensorDataset(xb, yb)

# Define the batchsize
batch_size = 32

# Training Loader
trainloader = DataLoader(trainloader, batch_size, shuffle=True)

**Format the Testing Set**

In [None]:
# Convert data into arrays
xb = np.array(mcTest_x, dtype = "float32")
yb = np.array(mcTest_y, dtype = "float32")

# Convert arrays into tensors
xb = torch.from_numpy(xb)
yb = torch.from_numpy(yb)

# Combine the arrays
testloader = TensorDataset(xb, yb) 

# Define the batchsize
batch_size= 32

# Training Loader
testloader = DataLoader(testloader, batch_size, shuffle=True)

**Create Neural Network Model**\

# **TO DOs:**

* LOOCFV!!!

In [None]:
# Define the model with hidden layers - 50 inputs
model = nn.Sequential(nn.Linear(50, 30),
                      nn.ReLU(),
                      nn.Linear(30, 5))
                      
                      
# Set optimizer and learning rate
#optimizer = optim.SGD(model.parameters(), lr=0.003)

# Could also use Adam optimizer; similar to stochastic gradient descent, but uses momentum which can speed up the actual fitting process, and it also adjusts the learning rate for each of the individual parameters in the model
optimizer = optim.Adam(model.parameters(), lr=0.001)

criterion = nn.CrossEntropyLoss() #don't use with softmax or sigmoid- PyTorch manual indicates "This criterion combines nn.LogSoftmax() and nn.NLLLoss() in one single class."

# Set epochs
epochs = 200
for e in range(epochs):
    running_loss = 0
    for xb, yb in trainloader:
        
        # Clear the gradients, do this because gradients are accumulated
        optimizer.zero_grad()
        
        # Training pass
        output = model.forward(xb)
        loss = criterion(output, yb) # Loss calculated from the output compared to the labels  
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item() # loss.item() gets the scalar value held in the loss. Running_loss = 0, 
        # += notation, says "Add a value and the variable and assigns the result to that variable." So, adds the running_loss (0) with loss.item and assigns to running_loss
    else:
        print(f"Training loss: {running_loss/len(trainloader)}")