# **Methylation Biomarkers for Predicting Cancer**

## **Data Pre-Processing**

**Author:** Meg Hutch

**Date:** February 25, 2020

**Objective:** Pre-process data for use in Neural Networks, Randfom Forest, and logistic regression.

**Note:** In this version, I will only test the ability of methylation levels to classify cancer types. I will not include phenotypic data for now. Additionally, this version has our data split 70% for training and 30% for testing. The 70% training data will undergo leave-one-out-cross-fold validation to tune hyperparameters prior to testing final performance on the 30% test set. 

Note: This is the new version of the script where we normalize gene counts using DEseq2 in the initial pre-processing script in R. This provided more than double the number of Principal Components that make up 90% of the variance (157). Regardless, we will begin running the deep learning classifier on the revised data. 




**Notes/ToDos: Will need to similarly normalize/modify the final testing sets as well. For now, I just focused on the various training sets. I also have sections to create a few more datasets, combining stomach/colon cancer + removing GBM + BC, will work on this after initital analyses**

In [1]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.colors
import seaborn as sns

In [2]:
# set working directory for git hub
import os
#os.chdir('/projects/p31049/Multi_Cancer_DL/')
os.chdir('C:\\Users\\User\\Box Sync/Projects/Multi_Cancer_DL/')
os. getcwd()

'C:\\Users\\User\\Box Sync\\Projects\\Multi_Cancer_DL'

**Import Training, Testing, and Principal component data**

In [3]:
# Training set
mcTrain = pd.read_csv('02_Processed_Data/mcTrain_70_30.csv')
# Testing set
mcTest = pd.read_csv('02_Processed_Data/mcTest_70_30.csv')

In [4]:
mcTrain.head()

Unnamed: 0,seq_num,diagnosis,gender,age,frag_mean,OR4F5,AL627309.1,OR4F29,OR4F16,AL669831.1,...,SYCE3,CPT1B,CHKB-CPT1B,CHKB,MAPK8IP2,ARSA,SHANK3,ACR,RABL2B,dilute_library_concentration
0,SEQF2032,HEA,1,58,178.23204,0.0,24.924181,0,0,6.062639,...,103.738482,0.0,20.208795,109.127494,145.503325,117.211012,920.847433,127.989036,84.203313,6.579
1,SEQF2036,HCC,1,49,186.899353,0.0,28.1983,0,0,6.189871,...,90.097008,0.0,26.822774,105.227804,120.358599,99.037933,846.636774,90.784772,80.468321,6.666
2,SEQF2037,HCC,0,47,179.389458,0.761211,26.642369,0,0,5.328474,...,93.628898,0.0,14.463,105.808267,114.942793,149.958478,950.751978,118.748846,76.882266,6.536
3,SEQF2038,HCC,1,50,178.434177,0.714086,10.711287,0,0,13.567631,...,112.825559,0.714086,14.995802,122.108675,117.110074,105.684701,893.321358,89.974813,102.114272,7.026
4,SEQF2040,HEA,0,71,179.532989,0.0,15.776344,0,0,4.302639,...,103.263339,0.0,20.078983,103.263339,124.776535,113.302831,955.902997,116.171257,96.092274,7.26


# **Pre-Process Data**

**Shuffle the training and test sets**

Currently, all disease states are in order - we don't want to feed to the network in order!

In [5]:
import random
random.seed(222020)
mcTrain = mcTrain.sample(frac=1, axis = 0).reset_index(drop=True) # frac = 1 returns all rows in random order
mcTest = mcTest.sample(frac=1, axis = 0).reset_index(drop=True)

**Create a new numeric index and drop seq_num and demographic data for these experiments**

For future code we want the index to be numeric

In [6]:
# Create new ids
mcTrain['id'] = mcTrain.index + 1
mcTest['id'] = mcTest.index + 243

# Drop num_seq
mcTrain = mcTrain.drop(columns=["seq_num"])
mcTest = mcTest.drop(columns=["seq_num"])

# set index id
mcTrain = mcTrain.set_index('id')
mcTest = mcTest.set_index('id')

In [7]:
# Examine the unique target variables
mcTrain.diagnosis.unique()

array(['CRC', 'STAD', 'HCC', 'GBM', 'HEA', 'BRCA', 'ESCA'], dtype=object)

In [8]:
# Replace each outcome target with numerical value
mcTrain = mcTrain.replace('HEA', 0)
mcTrain = mcTrain.replace('CRC', 1)
mcTrain = mcTrain.replace('ESCA', 2)
mcTrain = mcTrain.replace('HCC', 3)
mcTrain = mcTrain.replace('STAD', 4)
mcTrain = mcTrain.replace('GBM', 5)
mcTrain = mcTrain.replace('BRCA', 6)

mcTest = mcTest.replace('HEA', 0)
mcTest = mcTest.replace('CRC', 1)
mcTest = mcTest.replace('ESCA', 2)
mcTest = mcTest.replace('HCC', 3)
mcTest = mcTest.replace('STAD', 4)
mcTest = mcTest.replace('GBM', 5)
mcTest = mcTest.replace('BRCA', 6)

**Save the Training and Testing sets**

In [9]:
mcTrain.to_csv('02_Processed_Data/Final_Datasets/mcTrain_Full_70_30.csv')
mcTest.to_csv('02_Processed_Data/Final_Datasets/mcTest_Full_70_30.csv')

**Remove Labels (Diagnosis) from the datasets**

In [10]:
mcTrain_x = mcTrain.drop(columns=["diagnosis"])
mcTest_x = mcTest.drop(columns=["diagnosis"])

**Create Labeled Datasets**

In [11]:
mcTrain_y = mcTrain[['diagnosis']]
mcTest_y = mcTest[['diagnosis']]

# **Save the Main Training and Testing Datasets**

In [12]:
mcTrain_x.to_csv('02_Processed_Data/Final_Datasets/mcTrain_x_Full_70_30.csv')
mcTrain_y.to_csv('02_Processed_Data/Final_Datasets/mcTrain_y_Full_70_30.csv')

mcTest_x.to_csv('02_Processed_Data/Final_Datasets/mcTest_x_Full_70_30.csv')
mcTest_y.to_csv('02_Processed_Data/Final_Datasets/mcTest_y_Full_70_30.csv')

# **Downsampling the Majority Class**

In [13]:
# Subset healthy patients
class0 = mcTrain_y[mcTrain_y.diagnosis == 0]

# Select only 30 healthy subjects - we will remove these subjects
class0 = class0.head(30)
class0 = class0.index.tolist()

# remove these patients from the main dataframe 
mcTrain_y_ds = mcTrain_y[~mcTrain_y.index.isin(class0)]
#print(mcTrain_y.head(20))

# Print the cases we wanted to remove
#print(class0)

#Observe class distributions
class0_new = mcTrain_y[mcTrain_y.diagnosis == 0]
#print('# Healthy Subjects', class0_new.shape) # shoud be 31
#print('# Full Training Set', mcTrain_y.shape)

**Remove excess healthy patients from the input training set and original dataset**

In [14]:
mcTrain_x_ds = mcTrain_x[~mcTrain_x.index.isin(class0)]
print(mcTrain_x_ds.shape)

(212, 19104)


# **Save the Downsampled Training Data**

In [15]:
mcTrain_x_ds.to_csv('02_Processed_Data/Final_Datasets/mcTrain_x_ds_70_30.csv')
mcTrain_y_ds.to_csv('02_Processed_Data/Final_Datasets/mcTrain_y_ds_70_30.csv')

# **Equalize All Classes**

In [16]:
# Subset healthy patients
class0 = mcTrain_y[mcTrain_y.diagnosis == 0]
class1 = mcTrain_y[mcTrain_y.diagnosis == 1]
class2 = mcTrain_y[mcTrain_y.diagnosis == 2]
class3 = mcTrain_y[mcTrain_y.diagnosis == 3]
class4 = mcTrain_y[mcTrain_y.diagnosis == 4]
class5 = mcTrain_y[mcTrain_y.diagnosis == 5]
class6 = mcTrain_y[mcTrain_y.diagnosis == 6]


# Select only top 20 patients - we will remove these patients
class0 = class0.head(20)
class0 = class0.index.tolist()

class1 = class1.head(20)
class1 = class1.index.tolist()

class2 = class2.head(20)
class2 = class2.index.tolist()

class3 = class3.head(20)
class3 = class3.index.tolist()

class4 = class4.head(20)
class4 = class4.index.tolist()

class5 = class5.head(20)
class5 = class5.index.tolist()

class6 = class6.head(20)
class6 = class6.index.tolist()

# Subset the main mcTrain database with the 20 patients in each class (note: gbm and breast cancer will have 19 and 18)
mcTrain_y_es_20 = mcTrain_y[mcTrain_y.index.isin(class0)]
mcTrain_y_es_20_1 = mcTrain_y[mcTrain_y.index.isin(class1)]
mcTrain_y_es_20_2 = mcTrain_y[mcTrain_y.index.isin(class2)]
mcTrain_y_es_20_3 = mcTrain_y[mcTrain_y.index.isin(class3)]
mcTrain_y_es_20_4 = mcTrain_y[mcTrain_y.index.isin(class4)]
mcTrain_y_es_20_5 = mcTrain_y[mcTrain_y.index.isin(class5)]
mcTrain_y_es_20_6 = mcTrain_y[mcTrain_y.index.isin(class6)]

# bind all dataframes
mcTrain_y_es_20 = mcTrain_y_es_20.append(pd.DataFrame(data = mcTrain_y_es_20_1))
mcTrain_y_es_20 = mcTrain_y_es_20.append(pd.DataFrame(data = mcTrain_y_es_20_2))
mcTrain_y_es_20 = mcTrain_y_es_20.append(pd.DataFrame(data = mcTrain_y_es_20_3))
mcTrain_y_es_20 = mcTrain_y_es_20.append(pd.DataFrame(data = mcTrain_y_es_20_4))
mcTrain_y_es_20 = mcTrain_y_es_20.append(pd.DataFrame(data = mcTrain_y_es_20_5))
mcTrain_y_es_20 = mcTrain_y_es_20.append(pd.DataFrame(data = mcTrain_y_es_20_6))

In [17]:
mcTrain_y_es_20.shape

(137, 1)

**Modify the feature training x set**

In [18]:
mcTrain_x_es_20 = mcTrain_x[mcTrain_x.index.isin(class0)]
mcTrain_x_es_20_1 = mcTrain_x[mcTrain_x.index.isin(class1)]
mcTrain_x_es_20_2 = mcTrain_x[mcTrain_x.index.isin(class2)]
mcTrain_x_es_20_3 = mcTrain_x[mcTrain_x.index.isin(class3)]
mcTrain_x_es_20_4 = mcTrain_x[mcTrain_x.index.isin(class4)]
mcTrain_x_es_20_5 = mcTrain_x[mcTrain_x.index.isin(class5)]
mcTrain_x_es_20_6 = mcTrain_x[mcTrain_x.index.isin(class6)]

# bind all dataframes
mcTrain_x_es_20 = mcTrain_x_es_20.append(pd.DataFrame(data = mcTrain_x_es_20_1))
mcTrain_x_es_20 = mcTrain_x_es_20.append(pd.DataFrame(data = mcTrain_x_es_20_2))
mcTrain_x_es_20 = mcTrain_x_es_20.append(pd.DataFrame(data = mcTrain_x_es_20_3))
mcTrain_x_es_20 = mcTrain_x_es_20.append(pd.DataFrame(data = mcTrain_x_es_20_4))
mcTrain_x_es_20 = mcTrain_x_es_20.append(pd.DataFrame(data = mcTrain_x_es_20_5))
mcTrain_x_es_20 = mcTrain_x_es_20.append(pd.DataFrame(data = mcTrain_x_es_20_6))

In [19]:
mcTrain_x_es_20.shape

(137, 19104)

# **Save Equal Downsampled Datasets** 

In [20]:
mcTrain_x_es_20.to_csv('02_Processed_Data/Final_Datasets/mcTrain_x_es_70_30.csv')
mcTrain_y_es_20.to_csv('02_Processed_Data/Final_Datasets/mcTrain_y_es_70_30.csv')

# **Only keep GI Cancers**

In [21]:
mcTrain_hea = mcTrain[mcTrain.diagnosis == 0]
mcTrain_crc = mcTrain[mcTrain.diagnosis == 1]
mcTrain_esca = mcTrain[mcTrain.diagnosis == 2]
mcTrain_hcc = mcTrain[mcTrain.diagnosis == 3]
mcTrain_stad = mcTrain[mcTrain.diagnosis == 4]

# bind dataframes
mcTrain_gi = mcTrain_hea.append(pd.DataFrame(data = mcTrain_crc))
mcTrain_gi = mcTrain_gi.append(pd.DataFrame(data = mcTrain_esca))
mcTrain_gi = mcTrain_gi.append(pd.DataFrame(data = mcTrain_hcc))
mcTrain_gi = mcTrain_gi.append(pd.DataFrame(data = mcTrain_stad))

**Remove Labels (Diagnosis) from the datasets**

In [22]:
mcTrain_gi_x = mcTrain_gi.drop(columns=["diagnosis"])

**Create Labeled Datasets**

In [23]:
mcTrain_gi_y = mcTrain_gi[['diagnosis']]

**Save training and testing sets**

In [25]:
mcTrain_gi.to_csv('02_Processed_Data/GI_Datasets/mcTrain_gi_Full_70_30.csv')
mcTrain_gi_x.to_csv('02_Processed_Data/GI_Datasets/mcTrain_x_gi_Full_70_30.csv')
mcTrain_gi_y.to_csv('02_Processed_Data/GI_Datasets/mcTrain_y_gi_Full_70_30.csv')

# **Combine Stomach + Colon Cancer Patients**

# **Remove Breast Cancer + GBM Patients**