# **Methylation Biomarkers for Predicting Cancer**

## **Data Pre-Processing**

**Author:** Meg Hutch

**Date:** February 25, 2020

**Objective:** Pre-process data for use in Neural Networks, Randfom Forest, and logistic regression.

**Note:** In this version, I will only test the ability of methylation levels to classify cancer types. I will not include phenotypic data for now. Additionally, this version has our data split 70% for training and 30% for testing. The 70% training data will undergo leave-one-out-cross-fold validation to tune hyperparameters prior to testing final performance on the 30% test set. 

Note: This is the new version of the script where we normalize gene counts using DEseq2 in the initial pre-processing script in R. This provided more than double the number of Principal Components that make up 90% of the variance (157). Regardless, we will begin running the deep learning classifier on the revised data. 




**Notes/ToDos: Will need to similarly normalize/modify the final testing sets as well. For now, I just focused on the various training sets. I also have sections to create a few more datasets, combining stomach/colon cancer + removing GBM + BC, will work on this after initital analyses**

In [1]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.colors
import seaborn as sns

In [2]:
# set working directory for git hub
import os
#os.chdir('/projects/p31049/Multi_Cancer_DL/')
os.chdir('C:\\Users\\User\\Box Sync/Projects/Multi_Cancer_DL/')
os. getcwd()

'C:\\Users\\User\\Box Sync\\Projects\\Multi_Cancer_DL'

**Import Training, Testing, and Principal component data**

In [3]:
# Training set
mcTrain = pd.read_csv('02_Processed_Data/mcTrain_70_30.csv')
# Testing set
mcTest = pd.read_csv('02_Processed_Data/mcTest_70_30.csv')
# Principal Components that make up 90% of the variance of the training set
genesTrain_transformed_90 = pd.read_csv('02_Processed_Data/genesTrain_transformed_157pc_70_30.csv')
# Principal Components projected onto the test set
genesTest_transformed_90 = pd.read_csv('02_Processed_Data/genesTest_transformed_157pc_70_30.csv')

In [4]:
mcTrain.head()
genesTrain_transformed_90.head()

Unnamed: 0.1,Unnamed: 0,pc1,pc2,pc3,pc4,pc5,pc6,pc7,pc8,pc9,...,pc148,pc149,pc150,pc151,pc152,pc153,pc154,pc155,pc156,pc157
0,SEQF2032,-28.945059,-32.403797,-0.519967,-27.074691,-14.202266,19.993461,18.922518,-6.985856,0.358684,...,-5.164339,-0.56049,5.182056,6.927316,0.949892,-4.678334,-3.688339,0.433768,-6.032803,4.715499
1,SEQF2036,-44.217894,24.452308,33.042421,-16.3712,-11.621861,-14.680939,1.362154,-4.139965,-0.503822,...,3.135182,1.609314,0.540895,1.674428,0.780811,6.550943,2.203284,-2.186134,2.337185,-1.654099
2,SEQF2037,0.402742,-2.331201,-0.163246,-27.016531,-3.761961,22.911067,9.722398,2.13608,2.637944,...,-2.626749,0.34314,-3.307787,-1.680439,-2.682968,2.546865,0.303318,-2.718366,-0.65262,1.215243
3,SEQF2038,22.945211,89.824662,-62.017133,-29.163198,3.770062,10.497954,2.137644,1.693253,5.346057,...,3.709008,-0.659335,4.609737,-4.582692,-5.888171,-3.690396,-1.85582,0.342069,0.081236,0.658757
4,SEQF2040,-49.427962,-13.271621,-0.336084,-20.161418,-17.332071,-0.177102,13.728438,-8.602698,0.822969,...,-1.294989,-6.169188,4.229857,0.29418,3.464734,-1.982001,0.224488,1.915049,8.692484,4.023793


# **Pre-Process Data**

In [5]:
# remove genetic data from the mcTrain dataset
mcTrain = mcTrain[['seq_num','diagnosis', 'dilute_library_concentration', 'age', 'gender', 'frag_mean']]

# do the same for the testing set
mcTest = mcTest[['seq_num','diagnosis', 'dilute_library_concentration', 'age', 'gender', 'frag_mean']]

In [6]:
# rename the first column name of the PC dataframes
genesTrain_transformed_90.rename(columns={'Unnamed: 0':'seq_num'}, inplace=True)
genesTest_transformed_90.rename(columns={'Unnamed: 0':'seq_num'}, inplace=True)

In [7]:
# merge PCs with clinical/phenotypic data
mcTrain = pd.merge(mcTrain, genesTrain_transformed_90, how="left", on="seq_num") 
mcTest = pd.merge(mcTest, genesTest_transformed_90, how="left", on="seq_num") 

**Shuffle the training and test sets**

Currently, all disease states are in order - we don't want to feed to the network in order!

In [8]:
import random
random.seed(222020)
mcTrain = mcTrain.sample(frac=1, axis = 0).reset_index(drop=True) # frac = 1 returns all rows in random order
mcTest = mcTest.sample(frac=1, axis = 0).reset_index(drop=True)

**Create a new numeric index and drop seq_num and demographic data for these experiments**

For future code we want the index to be numeric

In [9]:
# Create new ids
mcTrain['id'] = mcTrain.index + 1
mcTest['id'] = mcTest.index + 243

# Drop num_seq
mcTrain = mcTrain.drop(columns=["seq_num", "dilute_library_concentration", "age", "gender", "frag_mean"])
mcTest = mcTest.drop(columns=["seq_num", "dilute_library_concentration", "age", "gender", "frag_mean"])

**Remove Labels (Diagnosis) from the datasets**

In [10]:
mcTrain_x = mcTrain.drop(columns=["diagnosis"])
mcTest_x = mcTest.drop(columns=["diagnosis"])

**Create Labeled Datasets**

In [11]:
mcTrain_y = mcTrain[['id','diagnosis']]
mcTest_y = mcTest[['id','diagnosis']]

In [12]:
# Examine the unique target variables
mcTrain_y.diagnosis.unique()

array(['ESCA', 'GBM', 'HEA', 'STAD', 'BRCA', 'HCC', 'CRC'], dtype=object)

In [13]:
# Replace each outcome target with numerical value
mcTrain_y = mcTrain_y.replace('HEA', 0)
mcTrain_y = mcTrain_y.replace('CRC', 1)
mcTrain_y = mcTrain_y.replace('ESCA', 2)
mcTrain_y = mcTrain_y.replace('HCC', 3)
mcTrain_y = mcTrain_y.replace('STAD', 4)
mcTrain_y = mcTrain_y.replace('GBM', 5)
mcTrain_y = mcTrain_y.replace('BRCA', 6)

mcTest_y = mcTest_y.replace('HEA', 0)
mcTest_y = mcTest_y.replace('CRC', 1)
mcTest_y = mcTest_y.replace('ESCA', 2)
mcTest_y = mcTest_y.replace('HCC', 3)
mcTest_y = mcTest_y.replace('STAD', 4)
mcTest_y = mcTest_y.replace('GBM', 5)
mcTest_y = mcTest_y.replace('BRCA', 6)

**Convert seq_num id to index**

In [14]:
mcTrain_x = mcTrain_x.set_index('id')
mcTrain_y = mcTrain_y.set_index('id')

mcTest_x = mcTest_x.set_index('id')
mcTest_y = mcTest_y.set_index('id')

**Normalize Data**

From my reading, it seems that normalization, as opposed to standardization, is the more optimal approach when data is not normally distributed. 

Normalization will rescale our values into range of [0,1]. We need to normalize both the training and test sets

In [15]:
from sklearn.preprocessing import MinMaxScaler

# The normalization function to be performed will convert dataframe into array, for this reason we'll have to convert it back
# Thus, need to store columns and index
# select all columns
cols = list(mcTrain_x.columns.values)
index_train = list(mcTrain_x.index)
index_test = list(mcTest_x.index)

# Normalize data
scaler = MinMaxScaler()
mcTrain_x = scaler.fit_transform(mcTrain_x.astype(np.float))
mcTest_x = scaler.fit_transform(mcTest_x.astype(np.float))

# Convert back to dataframe
mcTrain_x = pd.DataFrame(mcTrain_x, columns = cols, index = index_train)
mcTest_x = pd.DataFrame(mcTest_x, columns = cols, index = index_test)

# **Save the Main Training and Testing Datasets**

In [16]:
mcTrain_x.to_csv('02_Processed_Data/mcTrain_x_Full_70_30.csv')
mcTrain_y.to_csv('02_Processed_Data/mcTrain_y_Full_70_30.csv')

# **Downsampling the Majority Class**

In [17]:
# Subset healthy patients
class0 = mcTrain_y[mcTrain_y.diagnosis == 0]

# Select only 30 healthy subjects - we will remove these subjects
class0 = class0.head(30)
class0 = class0.index.tolist()

# remove these patients from the main dataframe 
mcTrain_y_ds = mcTrain_y[~mcTrain_y.index.isin(class0)]
#print(mcTrain_y.head(20))

# Print the cases we wanted to remove
#print(class0)

#Observe class distributions
class0_new = mcTrain_y[mcTrain_y.diagnosis == 0]
#print('# Healthy Subjects', class0_new.shape) # shoud be 31
#print('# Full Training Set', mcTrain_y.shape)

**Remove excess healthy patients from the input training set and original dataset**

In [18]:
mcTrain_x_ds = mcTrain_x[~mcTrain_x.index.isin(class0)]
#print('# Full Training Set', mcTrain_x)

# **Save the Downsampled Training Data**

In [19]:
mcTrain_x_ds.to_csv('02_Processed_Data/mcTrain_x_ds_70_30.csv')
mcTrain_y_ds.to_csv('02_Processed_Data/mcTrain_y_ds_70_30.csv')

# **Equalize All Classes**

In [20]:
# Subset healthy patients
class0 = mcTrain_y[mcTrain_y.diagnosis == 0]
class1 = mcTrain_y[mcTrain_y.diagnosis == 1]
class2 = mcTrain_y[mcTrain_y.diagnosis == 2]
class3 = mcTrain_y[mcTrain_y.diagnosis == 3]
class4 = mcTrain_y[mcTrain_y.diagnosis == 4]
class5 = mcTrain_y[mcTrain_y.diagnosis == 5]
class6 = mcTrain_y[mcTrain_y.diagnosis == 6]


# Select only top 20 patients - we will remove these patients
class0 = class0.head(20)
class0 = class0.index.tolist()

class1 = class1.head(20)
class1 = class1.index.tolist()

class2 = class2.head(20)
class2 = class2.index.tolist()

class3 = class3.head(20)
class3 = class3.index.tolist()

class4 = class4.head(20)
class4 = class4.index.tolist()

class5 = class5.head(20)
class5 = class5.index.tolist()

class6 = class6.head(20)
class6 = class6.index.tolist()

# Subset the main mcTrain database with the 20 patients in each class (note: gbm and breast cancer will have 19 and 18)
mcTrain_y_ds_20 = mcTrain_y[mcTrain_y.index.isin(class0)]
mcTrain_y_ds_20_1 = mcTrain_y[mcTrain_y.index.isin(class1)]
mcTrain_y_ds_20_2 = mcTrain_y[mcTrain_y.index.isin(class2)]
mcTrain_y_ds_20_3 = mcTrain_y[mcTrain_y.index.isin(class3)]
mcTrain_y_ds_20_4 = mcTrain_y[mcTrain_y.index.isin(class4)]
mcTrain_y_ds_20_5 = mcTrain_y[mcTrain_y.index.isin(class5)]
mcTrain_y_ds_20_6 = mcTrain_y[mcTrain_y.index.isin(class6)]

# bind all dataframes
mcTrain_y_ds_20 = mcTrain_y_ds_20.append(pd.DataFrame(data = mcTrain_y_ds_20_1))
mcTrain_y_ds_20 = mcTrain_y_ds_20.append(pd.DataFrame(data = mcTrain_y_ds_20_2))
mcTrain_y_ds_20 = mcTrain_y_ds_20.append(pd.DataFrame(data = mcTrain_y_ds_20_3))
mcTrain_y_ds_20 = mcTrain_y_ds_20.append(pd.DataFrame(data = mcTrain_y_ds_20_4))
mcTrain_y_ds_20 = mcTrain_y_ds_20.append(pd.DataFrame(data = mcTrain_y_ds_20_5))
mcTrain_y_ds_20 = mcTrain_y_ds_20.append(pd.DataFrame(data = mcTrain_y_ds_20_6))

In [21]:
mcTrain_y_ds_20.shape

(137, 1)

**Modify the feature training x set**

In [22]:
mcTrain_x_ds_20 = mcTrain_x[mcTrain_x.index.isin(class0)]
mcTrain_x_ds_20_1 = mcTrain_x[mcTrain_x.index.isin(class1)]
mcTrain_x_ds_20_2 = mcTrain_x[mcTrain_x.index.isin(class2)]
mcTrain_x_ds_20_3 = mcTrain_x[mcTrain_x.index.isin(class3)]
mcTrain_x_ds_20_4 = mcTrain_x[mcTrain_x.index.isin(class4)]
mcTrain_x_ds_20_5 = mcTrain_x[mcTrain_x.index.isin(class5)]
mcTrain_x_ds_20_6 = mcTrain_x[mcTrain_x.index.isin(class6)]

# bind all dataframes
mcTrain_x_ds_20 = mcTrain_x_ds_20.append(pd.DataFrame(data = mcTrain_x_ds_20_1))
mcTrain_x_ds_20 = mcTrain_x_ds_20.append(pd.DataFrame(data = mcTrain_x_ds_20_2))
mcTrain_x_ds_20 = mcTrain_x_ds_20.append(pd.DataFrame(data = mcTrain_x_ds_20_3))
mcTrain_x_ds_20 = mcTrain_x_ds_20.append(pd.DataFrame(data = mcTrain_x_ds_20_4))
mcTrain_x_ds_20 = mcTrain_x_ds_20.append(pd.DataFrame(data = mcTrain_x_ds_20_5))
mcTrain_x_ds_20 = mcTrain_x_ds_20.append(pd.DataFrame(data = mcTrain_x_ds_20_6))

In [23]:
mcTrain_x_ds_20.shape

(137, 157)

# **Save Equal Downsampled Datasets** 

In [24]:
mcTrain_x_ds.to_csv('02_Processed_Data/mcTrain_x_ds_70_30.csv')
mcTrain_y_ds.to_csv('02_Processed_Data/mcTrain_y_ds_70_30.csv')

# **Combine Stomach + Colon Cancer Patients**

# **Remove Breast Cancer + Colon Cancer**