# **Methylation Biomarkers for Predicting Cancer**

## **Deep Learning Approaches for Cancer Classification - Clinical Data + RF Features**

**Author:** Meg Hutch

**Date:** March 1, 2020

**Objective:** Use neural networks to classify cancer type. 

**Note:** In this version, I will only test the ability of methylation levels to classify cancer types. I will not include phenotypic data for now. Additionally, this version has our data split 70% for training and 30% for testing. The 70% training data will undergo leave-one-out-cross-fold validation to tune hyperparameters prior to testing final performance on the 30% test set. 

Note: This is the new version of the script where we normalize gene counts using DEseq2 in the initial pre-processing script in R. This provided more than double the number of Principal Components that make up 90% of the variance (157). Regardless, we will begin running the deep learning classifier on the revised data. 

Update: This script will include the clinical data

In [16]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.colors
import seaborn as sns

In [17]:
# set working directory for git hub
import os
os.chdir('/projects/p31049/Multi_Cancer_DL/')
#os.chdir('C:\\Users\\User\\Box Sync/Projects/Multi_Cancer_DL/')
os. getcwd()

'/projects/p31049/Multi_Cancer_DL'

**Import Training, Testing, and Principal component data**

**Full Dataset**

In [18]:
# Training set
mcTrain_x = pd.read_csv('02_Processed_Data/Final_Datasets/mcTrain_x_Full_70_30.csv')
mcTrain_y = pd.read_csv('02_Processed_Data/Final_Datasets/mcTrain_y_Full_70_30.csv')
# Testing set
mcTest_x = pd.read_csv('02_Processed_Data/Final_Datasets/mcTest_x_Full_70_30.csv')
mcTest_y = pd.read_csv('02_Processed_Data/Final_Datasets/mcTest_y_Full_70_30.csv')

# Random Forest Features
rf_feats = pd.read_csv('02_Processed_Data/Final_Datasets/rf_100feats_FULL_70_30.csv')

In [19]:
#mcTrain_y.head()
#pca_Train.head()
#rf_feats.head()

# **Pre-Process Data**

In [20]:
# rename the first column name of the rf_feats dataframes
rf_feats.rename(columns={'Unnamed: 0':'Gene'}, inplace=True)

**Remove the Importance Column**

In [21]:
rf_feats = rf_feats.drop(columns=["0"])

**Convert id to index**

In [22]:
mcTrain_x = mcTrain_x.set_index('id')
mcTrain_y = mcTrain_y.set_index('id')

mcTest_x = mcTest_x.set_index('id')
mcTest_y = mcTest_y.set_index('id')

**Create seperate DF with only Clinical Variables**

In [23]:
mcTrain_clinical_x = mcTrain_x[['dilute_library_concentration', 'age', 'gender', 'frag_mean']]
mcTest_clinical_x = mcTest_x[['dilute_library_concentration', 'age', 'gender', 'frag_mean']]

**Keep only the Genes that were in the rf_feats**

In [24]:
# Create a list of the Genes to keep
rf_genes = rf_feats.Gene
# convert to df
rf_genes = pd.DataFrame(rf_genes)
# set Gene as an index
rf_genes = rf_genes.set_index('Gene')
# Create a list of the gene names
rf_genes = list(rf_genes.index)
rf_genes

['LMO7DN',
 'EVC',
 'DPYSL2',
 'EPB41',
 'G6PC',
 'HSPB8',
 'PRELID2',
 'MED18',
 'ADGRG5',
 'PAQR9',
 'CCDC13',
 'ADAM2',
 'CECR2',
 'APOL3',
 'NUPR1',
 'PRKAB2',
 'PPP2R5C',
 'ZHX2',
 'TRIM64B',
 'AKR1B10',
 'METAP1',
 'POTEM',
 'BAG5',
 'OR9Q1',
 'C8orf4',
 'XRCC4',
 'FAM134C',
 'PAFAH1B1',
 'POMGNT2',
 'C19orf66',
 'TBC1D2B',
 'CCM2',
 'NOVA1',
 'STAU1',
 'TOP1',
 'CELA2A',
 'RNF214',
 'DIO2',
 'SAE1',
 'LHX4',
 'TRIM68',
 'MARC1',
 'ATP5S',
 'SLC52A1',
 'GGCX',
 'UBE3D',
 'CKAP2',
 'TP53I13',
 'TMCO5A',
 'PARD6A',
 'GATB',
 'CAP2',
 'ARID1B',
 'SPRED1',
 'NCEH1',
 'PCNP',
 'MYRFL',
 'DAAM2',
 'PRR13',
 'GFOD2',
 'GLE1',
 'NOP9',
 'RUFY2',
 'PREX2',
 'DSCR3',
 'LINGO2',
 'CEACAM8',
 'ATP6V0A2',
 'RBM4',
 'GRIA4',
 'TCP11L1',
 'RGS9',
 'COA7',
 'RP11-298I3.5',
 'HBQ1',
 'PCLO',
 'GDI2',
 'NPRL3',
 'IL31RA',
 'NPNT',
 'LYSMD3',
 'ZSCAN29',
 'NAV2',
 'APOE',
 'GABRP',
 'KLF15',
 'SEMA7A',
 'RAB25',
 'RAF1',
 'FAIM',
 'GAS2L3',
 'AL590714.1',
 'RPL29',
 'ORM1',
 'RAP1B',
 'JAM2',
 'TMP

In [25]:
mcTrain_x = mcTrain_x[rf_genes]
mcTest_x = mcTest_x[rf_genes]
mcTrain_x

Unnamed: 0_level_0,LMO7DN,EVC,DPYSL2,EPB41,G6PC,HSPB8,PRELID2,MED18,ADGRG5,PAQR9,...,GAS2L3,AL590714.1,RPL29,ORM1,RAP1B,JAM2,TMPO,PDX1,HOXA1,FAT4
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,69.438420,274.690219,1748.214328,2400.731241,182.786428,164.405670,1325.456891,136.834533,418.672824,118.453775,...,266.520993,26.549984,10.211532,23.486524,1477.608722,738.293785,695.405349,21.444218,27.571137,1008.899390
2,73.950495,262.808681,1713.376078,2436.953227,220.713784,126.284691,1399.370900,137.661690,469.870067,79.638994,...,211.612185,32.993298,10.239299,28.442498,1457.393596,757.708146,696.272350,17.065499,18.203199,932.913934
3,80.060550,179.828312,1911.599596,2321.755952,230.328044,161.352801,1526.077255,115.779872,493.912009,71.438645,...,216.779336,23.402315,11.085307,36.951023,1416.455886,807.995705,615.850385,23.402315,20.938913,1037.092049
4,92.428914,231.072286,1754.182802,2466.082100,163.225530,153.392667,1492.628640,129.793795,469.027576,93.412201,...,243.855008,34.415021,9.832863,34.415021,1595.873704,812.194503,702.066435,14.749295,25.565444,872.174969
5,80.767629,212.265338,1816.937895,2435.711381,196.912814,156.195249,1632.040100,101.460162,624.113494,83.437633,...,220.942852,29.370047,10.680017,28.702546,1469.837342,790.321259,570.045908,18.022529,27.367544,729.578662
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
238,69.810532,152.520836,1575.289828,2644.453300,214.743267,128.238912,1358.270131,140.379874,504.608736,106.992228,...,179.838001,17.452633,12.140962,41.734557,1475.126891,729.216534,663.200053,12.899772,12.140962,887.807851
239,90.682730,193.456490,1774.358746,2449.945083,204.036142,139.046852,1413.139206,116.376170,510.846044,81.614457,...,266.002674,45.341365,9.068273,33.250334,1588.459150,890.202131,648.381518,12.091031,21.159304,926.475223
240,69.946359,137.394633,1858.574676,2430.635967,184.025539,152.383139,1349.798185,94.094506,540.418891,84.934864,...,195.683266,25.813537,15.821200,15.821200,1385.604059,869.333316,695.300114,12.490421,24.148148,926.789254
241,97.415283,200.919021,1775.393535,2520.620451,204.572095,161.952908,1478.276922,135.163705,500.471017,80.367609,...,226.490533,31.659967,10.959219,30.442276,1687.719780,871.866784,711.131567,9.741528,17.047675,1033.819692


In [26]:
# merge PCs with clinical/phenotypic data
mcTrain_x = pd.merge(mcTrain_clinical_x, mcTrain_x, how="left", on="id") 
mcTest_x = pd.merge(mcTest_clinical_x, mcTest_x, how="left", on="id") 

**Drop the library concentraion (should have already controlled for this perhaps with deseq2 normalization?); Just simply keep demographic data for these experiments!**

In [27]:
mcTrain_x = mcTrain_x.drop(columns=["dilute_library_concentration"])
mcTest_x = mcTest_x.drop(columns=["dilute_library_concentration"])

**Normalize Data**

From my reading, it seems that normalization, as opposed to standardization, is the more optimal approach when data is not normally distributed. 

Normalization will rescale our values into range of [0,1]. We need to normalize both the training and test sets

In [28]:
from sklearn.preprocessing import MinMaxScaler

# The normalization function to be performed will convert dataframe into array, for this reason we'll have to convert it back
# Thus, need to store columns and index
# select all columns
cols = list(mcTrain_x.columns.values)
index_train = list(mcTrain_x.index)
index_test = list(mcTest_x.index)

# Normalize data
scaler = MinMaxScaler()
mcTrain_x = scaler.fit_transform(mcTrain_x.astype(np.float))
mcTest_x = scaler.fit_transform(mcTest_x.astype(np.float))

# Convert back to dataframe
mcTrain_x = pd.DataFrame(mcTrain_x, columns = cols, index = index_train)
mcTest_x = pd.DataFrame(mcTest_x, columns = cols, index = index_test)

# Construct & Run Neural Network

In [29]:
# Import PyTorch packages
import torch
from torch import nn
#from torchvision import datasets, transforms
from torch import optim
from torch.utils.data.sampler import SubsetRandomSampler
from torch.utils.data.dataloader import DataLoader
from torch.utils.data import TensorDataset
import torch.nn.functional as F
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [30]:
mcTrain_x.shape

(242, 103)

In [33]:
# define list for results
results_ls = []

# Where we will store correct/incorrect classifications
incorrect_ls = []
correct_ls = []

# Leave-one-out-cross-fold validation function - the for loop will iterate through the dataset, removing one sample (patient)
# at a time in order to create k training and test datasets (where k = number of total samples) always with one sample missing
for index in range (0, 242):
#for index in range (0, 212): # 212 observations when we downsample healthy patients 
    # X - features; 
    mcTrain_xy_drop = mcTrain_x.drop(mcTrain_x.index[index]) # add 'drop'suffix so we can differentiate the df with index and the array that will be created in next line
    mcTrain_xy = np.array(mcTrain_xy_drop, dtype = "float32")
    
    # y - target/outputs
    mcTrain_yz_drop = mcTrain_y.drop(mcTrain_y.index[index]) 
    mcTrain_yz = np.array(mcTrain_yz_drop, dtype = "float32")
    
    # reformat into tensors
    xb = torch.from_numpy(mcTrain_xy)
    yb = torch.from_numpy(mcTrain_yz)
    
    # squeeze - function is used when we want to remove single-dimensional entries from the shape of an array.
    yb = yb.squeeze(1) 
    
    # subset the equivalent test set
    mcTrain_test_x_drop = mcTrain_x.iloc[[index]] # add 'drop'suffix so we can differentiate the df with index and the array that will be created in next line
    mcTrain_test_x = np.array(mcTrain_test_x_drop, dtype = "float32")
            
    # y - targets/outputs
    mcTrain_test_y_drop = mcTrain_y.iloc[[index]]
    mcTrain_test_y = np.array(mcTrain_test_y_drop, dtype = "float32")
        
    # Convert arrays into tensors
    test_xb = torch.from_numpy(mcTrain_test_x)
    test_yb = torch.from_numpy(mcTrain_test_y)
    
    # Define the batchsize
    batch_size = 32

    # Combine the arrays
    trainloader = TensorDataset(xb, yb)
    
    # Training Loader
    trainloader = DataLoader(trainloader, batch_size, shuffle=True)
    
    ## Build the Model and define hyperparameters
    
    # Set parameters for grid search
    #lrs = [1e-2, 1e-3, 1e-4]
    #epochs = [100, 150, 200, 250]
    
    # summarize experiment with changed parameters
    summary = ('Hidden Layers: 50, LR: 0.001, Epochs: 600')
    
    # Define the model with hidden layers
    model = nn.Sequential(nn.Linear(103, 50),
                          nn.ReLU(),
                          nn.Linear(50, 7))
                      
    # Set Stoachastic Gradient Descent Optimizer and the learning rate
    #optimizer = optim.SGD(model.parameters(), lr=0.003)

    # Set Adam optimizer: similar to stochastic gradient descent, but uses momentum which can speed up the actual fitting process, and it also adjusts the learning rate for each of the individual parameters in the model
    optimizer = optim.Adam(model.parameters(), lr=0.10,  weight_decay=0.01) # we can also change momentum parameter

    # loss function
    criterion = nn.CrossEntropyLoss() #don't use with softmax or sigmoid- PyTorch manual indicates "This criterion combines nn.LogSoftmax() and nn.NLLLoss() in one single class."
    
    # Set epochs - number of times the entire dataset will pass through the network
    epochs = 100
    for e in range(epochs):
        # Define running loss as 0
        running_loss = 0
        
        # Run the model for each xb, yb in the trainloader. For the number of epochs specified, the 
        for xb, yb in trainloader:
            # clear gradients - otherwise they are stored
            optimizer.zero_grad()
            # Training pass
            output = model.forward(xb)
            # caluclate loss calculated from the model output compared to the labels
            loss = criterion(output, yb.long()) 
            # backpropagate the loss
            loss.backward()
            # step function to update the weights
            optimizer.step()
        
            running_loss += loss.item() # loss.item() gets the scalar value held in the loss. 
            # += function: Adds the running_loss (0) with loss.item and assigns back to running_loss
        #else:
        #    print("Epoch {}/{}, Training loss: {:.5f}".format(e+1, epochs, running_loss/len(trainloader)))

    # Apply the model to the testing dataset
    # Thus will enable us to see the predictions for each class
    ps = model(test_xb)
    #print('Network Probabilities', ps)
    
    # Obtain the top prediction
    top_p, top_class = ps.topk(1, dim=1)
    #print('top prediction', top_p)
    #print('true vals', test_yb[:10])
        
    # Drop the grad by using detach
    top_p = top_p.detach().numpy()
    top_class = top_class.detach().numpy()

    # convert to integers
    top_class = top_class.astype(np.int)
    test_yb = test_yb.numpy()
    test_yb = test_yb.astype(np.int)
    
    #print('top class', top_class[:10])
    #print('prediction:', top_class)
    #print('true:', test_yb)
                
    # compare top_class to test_yb
    if top_class == test_yb:                
        results = 1 # prediction and true value are equal
    else: 
        results = 0
    
    # Create if-else statements to identify which classes are being classified correctly/incorrectly
    if results == 0:
        incorrect = test_yb
    else: 
        incorrect = np.array([[999]], dtype=int)
        
    if results == 1:
        correct = test_yb
    else: 
        correct = np.array([[999]], dtype=int)
    #print('Results:', results)
    
    results_ls.append(results)
    incorrect_ls.append(incorrect)
    correct_ls.append(correct)
    #print(results_ls) 

Epoch 1/100, Training loss: 3.30442
Epoch 2/100, Training loss: 2.11192
Epoch 3/100, Training loss: 1.96899
Epoch 4/100, Training loss: 1.91402
Epoch 5/100, Training loss: 1.88137
Epoch 6/100, Training loss: 1.97218
Epoch 7/100, Training loss: 1.90056
Epoch 8/100, Training loss: 1.95280
Epoch 9/100, Training loss: 2.27232
Epoch 10/100, Training loss: 2.06005
Epoch 11/100, Training loss: 1.83292
Epoch 12/100, Training loss: 2.39840
Epoch 13/100, Training loss: 1.94495
Epoch 14/100, Training loss: 1.88024
Epoch 15/100, Training loss: 1.86363
Epoch 16/100, Training loss: 1.82814
Epoch 17/100, Training loss: 1.93240
Epoch 18/100, Training loss: 1.81629
Epoch 19/100, Training loss: 1.95443
Epoch 20/100, Training loss: 1.85725
Epoch 21/100, Training loss: 2.06775
Epoch 22/100, Training loss: 1.88020
Epoch 23/100, Training loss: 1.85579
Epoch 24/100, Training loss: 1.92707
Epoch 25/100, Training loss: 1.91390
Epoch 26/100, Training loss: 1.85030
Epoch 27/100, Training loss: 1.88713
Epoch 28/1

Epoch 28/100, Training loss: 1.38653
Epoch 29/100, Training loss: 1.44738
Epoch 30/100, Training loss: 1.33219
Epoch 31/100, Training loss: 1.37798
Epoch 32/100, Training loss: 1.44653
Epoch 33/100, Training loss: 1.71137
Epoch 34/100, Training loss: 1.62248
Epoch 35/100, Training loss: 1.56335
Epoch 36/100, Training loss: 1.64157
Epoch 37/100, Training loss: 1.48459
Epoch 38/100, Training loss: 1.46099
Epoch 39/100, Training loss: 1.41866
Epoch 40/100, Training loss: 1.52463
Epoch 41/100, Training loss: 1.56321
Epoch 42/100, Training loss: 1.62099
Epoch 43/100, Training loss: 1.65454
Epoch 44/100, Training loss: 1.71527
Epoch 45/100, Training loss: 1.52064
Epoch 46/100, Training loss: 1.52929
Epoch 47/100, Training loss: 1.66222
Epoch 48/100, Training loss: 1.52375
Epoch 49/100, Training loss: 1.58123
Epoch 50/100, Training loss: 1.44801
Epoch 51/100, Training loss: 1.54625
Epoch 52/100, Training loss: 1.52452
Epoch 53/100, Training loss: 1.54358
Epoch 54/100, Training loss: 1.49524
E

Epoch 60/100, Training loss: 1.16783
Epoch 61/100, Training loss: 1.13754
Epoch 62/100, Training loss: 1.37887
Epoch 63/100, Training loss: 1.24133
Epoch 64/100, Training loss: 1.68609
Epoch 65/100, Training loss: 1.91750
Epoch 66/100, Training loss: 1.55094
Epoch 67/100, Training loss: 1.47820
Epoch 68/100, Training loss: 1.35569
Epoch 69/100, Training loss: 1.45494
Epoch 70/100, Training loss: 1.49692
Epoch 71/100, Training loss: 1.45328
Epoch 72/100, Training loss: 1.56484
Epoch 73/100, Training loss: 1.32696
Epoch 74/100, Training loss: 1.38519
Epoch 75/100, Training loss: 1.46327
Epoch 76/100, Training loss: 1.64548
Epoch 77/100, Training loss: 1.42493
Epoch 78/100, Training loss: 1.37890
Epoch 79/100, Training loss: 1.37904
Epoch 80/100, Training loss: 1.40924
Epoch 81/100, Training loss: 1.34656
Epoch 82/100, Training loss: 1.29678
Epoch 83/100, Training loss: 1.52406
Epoch 84/100, Training loss: 1.57880
Epoch 85/100, Training loss: 1.94673
Epoch 86/100, Training loss: 1.44025
E

Epoch 86/100, Training loss: 1.73465
Epoch 87/100, Training loss: 1.87544
Epoch 88/100, Training loss: 1.63667
Epoch 89/100, Training loss: 1.58139
Epoch 90/100, Training loss: 1.53227
Epoch 91/100, Training loss: 1.56995
Epoch 92/100, Training loss: 1.63127
Epoch 93/100, Training loss: 1.71851
Epoch 94/100, Training loss: 1.73676
Epoch 95/100, Training loss: 1.65536
Epoch 96/100, Training loss: 1.64099
Epoch 97/100, Training loss: 1.67530
Epoch 98/100, Training loss: 1.65435
Epoch 99/100, Training loss: 1.60131
Epoch 100/100, Training loss: 1.57208
[0, 0, 0, 0, 1, 1, 1]
Epoch 1/100, Training loss: 2.85963
Epoch 2/100, Training loss: 2.00709
Epoch 3/100, Training loss: 1.87171
Epoch 4/100, Training loss: 1.86537
Epoch 5/100, Training loss: 1.99155
Epoch 6/100, Training loss: 1.90887
Epoch 7/100, Training loss: 1.92961
Epoch 8/100, Training loss: 1.99102
Epoch 9/100, Training loss: 1.85149
Epoch 10/100, Training loss: 1.87840
Epoch 11/100, Training loss: 1.88663
Epoch 12/100, Training l

Epoch 11/100, Training loss: 1.89897
Epoch 12/100, Training loss: 1.71951
Epoch 13/100, Training loss: 1.95685
Epoch 14/100, Training loss: 1.87539
Epoch 15/100, Training loss: 1.78843
Epoch 16/100, Training loss: 1.77666
Epoch 17/100, Training loss: 1.77951
Epoch 18/100, Training loss: 1.95980
Epoch 19/100, Training loss: 2.03632
Epoch 20/100, Training loss: 1.70868
Epoch 21/100, Training loss: 1.77574
Epoch 22/100, Training loss: 1.73280
Epoch 23/100, Training loss: 1.92348
Epoch 24/100, Training loss: 1.84242
Epoch 25/100, Training loss: 1.81331
Epoch 26/100, Training loss: 1.67181
Epoch 27/100, Training loss: 1.90114
Epoch 28/100, Training loss: 1.67077
Epoch 29/100, Training loss: 1.70139
Epoch 30/100, Training loss: 1.69984
Epoch 31/100, Training loss: 1.60200
Epoch 32/100, Training loss: 1.74346
Epoch 33/100, Training loss: 1.58347
Epoch 34/100, Training loss: 1.60745
Epoch 35/100, Training loss: 1.67072
Epoch 36/100, Training loss: 1.58101
Epoch 37/100, Training loss: 1.70690
E

Epoch 47/100, Training loss: 1.39761
Epoch 48/100, Training loss: 1.49277
Epoch 49/100, Training loss: 1.53071
Epoch 50/100, Training loss: 1.48001
Epoch 51/100, Training loss: 1.35393
Epoch 52/100, Training loss: 1.31243
Epoch 53/100, Training loss: 1.41842
Epoch 54/100, Training loss: 1.63653
Epoch 55/100, Training loss: 1.34851
Epoch 56/100, Training loss: 1.31456
Epoch 57/100, Training loss: 1.27257
Epoch 58/100, Training loss: 1.32621
Epoch 59/100, Training loss: 1.25516
Epoch 60/100, Training loss: 1.24797
Epoch 61/100, Training loss: 1.35916
Epoch 62/100, Training loss: 1.60191
Epoch 63/100, Training loss: 1.41693
Epoch 64/100, Training loss: 1.35120
Epoch 65/100, Training loss: 1.25353
Epoch 66/100, Training loss: 1.18410
Epoch 67/100, Training loss: 1.53829
Epoch 68/100, Training loss: 1.52712
Epoch 69/100, Training loss: 1.53021
Epoch 70/100, Training loss: 1.56981
Epoch 71/100, Training loss: 1.32849
Epoch 72/100, Training loss: 1.30470
Epoch 73/100, Training loss: 1.32003
E

KeyboardInterrupt: 

# **Determine LOOCV Mean Error**

In [None]:
percent_correct = sum(results_ls)
percent_correct = percent_correct/len(mcTrain_y)*100
percent_incorrect = 100 - percent_correct
percent_incorrect = round(percent_incorrect, 1)
#print('Percent Error', round(percent_incorrect, 1))

# **Incorrect Predictions**

In [None]:
## Remove the correct elements from the ls to faciliate transforming this list into a dataframe
# First, concatenate all incorrect list elements and format into dataframe
incorrect_res = np.concatenate(incorrect_ls)
incorrect_res = pd.DataFrame(incorrect_res)
incorrect_res.columns = ['diagnosis']
incorrect_res = incorrect_res[incorrect_res.diagnosis != 999] # 999 are the results that were correct - we remove these

# Count number of incorrect predictions by diagnosis
incorrect_pred = incorrect_res.groupby(['diagnosis']).size()
incorrect_pred = pd.DataFrame(incorrect_pred)
incorrect_pred.columns = ['Count']

# Convert the index to the first column and change the numebr to categorical variables
incorrect_pred.reset_index(level=0, inplace=True)
incorrect_pred['diagnosis'] = incorrect_pred['diagnosis'].map({0: 'HEA', 1: 'CRC', 2: 'ESCA', 3: 'HCC', 4: 'STAD', 5:'GBM', 6:'BRCA'})

# Add a column with the number of cases in each class
mcTrain_y['diagnosis'] = mcTrain_y['diagnosis'].map({0: 'HEA', 1: 'CRC', 2: 'ESCA', 3: 'HCC', 4: 'STAD', 5:'GBM', 6:'BRCA'})
class_size = mcTrain_y.groupby(['diagnosis']).size()
class_size = pd.DataFrame(class_size)
class_size.columns = ['Sample_n']

# bind class_size to the pred df diagnoses
incorrect_pred = pd.merge(incorrect_pred, class_size, how="left", on="diagnosis") 

# Calculate the percent error for each class
incorrect_pred['Count_Perc_Incorrect'] = incorrect_pred['Count']/incorrect_pred['Sample_n']
incorrect_pred['Count_Perc_Incorrect'] = incorrect_pred['Count_Perc_Incorrect'].multiply(100)

# **Correct Predictions**

In [None]:
## Remove the incorrect elements from the ls to faciliate transforming this list into a dataframe
# First, concatenate all incorrect list elements and format into dataframe
correct_res = np.concatenate(correct_ls)
correct_res = pd.DataFrame(correct_res)
correct_res.columns = ['diagnosis']
correct_res = correct_res[correct_res.diagnosis != 999] # 999 are the results that were incorrect - we remove these

# Count number of correct predictions by diagnosis
correct_pred = correct_res.groupby(['diagnosis']).size()
correct_pred = pd.DataFrame(correct_pred)
correct_pred.columns = ['Count']

# Convert the index to the first column and change the numebr to categorical variables
correct_pred.reset_index(level=0, inplace=True)
correct_pred['diagnosis'] = correct_pred['diagnosis'].map({0: 'HEA', 1: 'CRC', 2: 'ESCA', 3: 'HCC', 4: 'STAD', 5:'GBM', 6:'BRCA'})

# Add a column with the number of cases in each class
class_size = mcTrain_y.groupby(['diagnosis']).size()
class_size = pd.DataFrame(class_size)
class_size.columns = ['Sample_n']

# bind class_size to the pred df diagnoses
correct_pred = pd.merge(correct_pred, class_size, how="left", on="diagnosis") 

# Calculate the percent correct for each class
correct_pred['Count_Perc_Correct'] = correct_pred['Count']/correct_pred['Sample_n']
correct_pred['Count_Perc_Correct'] = correct_pred['Count_Perc_Correct'].multiply(100)

# **Save Predictions**

In [None]:
# convert float to string in order to save as a txt file
percent_error = '     Percent Error'
percent_error = str(percent_error)
percent_incorrect = str(percent_incorrect)

# save the pre-specified summary of the experiment which includes parameter spe$
summary = summary + percent_error + percent_incorrect

#Feb 21, 2020 Test
print(summary)
print(correct_pred)