# Multi-Omics Imputation

The plan is to do the following:
- Divide the data into train, validation, and test sets. Leave the test set for later.
- Remove the values for a certain omics type from the validation set
- Impute using the train set and the remaining value in the validation set
- Compare these imputed values against the true values
    - Distribution of correlation coefficients
    - Get the mean and stdev of the correlation coefficients
    - Choose best model
- Evaluate on test set
- Choose best method
- Try on independent set
- Finally, train GCN model and see difference between single omics, multi-omics, and imputed multi-omics

I'll first start with some basics: data import and processing.
Then I'll move to imputing one omics from two.
Then I'll move to imputing two omics from one.
I'll do all the steps above along the way.

## Importing requisite packages

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from metrics import *

## Importing data

In [2]:
mrna = pd.read_csv("../R/TCGA BRCA/mrna_top1000.csv", index_col=0)
meth = pd.read_csv("../R/TCGA BRCA/meth_top1000.csv", index_col=0)
mirna = pd.read_csv("../R/TCGA BRCA/mirna_anova.csv", index_col=0)

labels = pd.read_csv("../R/TCGA BRCA/PAM50_subtype.csv", index_col=0)

## Basic Data Processing

Just combining all data and then also having a list containing what datatype the columns belong to.

In [3]:
all_data = pd.merge(pd.merge(mrna, meth, left_index=True, right_index=True), mirna,  left_index=True, right_index=True)

datatypes = ["mrna"]*mrna.shape[1] + ["meth"]*meth.shape[1] + ["mirna"]*mirna.shape[1]

In [27]:
all_data = (all_data-all_data.min())/(all_data.max() - all_data.min())
all_data.head()

Unnamed: 0_level_0,DBF4|10926,DACH1|1602,BBS4|585,L3MBTL4|91133,TK1|7083,KIAA1370|56204,GPD1L|23171,RERG|85004,RAPGEF3|10411,FBXO36|130888,...,hsa-mir-217,hsa-mir-424,hsa-mir-581,hsa-mir-483,hsa-mir-3614,hsa-mir-16-1,hsa-mir-550a-2,hsa-mir-24-1,hsa-mir-508,hsa-mir-642a
patient_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TCGA-D8-A1XU-01,0.27777,0.758159,0.378836,0.356571,0.43489,0.55401,0.49286,0.812734,0.599057,0.787462,...,0.004411,0.078464,0.3125,0.002911,0.001617,0.059937,0.046753,0.256461,0.010449,0.07069
TCGA-D8-A1XV-01,0.513189,0.830736,0.648579,0.674419,0.566643,0.889481,0.607707,0.79719,0.454187,0.844195,...,0.00317,0.012424,0.125,0.000333,0.002426,0.0829,0.038961,0.037773,0.00338,0.032759
TCGA-E9-A1N3-01,0.509665,1.0,0.5532,0.152295,0.63237,0.669888,0.497003,0.78917,0.553367,0.915498,...,0.003446,0.044661,0.6875,0.000832,0.002426,0.06351,0.049351,0.093439,0.053473,0.110345
TCGA-C8-A1HE-01,0.362294,0.879097,0.601582,0.283353,0.540255,0.786171,0.813419,0.758284,0.668789,0.88525,...,0.001792,0.057869,0.125,0.001248,0.004448,0.066019,0.007792,0.04175,0.001537,0.044828
TCGA-A1-A0SQ-01,0.429836,0.69313,0.505465,0.292312,0.608069,0.628752,0.616969,0.737191,0.509752,0.708378,...,0.001103,0.005261,0.0625,8.3e-05,0.003235,0.010478,0.005195,0.00994,0.000615,0.155172


In [4]:
labels.head()

Unnamed: 0_level_0,cancer_subtype
patient_id,Unnamed: 1_level_1
TCGA-D8-A1XU-01,LumA
TCGA-D8-A1XV-01,LumA
TCGA-E9-A1N3-01,LumA
TCGA-C8-A1HE-01,LumA
TCGA-A1-A0SQ-01,LumA


Doing the train-validation-test split.
These contain all the values intact.  

Here, we do a 60-20-20 split.

In [28]:
X_train, X_test, y_train, y_test = train_test_split(all_data, labels, test_size = 0.2, random_state = 42, stratify = labels)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size = 0.25, random_state = 42, stratify = y_train)


print(X_train.shape)
print(X_val.shape)
print(X_test.shape)
print(y_train.shape)
print(y_val.shape)
print(y_test.shape)

#all_data.head()
#labels.head()
#X_train.head()
#y_train.head()

(372, 2257)
(125, 2257)
(125, 2257)
(372, 1)
(125, 1)
(125, 1)


Removing all miRNA feature values

In [29]:
#Keeping values for later
from copy import deepcopy
X_test_truth = deepcopy(X_test)
X_val_truth = deepcopy(X_val)

mask = [x=="mirna" for x in datatypes]
X_test.loc[:,mask] = np.nan
X_val.loc[:,mask] = np.nan

X_test.loc[:,mask].head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value, self.name)


Unnamed: 0_level_0,hsa-mir-576,hsa-mir-200b,hsa-mir-3687,hsa-mir-126,hsa-mir-26a-2,hsa-mir-101-1,hsa-mir-218-2,hsa-mir-223,hsa-mir-335,hsa-mir-1468,...,hsa-mir-217,hsa-mir-424,hsa-mir-581,hsa-mir-483,hsa-mir-3614,hsa-mir-16-1,hsa-mir-550a-2,hsa-mir-24-1,hsa-mir-508,hsa-mir-642a
patient_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TCGA-A2-A3XX-01,,,,,,,,,,,...,,,,,,,,,,
TCGA-BH-A0DI-01,,,,,,,,,,,...,,,,,,,,,,
TCGA-A7-A6VX-01,,,,,,,,,,,...,,,,,,,,,,
TCGA-BH-A0AZ-01,,,,,,,,,,,...,,,,,,,,,,
TCGA-AR-A1AX-01,,,,,,,,,,,...,,,,,,,,,,


# Simple Imputation Methods

In [30]:
from sklearn.impute import SimpleImputer

#Combining the train and test samples into one dataframe.
print(X_train.shape)
print(X_val.shape)
X = pd.concat([X_train, X_val])
print(X.shape)
X.iloc[370:375, mask]

(372, 2257)
(125, 2257)
(497, 2257)


Unnamed: 0_level_0,hsa-mir-576,hsa-mir-200b,hsa-mir-3687,hsa-mir-126,hsa-mir-26a-2,hsa-mir-101-1,hsa-mir-218-2,hsa-mir-223,hsa-mir-335,hsa-mir-1468,...,hsa-mir-217,hsa-mir-424,hsa-mir-581,hsa-mir-483,hsa-mir-3614,hsa-mir-16-1,hsa-mir-550a-2,hsa-mir-24-1,hsa-mir-508,hsa-mir-642a
patient_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TCGA-D8-A1JB-01,0.079457,0.04615,0.00885,0.152926,0.09912,0.10393,0.080421,0.056304,0.022416,0.021445,...,0.000965,0.10141,0.0,0.007236,0.002022,0.02218,0.033766,0.055666,0.006146,0.005172
TCGA-D8-A1JL-01,0.110465,0.138854,0.017699,0.097901,0.139139,0.027866,0.097684,0.076584,0.062983,0.020316,...,0.115024,0.202149,0.5625,0.00025,0.020218,0.083382,0.033766,0.236581,0.003688,0.008621
TCGA-EW-A2FS-01,,,,,,,,,,,...,,,,,,,,,,
TCGA-BH-A8FZ-01,,,,,,,,,,,...,,,,,,,,,,
TCGA-D8-A1XG-01,,,,,,,,,,,...,,,,,,,,,,


## Imputing with Mean and Median

### Imputing with Mean

In [31]:
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit(X_train)

mean_imputed = imp.transform(X_val)
# SimpleImputers returns a numpy.ndarray
# I will convert it to a pandas data frame
#mean_imputed = pd.DataFrame(mean_imputed, columns = X_val.columns, index = X_val.index)


#print(mean_imputed.shape)
#mask = [x=="mirna" for x in datatypes]
#mean_imputed.loc[:,mask].head()

### Imputing with Median

In [32]:
imp = SimpleImputer(missing_values=np.nan, strategy='median')
imp.fit(X_train)

median_imputed = imp.transform(X_val)
#median_imputed = pd.DataFrame(median_imputed, columns = X_val.columns, index = X_val.index)

#print(median_imputed.shape)
#mask = [x=="mirna" for x in datatypes]
#median_imputed.loc[:,mask].head()

In [33]:
mask = [x=="mirna" for x in datatypes]
truth = X_val_truth.loc[:,mask].to_numpy()
random = (np.random.rand(truth.shape[0],truth.shape[1]))# - np.mean(truth))/np.std(truth)

print(nrmse(truth, truth))
print(nrmse(truth, random))
print(nrmse(truth, mean_imputed[:,mask]))
print(nrmse(truth, median_imputed[:,mask]))

0.0
7.255257410410361
1.3514235544024777
1.382975035923613


# Slightly More Complicated Methods

In [12]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

Here, initially, all the missing features are replaced with the mean value. Then, iteratively, the estimator is used to estimate the missing value from all the other features. Within each iteration through all the missing values, the features are imputed in a random order.

## Estimator: ElasticNet

#### L1 Ratio = 0.2

In [34]:
from sklearn.linear_model import ElasticNet

imp = IterativeImputer(estimator = ElasticNet(l1_ratio = 0.2), initial_strategy = "mean", 
                       imputation_order = "random", random_state = 42,
                      n_nearest_features = 50)
imp.fit(X_train)

elasticnet_l2_imputed = imp.transform(X_val)
#elasticnet_l2_imputed = pd.DataFrame(elasticnet_l2_imputed, columns = X_val.columns, index = X_val.index)
#print(elasticnet_l2_imputed.shape)
#mask = [x=="mirna" for x in datatypes]
#elasticnet_l2_imputed.loc[:,mask].head()

#### L1 Ratio = 0.8

In [35]:
from sklearn.linear_model import ElasticNet

imp = IterativeImputer(estimator = ElasticNet(l1_ratio = 0.75), initial_strategy = "mean", 
                       imputation_order = "random", random_state = 42,
                      n_nearest_features = 50)
imp.fit(X_train)

elasticnet_l1_imputed = imp.transform(X_val)
#elasticnet_l1_imputed = pd.DataFrame(elasticnet_l1_imputed, columns = X_val.columns, index = X_val.index)
#print(elasticnet_l1_imputed.shape)
#mask = [x=="mirna" for x in datatypes]
#elasticnet_l1_imputed.loc[:,mask].head()

In [36]:
print(nrmse(truth, mean_imputed[:,mask]))
print(nrmse(truth, elasticnet_l1_imputed[:,mask]))
print(nrmse(truth, elasticnet_l2_imputed[:,mask]))

1.3514235544024777
1.3514235544024777
1.3514235544024777


## Estimator: KNeighborsRegressor

#### K = 25

In [37]:
from sklearn.neighbors import KNeighborsRegressor

imp = IterativeImputer(estimator = KNeighborsRegressor(n_neighbors=25), 
                       initial_strategy = "mean", imputation_order = "random", random_state = 42,
                      n_nearest_features = 25)
imp.fit(X_train)

knn25_iter_imputed = imp.transform(X_val)

#### K = 50

In [38]:
from sklearn.neighbors import KNeighborsRegressor

imp = IterativeImputer(estimator = KNeighborsRegressor(n_neighbors=50), 
                       initial_strategy = "mean", imputation_order = "random", random_state = 42,
                      n_nearest_features = 50)
imp.fit(X_train)

knn50_iter_imputed = imp.transform(X_val)

#### K = 75

In [39]:
from sklearn.neighbors import KNeighborsRegressor

imp = IterativeImputer(estimator = KNeighborsRegressor(n_neighbors=75), 
                       initial_strategy = "mean", imputation_order = "random", random_state = 42,
                      n_nearest_features = 75)
imp.fit(X_train)

knn75_iter_imputed = imp.transform(X_val)

#### K = 100

In [40]:
from sklearn.neighbors import KNeighborsRegressor

imp = IterativeImputer(estimator = KNeighborsRegressor(n_neighbors=100), 
                       initial_strategy = "mean", imputation_order = "random", random_state = 42,
                      n_nearest_features = 100)
imp.fit(X_train)

knn100_iter_imputed = imp.transform(X_val)

In [41]:
print(nrmse(truth, mean_imputed[:,mask]))
print(nrmse(truth, knn25_iter_imputed[:,mask]))
print(nrmse(truth, knn50_iter_imputed[:,mask]))
print(nrmse(truth, knn75_iter_imputed[:,mask]))
print(nrmse(truth, knn100_iter_imputed[:,mask]))

1.3514235544024777
1.3373409744512563
1.3343264166172606
1.3348114622116865
1.3341838757891817


## Estimator: RandomForestRegressor

#### Max Depth = 10

In [11]:
from sklearn.ensemble import RandomForestRegressor

imp = IterativeImputer(estimator = RandomForestRegressor(max_depth=10), 
                       initial_strategy = "mean", imputation_order = "random", random_state = 42,
                      n_nearest_features = 200)
imp.fit(X_train)

rf10_imputed = imp.transform(X_val)

#### Max Depth = 20

In [12]:
from sklearn.ensemble import RandomForestRegressor

imp = IterativeImputer(estimator = RandomForestRegressor(max_depth=20), 
                       initial_strategy = "mean", imputation_order = "random", random_state = 42,
                      n_nearest_features = 200)
imp.fit(X_train)

rf20_imputed = imp.transform(X_val)

In [13]:
print(nrmse(truth, mean_imputed[:,mask]))
print(nrmse(truth, rf10_imputed[:,mask]))
print(nrmse(truth, rf20_imputed[:,mask]))

1.3514235544024777
1.3526947398648599
1.353721124349755


# Deck Imputation

It is not exactly deck imputation in that we are not replacing the missing value with a value from the existing set. Here, I select the k closest samples and get the average of their values to impute the missing values.

I am testing kNN with k values equal to all odd numbers between 0 and 20.

In [42]:
from sklearn.impute import KNNImputer

knn = {}

for i in [1,5,10,15,20,25,30,35,40,45,50,75,100]:
    imputer = KNNImputer(n_neighbors=i)
    imputer.fit(X_train)

    knn[i] = imputer.transform(X_val)
    #knn[i] = pd.DataFrame(knn[i], columns = X_val.columns, index = X_val.index)

print(len(knn))
#print(knn[1].shape)
#mask = [x=="mirna" for x in datatypes]
#knn[1].loc[:,mask].head()

13


In [43]:
print(nrmse(truth, mean_imputed[:,mask]))
for each in knn.values():
    print(nrmse(truth, each[:,mask]))

1.3514235544024777
1.8065944840222157
1.4431974817135849
1.3911630424398238
1.3747116272019078
1.3675897649810775
1.3579528416185302
1.3511837927083556
1.3507181123020544
1.352475771203293
1.3481180608253138
1.3463990074696408
1.3468734174741535
1.3466326773661295


# Comparing Imputation Methods

To compare the imputation methods, we first need to quantify them. Here, I am going to use the Normalized Root Mean Squared Error (NRMSE) to quantify each of the methods and then compare them.

< Insert NRMSE formula in latex >