## Use pre-trained models to make predictions on normal tissue samples

For some cancer types, TCGA provides samples from normal tissue in addition to the tumor samples (see `01_explore_data/normal_tissue_samples.ipynb`).

In this analysis, we want to make predictions on those samples and compare them to our tumor sample predictions.

Our assumption is that our models will predict that the normal tissue samples have a low probability of mutation (since they almost certainly do not have somatic mutations in any of the genes of interest).

In [6]:
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import mpmp.config as cfg
import mpmp.utilities.analysis_utilities as au
import mpmp.utilities.data_utilities as du
import mpmp.utilities.plot_utilities as plu

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [7]:
results_dir = Path(cfg.results_dirs['final'],
                   'pilot_genes',
                   'gene').resolve()

genes = [g.stem for g in results_dir.iterdir() if not g.is_file()]
print(genes)

['TP53', 'EGFR', 'IDH1', 'PIK3CA', 'SETD2', 'KRAS']


In [8]:
# load expression sample info, this has tumor/normal labels
sample_info_df = du.load_sample_info('expression')
print(sample_info_df.sample_type.unique())
sample_info_df.head()

['Primary Solid Tumor' 'Recurrent Solid Tumor' 'Solid Tissue Normal'
 'Additional - New Primary' 'Metastatic'
 'Primary Blood Derived Cancer - Peripheral Blood' 'Additional Metastatic']


Unnamed: 0_level_0,sample_type,cancer_type,id_for_stratification
sample_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
TCGA-02-0047-01,Primary Solid Tumor,GBM,GBMPrimary Solid Tumor
TCGA-02-0055-01,Primary Solid Tumor,GBM,GBMPrimary Solid Tumor
TCGA-02-2483-01,Primary Solid Tumor,GBM,GBMPrimary Solid Tumor
TCGA-02-2485-01,Primary Solid Tumor,GBM,GBMPrimary Solid Tumor
TCGA-02-2486-01,Primary Solid Tumor,GBM,GBMPrimary Solid Tumor


In [9]:
# load expression data
data_df = du.load_raw_data('expression', verbose=True)
print(data_df.shape)
data_df.iloc[:5, :5]

Loading expression data...


(11060, 15369)


Unnamed: 0_level_0,1,100,1000,10000,10001
sample_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
TCGA-02-0047-01,125.0,136.0,2300.0,1300.0,272.0
TCGA-02-0055-01,392.0,222.0,1820.0,903.0,321.0
TCGA-02-2483-01,272.0,256.0,2890.0,1320.0,458.0
TCGA-02-2485-01,83.9,129.0,6970.0,10100.0,419.0
TCGA-02-2486-01,108.0,205.0,2250.0,873.0,441.0


In [11]:
normal_ids = (
    sample_info_df[sample_info_df.sample_type.str.contains('Normal')]
      .index
      .intersection(data_df.index)
)
print(len(normal_ids))
print(normal_ids[:5])

737
Index(['TCGA-06-0675-11', 'TCGA-06-0678-11', 'TCGA-06-0680-11',
       'TCGA-06-0681-11', 'TCGA-06-AABW-11'],
      dtype='object', name='sample_id')
