## Use pre-trained models to make predictions on normal tissue samples

For some cancer types, TCGA provides samples from normal tissue in addition to the tumor samples (see `01_explore_data/normal_tissue_samples.ipynb`).

In this analysis, we want to make predictions on those samples and compare them to our tumor sample predictions.

Our assumption is that our models will predict that the normal tissue samples have a low probability of mutation (since they almost certainly do not have somatic mutations in any of the genes of interest).

In [18]:
from pathlib import Path
import pickle as pkl

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import mpmp.config as cfg
import mpmp.utilities.analysis_utilities as au
import mpmp.utilities.data_utilities as du
import mpmp.utilities.plot_utilities as plu
import mpmp.utilities.tcga_utilities as tu

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [19]:
results_dir = Path(cfg.results_dirs['final'],
                   'pilot_genes',
                   'gene').resolve()

genes = [g.stem for g in results_dir.iterdir() if not g.is_file()]
print(genes)

['TP53', 'EGFR', 'IDH1', 'PIK3CA', 'SETD2', 'KRAS']


### Load pre-trained model

In [23]:
gene = 'TP53'
training_data = 'expression'

model_filename = '{}_{}_elasticnet_classify_s42_model.pkl'.format(gene, training_data)

with open(str(results_dir / gene / model_filename), 'rb') as f:
    model_fit = pkl.load(f)

print(model_fit)
print(model_fit.feature_names_in_.shape)

SGDClassifier(alpha=0.1, class_weight='balanced', l1_ratio=0.1, loss='log',
              penalty='elasticnet', random_state=42)
(15389,)


### Load expression data and sample info

In [3]:
# load expression sample info, this has tumor/normal labels
sample_info_df = du.load_sample_info(training_data)
print(sample_info_df.sample_type.unique())
sample_info_df.head()

['Primary Solid Tumor' 'Recurrent Solid Tumor' 'Solid Tissue Normal'
 'Additional - New Primary' 'Metastatic'
 'Primary Blood Derived Cancer - Peripheral Blood' 'Additional Metastatic']


Unnamed: 0_level_0,sample_type,cancer_type,id_for_stratification
sample_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
TCGA-02-0047-01,Primary Solid Tumor,GBM,GBMPrimary Solid Tumor
TCGA-02-0055-01,Primary Solid Tumor,GBM,GBMPrimary Solid Tumor
TCGA-02-2483-01,Primary Solid Tumor,GBM,GBMPrimary Solid Tumor
TCGA-02-2485-01,Primary Solid Tumor,GBM,GBMPrimary Solid Tumor
TCGA-02-2486-01,Primary Solid Tumor,GBM,GBMPrimary Solid Tumor


In [4]:
# load expression data
data_df = du.load_raw_data('expression', verbose=True)
print(data_df.shape)
data_df.iloc[:5, :5]

Loading expression data...


(11060, 15369)


Unnamed: 0_level_0,1,100,1000,10000,10001
sample_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
TCGA-02-0047-01,125.0,136.0,2300.0,1300.0,272.0
TCGA-02-0055-01,392.0,222.0,1820.0,903.0,321.0
TCGA-02-2483-01,272.0,256.0,2890.0,1320.0,458.0
TCGA-02-2485-01,83.9,129.0,6970.0,10100.0,419.0
TCGA-02-2486-01,108.0,205.0,2250.0,873.0,441.0


In [5]:
normal_ids = (
    sample_info_df[sample_info_df.sample_type.str.contains('Normal')]
      .index
      .intersection(data_df.index)
)
print(len(normal_ids))
print(normal_ids[:5])

737
Index(['TCGA-06-0675-11', 'TCGA-06-0678-11', 'TCGA-06-0680-11',
       'TCGA-06-0681-11', 'TCGA-06-AABW-11'],
      dtype='object', name='sample_id')


In [6]:
normal_data_df = data_df.loc[normal_ids, :]
print(normal_data_df.shape)
normal_data_df.iloc[:5, :5]

(737, 15369)


Unnamed: 0_level_0,1,100,1000,10000,10001
sample_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
TCGA-06-0675-11,88.5,43.6,2090.0,3250.0,312.0
TCGA-06-0678-11,58.5,101.0,2060.0,2490.0,317.0
TCGA-06-0680-11,80.8,53.7,2170.0,3090.0,291.0
TCGA-06-0681-11,141.0,148.0,1910.0,2210.0,311.0
TCGA-06-AABW-11,258.0,208.0,1560.0,786.0,251.0


### Preprocessing for normal samples

This is a bit nuanced since we don't have mutation calling information for the normal samples, so we can't generate a log(mutation burden) covariate.

For now we'll just take the mean mutation burden from the tumor dataset and apply it to all the normal samples.

In [9]:
# load mutation data
pancancer_data = du.load_pancancer_data()
(sample_freeze_df,
 mutation_df,
 copy_loss_df,
 copy_gain_df,
 mut_burden_df) = pancancer_data

In [12]:
print(mut_burden_df.shape)
mut_burden_df.head()

(9074, 1)


Unnamed: 0_level_0,log10_mut
SAMPLE_BARCODE,Unnamed: 1_level_1
TCGA-02-0047-01,1.812913
TCGA-02-0055-01,1.70757
TCGA-02-2483-01,1.662758
TCGA-02-2485-01,1.748188
TCGA-02-2486-01,1.755875


In [13]:
mean_mutation_burden = mut_burden_df.sum() / mut_burden_df.shape[0]
print(mean_mutation_burden)

log10_mut    1.834867
dtype: float64


In [17]:
# construct covariate matrix for normal samples
y_normal_df = pd.DataFrame(
    {'log10_mut': mean_mutation_burden.values[0]},
    index=normal_ids
)
# add cancer type
# TODO: this needs to use the same dummies as the training data,
# how to do that?
# 1) if normal samples are not in the set of cancer types the model
#    was trained on, drop them
# 2) if normal samples are in the set of cancer types the model
#    was trained on, set the dummies *the same way*
y_normal_df = (y_normal_df
    .merge(sample_info_df, left_index=True, right_index=True)    
    .drop(columns={'id_for_stratification'})
    .rename(columns={'cancer_type': 'DISEASE'})
)
print(y_normal_df.shape)
y_normal_df.head()

(737, 3)


Unnamed: 0_level_0,log10_mut,sample_type,DISEASE
sample_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
TCGA-06-0675-11,1.834867,Solid Tissue Normal,GBM
TCGA-06-0678-11,1.834867,Solid Tissue Normal,GBM
TCGA-06-0680-11,1.834867,Solid Tissue Normal,GBM
TCGA-06-0681-11,1.834867,Solid Tissue Normal,GBM
TCGA-06-AABW-11,1.834867,Solid Tissue Normal,GBM


In [8]:
# add covariates
normal_data_cov_df = tu.align_matrices(
    normal_data_df,
    None,
)
print(normal_data_cov_df.shape)
normal_data_cov_df.iloc[:5, -5:]

AttributeError: 'NoneType' object has no attribute 'DISEASE'