# TCGA and GTEx Gene Differential Expression Heatmap by Subtype

Differential expression (gene TPM) comparison of all TCGA subtypes with TCGA normal tissue of the same type \cite{the_cancer_genome_atlas_research_network_cancer_2013}. Those gene differential expression results were compared differential expression results of those tumor samples compared to every other GTEx/TCGA normal tissue. Hover over a square to see the PearsonR concordance value and which subtypes were compared.


While GTEx \cite{consortium_genotype-tissue_2015} contains thousands of normal tissue samples, they can't be compared directly to TCGA due to differences in sequencing depth and laboratory batch effects. Unfortunately, there don't exist standard RNA-seq benchmark samples that every consortium uses to calibrate with before processing, which would likely introduce fewer batch effects that are easier to correct. Current available methods typically attempt naive distribution fitting that tend to work less effectively as the amount of samples and classes increases \cite{johnson_adjusting_2007, shaham_removal_2017}. Instead, we can evaluate GTEx as a prior by normalizing for sequencing depth and dispersion, then comparing differential expression results for protein-coding genes between TCGA normals and GTEx normals to see how concordant they are.

In [1]:
import pandas as pd
import rnaseq_lib as r
import holoviews as hv
hv.extension('bokeh', logo=False)

# Synapse ID: syn12009613
data_path = '/mnt/data/Objects/tcga_gtex_data.hd5'
exp = pd.read_hdf(data_path, key='exp')
met = pd.read_hdf(data_path, key='met')

# Add metadata and create holoview wrapper
df = r.data.add_metadata_to_exp(exp, met)
h = r.plot.Holoview(df)

In [2]:
# Create DE dataframe for all tumor and normal types
de_plot = h.de_concordance().relabel('TCGA and GTEx')

# Create DE dataframe for all tumor and normal types
de_gtex = h.de_concordance(gtex=True, tcga=False, normalize=False).relabel('GTEx')

# Create DE dataframe for all tumor and normal types
de_tcga = h.de_concordance(gtex=False, tcga=True).relabel('TCGA')

## PearsonR Gene Correlation Between TCGA Tumor/Normal and Tumor/Other-Normals

In [5]:
%%opts Layout [tabs=True]
%%opts HeatMap.HeatMap.TCGA_and_GTEx [height=600]
%%opts HeatMap.HeatMap.GTEx [width=1000]
%%opts HeatMap.HeatMap.TCGA [width=750]
%%opts HeatMap [height=500 invert_axes=True invert_xaxis=True]
hv.Layout([de_plot, de_gtex, de_tcga])

### Description of Individual HeatMap Squares
Each individual square in the above heat map corresponds to a scatter plot of gene fold change concordance as shown below, using **TCGA Lung Squamous Cell Carcinoma** and **GTEx Lung** as an example.

In [21]:
from scipy.stats import pearsonr

t = 'Lung_Squamous_Cell_Carcinoma'
# TCGA T/N
rcc = df[df.type == t]
rcc_t = rcc[rcc.label == 'tcga-tumor'][h.genes].median()
rcc_n = rcc[rcc.label == 'tcga-normal'][h.genes].median()

# GTEx
gtex = df[df.type == 'Lung'][h.genes].median()

# L2FC
rcc_l2fc = r.diff_exp.log2fc(rcc_t, rcc_n)
gtex_l2fc = r.diff_exp.log2fc(rcc_t, gtex)

# PearsonR
pearson_r = round(pearsonr(rcc_l2fc, gtex_l2fc)[0], 2)

In [29]:
%%opts Scatter [height=500 width=500] (alpha=0.1 color='red')
hv.Scatter((rcc_l2fc, gtex_l2fc), label='Lung Squamous Cell PearsonR: {}'.format(pearson_r),
          kdims='TCGA Tumor / TCGA Normal Gene Fold Change',
          vdims='TCGA Tumor / GTEx Normal Gene Fold Change')