# SCML Dimensionality Reduction
#### April 4th, 2018

Investigating TSNE and Trimap on TCGA and GTEx gene expression data.

- [Preservation of Structure When Subsetting](#Preservation-of-Structure-When-Subsetting)
- [Outliers](#Outliers)
- [Cluster Size and Distance](#Cluster-Size-and-Distance)

In [55]:
import pandas as pd
import rnaseq_lib as r
import numpy as np
import holoviews as hv
hv.extension('bokeh', logo=False)

# Synapse ID: syn12009613
exp = pd.read_hdf('/mnt/data/Objects/tcga_gtex_tpm.hd5')
tpm_path = '/mnt/data/Objects/tcga_gtex_tpm_truncatedsvd.hd5'
df = pd.read_hdf(tpm_path)
h = r.plot.Holoview(df)

Missing attributes: 
'5S_rRNA' is not in list


In [38]:
import pandas as pd
import rnaseq_lib as r
import holoviews as hv
hv.extension('bokeh', logo=False)

# Synapse ID: syn12009613
exp = pd.read_hdf('/mnt/data/Objects/tcga_gtex_tpm.hd5')
h_exp = r.plot.Holoview(exp)

df = pd.read_hdf('/mnt/data/Objects/tcga_gtex_tpm_truncatedsvd.hd5')
h = r.plot.Holoview(df)

Missing attributes: 
'5S_rRNA' is not in list


## Preservation of Structure When Subsetting

In [2]:
thyroid_tsne = h.tsne(genes=range(50), tissue_subset=['Thyroid'])
thyroid_trimap = h.trimap(genes=range(50), tissue_subset=['Thyroid'])

In [97]:
%%opts Scatter [color_index='label' width=500] (cmap='Set1' size=5 alpha=0.25) {+axiswise}
%%opts Scatter.Trimap [show_legend=False]
%%opts Scatter.TSNE [legend_position='top_right']
hv.Layout([thyroid_tsne.relabel('TSNE'),
                     thyroid_trimap.relabel('Trimap')]).relabel('Thyroid')

Subsetting for GTEx thyroid samples (**Orange**)

In [14]:
to = r.plot.Holoview(df[(df.type == 'Thyroid') & (df.label == 'gtex')])
thyroid_only_tsne = to.tsne(genes=range(50))
thyroid_only_trimap = to.trimap(genes=range(50))

Missing attributes: 
'5S_rRNA' is not in list


In [102]:
%%opts Scatter [color_index='type' width=500] (cmap='Set1' size=5 alpha=0.25) {+axiswise}
%%opts Scatter.Trimap [show_legend=False]
%%opts Scatter.TSNE [legend_position='bottom_right']
hv.Layout([thyroid_only_tsne.relabel('TSNE'), thyroid_only_trimap.relabel('Trimap')]).relabel('GTEx Thyroid Samples')

## Outliers

In [90]:
pan = df[df.type == 'Pancreas']
pan = r.plot.Holoview(pan)
#pan_tsne = pan.tsne(genes=range(50))
#pan_trimap = pan.trimap(genes=range(50))
pancreas_tsne = h.tsne(genes=range(50), tissue_subset=['Pancreas'])
pancreas_trimap = h.trimap(genes=range(50), tissue_subset=['Pancreas'])

Missing attributes: 
'5S_rRNA' is not in list


In [103]:
%%opts Scatter [color_index='type' width=500] (cmap='Set1' size=5 alpha=0.25) {+axiswise}
%%opts Scatter.Trimap [show_legend=False]
%%opts Scatter.TSNE [legend_position='bottom_left']
hv.Layout([pancreas_tsne.relabel('TSNE'), pancreas_trimap.relabel('Trimap')])
#hv.Layout([pan_tsne.relabel('TSNE'), pan_trimap.relabel('Trimap')])

Select an outlier and inlier point and compare the differences to the median expression of the remainder of samples.

In [76]:
outlier = 'GTEX-13NZA-1726-SM-5J1NA'
normal = 'GTEX-1128S-0826-SM-5GZZI'
pancreas = exp[exp.type == 'Pancreas']

sample_exp = pancreas.loc[sample]
normal_exp = pancreas.loc[normal]

pancreas_no_out = pancreas.drop(sample, axis=0)
pancreas_no_normal = pancreas.drop(normal, axis=0)

In [77]:
outlier_diffs = pancreas_no_out[h_exp.genes].apply(r.math.l2norm).median() - sample_exp[4:].apply(r.math.l2norm)
normal_diffs = pancreas_no_normal[h_exp.genes].apply(r.math.l2norm).median() - normal_exp[4:].apply(r.math.l2norm)

In [88]:
hist_outlier = hv.Histogram(np.histogram(outlier_diffs, bins=50), label='Outlier')
hist_norm = hv.Histogram(np.histogram(normal_diffs, bins=50), label='Inlier')

In [93]:
%%opts Histogram [width=500 height=300]
hist_outlier + hist_norm

## Cluster Size and Distance

Below are t-SNE plots for a mixture of Gaussians in plane, where one is 10 times as dispersed as the other.

Figures taken from: https://distill.pub/2016/misread-tsne/

![cluster](img/tsne-cluster.png)

The next diagrams show three Gaussians of 50 points each, one pair being 5 times as far apart as another pair.

![cluster](img/tsne-dist.png)

## Mixed Tissues

In [None]:
ut = df[(df.tissue == 'Uterus') | (df.tissue == 'Thyroid')]
ut['type'] = ut.loc[:, 'type'].apply(lambda x: x[:15])
ut = r.plot.Holoview(ut)
ut_tsne = ut.tsne(genes=range(50))
ut_trimap = ut.trimap(genes=range(50))

In [111]:
%%opts Scatter [color_index='type' width=500] (cmap='Set1' size=5 alpha=0.25) {+axiswise}
%%opts Scatter.Trimap [show_legend=False]
%%opts Scatter.TSNE [legend_position='left' width=750]
hv.Layout([ut_tsne.relabel('TSNE'), ut_trimap.relabel('Trimap')])