Adding Separation Analysis Module #163

gwaybio · 2019-02-09T23:50:59Z

Adding a GTEx and TARGET separation analysis. The figure describes searching for a feature that distinguishes sex in GTEx data and MYCN amplification in TARGET neuroblastoma samples.

gwaybio · 2019-02-13T18:15:42Z

updated figure:

gwaybio · 2019-02-13T18:16:25Z

In the most recent set of commits, I add two analyses.

I add an analysis focusing on detecting patient sex in TCGA data, and I apply the MYCN signature to an external dataset

cgreene · 2019-02-13T18:18:15Z

Is the GTEx sex variable predictive of TCGA or vice versa? Just curious if you tried it. I'd expect the cancer samples to be vastly different than normal tissue.

gwaybio · 2019-02-13T18:24:10Z

Is the GTEx sex variable predictive of TCGA or vice versa? Just curious if you tried it. I'd expect the cancer samples to be vastly different than normal tissue.

I did try it and they are not predictive...

GTEx predicting TCGA

TCGA predicting GTEx

I didn't follow this up with any fancy plots

ajlee21

In PR #162 I said that I was surprised that the p-value curves in the plot go up and down as you increase the number of z dimensions. After thinking about it some more I think it might make sense. If my understanding is correct, when you run your VAE using z=2 and then run using z=3 you are not guaranteed to get the same latent dimensions due to the stochasticity of the VAE keras build. As in if you found that you picked up a neutraphil signal in z=2, you might not see it in z=3 because the features are not cumulative between these models.

ajlee21 · 2019-02-14T01:59:02Z

10.detect-separation/scripts/nbconverted/0.download-validation-data.py

+# In[5]:
+
+
+get_ipython().system(' md5sum "download/2019-01-22-CellLineSTAR-fpkm-2pass_matrix.txt"')


What are you doing with the md5sum? I assume this is a QC step

Yes - and also helpful for someone else who may be running the code to confirm that the data they are using are the same version as described here

ajlee21 · 2019-02-14T02:00:40Z

10.detect-separation/scripts/nbconverted/0.download-validation-data.py

+get_ipython().system(' md5sum "download/2019-01-22-CellLineSTAR-fpkm-2pass_matrix.txt"')
+
+
+# ## Download Phenotype Data


Maybe this is not necessary, I would just clarify that by phenotype you mean "MYCN status". Not sure if you want to have the specific criteria that is used to describe what is meant by "amplified"

ajlee21 · 2019-02-14T02:02:53Z

10.detect-separation/scripts/nbconverted/0.download-validation-data.py

+# In[12]:
+
+
+# Create a synonym to entrez mapping and add to dictionary


Is there a reason you are replacing the ids with entrez ones?

Entrez IDs are more stable than hugo symbols, have better mappings to other databases, and all of my other data are in entrez id format

ajlee21 · 2019-02-14T02:03:38Z

10.detect-separation/scripts/nbconverted/0.download-validation-data.py

+    .sort_index(axis='columns')
+    .sort_index(axis='rows')
+)
+raw_scaled_df.columns = raw_scaled_df.columns.astype(str)


Why are you not keeping the original column names?

All of my other matrices have strings as the columns and merging or subsetting by column name should be assumed to be string

ajlee21 · 2019-02-14T02:05:21Z

10.detect-separation/scripts/nbconverted/1.separate.py

+# 
+# **Gregory Way, 2019**
+# 
+# Perform a t-test to isolate the specific k dimension and algorithm that best distinguishes the two groups.


This wording is a bit confusing

Ah, thanks for pointing out - will address in next commit!

ajlee21 · 2019-02-14T02:08:55Z

10.detect-separation/scripts/nbconverted/1.separate.py

+
+
+# Extract male and female ids from the dataset
+example_matrix_df = gtex_z_matrix_dict['signal']['8']['451283']['test']


Is this for testing?

I am just extracting the IDs in this example matrix. I will update the comment to be more clear, thanks!

ajlee21 · 2019-02-14T02:12:43Z

10.detect-separation/scripts/nbconverted/1.separate.py

+
+# coding: utf-8
+
+# # Detect Separation Between Two Phenotypes in TCGA, GTEx, and TARGET Data


Looks like you could possibly make a function to do this since you're doing something similar for the two datasets. Though you may not have time right now and I'm not sure how specific the processing is for the two datasets.

yeah, I could condense at least one of the pipeline steps to a function. Although the processing steps for the phenotype data are distinct

will add in next commit!

ajlee21 · 2019-02-14T02:16:01Z

10.detect-separation/scripts/nbconverted/2.apply-mycn-signature.py

+
+vae_seed = '451283'
+vae_k = 200
+vae_feature = "vae_111"


Why is this hard-coded in? I'm not sure how this signature is selected?

It is selected from the 1.separate module. But it is a good idea to remove the hard code! I will draw it in from a file generated in 1.separate instead. Thanks!

ajlee21 · 2019-02-14T02:20:31Z

10.detect-separation/scripts/nbconverted/3.visualize-separation.r

+
+mycn_validation_gg
+
+# Plot multi panel figure


I see there is a TARGET NBL box plot and NBL box plot. Is the the NBL plot from normal patients?

The TARGET boxplot is from patients, the other is from Cell lines

The cell line scores were derived from the signature learned from TARGET data

ajlee21 · 2019-02-14T02:21:25Z

10.detect-separation/scripts/nbconverted/3.visualize-separation.r

+                                  hjust = 0.5),
+        legend.position = "none")
+
+tcga_sex_gg


Looks like you have a very long set of warnings here -- any idea what this is?

yeah, its b/c of the readxl package warning that the column expected a specific data_type.

gwaybio added 6 commits February 9, 2019 16:42

add separation analysis

598a3c3

rename separate script and nbconvert

e1223b4

move r script to notebook

f86bd23

add figure

277acd5

lowercase panel labels

5f0ad79

delete old figures

c0318dc

This was referenced Feb 13, 2019

Validate MYCN status in NBL Cell Lines #165

Closed

Find Sex Feature in TCGA #164

Closed

gwaybio added 9 commits February 13, 2019 11:06

Merge remote-tracking branch 'upstream/master' into separation-analysis

5cdd9df

add separation of TCGA sex

c179d8f

update visualization notebook

99a399e

download nbl cell line dataset

b27cf2a

add application of mycn signature to cell line dataset

123d330

update figure

6557686

update all results

93666a4

add excel functions to environment

3d119a8

rename notebook modules

3d1d22a

gwaybio requested a review from ajlee21 February 13, 2019 18:17

gwaybio changed the title ~~[WIP] Separation Analysis~~ Adding Separation Analysis Module Feb 13, 2019

relabel axis

c688bd9

ajlee21 approved these changes Feb 14, 2019

View reviewed changes

gwaybio added 3 commits February 14, 2019 16:12

remove hardcoding

a41bb90

add new function, reduce code, and bolster documentation

6209cb0

describe MYCN amplification

f0dbca4

gwaybio merged commit 13c0bfe into greenelab:master Feb 14, 2019

gwaybio deleted the separation-analysis branch February 14, 2019 21:14

gwaybio mentioned this pull request Feb 17, 2019

Generate Supplementary Figure Describing GTEx and TCGA Sex Features #168

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Separation Analysis Module #163

Adding Separation Analysis Module #163

gwaybio commented Feb 9, 2019 •

edited

gwaybio commented Feb 13, 2019

gwaybio commented Feb 13, 2019

cgreene commented Feb 13, 2019

gwaybio commented Feb 13, 2019

ajlee21 left a comment

ajlee21 Feb 14, 2019

gwaybio Feb 14, 2019

ajlee21 Feb 14, 2019

ajlee21 Feb 14, 2019

gwaybio Feb 14, 2019

ajlee21 Feb 14, 2019

gwaybio Feb 14, 2019

ajlee21 Feb 14, 2019

gwaybio Feb 14, 2019

ajlee21 Feb 14, 2019

gwaybio Feb 14, 2019

ajlee21 Feb 14, 2019

gwaybio Feb 14, 2019 •

edited

gwaybio Feb 14, 2019

ajlee21 Feb 14, 2019

gwaybio Feb 14, 2019

ajlee21 Feb 14, 2019

gwaybio Feb 14, 2019

gwaybio Feb 14, 2019

ajlee21 Feb 14, 2019

gwaybio Feb 14, 2019

		# In[5]:


		get_ipython().system(' md5sum "download/2019-01-22-CellLineSTAR-fpkm-2pass_matrix.txt"')

		get_ipython().system(' md5sum "download/2019-01-22-CellLineSTAR-fpkm-2pass_matrix.txt"')


		# ## Download Phenotype Data

		# In[12]:


		# Create a synonym to entrez mapping and add to dictionary



		# Extract male and female ids from the dataset
		example_matrix_df = gtex_z_matrix_dict['signal']['8']['451283']['test']


		# coding: utf-8

		# # Detect Separation Between Two Phenotypes in TCGA, GTEx, and TARGET Data

Adding Separation Analysis Module #163

Adding Separation Analysis Module #163

Conversation

gwaybio commented Feb 9, 2019 • edited

gwaybio commented Feb 13, 2019

gwaybio commented Feb 13, 2019

cgreene commented Feb 13, 2019

gwaybio commented Feb 13, 2019

GTEx predicting TCGA

TCGA predicting GTEx

ajlee21 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gwaybio Feb 14, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gwaybio commented Feb 9, 2019 •

edited

gwaybio Feb 14, 2019 •

edited