Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Separation Analysis Module #163

Merged
merged 19 commits into from Feb 14, 2019

Conversation

gwaybio
Copy link
Collaborator

@gwaybio gwaybio commented Feb 9, 2019

Adding a GTEx and TARGET separation analysis. The figure describes searching for a feature that distinguishes sex in GTEx data and MYCN amplification in TARGET neuroblastoma samples.

target_mycn_gtex_sex_plot

@gwaybio
Copy link
Collaborator Author

gwaybio commented Feb 13, 2019

updated figure:

full_separation_plot

@gwaybio
Copy link
Collaborator Author

gwaybio commented Feb 13, 2019

In the most recent set of commits, I add two analyses.

I add an analysis focusing on detecting patient sex in TCGA data, and I apply the MYCN signature to an external dataset

@gwaybio gwaybio requested a review from ajlee21 February 13, 2019 18:17
@gwaybio gwaybio changed the title [WIP] Separation Analysis Adding Separation Analysis Module Feb 13, 2019
@cgreene
Copy link
Member

cgreene commented Feb 13, 2019

Is the GTEx sex variable predictive of TCGA or vice versa? Just curious if you tried it. I'd expect the cancer samples to be vastly different than normal tissue.

@gwaybio
Copy link
Collaborator Author

gwaybio commented Feb 13, 2019

Is the GTEx sex variable predictive of TCGA or vice versa? Just curious if you tried it. I'd expect the cancer samples to be vastly different than normal tissue.

I did try it and they are not predictive...

GTEx predicting TCGA

image

TCGA predicting GTEx

image

I didn't follow this up with any fancy plots

Copy link

@ajlee21 ajlee21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In PR #162 I said that I was surprised that the p-value curves in the plot go up and down as you increase the number of z dimensions. After thinking about it some more I think it might make sense. If my understanding is correct, when you run your VAE using z=2 and then run using z=3 you are not guaranteed to get the same latent dimensions due to the stochasticity of the VAE keras build. As in if you found that you picked up a neutraphil signal in z=2, you might not see it in z=3 because the features are not cumulative between these models.

# In[5]:


get_ipython().system(' md5sum "download/2019-01-22-CellLineSTAR-fpkm-2pass_matrix.txt"')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are you doing with the md5sum? I assume this is a QC step

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - and also helpful for someone else who may be running the code to confirm that the data they are using are the same version as described here

get_ipython().system(' md5sum "download/2019-01-22-CellLineSTAR-fpkm-2pass_matrix.txt"')


# ## Download Phenotype Data
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is not necessary, I would just clarify that by phenotype you mean "MYCN status". Not sure if you want to have the specific criteria that is used to describe what is meant by "amplified"

# In[12]:


# Create a synonym to entrez mapping and add to dictionary
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you are replacing the ids with entrez ones?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Entrez IDs are more stable than hugo symbols, have better mappings to other databases, and all of my other data are in entrez id format

.sort_index(axis='columns')
.sort_index(axis='rows')
)
raw_scaled_df.columns = raw_scaled_df.columns.astype(str)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you not keeping the original column names?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of my other matrices have strings as the columns and merging or subsetting by column name should be assumed to be string

#
# **Gregory Way, 2019**
#
# Perform a t-test to isolate the specific k dimension and algorithm that best distinguishes the two groups.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wording is a bit confusing

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for pointing out - will address in next commit!



# Extract male and female ids from the dataset
example_matrix_df = gtex_z_matrix_dict['signal']['8']['451283']['test']
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for testing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just extracting the IDs in this example matrix. I will update the comment to be more clear, thanks!


# coding: utf-8

# # Detect Separation Between Two Phenotypes in TCGA, GTEx, and TARGET Data
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you could possibly make a function to do this since you're doing something similar for the two datasets. Though you may not have time right now and I'm not sure how specific the processing is for the two datasets.

Copy link
Collaborator Author

@gwaybio gwaybio Feb 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I could condense at least one of the pipeline steps to a function. Although the processing steps for the phenotype data are distinct

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add in next commit!


vae_seed = '451283'
vae_k = 200
vae_feature = "vae_111"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this hard-coded in? I'm not sure how this signature is selected?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is selected from the 1.separate module. But it is a good idea to remove the hard code! I will draw it in from a file generated in 1.separate instead. Thanks!


mycn_validation_gg

# Plot multi panel figure
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see there is a TARGET NBL box plot and NBL box plot. Is the the NBL plot from normal patients?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TARGET boxplot is from patients, the other is from Cell lines

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cell line scores were derived from the signature learned from TARGET data

hjust = 0.5),
legend.position = "none")

tcga_sex_gg
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you have a very long set of warnings here -- any idea what this is?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, its b/c of the readxl package warning that the column expected a specific data_type.

@gwaybio gwaybio merged commit 13c0bfe into greenelab:master Feb 14, 2019
@gwaybio gwaybio deleted the separation-analysis branch February 14, 2019 21:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants