New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Separation Analysis Module #163
Conversation
In the most recent set of commits, I add two analyses. I add an analysis focusing on detecting patient sex in TCGA data, and I apply the MYCN signature to an external dataset |
Is the GTEx sex variable predictive of TCGA or vice versa? Just curious if you tried it. I'd expect the cancer samples to be vastly different than normal tissue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In PR #162 I said that I was surprised that the p-value curves in the plot go up and down as you increase the number of z dimensions. After thinking about it some more I think it might make sense. If my understanding is correct, when you run your VAE using z=2 and then run using z=3 you are not guaranteed to get the same latent dimensions due to the stochasticity of the VAE keras build. As in if you found that you picked up a neutraphil signal in z=2, you might not see it in z=3 because the features are not cumulative between these models.
# In[5]: | ||
|
||
|
||
get_ipython().system(' md5sum "download/2019-01-22-CellLineSTAR-fpkm-2pass_matrix.txt"') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are you doing with the md5sum? I assume this is a QC step
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - and also helpful for someone else who may be running the code to confirm that the data they are using are the same version as described here
get_ipython().system(' md5sum "download/2019-01-22-CellLineSTAR-fpkm-2pass_matrix.txt"') | ||
|
||
|
||
# ## Download Phenotype Data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this is not necessary, I would just clarify that by phenotype you mean "MYCN status". Not sure if you want to have the specific criteria that is used to describe what is meant by "amplified"
# In[12]: | ||
|
||
|
||
# Create a synonym to entrez mapping and add to dictionary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason you are replacing the ids with entrez ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Entrez IDs are more stable than hugo symbols, have better mappings to other databases, and all of my other data are in entrez id format
.sort_index(axis='columns') | ||
.sort_index(axis='rows') | ||
) | ||
raw_scaled_df.columns = raw_scaled_df.columns.astype(str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you not keeping the original column names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of my other matrices have strings as the columns and merging or subsetting by column name should be assumed to be string
# | ||
# **Gregory Way, 2019** | ||
# | ||
# Perform a t-test to isolate the specific k dimension and algorithm that best distinguishes the two groups. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wording is a bit confusing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks for pointing out - will address in next commit!
|
||
|
||
# Extract male and female ids from the dataset | ||
example_matrix_df = gtex_z_matrix_dict['signal']['8']['451283']['test'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this for testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am just extracting the IDs in this example matrix. I will update the comment to be more clear, thanks!
|
||
# coding: utf-8 | ||
|
||
# # Detect Separation Between Two Phenotypes in TCGA, GTEx, and TARGET Data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you could possibly make a function to do this since you're doing something similar for the two datasets. Though you may not have time right now and I'm not sure how specific the processing is for the two datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I could condense at least one of the pipeline steps to a function. Although the processing steps for the phenotype data are distinct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will add in next commit!
|
||
vae_seed = '451283' | ||
vae_k = 200 | ||
vae_feature = "vae_111" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this hard-coded in? I'm not sure how this signature is selected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is selected from the 1.separate
module. But it is a good idea to remove the hard code! I will draw it in from a file generated in 1.separate
instead. Thanks!
|
||
mycn_validation_gg | ||
|
||
# Plot multi panel figure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see there is a TARGET NBL box plot and NBL box plot. Is the the NBL plot from normal patients?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TARGET boxplot is from patients, the other is from Cell lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cell line scores were derived from the signature learned from TARGET data
hjust = 0.5), | ||
legend.position = "none") | ||
|
||
tcga_sex_gg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you have a very long set of warnings here -- any idea what this is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, its b/c of the readxl
package warning that the column expected a specific data_type.
Adding a GTEx and TARGET separation analysis. The figure describes searching for a feature that distinguishes sex in GTEx data and MYCN amplification in TARGET neuroblastoma samples.