-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add custom functions for working with PLIER models and initial exploratory analyses #3
Conversation
To facilitate code review
All PLIER models built in Docker container will have FDR calculations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a couple of comments (mostly clarification) - overall great PR 👍
util/plier_util.R
Outdated
data(canonicalPathways) | ||
|
||
# combine the pathway data from PLIER | ||
all.paths <- combinePaths(bloodCellMarkersIRISDMAP, svmMarkers, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this PLIER::combinePaths()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for the rest of the functions below
util/plier_util.R
Outdated
|
||
# PLIER main function + return results | ||
plier.res <- PLIER(exprs.norm[cm.genes, ], all.paths[cm.genes, ], | ||
k = round((set.k + set.k*0.3), 0), trace = TRUE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is 0.3
? I feel like I may have asked this question before elsewhere...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Came up here greenelab/rheum-plier-data#1 (comment) originally, but also here greenelab/rheum-plier-data#20 (comment) -- will update the documentation
util/plier_util.R
Outdated
# training data, set missing genes to zero (the mean), and reorder to match | ||
# plier.model$Z | ||
# | ||
# This makes the input gene expression data suitable for projection (?) into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why (?)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, note to self re: wording during development
util/plier_util.R
Outdated
indx.vector[row.iter] <- | ||
which(rownames(z.mat) == rownames(exprs.cg)[row.iter]) | ||
} | ||
ord.rownorm <- exprs.cg[order(indx.vector), ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am not sure if this will have the intended outcome. Lines 91-95 create a matched order, but then it is reordered in Line 96 with order()
.
It may also be worth checking out match
# exprs.new.b: a matrix that contains the values of each latent variable | ||
# for each sample from the new dataset (exprs.mat), | ||
# | ||
require(PLIER) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we get some spacing in this function? (To match aesthetic (that I like) of previous functions in this file)
```{r} | ||
png.file <- file.path(plot.dir, | ||
"recount2_recon_MASE_all_lvs.png") | ||
ggplot2::ggsave(filename = png.file, plot = ggplot2::last_plot(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very interesting distribution - Are these LVs or samples? (I think I'm a bit confused)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Samples, I'll add that info to the axis label
#### Spearman correlation (input, reconstructed) | ||
|
||
Spearman correlation between input and reconstructed values was used as an | ||
evaluation in [Cleary, et al.](https://doi.org/10.1016/j.cell.2017.10.023) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is related to a comment below (in the util file) - I think its worth adding a brief description of what the eval is doing
```{r} | ||
png.file <- file.path(plot.dir, | ||
"recount2_recon_MASE.png") | ||
ggplot2::ggsave(filename = png.file, plot = ggplot2::last_plot(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe I am misunderstanding - pathway associated LVs have a worse reconstruction error?
```{r} | ||
png.file <- file.path(plot.dir, | ||
"recount2_recon_scatter.png") | ||
ggplot2::ggsave(filename = png.file, plot = ggplot2::last_plot(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth adding all points to the same graph with different colors and adjusting alpha
? I think it would emphasize differences. Also, what proportion are pathway-associated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some changes coming up
# save heatmap as pdf -- can not figure out another way around this, hm just | ||
# returns a list | ||
pdf(file.path(plot.dir, "E-MTAB-2452_recount_PLIER_cell_type_LVs_B.pdf")) | ||
gplots::heatmap.2(iso.b.matrix[indx.relevant.lv, ], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Axis labels are very small
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool results.
``` | ||
|
||
**Scatterplot** | ||
```{r} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is interesting that for pathway-associated reconstruction, there is a large proportion of points with low correlation but have relatively high MASE. Maybe, fit a linear regression and add R^2 will help for interpretation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the pattern is linear, so I am not sure that's appropriate. It would be interesting to check out what samples have high error and high correlation at some point in the future -- probably would make sense to come back around at the same time I look into #4.
```{r} | ||
png.file <- file.path(plot.dir, | ||
"recount2_recon_MASE.png") | ||
ggplot2::ggsave(filename = png.file, plot = ggplot2::last_plot(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pathway associated LVs have a worse reconstruction error because part of the loading information is lost when removing the non-significant LVs ? Is it possible to only look at those genes that associated with the significant LVs ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, so, this is as we would expect for the training dataset for this model -- when we drop out LVs learned by the model, we'll do worse at reconstruction. We expect these non-pathway associated LVs to capture some information (variance) that could be technical (e.g., kits used for library prep) or novel biology or tissue vs. blood -- any number of things that might not be captured by pathways and cell type gene sets supplied to PLIER.
But, this experimental set up gets more interesting when we apply the recount2 PLIER model to another dataset. Maybe if I just use (known) biological signal, that helps me with reconstruction as compared to including LVs that might capture something like library prep that could be specific to RNA-seq data or something else that might limit the generalizability of the model.
That's my thinking here. What do you think @gwaygenomics and @huqiwen0313?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to only look at those genes that associated with the significant LVs ?
Can you say a bit more about what you mean by that Qiwen? Do you mean filter the genes in the Z matrix to only those that have positive values once the non-significant LVs get dropped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is what I am thinking of. I mean only look at those genes that have high loadings in those significant LVs. The reconstruction error may look better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like - if you reconstruct a new dataset with the non-pathway LVs then sample correlation will be much lower than reconstruction on trained dataset (if the non-pathway LVs are capturing artifacts)?
(at least lower than reconstruction of new dataset with pathway LVs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like - if you reconstruct a new dataset with the non-pathway LVs then sample correlation will be much lower than reconstruction on trained dataset (if the non-pathway LVs are capturing artifacts)?
(at least lower than reconstruction of new dataset with pathway LVs)
Yep, that's the idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is what I am thinking of. I mean only look at those genes that have high loadings in those significant LVs. The reconstruction error may look better.
@huqiwen0313 I think it might make sense to put a pin in this and revisit once we're looking at test datasets, I've filed #4
``` | ||
```{r} | ||
# save heatmap as pdf -- can not figure out another way around this, hm just | ||
# returns a list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add annotation for different clusters (colors) in x and y axis?
Thanks for the comments @gwaygenomics and @huqiwen0313 ! I think this is ready for another look. I also fixed the documentation and text around one of the U sparsity evals, it was not quite right. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great @jaclyn-taroni - one minor optional comment
util/plier_util.R
Outdated
@@ -239,6 +239,8 @@ GetReconstructionCorrelation <- function(true.mat, recon.mat, | |||
for (col.iter in 1:ncol(true.mat)){ | |||
# for each gene (column), calculate the MASE between the true expression | |||
# values and the expression values after reconstruction | |||
# due to size of matrices, this is more efficient than calculating | |||
# correlation between all values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was fd4ae21 a significant speed upgrade? May consider updating this function to sapply
too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very minor speed up (less than half a second) and I tested on recount2, so I don't anticipate working with anything bigger
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
This PR adds custom functions for working with PLIER models, e.g., a wrapper function for running
PLIER::PLIER
, applying PLIER models to new data, various functions for reconstruction, etc.I also include an R notebook,
PLIER_util_proof-of-concept_notebook.*
, as a proof-of-concept (as its name implies 😄 ). Here, I track both the.Rmd
and.nb.html
files generated.Unfortunately, R notebooks will not exactly display on Github currently. Specifically, if we take a look at the Rmarkdown file
PLIER_util_proof-of-concept_notebook.Rmd
, the plot at the very end doesn't come up.So to facilitate code review I've tried a couple things:
Rnotebook_scripts/PLIER_util_proof-of-concept_notebook.R
usingutil/purl_wrapper.R
. I can also do this in a way that excludes documentation.Rnotebook_pdf/PLIER_util_proof-of-concept_notebook.pdf
-- in order to this I needed to addpdflatex
I am interested in the preferences of reviewers.
Update
Based on comments from @huqiwen0313 and @gwaygenomics, it seems like
.Rmd
files will be sufficient for code review. So I will reorganize accordingly.Ready for review update
I apologize for the size of the PR! I was hoping to get comments on repo organization as well, hence the multiple notebooks. Since R notebooks generate
.Rmd
and.nb.html
files, it's not quite as bad as it looks. Here's what I've added and what I recommend taking a look at:util/plier_util.R
- This contains custom functions for working with and evaluating PLIER models. You'll see most of them "in action" in the 3 notebooks I include here.01-PLIER_util_proof-of-concept_notebook.Rmd
- Proof-of-concept/"sanity check" for some of the custom functions using the NARES dataset (due to its relatively small size)02-recount2_PLIER_exploration.Rmd
- Initial exploration of the recount2 PLIER model & some potential ways to evaluate models (science comments most welcome here!)03-isolated_cell_type_populations.Rmd
- Here, I apply the recount2 PLIER model to a microarray dataset of sorted leukocytes, another important control for this model.I've included a zip file of the
.nb.html
files here for your viewing convenience: 01-03.nb.html.zip