Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add custom functions for working with PLIER models and initial exploratory analyses #3

Merged
merged 25 commits into from
Apr 8, 2018

Conversation

jaclyn-taroni
Copy link
Collaborator

@jaclyn-taroni jaclyn-taroni commented Apr 4, 2018

This PR adds custom functions for working with PLIER models, e.g., a wrapper function for running PLIER::PLIER, applying PLIER models to new data, various functions for reconstruction, etc.

I also include an R notebook, PLIER_util_proof-of-concept_notebook.*, as a proof-of-concept (as its name implies 😄 ). Here, I track both the .Rmd and .nb.html files generated.

Unfortunately, R notebooks will not exactly display on Github currently. Specifically, if we take a look at the Rmarkdown file PLIER_util_proof-of-concept_notebook.Rmd, the plot at the very end doesn't come up.

So to facilitate code review I've tried a couple things:

I am interested in the preferences of reviewers.

Update

Based on comments from @huqiwen0313 and @gwaygenomics, it seems like .Rmd files will be sufficient for code review. So I will reorganize accordingly.

Ready for review update

I apologize for the size of the PR! I was hoping to get comments on repo organization as well, hence the multiple notebooks. Since R notebooks generate .Rmd and .nb.html files, it's not quite as bad as it looks. Here's what I've added and what I recommend taking a look at:

  • util/plier_util.R - This contains custom functions for working with and evaluating PLIER models. You'll see most of them "in action" in the 3 notebooks I include here.
  • 01-PLIER_util_proof-of-concept_notebook.Rmd - Proof-of-concept/"sanity check" for some of the custom functions using the NARES dataset (due to its relatively small size)
  • 02-recount2_PLIER_exploration.Rmd - Initial exploration of the recount2 PLIER model & some potential ways to evaluate models (science comments most welcome here!)
  • 03-isolated_cell_type_populations.Rmd - Here, I apply the recount2 PLIER model to a microarray dataset of sorted leukocytes, another important control for this model.

I've included a zip file of the .nb.html files here for your viewing convenience: 01-03.nb.html.zip

@jaclyn-taroni jaclyn-taroni changed the title Add PLIER util [WIP] Add custom functions for working with PLIER models and initial exploratory analyses Apr 5, 2018
Copy link

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a couple of comments (mostly clarification) - overall great PR 👍

data(canonicalPathways)

# combine the pathway data from PLIER
all.paths <- combinePaths(bloodCellMarkersIRISDMAP, svmMarkers,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this PLIER::combinePaths()?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for the rest of the functions below


# PLIER main function + return results
plier.res <- PLIER(exprs.norm[cm.genes, ], all.paths[cm.genes, ],
k = round((set.k + set.k*0.3), 0), trace = TRUE)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is 0.3? I feel like I may have asked this question before elsewhere...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Came up here greenelab/rheum-plier-data#1 (comment) originally, but also here greenelab/rheum-plier-data#20 (comment) -- will update the documentation

# training data, set missing genes to zero (the mean), and reorder to match
# plier.model$Z
#
# This makes the input gene expression data suitable for projection (?) into
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why (?)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, note to self re: wording during development

indx.vector[row.iter] <-
which(rownames(z.mat) == rownames(exprs.cg)[row.iter])
}
ord.rownorm <- exprs.cg[order(indx.vector), ]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am not sure if this will have the intended outcome. Lines 91-95 create a matched order, but then it is reordered in Line 96 with order().

It may also be worth checking out match

# exprs.new.b: a matrix that contains the values of each latent variable
# for each sample from the new dataset (exprs.mat),
#
require(PLIER)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we get some spacing in this function? (To match aesthetic (that I like) of previous functions in this file)

```{r}
png.file <- file.path(plot.dir,
"recount2_recon_MASE_all_lvs.png")
ggplot2::ggsave(filename = png.file, plot = ggplot2::last_plot(),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very interesting distribution - Are these LVs or samples? (I think I'm a bit confused)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Samples, I'll add that info to the axis label

#### Spearman correlation (input, reconstructed)

Spearman correlation between input and reconstructed values was used as an
evaluation in [Cleary, et al.](https://doi.org/10.1016/j.cell.2017.10.023)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is related to a comment below (in the util file) - I think its worth adding a brief description of what the eval is doing

```{r}
png.file <- file.path(plot.dir,
"recount2_recon_MASE.png")
ggplot2::ggsave(filename = png.file, plot = ggplot2::last_plot(),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe I am misunderstanding - pathway associated LVs have a worse reconstruction error?

```{r}
png.file <- file.path(plot.dir,
"recount2_recon_scatter.png")
ggplot2::ggsave(filename = png.file, plot = ggplot2::last_plot(),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth adding all points to the same graph with different colors and adjusting alpha? I think it would emphasize differences. Also, what proportion are pathway-associated?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some changes coming up

# save heatmap as pdf -- can not figure out another way around this, hm just
# returns a list
pdf(file.path(plot.dir, "E-MTAB-2452_recount_PLIER_cell_type_LVs_B.pdf"))
gplots::heatmap.2(iso.b.matrix[indx.relevant.lv, ],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Axis labels are very small

Copy link

@huqiwen0313 huqiwen0313 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool results.

```

**Scatterplot**
```{r}
Copy link

@huqiwen0313 huqiwen0313 Apr 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is interesting that for pathway-associated reconstruction, there is a large proportion of points with low correlation but have relatively high MASE. Maybe, fit a linear regression and add R^2 will help for interpretation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the pattern is linear, so I am not sure that's appropriate. It would be interesting to check out what samples have high error and high correlation at some point in the future -- probably would make sense to come back around at the same time I look into #4.

```{r}
png.file <- file.path(plot.dir,
"recount2_recon_MASE.png")
ggplot2::ggsave(filename = png.file, plot = ggplot2::last_plot(),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pathway associated LVs have a worse reconstruction error because part of the loading information is lost when removing the non-significant LVs ? Is it possible to only look at those genes that associated with the significant LVs ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, so, this is as we would expect for the training dataset for this model -- when we drop out LVs learned by the model, we'll do worse at reconstruction. We expect these non-pathway associated LVs to capture some information (variance) that could be technical (e.g., kits used for library prep) or novel biology or tissue vs. blood -- any number of things that might not be captured by pathways and cell type gene sets supplied to PLIER.

But, this experimental set up gets more interesting when we apply the recount2 PLIER model to another dataset. Maybe if I just use (known) biological signal, that helps me with reconstruction as compared to including LVs that might capture something like library prep that could be specific to RNA-seq data or something else that might limit the generalizability of the model.

That's my thinking here. What do you think @gwaygenomics and @huqiwen0313?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to only look at those genes that associated with the significant LVs ?

Can you say a bit more about what you mean by that Qiwen? Do you mean filter the genes in the Z matrix to only those that have positive values once the non-significant LVs get dropped?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is what I am thinking of. I mean only look at those genes that have high loadings in those significant LVs. The reconstruction error may look better.

Copy link

@gwaybio gwaybio Apr 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like - if you reconstruct a new dataset with the non-pathway LVs then sample correlation will be much lower than reconstruction on trained dataset (if the non-pathway LVs are capturing artifacts)?

(at least lower than reconstruction of new dataset with pathway LVs)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like - if you reconstruct a new dataset with the non-pathway LVs then sample correlation will be much lower than reconstruction on trained dataset (if the non-pathway LVs are capturing artifacts)?
(at least lower than reconstruction of new dataset with pathway LVs)

Yep, that's the idea.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is what I am thinking of. I mean only look at those genes that have high loadings in those significant LVs. The reconstruction error may look better.

@huqiwen0313 I think it might make sense to put a pin in this and revisit once we're looking at test datasets, I've filed #4

```
```{r}
# save heatmap as pdf -- can not figure out another way around this, hm just
# returns a list

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add annotation for different clusters (colors) in x and y axis?

@jaclyn-taroni
Copy link
Collaborator Author

Thanks for the comments @gwaygenomics and @huqiwen0313 ! I think this is ready for another look. I also fixed the documentation and text around one of the U sparsity evals, it was not quite right.

Copy link

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great @jaclyn-taroni - one minor optional comment

@@ -239,6 +239,8 @@ GetReconstructionCorrelation <- function(true.mat, recon.mat,
for (col.iter in 1:ncol(true.mat)){
# for each gene (column), calculate the MASE between the true expression
# values and the expression values after reconstruction
# due to size of matrices, this is more efficient than calculating
# correlation between all values
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was fd4ae21 a significant speed upgrade? May consider updating this function to sapply too

Copy link
Collaborator Author

@jaclyn-taroni jaclyn-taroni Apr 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor speed up (less than half a second) and I tested on recount2, so I don't anticipate working with anything bigger

Copy link

@huqiwen0313 huqiwen0313 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@jaclyn-taroni jaclyn-taroni merged commit 23c8b35 into greenelab:master Apr 8, 2018
@jaclyn-taroni jaclyn-taroni deleted the initial-util branch April 8, 2018 20:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants