Add custom functions for working with PLIER models and initial exploratory analyses #3

jaclyn-taroni · 2018-04-04T17:56:14Z

This PR adds custom functions for working with PLIER models, e.g., a wrapper function for running PLIER::PLIER, applying PLIER models to new data, various functions for reconstruction, etc.

I also include an R notebook, PLIER_util_proof-of-concept_notebook.*, as a proof-of-concept (as its name implies 😄 ). Here, I track both the .Rmd and .nb.html files generated.

Unfortunately, R notebooks will not exactly display on Github currently. Specifically, if we take a look at the Rmarkdown file PLIER_util_proof-of-concept_notebook.Rmd, the plot at the very end doesn't come up.

So to facilitate code review I've tried a couple things:

Extracting the Rscript Rnotebook_scripts/PLIER_util_proof-of-concept_notebook.R using util/purl_wrapper.R. I can also do this in a way that excludes documentation.
Knitting to PDF Rnotebook_pdf/PLIER_util_proof-of-concept_notebook.pdf -- in order to this I needed to add pdflatex

I am interested in the preferences of reviewers.

Update

Based on comments from @huqiwen0313 and @gwaygenomics, it seems like .Rmd files will be sufficient for code review. So I will reorganize accordingly.

Ready for review update

I apologize for the size of the PR! I was hoping to get comments on repo organization as well, hence the multiple notebooks. Since R notebooks generate .Rmd and .nb.html files, it's not quite as bad as it looks. Here's what I've added and what I recommend taking a look at:

util/plier_util.R - This contains custom functions for working with and evaluating PLIER models. You'll see most of them "in action" in the 3 notebooks I include here.
01-PLIER_util_proof-of-concept_notebook.Rmd - Proof-of-concept/"sanity check" for some of the custom functions using the NARES dataset (due to its relatively small size)
02-recount2_PLIER_exploration.Rmd - Initial exploration of the recount2 PLIER model & some potential ways to evaluate models (science comments most welcome here!)
03-isolated_cell_type_populations.Rmd - Here, I apply the recount2 PLIER model to a microarray dataset of sorted leukocytes, another important control for this model.

I've included a zip file of the .nb.html files here for your viewing convenience: 01-03.nb.html.zip

To facilitate code review

All PLIER models built in Docker container will have FDR calculations

gwaybio

a couple of comments (mostly clarification) - overall great PR 👍

gwaybio · 2018-04-05T18:09:23Z

util/plier_util.R

+  data(canonicalPathways)
+
+  # combine the pathway data from PLIER
+  all.paths <- combinePaths(bloodCellMarkersIRISDMAP, svmMarkers, 


is this PLIER::combinePaths()?

Same for the rest of the functions below

gwaybio · 2018-04-05T18:23:43Z

util/plier_util.R

+
+  # PLIER main function + return results
+  plier.res <- PLIER(exprs.norm[cm.genes, ], all.paths[cm.genes, ], 
+                     k = round((set.k + set.k*0.3), 0), trace = TRUE)


what is 0.3? I feel like I may have asked this question before elsewhere...

Came up here greenelab/rheum-plier-data#1 (comment) originally, but also here greenelab/rheum-plier-data#20 (comment) -- will update the documentation

gwaybio · 2018-04-05T18:24:16Z

util/plier_util.R

+  # training data, set missing genes to zero (the mean), and reorder to match
+  # plier.model$Z
+  # 
+  # This makes the input gene expression data suitable for projection (?) into


Oops, note to self re: wording during development

gwaybio · 2018-04-05T18:31:58Z

util/plier_util.R

+    indx.vector[row.iter] <- 
+      which(rownames(z.mat) == rownames(exprs.cg)[row.iter])
+  }
+  ord.rownorm <- exprs.cg[order(indx.vector), ]


i am not sure if this will have the intended outcome. Lines 91-95 create a matched order, but then it is reordered in Line 96 with order().

It may also be worth checking out match

gwaybio · 2018-04-05T18:36:40Z

util/plier_util.R

+  #   exprs.new.b: a matrix that contains the values of each latent variable
+  #                for each sample from the new dataset (exprs.mat), 
+  # 
+  require(PLIER)


can we get some spacing in this function? (To match aesthetic (that I like) of previous functions in this file)

gwaybio · 2018-04-05T19:11:40Z

02-recount2_PLIER_exploration.Rmd

+```{r}
+png.file <- file.path(plot.dir, 
+                      "recount2_recon_MASE_all_lvs.png")
+ggplot2::ggsave(filename = png.file, plot = ggplot2::last_plot(), 


This is a very interesting distribution - Are these LVs or samples? (I think I'm a bit confused)

Samples, I'll add that info to the axis label

gwaybio · 2018-04-05T19:19:06Z

02-recount2_PLIER_exploration.Rmd

+#### Spearman correlation (input, reconstructed)
+
+Spearman correlation between input and reconstructed values was used as an 
+evaluation in [Cleary, et al.](https://doi.org/10.1016/j.cell.2017.10.023)


This is related to a comment below (in the util file) - I think its worth adding a brief description of what the eval is doing

gwaybio · 2018-04-05T19:26:03Z

02-recount2_PLIER_exploration.Rmd

+```{r}
+png.file <- file.path(plot.dir, 
+                      "recount2_recon_MASE.png")
+ggplot2::ggsave(filename = png.file, plot = ggplot2::last_plot(), 


maybe I am misunderstanding - pathway associated LVs have a worse reconstruction error?

gwaybio · 2018-04-05T19:27:45Z

02-recount2_PLIER_exploration.Rmd

+```{r}
+png.file <- file.path(plot.dir, 
+                      "recount2_recon_scatter.png")
+ggplot2::ggsave(filename = png.file, plot = ggplot2::last_plot(), 


Is it worth adding all points to the same graph with different colors and adjusting alpha? I think it would emphasize differences. Also, what proportion are pathway-associated?

Some changes coming up

gwaybio · 2018-04-05T19:31:24Z

03-isolated_cell_type_populations.Rmd

+# save heatmap as pdf -- can not figure out another way around this, hm just
+# returns a list
+pdf(file.path(plot.dir, "E-MTAB-2452_recount_PLIER_cell_type_LVs_B.pdf"))
+gplots::heatmap.2(iso.b.matrix[indx.relevant.lv, ], 


Axis labels are very small

huqiwen0313

Very cool results.

huqiwen0313 · 2018-04-06T05:11:51Z

02-recount2_PLIER_exploration.Rmd

+```
+
+**Scatterplot**
+```{r}


It is interesting that for pathway-associated reconstruction, there is a large proportion of points with low correlation but have relatively high MASE. Maybe, fit a linear regression and add R^2 will help for interpretation.

I don't think the pattern is linear, so I am not sure that's appropriate. It would be interesting to check out what samples have high error and high correlation at some point in the future -- probably would make sense to come back around at the same time I look into #4.

huqiwen0313 · 2018-04-06T05:15:45Z

02-recount2_PLIER_exploration.Rmd

+```{r}
+png.file <- file.path(plot.dir, 
+                      "recount2_recon_MASE.png")
+ggplot2::ggsave(filename = png.file, plot = ggplot2::last_plot(), 


pathway associated LVs have a worse reconstruction error because part of the loading information is lost when removing the non-significant LVs ? Is it possible to only look at those genes that associated with the significant LVs ?

Yes, so, this is as we would expect for the training dataset for this model -- when we drop out LVs learned by the model, we'll do worse at reconstruction. We expect these non-pathway associated LVs to capture some information (variance) that could be technical (e.g., kits used for library prep) or novel biology or tissue vs. blood -- any number of things that might not be captured by pathways and cell type gene sets supplied to PLIER.

But, this experimental set up gets more interesting when we apply the recount2 PLIER model to another dataset. Maybe if I just use (known) biological signal, that helps me with reconstruction as compared to including LVs that might capture something like library prep that could be specific to RNA-seq data or something else that might limit the generalizability of the model.

That's my thinking here. What do you think @gwaygenomics and @huqiwen0313?

Is it possible to only look at those genes that associated with the significant LVs ?

Can you say a bit more about what you mean by that Qiwen? Do you mean filter the genes in the Z matrix to only those that have positive values once the non-significant LVs get dropped?

Yes, this is what I am thinking of. I mean only look at those genes that have high loadings in those significant LVs. The reconstruction error may look better.

Something like - if you reconstruct a new dataset with the non-pathway LVs then sample correlation will be much lower than reconstruction on trained dataset (if the non-pathway LVs are capturing artifacts)?

(at least lower than reconstruction of new dataset with pathway LVs)

Something like - if you reconstruct a new dataset with the non-pathway LVs then sample correlation will be much lower than reconstruction on trained dataset (if the non-pathway LVs are capturing artifacts)?
(at least lower than reconstruction of new dataset with pathway LVs)

Yep, that's the idea.

Yes, this is what I am thinking of. I mean only look at those genes that have high loadings in those significant LVs. The reconstruction error may look better.

@huqiwen0313 I think it might make sense to put a pin in this and revisit once we're looking at test datasets, I've filed #4

huqiwen0313 · 2018-04-06T05:23:37Z

03-isolated_cell_type_populations.Rmd

+```
+```{r}
+# save heatmap as pdf -- can not figure out another way around this, hm just
+# returns a list


add annotation for different clusters (colors) in x and y axis?

jaclyn-taroni · 2018-04-06T19:43:29Z

Thanks for the comments @gwaygenomics and @huqiwen0313 ! I think this is ready for another look. I also fixed the documentation and text around one of the U sparsity evals, it was not quite right.

gwaybio

looks great @jaclyn-taroni - one minor optional comment

gwaybio · 2018-04-07T17:01:01Z

util/plier_util.R

@@ -239,6 +239,8 @@ GetReconstructionCorrelation <- function(true.mat, recon.mat,
  for (col.iter in 1:ncol(true.mat)){
    # for each gene (column), calculate the MASE between the true expression 
    # values and the expression values after reconstruction
+    # due to size of matrices, this is more efficient than calculating 
+    # correlation between all values


was fd4ae21 a significant speed upgrade? May consider updating this function to sapply too

Very minor speed up (less than half a second) and I tested on recount2, so I don't anticipate working with anything bigger

huqiwen0313

Looks good to me

jaclyn-taroni added 17 commits April 3, 2018 16:56

Merge branch 'master' into origin/initial-util

02fe738

Remove files now being ignored

f936f3a

Add PLIER custom function

6f6638e

Add R notebook for proof-of-concept of PLIER util

474f4bf

Add wrapper script for conversion of Rmd notebook files to .R

795b5cb

To facilitate code review

Add proof-of-concept NB script

7a873d7

knit to PDF

74ee06e

Docker image with ability to knit to PDF

f161af9

newline fix

8a21cce

Remove script extracted from notebook

89f8f5b

Remove converted PDF & associated Dockerfile

00e590c

Remove knitr::purl wrapper

86747d0

Update proof-of-concept notebook

df6ebc0

Update plot directory structure

378ac59

Add recount PLIER EDA notebook

d3228b2

Remove checks for FDR presence

300bb12

All PLIER models built in Docker container will have FDR calculations

Add sorted immune cell notebook

97e7f57

jaclyn-taroni changed the title ~~Add PLIER util [WIP]~~ Add custom functions for working with PLIER models and initial exploratory analyses Apr 5, 2018

jaclyn-taroni requested review from gwaybio and huqiwen0313 April 5, 2018 17:32

gwaybio reviewed Apr 5, 2018

View reviewed changes

huqiwen0313 reviewed Apr 6, 2018

View reviewed changes

jaclyn-taroni added 4 commits April 6, 2018 08:25

Update: style & doc in response to PR comments

530c542

Simplify reordering

1fa6d4b

Update: docs, rerun function updates

19a3635

Correct documentation

83a8466

jaclyn-taroni mentioned this pull request Apr 6, 2018

Consider filtering genes for reconstruction #4

Closed

jaclyn-taroni added 3 commits April 6, 2018 14:56

Update: recount EDA notebook in response to PR comments

e3fe726

Update heatmap

1134db9

Add comment to correlation function

f7c500a

Sub sapply for loop

fd4ae21

gwaybio approved these changes Apr 7, 2018

View reviewed changes

huqiwen0313 approved these changes Apr 8, 2018

View reviewed changes

jaclyn-taroni merged commit 23c8b35 into greenelab:master Apr 8, 2018

jaclyn-taroni deleted the initial-util branch April 8, 2018 20:06

jaclyn-taroni mentioned this pull request Apr 11, 2018

Isolated immune cell reconstruction evaluation #5

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add custom functions for working with PLIER models and initial exploratory analyses #3

Add custom functions for working with PLIER models and initial exploratory analyses #3

jaclyn-taroni commented Apr 4, 2018 •

edited

Loading

gwaybio left a comment

gwaybio Apr 5, 2018

gwaybio Apr 5, 2018

gwaybio Apr 5, 2018

jaclyn-taroni Apr 6, 2018

gwaybio Apr 5, 2018

jaclyn-taroni Apr 6, 2018

gwaybio Apr 5, 2018

gwaybio Apr 5, 2018

gwaybio Apr 5, 2018

jaclyn-taroni Apr 6, 2018

gwaybio Apr 5, 2018

gwaybio Apr 5, 2018

gwaybio Apr 5, 2018

jaclyn-taroni Apr 6, 2018

gwaybio Apr 5, 2018

huqiwen0313 left a comment

huqiwen0313 Apr 6, 2018 •

edited

Loading

jaclyn-taroni Apr 6, 2018

huqiwen0313 Apr 6, 2018

jaclyn-taroni Apr 6, 2018

jaclyn-taroni Apr 6, 2018

huqiwen0313 Apr 6, 2018

gwaybio Apr 6, 2018 •

edited

Loading

jaclyn-taroni Apr 6, 2018

jaclyn-taroni Apr 6, 2018

huqiwen0313 Apr 6, 2018

jaclyn-taroni commented Apr 6, 2018

gwaybio left a comment

gwaybio Apr 7, 2018

jaclyn-taroni Apr 8, 2018 •

edited

Loading

huqiwen0313 left a comment

Add custom functions for working with PLIER models and initial exploratory analyses #3

Add custom functions for working with PLIER models and initial exploratory analyses #3

Conversation

jaclyn-taroni commented Apr 4, 2018 • edited Loading

Update

Ready for review update

gwaybio left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huqiwen0313 left a comment

Choose a reason for hiding this comment

huqiwen0313 Apr 6, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gwaybio Apr 6, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaclyn-taroni commented Apr 6, 2018

gwaybio left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaclyn-taroni Apr 8, 2018 • edited Loading

Choose a reason for hiding this comment

huqiwen0313 left a comment

Choose a reason for hiding this comment

jaclyn-taroni commented Apr 4, 2018 •

edited

Loading

huqiwen0313 Apr 6, 2018 •

edited

Loading

gwaybio Apr 6, 2018 •

edited

Loading

jaclyn-taroni Apr 8, 2018 •

edited

Loading