Update recount PLIER #20

jaclyn-taroni · 2018-03-22T14:26:05Z

This PR is an update of the recount PLIER pipeline recount2/. Specifically, we run the newest version of PLIER (a2d4a2a) in a Docker container (image jtaroni/multi-plier:recount).

To facilitate this I've made the following changes:

Updated the Docker image to include the recount bioconductor package (& dependencies). I've added an additional Dockerfile and list of installed R packages and their versions. I decided pulling from jtaroni/multi-plier:v1 in a new Dockerfile was a little easier to follow than using docker commit in this particular context. The docker subdirectory and docker/list_user_installed_R_packages.R have been update to reflect this change.
Updated the recount PLIER pipeline in the following ways:
- Renamed scripts/files to be more consistent with the rest of the repository
- Updated the the combination/merging of multiple experiments to be a bit more efficient
- Split the data prep (row-normalization, determining k) and the main PLIER function into two scripts. The rationale here is that each of these steps are computationally intensive and if, for example, the PLIER bit fails, both processes would need to be rerun. recount2/2-prep_recount_for_plier.R and recount2/3-run_recount_plier.R are modified from @huqiwen0313 's original code (even though the diff doesn't look like it).

Pending approval, we'll add the ignored recount files to figshare. Once that's done I will update the README in a subsequent PR. I will also add updated, recount-specific instructions to the Docker section of the README in that PR.

Rename to be more consistent with rest of repo, add dir.create step

gwaybio

A couple of comments

gwaybio · 2018-03-22T16:19:32Z

recount2/1-get_all_recount_dataset.R

@@ -1,7 +1,8 @@
 # Qiwen Hu - 2017


Want to add your name @jaclyn-taroni ? Or are the edits not substantial enough?

gwaybio · 2018-03-22T17:38:22Z

recount2/1-get_all_recount_dataset.R

+
+# combine experiments -- this is the most memory efficient way to go about this
+# that I've found -- will need to drop extraneous gene id columns
+rpkm.df <- do.call(base::cbind, c(rpkm.list, by = "id"))


does dplyr::bind_cols() work?

Did a bit of exploration of this topic here: https://github.com/greenelab/rheum-plier-data/pull/10/files#diff-336c115ba63149bca524a54de200d2b2R214

I believe this will be more efficient

gwaybio · 2018-03-22T17:40:34Z

recount2/1-get_all_recount_dataset.R

+# that I've found -- will need to drop extraneous gene id columns
+rpkm.df <- do.call(base::cbind, c(rpkm.list, by = "id"))
+id.cols <- grep("id", colnames(rpkm.df))
+rpkm.df <- rpkm.df[, -id.cols[2:length(id.cols)]]


not really sure what's happening here, does something like

rpkm.df >%> dplyr::select(starts_with("id"))

work?

Can do

rpkm.df %>% dplyr::select(-dplyr::ends_with("id"))

will update

gwaybio · 2018-03-22T17:42:43Z

recount2/1-get_all_recount_dataset.R

+rpkm.df <- do.call(base::cbind, c(rpkm.list, by = "id"))
+id.cols <- grep("id", colnames(rpkm.df))
+rpkm.df <- rpkm.df[, -id.cols[2:length(id.cols)]]
+rpkm.df <- rpkm.df[, c(id.cols[1], 1:(id.cols[1] - 1),


looks like the purpose of this is to reorder the dataframe by column? This may be tough to do other than how you have it, but a comment describing this would make it easier to understand

gwaybio · 2018-03-22T17:43:55Z

recount2/2-prep_recount_for_plier.R

+genes <- unlist(lapply(strsplit(rpkm$ENSG, "[.]"), `[[`, 1))
+rpkm$ensembl_gene_id <- unlist(lapply(strsplit(rpkm$ENSG, "[.]"), `[[`, 1))
+gene.list <- biomaRt::getBM(filters = "ensembl_gene_id",
+		attributes = c("ensembl_gene_id", "hgnc_symbol"),


indentation is a bit off here

gwaybio · 2018-03-22T17:56:42Z

recount2/2-prep_recount_for_plier.R

+rpkm <- rpkm[, -1*c(1, 2, ncol(rpkm))]
+
+# PLIER prior information (pathways)
+allPaths <- combinePaths(bloodCellMarkersIRISDMAP, svmMarkers,


what is combinePaths and commonRows?

Functions from PLIER for combining the different pathway matrices into a single, prior information matrix and finding the overlapping genes between the gene expression matrix and prior info, respectively

gwaybio · 2018-03-22T17:56:55Z

recount2/2-prep_recount_for_plier.R

+cm.genes <- commonRows(allPaths, rpkm)
+
+# filter to common genes before row normalization to save on computation
+rpkm.cm <- rpkm[cm.genes, ]


dplyr::filter()?

This way will also reorder the matrix

gwaybio · 2018-03-22T17:57:22Z

recount2/2-prep_recount_for_plier.R

+   canonicalPathways)
+
+# row-normalize (z-score)
+rpkm.cm <- rowNorm(rpkm.cm)


i guess these are in the PLIER library?

gwaybio · 2018-03-22T17:57:47Z

recount2/3-run_recount_plier.R

+
+# read in data
+plier.data.list <- readRDS(file = file.path("recount2",
+                                           	"recount_data_prep_PLIER.RDS"))


wonky indent

gwaybio · 2018-03-22T17:58:21Z

recount2/3-run_recount_plier.R

+# run PLIER
+plierResult <- PLIER(as.matrix(plier.data.list$rpkm.cm), 
+                     plier.data.list$all.paths.cm,
+                     k = round((plier.data.list$k + plier.data.list$k*0.3), 0), 


what is 0.3?

also needs spaces here between *

See #1 (comment) -- PLIER authors suggest using a larger k than the estimation (30-50% IIRC)

jaclyn-taroni · 2018-03-22T19:32:36Z

Thanks for the comments @gwaygenomics! Think it is ready for another 👀

gwaybio

nice updates

gwaybio · 2018-03-22T19:39:06Z

recount2/3-run_recount_plier.R

+# run PLIER
+plierResult <- PLIER(as.matrix(plier.data.list$rpkm.cm), 
+                     plier.data.list$all.paths.cm,
+                     k = round((plier.data.list$k + plier.data.list$k * 0.3), 


may want to consider setting plier.data.list$k to a descriptive variable name 🤷‍♂️

This may also help with the small 0), indent on 17

huqiwen0313

Update looks nice !

I only have one small additional comment.

huqiwen0313 · 2018-03-22T23:45:36Z

recount2/3-run_recount_plier.R

+plier.data.list <- readRDS(file = file.path("recount2",
+                                            "recount_data_prep_PLIER.RDS"))
+# run PLIER
+plierResult <- PLIER(as.matrix(plier.data.list$rpkm.cm), 


This should be PLIER::PLIER ?

jaclyn-taroni added 4 commits March 22, 2018 09:55

Update: new version of Docker image that includes recount package

e2f088b

Update: change merge to be more memory efficient

f15f484

Rename to be more consistent with rest of repo, add dir.create step

Update: split recount data prep and main PLIER mode

868653b

Update: ignore new and renamed recount RDS, gz files

94336e2

jaclyn-taroni requested review from gwaybio and huqiwen0313 March 22, 2018 14:26

gwaybio suggested changes Mar 22, 2018

View reviewed changes

jaclyn-taroni added 5 commits March 22, 2018 14:14

Minor spacing fix

ae73797

Fix indentation

35a058f

Update: "tidy" combining experiments

0e77904

Update in response to PR comments

b032677

Fix indent

31c5981

gwaybio approved these changes Mar 22, 2018

View reviewed changes

Update: k arg variable name

8e204cb

huqiwen0313 approved these changes Mar 22, 2018

View reviewed changes

double colon for PLIER function

fed9f9a

jaclyn-taroni merged commit 978c379 into greenelab:master Mar 23, 2018

jaclyn-taroni deleted the recount branch March 23, 2018 11:13

This was referenced Mar 30, 2018

Update README information about recount data #21

Merged

Add custom functions for working with PLIER models and initial exploratory analyses greenelab/multi-plier#3

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update recount PLIER #20

Update recount PLIER #20

jaclyn-taroni commented Mar 22, 2018

gwaybio left a comment

gwaybio Mar 22, 2018

gwaybio Mar 22, 2018

jaclyn-taroni Mar 22, 2018

gwaybio Mar 22, 2018

jaclyn-taroni Mar 22, 2018

gwaybio Mar 22, 2018

gwaybio Mar 22, 2018

gwaybio Mar 22, 2018

jaclyn-taroni Mar 22, 2018

gwaybio Mar 22, 2018

jaclyn-taroni Mar 22, 2018

gwaybio Mar 22, 2018

gwaybio Mar 22, 2018

gwaybio Mar 22, 2018

jaclyn-taroni Mar 22, 2018

jaclyn-taroni commented Mar 22, 2018

gwaybio left a comment

gwaybio Mar 22, 2018

huqiwen0313 left a comment

huqiwen0313 Mar 22, 2018

Update recount PLIER #20

Update recount PLIER #20

Conversation

jaclyn-taroni commented Mar 22, 2018

gwaybio left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaclyn-taroni commented Mar 22, 2018

gwaybio left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huqiwen0313 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment