Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update recount PLIER #20

Merged
merged 11 commits into from
Mar 23, 2018
Merged

Update recount PLIER #20

merged 11 commits into from
Mar 23, 2018

Conversation

jaclyn-taroni
Copy link
Collaborator

This PR is an update of the recount PLIER pipeline recount2/. Specifically, we run the newest version of PLIER (a2d4a2a) in a Docker container (image jtaroni/multi-plier:recount).

To facilitate this I've made the following changes:

  • Updated the Docker image to include the recount bioconductor package (& dependencies). I've added an additional Dockerfile and list of installed R packages and their versions. I decided pulling from jtaroni/multi-plier:v1 in a new Dockerfile was a little easier to follow than using docker commit in this particular context. The docker subdirectory and docker/list_user_installed_R_packages.R have been update to reflect this change.
  • Updated the recount PLIER pipeline in the following ways:
    • Renamed scripts/files to be more consistent with the rest of the repository
    • Updated the the combination/merging of multiple experiments to be a bit more efficient
    • Split the data prep (row-normalization, determining k) and the main PLIER function into two scripts. The rationale here is that each of these steps are computationally intensive and if, for example, the PLIER bit fails, both processes would need to be rerun. recount2/2-prep_recount_for_plier.R and recount2/3-run_recount_plier.R are modified from @huqiwen0313 's original code (even though the diff doesn't look like it).

Pending approval, we'll add the ignored recount files to figshare. Once that's done I will update the README in a subsequent PR. I will also add updated, recount-specific instructions to the Docker section of the README in that PR.

Copy link

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of comments

@@ -1,7 +1,8 @@
# Qiwen Hu - 2017
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to add your name @jaclyn-taroni ? Or are the edits not substantial enough?


# combine experiments -- this is the most memory efficient way to go about this
# that I've found -- will need to drop extraneous gene id columns
rpkm.df <- do.call(base::cbind, c(rpkm.list, by = "id"))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does dplyr::bind_cols() work?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a bit of exploration of this topic here: https://github.com/greenelab/rheum-plier-data/pull/10/files#diff-336c115ba63149bca524a54de200d2b2R214

I believe this will be more efficient

# that I've found -- will need to drop extraneous gene id columns
rpkm.df <- do.call(base::cbind, c(rpkm.list, by = "id"))
id.cols <- grep("id", colnames(rpkm.df))
rpkm.df <- rpkm.df[, -id.cols[2:length(id.cols)]]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really sure what's happening here, does something like

rpkm.df >%> dplyr::select(starts_with("id"))

work?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can do

rpkm.df %>% dplyr::select(-dplyr::ends_with("id"))

will update

rpkm.df <- do.call(base::cbind, c(rpkm.list, by = "id"))
id.cols <- grep("id", colnames(rpkm.df))
rpkm.df <- rpkm.df[, -id.cols[2:length(id.cols)]]
rpkm.df <- rpkm.df[, c(id.cols[1], 1:(id.cols[1] - 1),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like the purpose of this is to reorder the dataframe by column? This may be tough to do other than how you have it, but a comment describing this would make it easier to understand

genes <- unlist(lapply(strsplit(rpkm$ENSG, "[.]"), `[[`, 1))
rpkm$ensembl_gene_id <- unlist(lapply(strsplit(rpkm$ENSG, "[.]"), `[[`, 1))
gene.list <- biomaRt::getBM(filters = "ensembl_gene_id",
attributes = c("ensembl_gene_id", "hgnc_symbol"),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation is a bit off here

rpkm <- rpkm[, -1*c(1, 2, ncol(rpkm))]

# PLIER prior information (pathways)
allPaths <- combinePaths(bloodCellMarkersIRISDMAP, svmMarkers,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is combinePaths and commonRows?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functions from PLIER for combining the different pathway matrices into a single, prior information matrix and finding the overlapping genes between the gene expression matrix and prior info, respectively

cm.genes <- commonRows(allPaths, rpkm)

# filter to common genes before row normalization to save on computation
rpkm.cm <- rpkm[cm.genes, ]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dplyr::filter()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way will also reorder the matrix

canonicalPathways)

# row-normalize (z-score)
rpkm.cm <- rowNorm(rpkm.cm)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess these are in the PLIER library?


# read in data
plier.data.list <- readRDS(file = file.path("recount2",
"recount_data_prep_PLIER.RDS"))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wonky indent

# run PLIER
plierResult <- PLIER(as.matrix(plier.data.list$rpkm.cm),
plier.data.list$all.paths.cm,
k = round((plier.data.list$k + plier.data.list$k*0.3), 0),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is 0.3?

also needs spaces here between *

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #1 (comment) -- PLIER authors suggest using a larger k than the estimation (30-50% IIRC)

@jaclyn-taroni
Copy link
Collaborator Author

Thanks for the comments @gwaygenomics! Think it is ready for another 👀

Copy link

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice updates

# run PLIER
plierResult <- PLIER(as.matrix(plier.data.list$rpkm.cm),
plier.data.list$all.paths.cm,
k = round((plier.data.list$k + plier.data.list$k * 0.3),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may want to consider setting plier.data.list$k to a descriptive variable name 🤷‍♂️

This may also help with the small 0), indent on 17

Copy link
Contributor

@huqiwen0313 huqiwen0313 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update looks nice !

I only have one small additional comment.

plier.data.list <- readRDS(file = file.path("recount2",
"recount_data_prep_PLIER.RDS"))
# run PLIER
plierResult <- PLIER(as.matrix(plier.data.list$rpkm.cm),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be PLIER::PLIER ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants