-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update recount PLIER #20
Conversation
Rename to be more consistent with rest of repo, add dir.create step
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of comments
@@ -1,7 +1,8 @@ | |||
# Qiwen Hu - 2017 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Want to add your name @jaclyn-taroni ? Or are the edits not substantial enough?
|
||
# combine experiments -- this is the most memory efficient way to go about this | ||
# that I've found -- will need to drop extraneous gene id columns | ||
rpkm.df <- do.call(base::cbind, c(rpkm.list, by = "id")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does dplyr::bind_cols()
work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a bit of exploration of this topic here: https://github.com/greenelab/rheum-plier-data/pull/10/files#diff-336c115ba63149bca524a54de200d2b2R214
I believe this will be more efficient
recount2/1-get_all_recount_dataset.R
Outdated
# that I've found -- will need to drop extraneous gene id columns | ||
rpkm.df <- do.call(base::cbind, c(rpkm.list, by = "id")) | ||
id.cols <- grep("id", colnames(rpkm.df)) | ||
rpkm.df <- rpkm.df[, -id.cols[2:length(id.cols)]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really sure what's happening here, does something like
rpkm.df >%> dplyr::select(starts_with("id"))
work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can do
rpkm.df %>% dplyr::select(-dplyr::ends_with("id"))
will update
recount2/1-get_all_recount_dataset.R
Outdated
rpkm.df <- do.call(base::cbind, c(rpkm.list, by = "id")) | ||
id.cols <- grep("id", colnames(rpkm.df)) | ||
rpkm.df <- rpkm.df[, -id.cols[2:length(id.cols)]] | ||
rpkm.df <- rpkm.df[, c(id.cols[1], 1:(id.cols[1] - 1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like the purpose of this is to reorder the dataframe by column? This may be tough to do other than how you have it, but a comment describing this would make it easier to understand
recount2/2-prep_recount_for_plier.R
Outdated
genes <- unlist(lapply(strsplit(rpkm$ENSG, "[.]"), `[[`, 1)) | ||
rpkm$ensembl_gene_id <- unlist(lapply(strsplit(rpkm$ENSG, "[.]"), `[[`, 1)) | ||
gene.list <- biomaRt::getBM(filters = "ensembl_gene_id", | ||
attributes = c("ensembl_gene_id", "hgnc_symbol"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation is a bit off here
recount2/2-prep_recount_for_plier.R
Outdated
rpkm <- rpkm[, -1*c(1, 2, ncol(rpkm))] | ||
|
||
# PLIER prior information (pathways) | ||
allPaths <- combinePaths(bloodCellMarkersIRISDMAP, svmMarkers, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is combinePaths
and commonRows
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functions from PLIER for combining the different pathway matrices into a single, prior information matrix and finding the overlapping genes between the gene expression matrix and prior info, respectively
recount2/2-prep_recount_for_plier.R
Outdated
cm.genes <- commonRows(allPaths, rpkm) | ||
|
||
# filter to common genes before row normalization to save on computation | ||
rpkm.cm <- rpkm[cm.genes, ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dplyr::filter()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This way will also reorder the matrix
recount2/2-prep_recount_for_plier.R
Outdated
canonicalPathways) | ||
|
||
# row-normalize (z-score) | ||
rpkm.cm <- rowNorm(rpkm.cm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess these are in the PLIER
library?
recount2/3-run_recount_plier.R
Outdated
|
||
# read in data | ||
plier.data.list <- readRDS(file = file.path("recount2", | ||
"recount_data_prep_PLIER.RDS")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wonky indent
recount2/3-run_recount_plier.R
Outdated
# run PLIER | ||
plierResult <- PLIER(as.matrix(plier.data.list$rpkm.cm), | ||
plier.data.list$all.paths.cm, | ||
k = round((plier.data.list$k + plier.data.list$k*0.3), 0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is 0.3?
also needs spaces here between *
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #1 (comment) -- PLIER authors suggest using a larger k than the estimation (30-50% IIRC)
Thanks for the comments @gwaygenomics! Think it is ready for another 👀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice updates
recount2/3-run_recount_plier.R
Outdated
# run PLIER | ||
plierResult <- PLIER(as.matrix(plier.data.list$rpkm.cm), | ||
plier.data.list$all.paths.cm, | ||
k = round((plier.data.list$k + plier.data.list$k * 0.3), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may want to consider setting plier.data.list$k
to a descriptive variable name 🤷♂️
This may also help with the small 0),
indent on 17
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update looks nice !
I only have one small additional comment.
recount2/3-run_recount_plier.R
Outdated
plier.data.list <- readRDS(file = file.path("recount2", | ||
"recount_data_prep_PLIER.RDS")) | ||
# run PLIER | ||
plierResult <- PLIER(as.matrix(plier.data.list$rpkm.cm), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be PLIER::PLIER ?
This PR is an update of the recount PLIER pipeline
recount2/
. Specifically, we run the newest version of PLIER (a2d4a2a
) in a Docker container (imagejtaroni/multi-plier:recount
).To facilitate this I've made the following changes:
Dockerfile
and list of installed R packages and their versions. I decided pulling fromjtaroni/multi-plier:v1
in a new Dockerfile was a little easier to follow than usingdocker commit
in this particular context. Thedocker
subdirectory anddocker/list_user_installed_R_packages.R
have been update to reflect this change.k
) and the main PLIER function into two scripts. The rationale here is that each of these steps are computationally intensive and if, for example, the PLIER bit fails, both processes would need to be rerun.recount2/2-prep_recount_for_plier.R
andrecount2/3-run_recount_plier.R
are modified from @huqiwen0313 's original code (even though the diff doesn't look like it).Pending approval, we'll add the ignored recount files to figshare. Once that's done I will update the README in a subsequent PR. I will also add updated, recount-specific instructions to the Docker section of the README in that PR.