Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote GAMBLR (outside GSC) is now a thing. Testers wanted! #99

Merged
merged 53 commits into from
Jun 24, 2022

Conversation

rdmorin
Copy link
Collaborator

@rdmorin rdmorin commented Jun 2, 2022

This was tested on my laptop using the test_remote.R script bundled with GAMBLR. The snakefile to allow the user to sync the required files is also provided. I haven't made a yml file for that environment yet.

Pull Request Checklists

Important: When opening a pull request, keep only the applicable checklist and delete all other sections.

Checklist for all PRs

Required

  • I tested the new code for my use case (please provide a reproducible example of how you tested the new functionality)

  • I ensured all dplyr functions that commonly conflict with other packages are fully qualified.

This can be checked and addressed by running check_functions.pl and responding to the prompts. Test your code after you do this.

  • I generated the documentation and checked for errors relating to the new function (e.g. devtools::document()) and added NAMESPACE and all other modified files in the root directory and under man.

Optional but preferred with PRs

  • I updated and/or successfully knitted a vignette that relies on the modified code (which ones?)

Checklist for New Functions

Required

  • I documented my function using ROxygen style.)

  • All parameters for the function are described in the documentation and the function has a decriptive title.

Example:

#' Use GISTIC2.0 scores output to reproduce maftools::chromoplot with more flexibility
#'
#' @param scores output file scores.gistic from the run of GISTIC2.0
#' @param genes_to_label optional. Provide a data frame of genes to label (if mutated). The first 3 columns must contain chromosome, start, and end coordinates. Another required column must contain gene names and be named `gene`. (truncated for example)
#' @param cutoff optional. Used to determine which regions to color as aberrant. Must be float in the range [0-1]. (truncated for example)
  • My function uses a library that isn't already a dependency of GAMBLR and I made the package aware of this dependency using the function documentation import statment.

Example:

#' @return nothing
#' @export
#' @import tidyverse ggrepel

Checklist for changes to existing code

  • I added/removed arguments to a function and updated documentation for all changed/new arguments

  • I tested the new code for compatability with existing functionality in the Master branch (please provide a reprex of how you tested the original functionality)

@rdmorin
Copy link
Collaborator Author

rdmorin commented Jun 8, 2022

This now has a patch for cBIoportal (portal.R) that fixes some issues caused by recent updates.

@rdmorin rdmorin requested a review from mattssca June 8, 2022 00:52
missing = required_cols[!required_cols %in% colnames(metadata_df)]
message("FAIL! MISSING metadata columns:")
print(paste(missing,sep=" "))
return()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about adding here else part that will just say something alongside "Success! All required columns are present in the metadata". This way we are letting user know that the completeness check was performed and all is well

R/viz.R Outdated
"pathology" = "pathology", "analysis_cohort" = "pathology", "group" = "pathology",
"FL_group" = "FL", "lymphgen" = "lymphgen", "lymphgen_with_cnv" = "lymphgen",
"bcl2_ba" = "pos_neg", "BCL2_status" = "pos_neg", "myc_ba" = "pos_neg",
"bcl6_ba" = "pos_neg", "manta_BCL2_sv" = "pos_neg", "manual_BCL2_sv" = "pos_neg", "manta_MYC_sv" = "pos_neg")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other thing. I can think of adding here is "EBV_status_inf"="EBV_status"

lymphoma_genes = left_join(lymphoma_genes,gene_detail)
lymphoma_genes$LymphGen=FALSE
lymphoma_genes[lymphoma_genes$entrezgene_id %in% lymphgen_anno$NCBI_Gene_ID,"LymphGen"] = TRUE

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that here line
usethis::use_data(lymphoma_genes, overwrite = TRUE)
might be missing?

output:
sv_combined
run:
shell("cp {input[0]} {output[0]}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New line is missing here, but would it make sense to also add the expression data syncing here?

Copy link
Contributor

@mattssca mattssca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works as intended on Rstudio (remote, gphost) for me. Please see the comments in the review for a few questions and thoughts.

I will give this a go on my local machine as well (using snakefile to get all required files).

As for cBioPortal study setup, importing the genome study works fine (currently added to our instance), but for some strange reason (Docker being Docker...), it does not work for the capture study. I will troubleshoot this and it's very likely that this error is related to Docker and not the code in this PR. So should be fine to approve this PR.

edit: The capture study was successfully imported into our cBioPortal instance. This was achieved by increasing the memory allocated to Docker.

test_remote.R Outdated
# all these should proceed without error
coding_ssm = get_coding_ssm(these_samples_metadata = all_meta,seq_type = "genome")

cap_ssm = get_coding_ssm(these_samples_metadata = cap_meta,seq_type = "capture",seq_type_priority="genome")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gives an unused argument error:
> cap_ssm = get_coding_ssm(these_samples_metadata = cap_meta,seq_type = "capture",seq_type_priority="genome") Error in get_coding_ssm(these_samples_metadata = cap_meta, seq_type = "capture", : unused argument (seq_type_priority = "genome")
Works with the last argument dropped.

test_remote.R Outdated

cbio_path = paste0(config::get("project_base"),"cbioportal-docker-compose/study/GAMBL_capture_2022/")

samples_included = GAMBLR::setup_study(seq_type = "capture",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should seq_type be the same parameter as for genome? (i.e seq_type_filter)

test_remote.R Outdated
description = "GAMBL capture data",
out_dir = cbio_path)

finalize_study(seq_type="capture",short_name="GAMBL_capture_2022",sample_ids=samples_included,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above related to seq_type parameter.

Copy link
Contributor

@lkhilton lkhilton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into some errors when running test_remote.R outlined in my comments. I also committed a couple of changes I caught in the process of testing.

svar_all = get_combined_sv(min_vaf = 0.1,projection = "grch37")

dim(svar_all)
# [1] 450641 19
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This returned a very different value for me:

[1] 150162     19

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get 450641 on the gphost too. Must be a truncated file on your end.

test_remote.R Outdated

all_seg = get_sample_cn_segments(sample_list = all_meta$sample_id,multiple_samples = T)
#dim(all_seg)
#[1] 375247 7
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This returned the following:

[1] 376588      7

Copy link
Collaborator Author

@rdmorin rdmorin Jun 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I get that number now too. Maybe that file was updated recently?

cn_seg = get_sample_cn_segments("14-24534_tumorB",from_flatfile = T)

#should automagically get the single seg file and MAF for the sample
cn_ssm = assign_cn_to_ssm("14-24534_tumorB",seg_file_source = "battenberg",ssh_session = session)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gives me the following error:

trying to find output from: battenberg                                                                      
looking for flatfile: /projects/nhl_meta_analysis_scratch/gambl/results_local/
~/gambl/results_local/
Error in file(con, "rb") : cannot open the connection
In addition: Warning messages:
1: In file(con, "rb") :
  'raw = FALSE' but '/Users/laurahilton/gambl/results_local' is not a regular file
2: In file(con, "rb") :
  cannot open file '/Users/laurahilton/gambl/results_local': it is a directory

# advanced functions


coding_ssm_status = get_coding_ssm_status(gene_symbols = c("EZH2","CREBBP","KMT2D"),these_samples_metadata = cap_meta,maf_data = cap_ssm)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gives me a similar error message so it's not related to the ssh connection?

Joining, by = "sample_id"
Error in file(con, "rb") : cannot open the connection
In addition: Warning messages:
1: In file(con, "rb") :
  'raw = FALSE' but '/Users/laurahilton/gambl' is not a regular file
2: In file(con, "rb") :
  cannot open file '/Users/laurahilton/gambl': it is a directory

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is probably broken because it's not finding the file it needs under your gambl repo. I think we need to start submitting PR and reviewing gambl/GAMBLR updates in sync but I'm not sure the best way to coordinate this. The new files are under the directory you suggested (versioned_results).




# test if we can make a cBioportal instance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deleted Docker from my laptop today so I'm unhelpful here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries. This part has been tested by Adam.

#'
#' @examples
get_ssh_session = function(host="gphost01.bcgsc.ca"){
session = ssh_connect(host=host)
if (!requireNamespace("ssh", quietly = TRUE)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added library(ssh) to the top of the test_remote.R script because without it in NAMESPACE this throws an error. Is it possible to also add a step to load the library if it's already installed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I suppose we could have it load automatically but I think it's frowned upon to put library() or require() into functions like that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this function only needs to be run once per "session" I figured this warning would be fine

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message I got was a bit cryptic:

Error in ssh_exec_internal(ssh_session, command = paste("stat", aug_maf_path),  :                           
  could not find function "ssh_exec_internal"

What is the best practice in this context when we can't have this package listed in DESCRIPTION as an import? Should we just check if the package is loaded with if(!require("ssh")) and throw an error message if that returns TRUE?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. It's not printing this as a message?
"The ssh package must be installed to use this functionality"

@mattssca mattssca mentioned this pull request Jun 15, 2022
9 tasks
@mattssca mattssca merged commit 49b509d into master Jun 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants