Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial release review [DO NOT MERGE] #29

Draft
wants to merge 541 commits into
base: master
Choose a base branch
from
Draft

Initial release review [DO NOT MERGE] #29

wants to merge 541 commits into from

Conversation

scwatts
Copy link
Collaborator

@scwatts scwatts commented May 13, 2024

Background

nf-core/oncoanalyser is a cancer DNA/RNA analysis and reporting pipeline that implements the hmftools workflow, which is developed by the Hartwig Medical Foundation.

The hmftools workflow provides comprehensive analysis of cancer DNA/RNA data where each component is fine-tuned and optimised to operate in an integrated manner to improve genomic characterisation.

There are certain expectations around reference and input data, and how the hmftools workflow should be used. As such, these expectations have guided certain design decisions in oncoanalyser with some described below to provide further context for the community review.

Overarching design choices

Separation of run modes into workflows. The hmftools workflow is flexible and can be used to analyse different types of data (e.g. WGTS, targeted sequencing), which require specific configuration and reference data. There are also additional modes planned for hmftools that will eventually be implemented in oncoanalyser.

Hence, I have structured oncoanalyser so that the majority of the logic corresponding to different run modes is lifted up into workflows rather than at the subworkflow or module level. While this introduces some code duplication, much of the mode-specific logic is simplified by dividing it into different workflows providing a better maintenance and development experience. Retaining this layout will ease the addition of new run modes in the future.

Arbitrarily run individual tools or enter at any point. A key feature of oncoanalyser is the ability to begin analysis from any point in the pipeline or run just a specific set of tools. The primary use-case for this is to skip variant calling and run new/supplementary/modified downstream analyses. For example, a user can run oncoanalyser from PURPLE so long as they provide the required inputs.

Accordingly, outputs from each tool can be provided in the samplesheet and are recognised by oncoanalyser. When a user provides input for a given tool, it is retrieved in the respective subworkflow and inserted into the appropriate input channel. The feature to arbitrarily select tools to run is implemented by wrapping each tool subworkflow call in a conditional statement that checks whether the corresponding tool should be run.

Strong reference genome recommendations. Hartwig support only the reference genomes they distribute for analysis with the hmftools workflows. Hence, oncoanalyser is configured to provide users with a choice of the Hartwig GRCh37 or GRCh38 genome. While not recommended, custom reference genomes can be used (including those from iGenomes) and oncoanalyser will build all necessary indexes for the target analysis. There is additionally an option to write/publish prepared reference data to the output directory.

No egress fee reference data hosting. Given the use of Hartwig-distributed genomes and size of reference data required to run oncoanalyser, a cloud hosting provider is needed to deliver this data to users. I chose to host all default reference data (genomes, hmftools/panel resource files, and other databases) on Cloudflare R2, which is an object storage service that doesn't charge for egress. I've used Cloudflare R2 for this purpose since Jan 2023 with the nominal hosting costs being covered by our organisation (UMCCR).

Other notes

  • I will make PRs outside this review to address each comment/request/etc made

mkcmkc and others added 30 commits February 13, 2024 14:32
+ Put read_alignment and read_processing subworkflows into targeted workflow.
+ Uncomment the whole wgts worflow and integrate this with the read_alignment and read_processing subworkflows.
Otherwise generates a java.nio.channels.OverlappingFileLockException
error
The Debian slim container didn't have ps, which is required by Nextflow
Copy link
Member

@mirpedrol mirpedrol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All local modules are good candidates to be added to nf-core/modules, this would make the modules easier to maintain, and the pipeline PR reviews easier.

conf/test_stub.config Outdated Show resolved Hide resolved
.nf-core.yml Outdated Show resolved Hide resolved
CITATIONS.md Outdated Show resolved Hide resolved
assets/samplesheet.csv Show resolved Hide resolved
docs/output.md Outdated Show resolved Hide resolved
modules/local/amber/meta.yml Show resolved Hide resolved
modules/local/custom/extract_tarball/main.nf Outdated Show resolved Hide resolved
modules/local/fastp/main.nf Outdated Show resolved Hide resolved
nextflow.config Outdated Show resolved Hide resolved
modules/local/star/align/main.nf Outdated Show resolved Hide resolved
modules/local/bwa-mem2/mem/main.nf Outdated Show resolved Hide resolved
modules/local/cobalt/main.nf Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants