Workflow output definition #275

bentsherman · 2024-02-13T21:08:05Z

This PR is a prototype for defining an output schema for a Nextflow pipeline. See nextflow-io/nextflow#4669 and nextflow-io/nextflow#4670 for original discussions.

The meta workflow that we are targeting is:

fetchngs -> rnaseq -> differentialabundance

In other words, we want to eliminate the manual curation of samplesheets between each pipeline. To do this, the output schema should "mirror" the params schema, it should describe the outputs as a collection of samplesheets.

Here is the tree of pipeline outputs for the fetchngs test profile:

/custom/user-settings.mkfg
/fastq/DRX024467_DRR026872.fastq.gz
/fastq/DRX026011_DRR028935_1.fastq.gz
/fastq/DRX026011_DRR028935_2.fastq.gz
/fastq/ERX1234253_ERR1160846.fastq.gz
/fastq/SRX10940790_SRR14593545_1.fastq.gz
/fastq/SRX10940790_SRR14593545_2.fastq.gz
/fastq/SRX11047067_SRR14709033.fastq.gz
/fastq/SRX17709227_SRR21711856.fastq.gz
/fastq/SRX17709228_SRR21711855.fastq.gz
/fastq/SRX6725035_SRR9984183.fastq.gz
/fastq/SRX9315476_SRR12848126_1.fastq.gz
/fastq/SRX9315476_SRR12848126_2.fastq.gz
/fastq/SRX9504942_SRR13055517_1.fastq.gz
/fastq/SRX9504942_SRR13055517_2.fastq.gz
/fastq/SRX9504942_SRR13055518_1.fastq.gz
/fastq/SRX9504942_SRR13055518_2.fastq.gz
/fastq/SRX9504942_SRR13055519_1.fastq.gz
/fastq/SRX9504942_SRR13055519_2.fastq.gz
/fastq/SRX9504942_SRR13055520_1.fastq.gz
/fastq/SRX9504942_SRR13055520_2.fastq.gz
/fastq/SRX9626017_SRR13191702_1.fastq.gz
/fastq/SRX9626017_SRR13191702_2.fastq.gz
/fastq/md5/DRX024467_DRR026872.fastq.gz.md5
/fastq/md5/DRX026011_DRR028935_1.fastq.gz.md5
/fastq/md5/DRX026011_DRR028935_2.fastq.gz.md5
/fastq/md5/ERX1234253_ERR1160846.fastq.gz.md5
/fastq/md5/SRX17709227_SRR21711856.fastq.gz.md5
/fastq/md5/SRX17709228_SRR21711855.fastq.gz.md5
/fastq/md5/SRX6725035_SRR9984183.fastq.gz.md5
/fastq/md5/SRX9504942_SRR13055517_1.fastq.gz.md5
/fastq/md5/SRX9504942_SRR13055517_2.fastq.gz.md5
/fastq/md5/SRX9504942_SRR13055518_1.fastq.gz.md5
/fastq/md5/SRX9504942_SRR13055518_2.fastq.gz.md5
/fastq/md5/SRX9504942_SRR13055519_1.fastq.gz.md5
/fastq/md5/SRX9504942_SRR13055519_2.fastq.gz.md5
/fastq/md5/SRX9504942_SRR13055520_1.fastq.gz.md5
/fastq/md5/SRX9504942_SRR13055520_2.fastq.gz.md5
/fastq/md5/SRX9626017_SRR13191702_1.fastq.gz.md5
/fastq/md5/SRX9626017_SRR13191702_2.fastq.gz.md5
/metadata/DRR026872.runinfo_ftp.tsv
/metadata/DRR028935.runinfo_ftp.tsv
/metadata/ERR1160846.runinfo_ftp.tsv
/metadata/GSE214215.runinfo_ftp.tsv
/metadata/GSM4907283.runinfo_ftp.tsv
/metadata/SRR12848126.runinfo_ftp.tsv
/metadata/SRR13191702.runinfo_ftp.tsv
/metadata/SRR14593545.runinfo_ftp.tsv
/metadata/SRR14709033.runinfo_ftp.tsv
/metadata/SRR9984183.runinfo_ftp.tsv
/samplesheet/id_mappings.csv
/samplesheet/multiqc_config.yml
/samplesheet/samplesheet.csv

From what I can tell, the samplesheet.csv contains all of the metadata, including the file paths, MD5 checksums, and id mappings. So the samplesheet and the fastq files comprise the essential outputs and everything else is duplication.

The initial output schema basically describes this samplesheet in a similar manner to the input_schema.json file. This particular output schema should closely resemble the input_schema.json for nf-core/rnaseq.

What I'd to do from here is collect feedback on this approach -- what else is needed to complete the output schema for this pipeline? Then we can think about how to operationalize it in Nextflow -- should Nextflow automatically generate the samplesheet from the schema? how does the schema interact with the publish mechanism? how to collect metadata which normally can't be published directly but only through files?

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

output_schema.json

evanfloden · 2024-02-14T07:36:18Z

Clarifying the schema added in this PR is the output equivalent of the samplesheet schema schema_input.json here? You have it as input_schema.json above.

I had considered this schema to be defining one of the inputs/outputs of a pipeline. Whereas the nextflow_schema.json in the base directory of the repo defines all the possible inputs. Is that correct?

adamrtalbot · 2024-02-14T09:41:59Z

From what I can tell, the samplesheet.csv contains all of the metadata, including the file paths, MD5 checksums, and id mappings. So the samplesheet and the fastq files comprise the essential outputs and everything else is duplication.

This is unique to fetchngs - I would just ignore it and pretend it doesn't exist.

adamrtalbot · 2024-02-14T09:52:33Z

It's not clear to me what this adds.

Pipeline developer writes code to generate samplesheet in pipeline 1
Pipeline developer writes code to read samplesheet in pipeline 2
- Optional: use a samplesheet schema with nf-validation

Where does this file fit in?

bentsherman · 2024-02-14T14:24:37Z

@evanfloden this output schema is like the nextflow_schema.json with the schema_input.json embedded. So it lists all of the outputs but each samplesheet output has it's own schema embedded instead of in a separate file for simplicitly.

@adamrtalbot At the very least, this output schema should be used to validate any samplesheets that are produced, and allow external tools like Seqera Platform to inspect a workflow's expected outputs e.g. for the purpose of chaining pipelines.

What isn't clear to me yet is whether the output schema can be used to automate the generation of the samplesheet.

adamrtalbot · 2024-02-14T14:58:19Z

What isn't clear to me yet is whether the output schema can be used to automate the generation of the samplesheet.

Not clear to me either. fetchngs uses an exec process to do this, I think this is quite an overhead for every pipeline developer to do.

Perhaps something like this could work:

my_channel
    .toSamplesheet(schema: 'output_schema.json', format: 'csv')

Although it's not clear how you go from channel contents to file contents.

bentsherman · 2024-02-14T15:58:34Z

I think that could work. As long as the channel emits maps (or records once we support record types properly), generating the samplesheet is trivial.

pditommaso · 2024-02-19T12:05:48Z

This looks going in the right direction. One thing I found awful in the current schema is JSON schema that's totally unreadable. Wonder if we should not look into a different system more human friendly

pditommaso · 2024-02-19T13:36:12Z

Possible alternatives

drpatelh · 2024-02-21T12:50:11Z

My biggest concern with this is how unwieldy that file is going to get when we go to defining an output schema from a very simple pipeline like fetchngs to rnaseq. This is why I was suggesting we try and incorporate the publishing logic and output file definition at the module/subworkflow/workflow level and then combine them somehow rather than having one single massive file.

I also suspect there will still need to be some sort of "conversion" layer or plugin that can take this output schema file to generate custom csvs/jsons etc which can be used as input downstream for other pipelines. Ideally, this plugin can be invoked outside of the pipeline context.

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

ewels · 2024-02-28T18:42:39Z

but each samplesheet output has it's own schema embedded instead of in a separate file for simplicitly.

I don't think that we should do this, it breaks how JSON schema validation works. The beauty of using the standard is that very many platforms and libraries use the syntax in the same way. You have a parsed object in memory (be it params of the contents of a sample sheet, doesn't really matter) and you validate it against a schema.

If we start merging sample sheet schema inside output schema, we can no longer use this for validation. We would have to validate the output files with subsets of the schema, and validate the list of output files with a subset of the schema. If you have to break the schema down to use it, it becomes custom and a lot less useful imho. Separate files is undoubtably more verbose, but it's also much more portable.

This is why the nextflow_schema.json for params refers to the path to a separate schema file for any given files, rather than embedding that logic within.

ewels · 2024-02-28T18:51:57Z

@pditommaso YAML is fine (and Ben's YAML conversion here hopefully is a lot easier to read), but my strong preference is to stick with as-close-to-as-possible JSON Schema syntax.

To clarify, that JSON Schema can be written in YAML (or toml, or really any format), as long as it's laid out with the structure and keywords of JSON schema. The benefit of using it is that there are about a bazillion different implementations so it just works everywhere.

In contrast, the Yamale syntax you linked to seems to by a Python tool with it's own schema syntax, so every part of our toolchain would need to build its own parser and validation library for that syntax.

The YAML Schema you linked to seems to still be valid JSON Schema, just in YAML format and with a couple of extra keys. That would still work with any JSON Schema implementation, so that'd be fine. But I'm not sure that we're doing anything complex enough to need those extra keywords to be honest.

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

github-actions · 2024-02-28T21:02:06Z

`nf-core lint` overall result: Passed ✅ ⚠️

Posted for pipeline commit 9beb5ea

+| ✅ 155 tests passed       |+
#| ❔   5 tests were ignored |#
!| ❗   4 tests had warnings |!

❗ Test warnings:

files_exist - File not found: assets/multiqc_config.yml
files_exist - File not found: conf/igenomes.config
files_exist - File not found: .github/workflows/awstest.yml
files_exist - File not found: .github/workflows/awsfulltest.yml

❔ Tests ignored:

files_exist - File is ignored: conf/modules.config
files_unchanged - File ignored due to lint config: assets/sendmail_template.txt
actions_ci - actions_ci
actions_awstest - 'awstest.yml' workflow not found: /home/runner/work/fetchngs/fetchngs/.github/workflows/awstest.yml
multiqc_config - 'assets/multiqc_config.yml' not found

✅ Tests passed:

files_exist - File found: .gitattributes
files_exist - File found: .gitignore
files_exist - File found: .nf-core.yml
files_exist - File found: .editorconfig
files_exist - File found: .prettierignore
files_exist - File found: .prettierrc.yml
files_exist - File found: CHANGELOG.md
files_exist - File found: CITATIONS.md
files_exist - File found: CODE_OF_CONDUCT.md
files_exist - File found: LICENSE or LICENSE.md or LICENCE or LICENCE.md
files_exist - File found: nextflow_schema.json
files_exist - File found: nextflow.config
files_exist - File found: README.md
files_exist - File found: .github/.dockstore.yml
files_exist - File found: .github/CONTRIBUTING.md
files_exist - File found: .github/ISSUE_TEMPLATE/bug_report.yml
files_exist - File found: .github/ISSUE_TEMPLATE/config.yml
files_exist - File found: .github/ISSUE_TEMPLATE/feature_request.yml
files_exist - File found: .github/PULL_REQUEST_TEMPLATE.md
files_exist - File found: .github/workflows/branch.yml
files_exist - File found: .github/workflows/ci.yml
files_exist - File found: .github/workflows/linting_comment.yml
files_exist - File found: .github/workflows/linting.yml
files_exist - File found: assets/email_template.html
files_exist - File found: assets/email_template.txt
files_exist - File found: assets/sendmail_template.txt
files_exist - File found: assets/nf-core-fetchngs_logo_light.png
files_exist - File found: conf/test.config
files_exist - File found: conf/test_full.config
files_exist - File found: docs/images/nf-core-fetchngs_logo_light.png
files_exist - File found: docs/images/nf-core-fetchngs_logo_dark.png
files_exist - File found: docs/output.md
files_exist - File found: docs/README.md
files_exist - File found: docs/README.md
files_exist - File found: docs/usage.md
files_exist - File found: main.nf
files_exist - File found: conf/base.config
files_exist - File found: modules.json
files_exist - File found: pyproject.toml
files_exist - File not found check: Singularity
files_exist - File not found check: parameters.settings.json
files_exist - File not found check: pipeline_template.yml
files_exist - File not found check: .nf-core.yaml
files_exist - File not found check: bin/markdown_to_html.r
files_exist - File not found check: conf/aws.config
files_exist - File not found check: .github/workflows/push_dockerhub.yml
files_exist - File not found check: .github/ISSUE_TEMPLATE/bug_report.md
files_exist - File not found check: .github/ISSUE_TEMPLATE/feature_request.md
files_exist - File not found check: docs/images/nf-core-fetchngs_logo.png
files_exist - File not found check: .markdownlint.yml
files_exist - File not found check: .yamllint.yml
files_exist - File not found check: lib/Checks.groovy
files_exist - File not found check: lib/Completion.groovy
files_exist - File not found check: lib/Workflow.groovy
files_exist - File not found check: lib/Utils.groovy
files_exist - File not found check: lib/WorkflowMain.groovy
files_exist - File not found check: lib/NfcoreTemplate.groovy
files_exist - File not found check: lib/WorkflowFetchngs.groovy
files_exist - File not found check: lib/nfcore_external_java_deps.jar
files_exist - File not found check: .travis.yml
nextflow_config - Config variable found: manifest.name
nextflow_config - Config variable found: manifest.nextflowVersion
nextflow_config - Config variable found: manifest.description
nextflow_config - Config variable found: manifest.version
nextflow_config - Config variable found: manifest.homePage
nextflow_config - Config variable found: timeline.enabled
nextflow_config - Config variable found: trace.enabled
nextflow_config - Config variable found: report.enabled
nextflow_config - Config variable found: dag.enabled
nextflow_config - Config variable found: process.cpus
nextflow_config - Config variable found: process.memory
nextflow_config - Config variable found: process.time
nextflow_config - Config variable found: params.outdir
nextflow_config - Config variable found: params.input
nextflow_config - Config variable found: params.validationShowHiddenParams
nextflow_config - Config variable found: params.validationSchemaIgnoreParams
nextflow_config - Config variable found: manifest.mainScript
nextflow_config - Config variable found: timeline.file
nextflow_config - Config variable found: trace.file
nextflow_config - Config variable found: report.file
nextflow_config - Config variable found: dag.file
nextflow_config - Config variable (correctly) not found: params.nf_required_version
nextflow_config - Config variable (correctly) not found: params.container
nextflow_config - Config variable (correctly) not found: params.singleEnd
nextflow_config - Config variable (correctly) not found: params.igenomesIgnore
nextflow_config - Config variable (correctly) not found: params.name
nextflow_config - Config variable (correctly) not found: params.enable_conda
nextflow_config - Config timeline.enabled had correct value: true
nextflow_config - Config report.enabled had correct value: true
nextflow_config - Config trace.enabled had correct value: true
nextflow_config - Config dag.enabled had correct value: true
nextflow_config - Config manifest.name began with nf-core/
nextflow_config - Config variable manifest.homePage began with https://github.com/nf-core/
nextflow_config - Config dag.file ended with .html
nextflow_config - Config variable manifest.nextflowVersion started with >= or !>=
nextflow_config - Config manifest.version ends in dev: 1.13.0dev
nextflow_config - Config params.custom_config_version is set to master
nextflow_config - Config params.custom_config_base is set to https://raw.githubusercontent.com/nf-core/configs/master
nextflow_config - Lines for loading custom profiles found
nextflow_config - nextflow.config contains configuration profile test
nextflow_config - Config default value correct: params.sample_mapping_fields= experiment_accession,run_accession,sample_accession,experiment_alias,run_alias,sample_alias,experiment_title,sample_title,sample_description
nextflow_config - Config default value correct: params.nf_core_rnaseq_strandedness= auto
nextflow_config - Config default value correct: params.download_method= ftp
nextflow_config - Config default value correct: params.custom_config_version= master
nextflow_config - Config default value correct: params.custom_config_base= https://raw.githubusercontent.com/nf-core/configs/master
nextflow_config - Config default value correct: params.max_cpus= 16
nextflow_config - Config default value correct: params.max_memory= 128.GB
nextflow_config - Config default value correct: params.max_time= 240.h
nextflow_config - Config default value correct: params.publish_dir_mode= copy
nextflow_config - Config default value correct: params.validate_params= true
files_unchanged - .gitattributes matches the template
files_unchanged - .prettierrc.yml matches the template
files_unchanged - CODE_OF_CONDUCT.md matches the template
files_unchanged - LICENSE matches the template
files_unchanged - .github/.dockstore.yml matches the template
files_unchanged - .github/CONTRIBUTING.md matches the template
files_unchanged - .github/ISSUE_TEMPLATE/bug_report.yml matches the template
files_unchanged - .github/ISSUE_TEMPLATE/config.yml matches the template
files_unchanged - .github/ISSUE_TEMPLATE/feature_request.yml matches the template
files_unchanged - .github/PULL_REQUEST_TEMPLATE.md matches the template
files_unchanged - .github/workflows/branch.yml matches the template
files_unchanged - .github/workflows/linting_comment.yml matches the template
files_unchanged - .github/workflows/linting.yml matches the template
files_unchanged - assets/email_template.html matches the template
files_unchanged - assets/email_template.txt matches the template
files_unchanged - assets/nf-core-fetchngs_logo_light.png matches the template
files_unchanged - docs/images/nf-core-fetchngs_logo_light.png matches the template
files_unchanged - docs/images/nf-core-fetchngs_logo_dark.png matches the template
files_unchanged - docs/README.md matches the template
files_unchanged - .gitignore matches the template
files_unchanged - .prettierignore matches the template
files_unchanged - pyproject.toml matches the template
readme - README Nextflow minimum version badge matched config. Badge: 23.04.0, Config: 23.04.0
readme - README Zenodo placeholder was replaced with DOI.
pipeline_todos - No TODO strings found
pipeline_name_conventions - Name adheres to nf-core convention
template_strings - Did not find any Jinja template strings (149 files)
schema_lint - Schema lint passed
schema_lint - Schema title + description lint passed
schema_lint - Input mimetype lint passed: 'text/csv'
schema_params - Schema matched params returned from nextflow config
system_exit - No System.exit calls found
actions_schema_validation - Workflow validation passed: branch.yml
actions_schema_validation - Workflow validation passed: cloud_tests_small.yml
actions_schema_validation - Workflow validation passed: linting.yml
actions_schema_validation - Workflow validation passed: ci.yml
actions_schema_validation - Workflow validation passed: clean-up.yml
actions_schema_validation - Workflow validation passed: fix-linting.yml
actions_schema_validation - Workflow validation passed: linting_comment.yml
actions_schema_validation - Workflow validation passed: download_pipeline.yml
actions_schema_validation - Workflow validation passed: cloud_tests_full.yml
actions_schema_validation - Workflow validation passed: release-announcements.yml
merge_markers - No merge markers found in pipeline files
modules_json - Only installed modules found in modules.json
modules_structure - modules directory structure is correct 'modules/nf-core/TOOL/SUBTOOL'

Run details

nf-core/tools version 2.13.1
Run at 2024-03-01 18:00:10

bentsherman · 2024-02-28T21:12:51Z

I converted the JSON schema to YAML just to see what it looks like, and it is indeed much simpler. If the JSON schema can be used with YAML "schemas" just the same, that seems like the best approach to me, even for the nextflow_schema.json.

bentsherman · 2024-02-28T21:25:17Z

I also added a prototype for the workflow output DSL (see nextflow-io/nextflow#4784). It allows you to define an arbitrarily nested directory structure with path(), then publish process outputs with select() using a process selector and the standard publish options. It is, in my opinion, stupid simple

Another idea I considered was being able to select channels from the top-level workflow emits, but that is slightly more complicated to implement (and adds some boilerplate to the pipeline code) whereas I found I could get the job done with just the process outputs.

I thought about having some DSL method like index <source-channel> <filename> which could collect metadata records from a channel and write them to a file. It's actually pretty trivial to do with Groovy, fetchngs was just doing it in a roundabout way, so I simplified some things in the pipeline code.

bentsherman · 2024-02-28T21:47:35Z

At this point, the output DSL is concerned only with mapping process outputs to a directory structure. Where output schemas could come in is as an optional refinement to describe the structure of specific files:

select 'SRA_TO_SAMPLESHEET', pattern: 'samplesheet.csv', schema: 'schema_samplesheet.json'
select 'SRA_TO_SAMPLESHEET', pattern: 'id_mapping.csv', schema: 'schema_mapping.json'

So it's still up to the user to generate the output file, and they might even be able to use the same output schema to do it (like Adam's toSamplesheet() example). But the above definition can be used by external tools and users to understand the structure of workflow outputs without running the pipeline.

Given this example, I agree with @ewels that it makes more sense to keep the schema for each file separate. I'm imaging a nextflow command to generate some kind of global schema from this output definition (i.e. by the pipeline developer before a version release) for use by external tools.

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

bentsherman · 2024-02-29T04:54:56Z

See nf-core/rnaseq#1227 for a similar prototype with rnaseq. It is not for the faint of heart

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

mahesh-panchal · 2024-03-19T14:08:15Z

How would publishing using task variables and other workflow variables look like?

The output DSL will be scoped to the script or workflow block (not sure which one yet), so it will be able to use any variables in that scope. Task variables aren't supported since the publishing is decoupled from the individual tasks

Doesn't this take away a major feature like publishing files to a folder based on sample name? I think there are quite a few examples where some field of meta for example is used in the publish path.

adamrtalbot · 2024-03-19T15:03:28Z

Doesn't this take away a major feature like publishing files to a folder based on sample name? I think there are quite a few examples where some field of meta for example is used in the publish path.

Good point, being able to publish files as results/sample1/bam/sample1.bam is a requirement. Presumably this would work?

path( "results" ) {
    select 'SAMTOOLS_SORT', pattern: '*.bam', saveAs: { "${meta.id}/bam/${it}" }
}

For what it's worth, this is another good example to drive publishing from channels rather than processes, because then the vals would be in scope. You can see that in action here: https://github.com/nf-core/fetchngs/pull/302/files

bentsherman · 2024-03-19T15:07:49Z

Doesn't this take away a major feature like publishing files to a folder based on sample name?

I think I have seen this pattern before, though I couldn't find an example of it in rnaseq.

It is a consequence of decoupling the publishing from the task execution. We might be able to recover it in #302 by allowing the path to reference channel items, e.g. given a channel of files with metadata, publish the file to a path based on the meta id, but not sure what that syntax would look like.

bentsherman · 2024-03-19T15:16:44Z

@adamrtalbot good point, with channel selectors we could do something like this:

path( "results" ) {
    select NFCORE_RNASEQ.out.bam, saveAs: { meta, bam -> "${meta.id}/bam/${bam.name}" }
}

The only thing is, I was imagining the selected channel would just provide paths, but if they provide tuples/records with files and metadata, it's not obvious how the file elements are being pulled out of the tuple.

mahesh-panchal · 2024-03-19T15:19:39Z

it's not obvious how the file elements are being pulled out of the tuple.

Isn't this how it is now? Only path types are published. val, env, etc are ignored.

adamrtalbot · 2024-03-19T15:21:39Z

I think I have seen this pattern before, though I couldn't find an example of it in rnaseq.

It's unusual in nf-core, but quite common elsewhere.

The only thing is, I was imagining the selected channel would just provide paths, but if they provide tuples/records with files and metadata, it's not obvious how the file elements are being pulled out of the tuple.

My thought was, capture all the contents of the channel, publish only the file-like objects. Then we can dump all the contents to a log of some description, similar to how nf-test does it in snapshots (snippet below for anyone who hasn't seen one).

{
    "with_umi": {
        "content": [
            [
                [
                    {
                        "id": "test",
                        "single_end": true
                    },
                    "test.fastp.fastq.gz:md5,ba8c6c3a7ce718d9a2c5857e2edf53bc"
                ]
            ],
            [
                [
                    {
                        "id": "test",
                        "single_end": true
                    },
                    "test.fastp.json:md5,d39c5c6d9a2e35fb60d26ced46569af6"
                ]
            ],
            // etc
       
       "meta": {
            "nf-test": "0.8.4",
            "nextflow": "23.10.1"
        },
        "timestamp": "2024-03-18T17:31:09.193212"
    }
}

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

bentsherman · 2024-03-21T17:38:58Z

Since there is a lot of support for the channel selectors, but most of the discussion is here, I closed the other PR and migrated this one to the channel selectors.

I found that topics weren't really needed for fetchngs. We'll see what happens with rnaseq

bentsherman · 2024-03-21T17:47:11Z

Isn't this how it is now? Only path types are published. val, env, etc are ignored.

@mahesh-panchal yes, because of how process outputs are declared, Nextflow knows which tuple elements are paths and collects them accordingly

Since a generic channel doesn't have such a declaration, for now I think we can follow @adamrtalbot 's suggestion and traverse whatever data structure the channel spits out:

if it's a path or path list, publish it
if it's a tuple, traverse each element
- if the element is a path or path list, publish it
otherwise raise an error

Once we have support for record types and better type inference in the language, we'll be able to infer the incoming type and, if it's a record type, scan the type definition for path elements. Much more robust but not a blocker for the quick n dirty solution

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

workflows/sra/main.nf

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

main.nf

maxulysse · 2024-04-04T10:25:09Z

main.nf

+    'fastq' {
+        from 'fastq'
+    }


ok, so the first fastq is the path, and the second is the topic.
Can it be a bit more explicit?

Thinking something like this at least:

Suggested change

'fastq' {

from 'fastq'

}

'fastq/' {

from 'fastq'

}

you can specify a trailing slash if you want

I'd like that, it makes the path more explicit, but I'd rather even have a path and a topic specified somewhere, not to be too confused, I think it's better to be a bit more explicit

That's fair. This case is simple enough to be confusing, but if you used a regular emit instead of a topic, or you had a more complex directory structure like in rnaseq, it would be clearer.

I did propose calling it fromTopic to help denote it as a topic, but Paolo was in favor of not having too many different keywords. We could revisit this

adamrtalbot · 2024-04-04T10:38:08Z

I'm not a massive fan of topics, I think they are a little bit hidden and it's hard to identify where they originated. However, I see that there is a value in them, I just wouldn't use them much myself. But the general principle here looks great!

bentsherman · 2024-04-04T12:40:58Z

@adamrtalbot totally fair. fetchngs is simple enough that you could do it without topics, and you are certainly free to scrap the topics in the final implementation. I used them here as an exercise to show how to access those intermediate channels without adding them to the emits. In practice, users will be able to use any mixture of emits / topics based on their preference.

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

bentsherman · 2024-04-10T22:20:49Z

workflows/sra/main.nf

+    publish:
+    ch_fastq                    >> 'fastq/'
+    ASPERA_CLI.out.md5          >> 'fastq/md5/'
+    SRA_FASTQ_FTP.out.md5       >> 'fastq/md5/'
+    SRA_RUNINFO_TO_FTP.out.tsv  >> 'metadata/'
+    ch_versions_yml             >> 'pipeline_info/'
+    ch_samplesheet              >> 'samplesheet/'
+    ch_mappings                 >> 'samplesheet/'
+    ch_sample_mappings_yml      >> 'samplesheet/'


By default, the "topic" name will be used as the publish path, which makes fetchngs really simple. No need to define any rules in the output DSL, just the base directory and publish mode, then all of these channels will be published exactly as it says.

These names don't have to be paths, they can also be arbitrary names which you would then use in the output DSL to customize publish options for that name. I'll demonstrate this with rnaseq. You can think of the names as "topics" if you want, but at this point I'm not even using topics under the hood because they aren't necessary.

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

bentsherman · 2024-05-21T14:59:20Z

Folding into #312

Add workflow output schema

9b3a10d

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

bentsherman changed the base branch from master to dev February 13, 2024 21:08

nf-core deleted a comment from github-actions bot Feb 13, 2024

This comment was marked as resolved.

Sign in to view

ewels reviewed Feb 13, 2024

View reviewed changes

output_schema.json Outdated Show resolved Hide resolved

bentsherman added 2 commits February 28, 2024 11:34

Convert output schema to YAML

b03ed3b

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

Add output definition

b8dd9e2

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

Simplify output DSL, samplesheet generation

98b6574

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

bentsherman mentioned this pull request Feb 28, 2024

Workflow output definition nextflow-io/nextflow#4784

Merged

bentsherman added 2 commits February 28, 2024 16:16

Flatten output definition

5647bd1

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

Fix bug

4405674

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

bentsherman mentioned this pull request Feb 28, 2024

Workflow output definition nf-core/rnaseq#1227

Draft

Add default publish mode

47fa3f9

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

Add toSamplesheet() method to generate samplesheet

ec29439

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

bentsherman changed the title ~~Workflow output DSL~~ Workflow output DSL (process selectors) Mar 19, 2024

Replace process selectors with channel selectors

5b3587b

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

bentsherman changed the title ~~Workflow output DSL (process selectors)~~ Workflow output DSL (channel selectors) Mar 21, 2024

bentsherman mentioned this pull request Mar 22, 2024

Workflow output definitions and schema nextflow-io/nextflow#4670

Closed

bentsherman added 2 commits March 27, 2024 12:51

Use channel topics to reduce channel propagation

9e8ef57

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

Temporary workaround for topic operator bug

35aaddd

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

bentsherman commented Mar 29, 2024

View reviewed changes

workflows/sra/main.nf Outdated Show resolved Hide resolved

Apply updates from upstream

77bdd19

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

bentsherman changed the title ~~Workflow output DSL (channel selectors)~~ Workflow output DSL Apr 3, 2024

maxulysse reviewed Apr 4, 2024

View reviewed changes

main.nf Outdated Show resolved Hide resolved

maxulysse reviewed Apr 4, 2024

View reviewed changes

Update using latest syntax

3ca8a28

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

bentsherman commented Apr 10, 2024

View reviewed changes

Rename output -> publish

bb7febb

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

bentsherman changed the title ~~Workflow output DSL~~ Workflow publish definition Apr 24, 2024

minor updates

c01bd0c

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

bentsherman changed the title ~~Workflow publish definition~~ Workflow output definition May 17, 2024

bentsherman closed this May 21, 2024

bentsherman deleted the workflow-inputs-outputs branch May 21, 2024 14:59

Workflow output definition #275

Workflow output definition #275

Conversation

bentsherman commented Feb 13, 2024

This comment was marked as resolved.

evanfloden commented Feb 14, 2024

adamrtalbot commented Feb 14, 2024

adamrtalbot commented Feb 14, 2024 • edited Loading

bentsherman commented Feb 14, 2024

adamrtalbot commented Feb 14, 2024

bentsherman commented Feb 14, 2024

pditommaso commented Feb 19, 2024

pditommaso commented Feb 19, 2024

drpatelh commented Feb 21, 2024 • edited Loading

ewels commented Feb 28, 2024

ewels commented Feb 28, 2024

github-actions bot commented Feb 28, 2024 • edited Loading

nf-core lint overall result: Passed ✅ ⚠️

❗ Test warnings:

❔ Tests ignored:

✅ Tests passed:

Run details

bentsherman commented Feb 28, 2024

bentsherman commented Feb 28, 2024

bentsherman commented Feb 28, 2024

bentsherman commented Feb 29, 2024

mahesh-panchal commented Mar 19, 2024

adamrtalbot commented Mar 19, 2024

bentsherman commented Mar 19, 2024

bentsherman commented Mar 19, 2024

mahesh-panchal commented Mar 19, 2024

adamrtalbot commented Mar 19, 2024 • edited Loading

bentsherman commented Mar 21, 2024

bentsherman commented Mar 21, 2024

maxulysse Apr 4, 2024

Choose a reason for hiding this comment

bentsherman Apr 4, 2024

Choose a reason for hiding this comment

maxulysse Apr 4, 2024

Choose a reason for hiding this comment

bentsherman Apr 4, 2024 • edited Loading

Choose a reason for hiding this comment

adamrtalbot commented Apr 4, 2024 • edited Loading

bentsherman commented Apr 4, 2024

bentsherman Apr 10, 2024

Choose a reason for hiding this comment

bentsherman commented May 21, 2024

adamrtalbot commented Feb 14, 2024 •

edited

Loading

drpatelh commented Feb 21, 2024 •

edited

Loading

github-actions bot commented Feb 28, 2024 •

edited

Loading

`nf-core lint` overall result: Passed ✅ ⚠️

adamrtalbot commented Mar 19, 2024 •

edited

Loading

bentsherman Apr 4, 2024 •

edited

Loading

adamrtalbot commented Apr 4, 2024 •

edited

Loading