Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow output definition #275

Closed
wants to merge 17 commits into from
Closed

Workflow output definition #275

wants to merge 17 commits into from

Conversation

bentsherman
Copy link

This PR is a prototype for defining an output schema for a Nextflow pipeline. See nextflow-io/nextflow#4669 and nextflow-io/nextflow#4670 for original discussions.

The meta workflow that we are targeting is:

fetchngs -> rnaseq -> differentialabundance

In other words, we want to eliminate the manual curation of samplesheets between each pipeline. To do this, the output schema should "mirror" the params schema, it should describe the outputs as a collection of samplesheets.

Here is the tree of pipeline outputs for the fetchngs test profile:

/custom/user-settings.mkfg
/fastq/DRX024467_DRR026872.fastq.gz
/fastq/DRX026011_DRR028935_1.fastq.gz
/fastq/DRX026011_DRR028935_2.fastq.gz
/fastq/ERX1234253_ERR1160846.fastq.gz
/fastq/SRX10940790_SRR14593545_1.fastq.gz
/fastq/SRX10940790_SRR14593545_2.fastq.gz
/fastq/SRX11047067_SRR14709033.fastq.gz
/fastq/SRX17709227_SRR21711856.fastq.gz
/fastq/SRX17709228_SRR21711855.fastq.gz
/fastq/SRX6725035_SRR9984183.fastq.gz
/fastq/SRX9315476_SRR12848126_1.fastq.gz
/fastq/SRX9315476_SRR12848126_2.fastq.gz
/fastq/SRX9504942_SRR13055517_1.fastq.gz
/fastq/SRX9504942_SRR13055517_2.fastq.gz
/fastq/SRX9504942_SRR13055518_1.fastq.gz
/fastq/SRX9504942_SRR13055518_2.fastq.gz
/fastq/SRX9504942_SRR13055519_1.fastq.gz
/fastq/SRX9504942_SRR13055519_2.fastq.gz
/fastq/SRX9504942_SRR13055520_1.fastq.gz
/fastq/SRX9504942_SRR13055520_2.fastq.gz
/fastq/SRX9626017_SRR13191702_1.fastq.gz
/fastq/SRX9626017_SRR13191702_2.fastq.gz
/fastq/md5/DRX024467_DRR026872.fastq.gz.md5
/fastq/md5/DRX026011_DRR028935_1.fastq.gz.md5
/fastq/md5/DRX026011_DRR028935_2.fastq.gz.md5
/fastq/md5/ERX1234253_ERR1160846.fastq.gz.md5
/fastq/md5/SRX17709227_SRR21711856.fastq.gz.md5
/fastq/md5/SRX17709228_SRR21711855.fastq.gz.md5
/fastq/md5/SRX6725035_SRR9984183.fastq.gz.md5
/fastq/md5/SRX9504942_SRR13055517_1.fastq.gz.md5
/fastq/md5/SRX9504942_SRR13055517_2.fastq.gz.md5
/fastq/md5/SRX9504942_SRR13055518_1.fastq.gz.md5
/fastq/md5/SRX9504942_SRR13055518_2.fastq.gz.md5
/fastq/md5/SRX9504942_SRR13055519_1.fastq.gz.md5
/fastq/md5/SRX9504942_SRR13055519_2.fastq.gz.md5
/fastq/md5/SRX9504942_SRR13055520_1.fastq.gz.md5
/fastq/md5/SRX9504942_SRR13055520_2.fastq.gz.md5
/fastq/md5/SRX9626017_SRR13191702_1.fastq.gz.md5
/fastq/md5/SRX9626017_SRR13191702_2.fastq.gz.md5
/metadata/DRR026872.runinfo_ftp.tsv
/metadata/DRR028935.runinfo_ftp.tsv
/metadata/ERR1160846.runinfo_ftp.tsv
/metadata/GSE214215.runinfo_ftp.tsv
/metadata/GSM4907283.runinfo_ftp.tsv
/metadata/SRR12848126.runinfo_ftp.tsv
/metadata/SRR13191702.runinfo_ftp.tsv
/metadata/SRR14593545.runinfo_ftp.tsv
/metadata/SRR14709033.runinfo_ftp.tsv
/metadata/SRR9984183.runinfo_ftp.tsv
/samplesheet/id_mappings.csv
/samplesheet/multiqc_config.yml
/samplesheet/samplesheet.csv

From what I can tell, the samplesheet.csv contains all of the metadata, including the file paths, MD5 checksums, and id mappings. So the samplesheet and the fastq files comprise the essential outputs and everything else is duplication.

The initial output schema basically describes this samplesheet in a similar manner to the input_schema.json file. This particular output schema should closely resemble the input_schema.json for nf-core/rnaseq.

What I'd to do from here is collect feedback on this approach -- what else is needed to complete the output schema for this pipeline? Then we can think about how to operationalize it in Nextflow -- should Nextflow automatically generate the samplesheet from the schema? how does the schema interact with the publish mechanism? how to collect metadata which normally can't be published directly but only through files?

Signed-off-by: Ben Sherman <bentshermann@gmail.com>
@bentsherman bentsherman changed the base branch from master to dev February 13, 2024 21:08
@nf-core nf-core deleted a comment from github-actions bot Feb 13, 2024

This comment was marked as resolved.

output_schema.json Outdated Show resolved Hide resolved
@evanfloden
Copy link

Clarifying the schema added in this PR is the output equivalent of the samplesheet schema schema_input.json here? You have it as input_schema.json above.

I had considered this schema to be defining one of the inputs/outputs of a pipeline. Whereas the nextflow_schema.json in the base directory of the repo defines all the possible inputs. Is that correct?

@adamrtalbot
Copy link
Contributor

From what I can tell, the samplesheet.csv contains all of the metadata, including the file paths, MD5 checksums, and id mappings. So the samplesheet and the fastq files comprise the essential outputs and everything else is duplication.

This is unique to fetchngs - I would just ignore it and pretend it doesn't exist.

@adamrtalbot
Copy link
Contributor

adamrtalbot commented Feb 14, 2024

It's not clear to me what this adds.

  • Pipeline developer writes code to generate samplesheet in pipeline 1
  • Pipeline developer writes code to read samplesheet in pipeline 2
    • Optional: use a samplesheet schema with nf-validation

Where does this file fit in?

@bentsherman
Copy link
Author

@evanfloden this output schema is like the nextflow_schema.json with the schema_input.json embedded. So it lists all of the outputs but each samplesheet output has it's own schema embedded instead of in a separate file for simplicitly.

@adamrtalbot At the very least, this output schema should be used to validate any samplesheets that are produced, and allow external tools like Seqera Platform to inspect a workflow's expected outputs e.g. for the purpose of chaining pipelines.

What isn't clear to me yet is whether the output schema can be used to automate the generation of the samplesheet.

@adamrtalbot
Copy link
Contributor

What isn't clear to me yet is whether the output schema can be used to automate the generation of the samplesheet.

Not clear to me either. fetchngs uses an exec process to do this, I think this is quite an overhead for every pipeline developer to do.

Perhaps something like this could work:

my_channel
    .toSamplesheet(schema: 'output_schema.json', format: 'csv')

Although it's not clear how you go from channel contents to file contents.

@bentsherman
Copy link
Author

I think that could work. As long as the channel emits maps (or records once we support record types properly), generating the samplesheet is trivial.

@pditommaso
Copy link

This looks going in the right direction. One thing I found awful in the current schema is JSON schema that's totally unreadable. Wonder if we should not look into a different system more human friendly

@pditommaso
Copy link

Possible alternatives

@drpatelh
Copy link
Member

drpatelh commented Feb 21, 2024

My biggest concern with this is how unwieldy that file is going to get when we go to defining an output schema from a very simple pipeline like fetchngs to rnaseq. This is why I was suggesting we try and incorporate the publishing logic and output file definition at the module/subworkflow/workflow level and then combine them somehow rather than having one single massive file.

I also suspect there will still need to be some sort of "conversion" layer or plugin that can take this output schema file to generate custom csvs/jsons etc which can be used as input downstream for other pipelines. Ideally, this plugin can be invoked outside of the pipeline context.

Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
@ewels
Copy link
Member

ewels commented Feb 28, 2024

but each samplesheet output has it's own schema embedded instead of in a separate file for simplicitly.

I don't think that we should do this, it breaks how JSON schema validation works. The beauty of using the standard is that very many platforms and libraries use the syntax in the same way. You have a parsed object in memory (be it params of the contents of a sample sheet, doesn't really matter) and you validate it against a schema.

If we start merging sample sheet schema inside output schema, we can no longer use this for validation. We would have to validate the output files with subsets of the schema, and validate the list of output files with a subset of the schema. If you have to break the schema down to use it, it becomes custom and a lot less useful imho. Separate files is undoubtably more verbose, but it's also much more portable.

This is why the nextflow_schema.json for params refers to the path to a separate schema file for any given files, rather than embedding that logic within.

@ewels
Copy link
Member

ewels commented Feb 28, 2024

@pditommaso YAML is fine (and Ben's YAML conversion here hopefully is a lot easier to read), but my strong preference is to stick with as-close-to-as-possible JSON Schema syntax.

To clarify, that JSON Schema can be written in YAML (or toml, or really any format), as long as it's laid out with the structure and keywords of JSON schema. The benefit of using it is that there are about a bazillion different implementations so it just works everywhere.

In contrast, the Yamale syntax you linked to seems to by a Python tool with it's own schema syntax, so every part of our toolchain would need to build its own parser and validation library for that syntax.

The YAML Schema you linked to seems to still be valid JSON Schema, just in YAML format and with a couple of extra keys. That would still work with any JSON Schema implementation, so that'd be fine. But I'm not sure that we're doing anything complex enough to need those extra keywords to be honest.

Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Copy link

github-actions bot commented Feb 28, 2024

nf-core lint overall result: Passed ✅ ⚠️

Posted for pipeline commit 9beb5ea

+| ✅ 155 tests passed       |+
#| ❔   5 tests were ignored |#
!| ❗   4 tests had warnings |!

❗ Test warnings:

  • files_exist - File not found: assets/multiqc_config.yml
  • files_exist - File not found: conf/igenomes.config
  • files_exist - File not found: .github/workflows/awstest.yml
  • files_exist - File not found: .github/workflows/awsfulltest.yml

❔ Tests ignored:

  • files_exist - File is ignored: conf/modules.config
  • files_unchanged - File ignored due to lint config: assets/sendmail_template.txt
  • actions_ci - actions_ci
  • actions_awstest - 'awstest.yml' workflow not found: /home/runner/work/fetchngs/fetchngs/.github/workflows/awstest.yml
  • multiqc_config - 'assets/multiqc_config.yml' not found

✅ Tests passed:

Run details

  • nf-core/tools version 2.13.1
  • Run at 2024-03-01 18:00:10

@bentsherman
Copy link
Author

I converted the JSON schema to YAML just to see what it looks like, and it is indeed much simpler. If the JSON schema can be used with YAML "schemas" just the same, that seems like the best approach to me, even for the nextflow_schema.json.

@bentsherman
Copy link
Author

I also added a prototype for the workflow output DSL (see nextflow-io/nextflow#4784). It allows you to define an arbitrarily nested directory structure with path(), then publish process outputs with select() using a process selector and the standard publish options. It is, in my opinion, stupid simple

Another idea I considered was being able to select channels from the top-level workflow emits, but that is slightly more complicated to implement (and adds some boilerplate to the pipeline code) whereas I found I could get the job done with just the process outputs.

I thought about having some DSL method like index <source-channel> <filename> which could collect metadata records from a channel and write them to a file. It's actually pretty trivial to do with Groovy, fetchngs was just doing it in a roundabout way, so I simplified some things in the pipeline code.

@bentsherman
Copy link
Author

At this point, the output DSL is concerned only with mapping process outputs to a directory structure. Where output schemas could come in is as an optional refinement to describe the structure of specific files:

select 'SRA_TO_SAMPLESHEET', pattern: 'samplesheet.csv', schema: 'schema_samplesheet.json'
select 'SRA_TO_SAMPLESHEET', pattern: 'id_mapping.csv', schema: 'schema_mapping.json'

So it's still up to the user to generate the output file, and they might even be able to use the same output schema to do it (like Adam's toSamplesheet() example). But the above definition can be used by external tools and users to understand the structure of workflow outputs without running the pipeline.

Given this example, I agree with @ewels that it makes more sense to keep the schema for each file separate. I'm imaging a nextflow command to generate some kind of global schema from this output definition (i.e. by the pipeline developer before a version release) for use by external tools.

Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
@bentsherman
Copy link
Author

See nf-core/rnaseq#1227 for a similar prototype with rnaseq. It is not for the faint of heart

Signed-off-by: Ben Sherman <bentshermann@gmail.com>
@mahesh-panchal
Copy link
Member

How would publishing using task variables and other workflow variables look like?

The output DSL will be scoped to the script or workflow block (not sure which one yet), so it will be able to use any variables in that scope. Task variables aren't supported since the publishing is decoupled from the individual tasks

Doesn't this take away a major feature like publishing files to a folder based on sample name? I think there are quite a few examples where some field of meta for example is used in the publish path.

@adamrtalbot
Copy link
Contributor

Doesn't this take away a major feature like publishing files to a folder based on sample name? I think there are quite a few examples where some field of meta for example is used in the publish path.

Good point, being able to publish files as results/sample1/bam/sample1.bam is a requirement. Presumably this would work?

path( "results" ) {
    select 'SAMTOOLS_SORT', pattern: '*.bam', saveAs: { "${meta.id}/bam/${it}" }
}

For what it's worth, this is another good example to drive publishing from channels rather than processes, because then the vals would be in scope. You can see that in action here: https://github.com/nf-core/fetchngs/pull/302/files

@bentsherman
Copy link
Author

Doesn't this take away a major feature like publishing files to a folder based on sample name?

I think I have seen this pattern before, though I couldn't find an example of it in rnaseq.

It is a consequence of decoupling the publishing from the task execution. We might be able to recover it in #302 by allowing the path to reference channel items, e.g. given a channel of files with metadata, publish the file to a path based on the meta id, but not sure what that syntax would look like.

@bentsherman
Copy link
Author

@adamrtalbot good point, with channel selectors we could do something like this:

path( "results" ) {
    select NFCORE_RNASEQ.out.bam, saveAs: { meta, bam -> "${meta.id}/bam/${bam.name}" }
}

The only thing is, I was imagining the selected channel would just provide paths, but if they provide tuples/records with files and metadata, it's not obvious how the file elements are being pulled out of the tuple.

@bentsherman bentsherman changed the title Workflow output DSL Workflow output DSL (process selectors) Mar 19, 2024
@mahesh-panchal
Copy link
Member

it's not obvious how the file elements are being pulled out of the tuple.

Isn't this how it is now? Only path types are published. val, env, etc are ignored.

@adamrtalbot
Copy link
Contributor

adamrtalbot commented Mar 19, 2024

I think I have seen this pattern before, though I couldn't find an example of it in rnaseq.

It's unusual in nf-core, but quite common elsewhere.

The only thing is, I was imagining the selected channel would just provide paths, but if they provide tuples/records with files and metadata, it's not obvious how the file elements are being pulled out of the tuple.

My thought was, capture all the contents of the channel, publish only the file-like objects. Then we can dump all the contents to a log of some description, similar to how nf-test does it in snapshots (snippet below for anyone who hasn't seen one).

{
    "with_umi": {
        "content": [
            [
                [
                    {
                        "id": "test",
                        "single_end": true
                    },
                    "test.fastp.fastq.gz:md5,ba8c6c3a7ce718d9a2c5857e2edf53bc"
                ]
            ],
            [
                [
                    {
                        "id": "test",
                        "single_end": true
                    },
                    "test.fastp.json:md5,d39c5c6d9a2e35fb60d26ced46569af6"
                ]
            ],
            // etc
       
       "meta": {
            "nf-test": "0.8.4",
            "nextflow": "23.10.1"
        },
        "timestamp": "2024-03-18T17:31:09.193212"
    }
}

Signed-off-by: Ben Sherman <bentshermann@gmail.com>
@bentsherman bentsherman changed the title Workflow output DSL (process selectors) Workflow output DSL (channel selectors) Mar 21, 2024
@bentsherman
Copy link
Author

Since there is a lot of support for the channel selectors, but most of the discussion is here, I closed the other PR and migrated this one to the channel selectors.

I found that topics weren't really needed for fetchngs. We'll see what happens with rnaseq

@bentsherman
Copy link
Author

Isn't this how it is now? Only path types are published. val, env, etc are ignored.

@mahesh-panchal yes, because of how process outputs are declared, Nextflow knows which tuple elements are paths and collects them accordingly

Since a generic channel doesn't have such a declaration, for now I think we can follow @adamrtalbot 's suggestion and traverse whatever data structure the channel spits out:

  • if it's a path or path list, publish it
  • if it's a tuple, traverse each element
    • if the element is a path or path list, publish it
  • otherwise raise an error

Once we have support for record types and better type inference in the language, we'll be able to infer the incoming type and, if it's a record type, scan the type definition for path elements. Much more robust but not a blocker for the quick n dirty solution

Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
workflows/sra/main.nf Outdated Show resolved Hide resolved
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
@bentsherman bentsherman changed the title Workflow output DSL (channel selectors) Workflow output DSL Apr 3, 2024
main.nf Outdated Show resolved Hide resolved
main.nf Outdated
Comment on lines 97 to 99
'fastq' {
from 'fastq'
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so the first fastq is the path, and the second is the topic.
Can it be a bit more explicit?

Thinking something like this at least:

Suggested change
'fastq' {
from 'fastq'
}
'fastq/' {
from 'fastq'
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can specify a trailing slash if you want

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like that, it makes the path more explicit, but I'd rather even have a path and a topic specified somewhere, not to be too confused, I think it's better to be a bit more explicit

Copy link
Author

@bentsherman bentsherman Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair. This case is simple enough to be confusing, but if you used a regular emit instead of a topic, or you had a more complex directory structure like in rnaseq, it would be clearer.

I did propose calling it fromTopic to help denote it as a topic, but Paolo was in favor of not having too many different keywords. We could revisit this

@adamrtalbot
Copy link
Contributor

adamrtalbot commented Apr 4, 2024

I'm not a massive fan of topics, I think they are a little bit hidden and it's hard to identify where they originated. However, I see that there is a value in them, I just wouldn't use them much myself. But the general principle here looks great!

@bentsherman
Copy link
Author

@adamrtalbot totally fair. fetchngs is simple enough that you could do it without topics, and you are certainly free to scrap the topics in the final implementation. I used them here as an exercise to show how to access those intermediate channels without adding them to the emits. In practice, users will be able to use any mixture of emits / topics based on their preference.

Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Comment on lines +195 to +203
publish:
ch_fastq >> 'fastq/'
ASPERA_CLI.out.md5 >> 'fastq/md5/'
SRA_FASTQ_FTP.out.md5 >> 'fastq/md5/'
SRA_RUNINFO_TO_FTP.out.tsv >> 'metadata/'
ch_versions_yml >> 'pipeline_info/'
ch_samplesheet >> 'samplesheet/'
ch_mappings >> 'samplesheet/'
ch_sample_mappings_yml >> 'samplesheet/'
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, the "topic" name will be used as the publish path, which makes fetchngs really simple. No need to define any rules in the output DSL, just the base directory and publish mode, then all of these channels will be published exactly as it says.

These names don't have to be paths, they can also be arbitrary names which you would then use in the output DSL to customize publish options for that name. I'll demonstrate this with rnaseq. You can think of the names as "topics" if you want, but at this point I'm not even using topics under the hood because they aren't necessary.

Signed-off-by: Ben Sherman <bentshermann@gmail.com>
@bentsherman bentsherman changed the title Workflow output DSL Workflow publish definition Apr 24, 2024
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
@bentsherman bentsherman changed the title Workflow publish definition Workflow output definition May 17, 2024
@bentsherman
Copy link
Author

Folding into #312

@bentsherman bentsherman deleted the workflow-inputs-outputs branch May 21, 2024 14:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants