-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Beyond DSL2 #309
base: dev
Are you sure you want to change the base?
Conversation
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
workflows/sra/main.nf
Outdated
|> map { meta -> | ||
def sample = new Sample( meta, meta.fastq_aspera.tokenize(';').take(2).collect( name -> file(name) ) ) | ||
ASPERA_CLI ( sample, 'era-fasp', aspera_cli_args ) | ||
} // fastq: Channel<Sample>, md5: Channel<Sample> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mahesh-panchal to your point about dynamic args, I think we can do even better in DSL3:
|> map { meta ->
def sample = new Sample( /* ... */ )
ASPERA_CLI ( sample, 'era-fasp', "${meta.key}" )
}
Because we call the process in a map
operator explicitly (currently it is implied), we can control how the process is invoked for each task within the operator closure, instead of passing multiple queue channels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this specifically is much better. Treating processes like functions is soooo much better since there's no implicit transformation stuff going on with all the singleton/queue channel stuff. It has to be formed and then map
ped. And tuple
s disappear too, except in channels (?).
Actually I think what's worrying me about this syntax is the mixing of input types. An input could be a channel (e.g. MULTIQC_MAPPINGS_CONFIG ( mappings )
lower down ) or it could be an input set (e.g. this dynamically defined Sample
). This is already confusing to new comers where we commonly see people trying to use channels inside map
, branch
, etc.
I guess one could explain the second option as passing dynamically defined singleton channels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this proposal, there are no "value" channels, only queue channels and regular values. So the MULTIQC_MAPPINGS_CONFIG ( mappings )
is no different because mappings
is just a value. It may be an async value, and Nextflow might represent it as a value channel under the hood, but to the user it should be indistinguishable from a regular value
In other words, you can not call a process with a channel, only values
SRA ( | ||
ids, | ||
params.ena_metadata_fields ?: '', | ||
params.sample_mapping_fields, | ||
params.nf_core_pipeline ?: '', | ||
params.nf_core_rnaseq_strandedness ?: 'auto', | ||
params.download_method, | ||
params.skip_fastq_download, | ||
params.dbgap_key, | ||
params.aspera_cli_args, | ||
params.sra_fastq_ftp_args, | ||
params.sratools_fasterqdump_args, | ||
params.sratools_pigz_args, | ||
params.outdir | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My immediate thought it to make a map or SraParams object to handle all these params in a single object. Map is simpler but an object would give you the typing I'd like.
SRA ( | |
ids, | |
params.ena_metadata_fields ?: '', | |
params.sample_mapping_fields, | |
params.nf_core_pipeline ?: '', | |
params.nf_core_rnaseq_strandedness ?: 'auto', | |
params.download_method, | |
params.skip_fastq_download, | |
params.dbgap_key, | |
params.aspera_cli_args, | |
params.sra_fastq_ftp_args, | |
params.sratools_fasterqdump_args, | |
params.sratools_pigz_args, | |
params.outdir | |
) | |
mySraParam = SraParams( | |
params.ena_metadata_fields ?: '', | |
params.sample_mapping_fields, | |
params.nf_core_pipeline ?: '', | |
params.nf_core_rnaseq_strandedness ?: 'auto', | |
params.download_method, | |
params.skip_fastq_download, | |
params.dbgap_key, | |
params.aspera_cli_args, | |
params.sra_fastq_ftp_args, | |
params.sratools_fasterqdump_args, | |
params.sratools_pigz_args, | |
params.outdir | |
) | |
SRA ( | |
ids, | |
mySraParams | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I imagine someone will pass the whole dang params object in 🤔 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could make a record type 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An option I'd considered was to have those params as part of the record that supplied meta and files to stage.
modules/local/aspera_cli/main.nf
Outdated
Sample fastq = new Sample(meta, path("*fastq.gz")) | ||
Sample md5 = new Sample(meta, path("*md5")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we simplify this or is it important to be explicit? Feels like a lot of boilerplate to set some outputs? I think implicit looks a bit nicer and I can't really see the downside?
Sample fastq = new Sample(meta, path("*fastq.gz")) | |
Sample md5 = new Sample(meta, path("*md5")) | |
fastq = Sample(meta, path("*fastq.gz")) | |
md5 = Sample(meta, path("*md5")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've considered it. The proposed syntax is the most basic form that matches how variables are declared in general.
I guess you don't really need the output type on the left if it can always be inferred from the right-hand side.
Alternatively, since we always call a process with a single input -> single output with this proposed syntax, instead of channels, we could also just specify each record element as a separate input/output and bundle them into records in the workflow as needed. But that might be unwieldy for records with many elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did some refactoring today. Since process calls are so much more flexible now, I think we can simplify a few things here:
- the output type can be omitted, inferred from the right-hand side
- if there is only one output, the output name can be omitted because the process will just return that value directly
- if there are multiple outputs, the process will return an "implicit" record type, similar to the process
.out
but with single values instead of channels. - I removed the use of
Sample
from all processes since it doesn't add much value. Instead I only bundle some things intoSample
s at the workflow level where it makes sense. Of course for larger record types it might be different.
Hopefully that makes the types less daunting as well. Basically the only place where they need to be explicitly declared is for function/process/workflow inputs.
val pipeline | ||
val strandedness | ||
val mapping_fields | ||
List<Map> sra_metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean a sample of 1 no longer has to be explicitly tested if you want to use it? https://github.com/nf-core/modules/blob/5f12fc2128f419a8750c5b0620e4b54d7aa33fec/modules/nf-core/ashlar/main.nf#L27-L29
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes exactly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
List<Map>
syntax is a bit hard/complex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, it's a little clunky but it's inherited from Groovy directly? [Map]
or List(Map)
is nicer looking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's groovy syntax for generics: https://groovy-lang.org/objectorientation.html#generics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also it's a list of maps so you can't get much simpler than List<Map>
types/types.nf
Outdated
record Sample { | ||
Map<String,?> meta | ||
List<Path> files | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// | ||
// MODULE: Get SRA run information for public database ids | ||
// | ||
|> map { id -> | ||
SRA_IDS_TO_RUNINFO ( id, ena_metadata_fields ) | ||
} // Channel<Path> | ||
// | ||
// MODULE: Parse SRA run information, create file containing FTP links and read into workflow as [ meta, [reads] ] | ||
// | ||
|> map(SRA_RUNINFO_TO_FTP) // Channel<Path> | ||
|> set { runinfo_ftp } // Channel<Path> | ||
|> flatMap { tsv -> | ||
splitCsv(tsv, header:true, sep:'\t') | ||
} // Channel<Map> | ||
|> map { meta -> | ||
meta + [single_end: meta.single_end.toBoolean()] | ||
} // Channel<Map> | ||
|> unique // Channel<Map> | ||
|> set { sra_metadata } // Channel<Map> | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This actually might have the biggest impact. Piping has never been more popular, especially with biologists because of R. Being able to write the pipeline in this functional way might help with express the mental model of channels. Sorry, too much language there but I like this a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. A lot of moving pieces have to come together to make this work. Does it make sense to you how I am calling the process like a function in the map operator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took a second to process, but this is soooo much better. This deals with what I wanted much better than how I initially thought about from the current syntax.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is |>
used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooh, I see, sorry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the pipe replacement |>
is easier to see, and provides some visual directionality, helping readability.
Something extra: How about also shifting:
workflow {
workflow.onComplete {
}
}
to
workflow {
onStart:
...
take:
...
main:
...
onComplete:
...
}
There are some things I really like here, but I have reservations about other stuff like how channels are obfuscated with their channel values, and process outputs
SRA ( | ||
ids, | ||
params.ena_metadata_fields ?: '', | ||
params.sample_mapping_fields, | ||
params.nf_core_pipeline ?: '', | ||
params.nf_core_rnaseq_strandedness ?: 'auto', | ||
params.download_method, | ||
params.skip_fastq_download, | ||
params.dbgap_key, | ||
params.aspera_cli_args, | ||
params.sra_fastq_ftp_args, | ||
params.sratools_fasterqdump_args, | ||
params.sratools_pigz_args, | ||
params.outdir | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An option I'd considered was to have those params as part of the record that supplied meta and files to stage.
@@ -86,6 +102,11 @@ workflow { | |||
) | |||
} | |||
|
|||
publish { | |||
directory params.outdir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have the syntax have an =
or :
here for readability? Or is this a function?
directory(params.outdir)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is a function call under the hood, so you could use parentheses here. it is the same syntax as process directives.
personally I would rather put these settings in the config file because they seem more like config, but Paolo prefers this form for now
// | ||
// MODULE: Get SRA run information for public database ids | ||
// | ||
|> map { id -> | ||
SRA_IDS_TO_RUNINFO ( id, ena_metadata_fields ) | ||
} // Channel<Path> | ||
// | ||
// MODULE: Parse SRA run information, create file containing FTP links and read into workflow as [ meta, [reads] ] | ||
// | ||
|> map(SRA_RUNINFO_TO_FTP) // Channel<Path> | ||
|> set { runinfo_ftp } // Channel<Path> | ||
|> flatMap { tsv -> | ||
splitCsv(tsv, header:true, sep:'\t') | ||
} // Channel<Map> | ||
|> map { meta -> | ||
meta + [single_end: meta.single_end.toBoolean()] | ||
} // Channel<Map> | ||
|> unique // Channel<Map> | ||
|> set { sra_metadata } // Channel<Map> | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took a second to process, but this is soooo much better. This deals with what I wanted much better than how I initially thought about from the current syntax.
topic: | ||
[ task.process, 'sratools', eval("fasterq-dump --version 2>&1 | grep -Eo '[0-9.]+'") ] >> 'versions' | ||
[ task.process, 'pigz', eval("pigz --version 2>&1 | sed 's/pigz //g'") ] >> 'versions' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of like this, but I dislike the name topic
. I don't feel like the word communicates what it's function is.
It would also be nice if we could supply a regex to validate what should be returned by the eval
for some fast fail behavior when there's extra stuff being emitted. Where would one define a global variable pattern? E.g.
def SOFTWARE_VERSION = /\d+.../
def SHASUM = /\w{16}/
Or maybe this should be a class? like you can filter { Number }
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
topic
is a term from stream processing, used to collect related events from many different sources. in this case we are sending the tool version info to a custom "versions" topic, then the workflow reads from that topic to build the versions yaml file.
eval
is just a function defined in the output / topic scope, so you could wrap it in a custom validation function:
def validate( pattern, text ) {
// ...
}
// ...
topic:
validate( /foo/, eval('...') ) >> 'versions'
sraCheckENAMetadataFields(ena_metadata_fields) | ||
} else { | ||
input = file(input) | ||
if (!isSraId(input)) | ||
error('Ids provided via --input not recognised please make sure they are either SRA / ENA / GEO / DDBJ ids!') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a set of functions for reporting errors, tips, warnings, etc to the user without reporting script line number? As in, there should be a distinction between error messages generated for the user, and error messages generated for the developer.
And ideally something that doesn't put something into a channel if the channel is empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
think there is an ongoing discussion for that here: nextflow-io/nextflow#4937
in any case, it can be done independently of these language improvements
// | ||
// Prefetch sequencing reads in SRA format. | ||
// | ||
input = SRATOOLS_PREFETCH ( input, ncbi_settings, dbgap_key ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this.
|> operator { var ->
var = process1( var, ... )
process2( var, ... )
}
This mixing is confusing.
It should be:
|> operator { var -> process1(var, ...) }
|> operator { var -> process2(var, ...) }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can keep them separate if you want. I combined them here mainly to show that it's possible. they are just functions after all, so why not be able to compose them?
.set { ch_mappings } | ||
sra_metadata // Channel<Map> | ||
|> collect // List<Map> | ||
|> { sra_metadata -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we don't need map
if we can just supply closures? Does map
have a purpose then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the pipe into closure is a shorthand for this:
index_files = SRA_TO_SAMPLESHEET (
sra_metadata |> collect, // a.k.a. collect(sra_metadata)
nf_core_pipeline,
nf_core_rnaseq_strandedness,
sample_mapping_fields
)
it's a convenient way to keep the pipeline going when you can't express the step as a curried function call. in this case, I want to supply some extra arguments to SRA_TO_SAMPLESHEET
, so I can use the closure to customize that function call instead of breaking up the pipeline.
actually I'd like to be able to curry the process call just like an operator:
sra_metadata
|> collect
|> SRA_TO_SAMPLESHEET (
nf_core_pipeline,
nf_core_rnaseq_strandedness,
sample_mapping_fields
)
|> set { index_files }
in any case, it's important to understand that the result of sra_metadata |> collect
is a value (a list of meta maps), not a value channel. you can't use operators like map
on a value here, only on channels. there are no more value channels, only queue channels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. So my general confusion is around the point it's either a stream or value, vs it always being a stream.
Does this mean operators like transpose
will be redefined, since it'll be problematic to distinguish between a stream vs a value, and what follows |>
could be either a Channel operator or Collection function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in order for |>
to work with anything without causing ambiguities, everything needs to be typed. I've gone back and forth on whether to allow operators to accept list inputs and "cast" them to channels, but ultimately I think I would prefer to force the user to be explicit. it's also not that hard:
1..10
|> Channel.of // it's just an extra line
|> map { /* ... */ }
I think this makes it perfectly clear what can go into an operator: only channels (i.e. queue channels). Anything else can be converted into a channel beforehand using Channel.of()
or Channel.fromList()
. So no, we won't need to change operators like transpose
, and any operator that currently returns a value channel like collect
will just return a regular value.
You can also clearly distinguish between the List::transpose()
method and the transpose
operator:
[ 1, 2, 3 ].transpose()
[ 1, 2, 3 ] |> Channel.of |> transpose
Note that operators can no longer be called using the dot syntax, and you can't use |>
to call an object method, only standalone functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that operators can no longer be called using the dot syntax, and you can't use |> to call an object method, only standalone functions.
This helps with transparency and readability a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I commented the expected type of each line off to the right so that you can tell whether something is a channel or value. Ideally the IDE tooling will be able to show these type hints in the editor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My issue with understanding the comments was that my mental model was still incorrect at the time of reading it, so it just caused confusion. It makes much more sense now my mental model is corrected.
workflows/sra/main.nf
Outdated
|> map { meta -> | ||
new Tuple2<Map,String>( meta, meta.run_accession ) | ||
} // Channel<Tuple2<Map,String>> | ||
|> FASTQ_DOWNLOAD_PREFETCH_FASTERQDUMP_SRATOOLS ( | ||
dbgap_key ? file(dbgap_key, checkIfExists: true) : [], | ||
sratools_fasterqdump_args, | ||
sratools_pigz_args ) // Channel<Sample> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So how does this compose inputs? Anything that's piped in is taken as the first channel, otherwise we need to use map
?
This would otherwise be:
|> map { meta ->
def tuple2 = new Tuple2<Map,String>( meta, meta.run_accession )
FASTQ_DOWNLOAD_PREFETCH_FASTERQDUMP_SRATOOLS (
tuple2,
dbgap_key ? file(dbgap_key, checkIfExists: true) : [],
sratools_fasterqdump_args,
sratools_pigz_args )
}
I'm not sure I'm liking the flexibility in the new syntax. This makes readability harder in my opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the pipe source becomes the first argument in the call. this is already how it works for operators, so just extending it to functions / processes / workflows in general
but you can't call a workflow in an operator because a workflow itself contains dataflow logic. a process is more like a regular function, which is why it can be called anywhere within a workflow.
I suspect that this syntax is easier to understand for someone new to Nextflow, but possibly harder to someone used to DSL2. People have learned a lot of things in order to cope with the complexity of dataflow logic, which will need to be unlearned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this example is actually the inverse of the other one, so it can also be written as:
sra_metadata
// ...
|> { sra_metadata ->
FASTQ_DOWNLOAD_PREFETCH_FASTERQDUMP_SRATOOLS (
sra_metadata,
dbgap_key ? file(dbgap_key, checkIfExists: true) : [],
sratools_fasterqdump_args,
sratools_pigz_args )
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but you can't call a workflow in an operator because a workflow itself contains dataflow logic. a process is more like a regular function, which is why it can be called anywhere within a workflow.
Is there a reason this has to be the case? At the moment workflows seem just to be channel zip-ties, and I think many would really like them to be more like functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's how I used to feel as well, but it doesn't really make sense in general. Calling a workflow in a map
operator implies that you want the workflow to independently process each value from the input channel, like a process. But what if the workflow has dataflow logic like groupTuple
and reduce
? Then it needs to operate on the channel as a whole, not just the individual values.
Now there is a special case, which is a workflow that only calls processes and certain operators like map
and filter
, for example:
workflow FOO {
take: input
main: input |> map(PROC1) |> map(PROC2) |> set { out }
emit: out
}
This workflow could in theory be called within an operator because it never needs the entire channel, each value is processed independently. But in that case, it could just be an operator closure!
workflow {
// closure equivalent to workflow FOO
input |> map { val ->
PROC2(PROC1(val))
// can use pipes here btw
// val |> PROC1 |> PROC2
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But what if the workflow has dataflow logic like
groupTuple
andreduce
? Then it needs to operate on the channel as a whole, not just the individual values.
I'm not convinced of this. If I knew that the subworkflow acted more like a function, then I would expect these operators only to work on the subset I passed as input and not everything. I have to admit though I don't have much experience with scatter/gather implementations so outside of the naive implementation I haven't thought about it a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this more, I guess you could treat workflows like functions and execute each workflow on an independent set of inputs. For example you could have a channel of channels, map it with a workflow, then each workflow invocation operates on one channel. This is actually something that has been requested before.
But I suspect it would add a lot of complexity without much benefit over what is already possible. I'll have to think on it though, maybe it'll become clearer after the first round of development
FWIW, I have a tiny feedback, that has came up after the previous discussion (which I wasn't aware of): The fair keyword, describes what is to my knowledge very often called "FIFO" (First-In, First-Out) in other contexts, and might have been a clearer name? (That said, perhaps not worth the change...) |
@samuell I would say, just submit an issue for that, it is more of an API change than a syntax change |
Reading through the suggestion in more detail now, I'm a little concerned about this one:
In my experience, there are some use cases that require run-time generated DAGs, for example when initiating pipeline structure based on values extracted as part of the workflow. This is common e.g. in machine learning, where you might run hyper-parameter tuning, which generates values which are send to initialize downstream processes, but that might potentially also influence how the DAG is generated downstream. I've been writing about it before: https://bionics.it/posts/dynamic-workflow-scheduling Not sure how well this applies here, but want to raise the flag about it, since it is a real limitation we have been running into with other pipeline systems (Luigi). EDIT: Actually, I guess since we are almost definitely talking about the DAG of processes and not the DAG of tasks, a compile-time DAG would still not rule out all of dynamic scheduling (Since the dataflow paradigm of Nextflow does dynamic task scheduling inherently). Still, it seems some cases of dynamic scheduling might be affected; those that require the process DAG structure to be defined based on outcomes of previous computations. |
modules/local/aspera_cli/main.nf
Outdated
Sample md5 = new Sample(meta, path("*md5")) | ||
|
||
topic: | ||
[ task.process, 'aspera_cli', eval('ascli --version') ] >> 'versions' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why brackets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My guess is that this could be any object but a List
is easy to process
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it's just a list literal, could be any expression
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there reasons for the above over
'versions' << [ task.process, 'aspera_cli', eval('ascli --version') ]
This is more consistent with add/append isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that could also work
@samuell yes I'm talking about the process i.e. "abstract" DAG, which Nextflow already constructs before executing the pipeline. But it has to execute the script in order to do this which limits its usefulness. |
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
How come there are new keywords ( |
Just another idea to consider. With a formal grammar, we don't have to adhere so closely to Groovy, we can make whatever syntax we want, as long as it can be translated to Groovy AST. So as a demonstration I have replaced Notice I also changed how types are specified: |
Is it going to be problematic if people combine groovy and this grammar? For example in |
It would apply to all Nextflow code, including |
I understood it would apply to all, but my question was really if it was possible that people could mix grammars and if so what would happen: e.g.
|
We would either drop |
This PR is a showcase of many language improvements we are working on. The changes vary widely from things that can be done today, to things that will be possible in upcoming releases, to things that are still being designed. I wanted to lay out a comprehensive vision for where we're going, even for things potentially far in the future, to help explain how we are thinking about new features right now.
View only the changes proposed for DSL2+: #312
New features / changes:
Use static types, record types: under development (Static types for process inputs/outputs nextflow-io/nextflow#4553). Specify process inputs and outputs as regular variable declarations with any type, including user-defined record types.
Optional<T>
(or possiblyT?
) to denote optional output (not shown in this PR).Path
andList<Path>
to distinguish between a single file or list of files.topic:
section to send values to topics (e.g. tool versions)Replace pipe operator
|
with|>
which works not only with channels / processes / workflows but any function call:Formalize
Channel
type which is a queue channel. Treat value channels as regular values that can be used without dataflow logic, for example:Any operator that currently returns a value channel will just return a regular value, which can be used without e.g. a
map
operatorTreat process as a regular function. You can call the process in the workflow body with regular values, which is like calling it with all value channels (i.e. it will execute once). Or you can call it in an operator closure, also with regular values.
Calling a process in a
map
operator is like calling it with a queue channel:You can call a process in a
reduce
operator to do process iteration:This way, you never call a process directly with channels, only with regular values. The way you call a process is exactly the way it looks in the definition (now with static types).
Like before, a process can only be called once in a workflow, unless you use import aliases.
Deprecations:
Use of
params
outside the top-level workflow -- subworkflows should receive params as explicit inputsparams()
andaddParams()
methods withinclude
statement -- pass params as process / workflow inputs-entry
command-line option -- use params to select different subworkflows from the top-level workflow insteadObject-method syntax for operators e.g.
foo.collect()
-- operators are just standalone functions, you can do eithercollect(foo)
orfoo |> collect
Value channels -- everything that was a value channel in DSL2 will appear to the user as a regular value, even though Nextflow might represent them as value channels "under the hood"
Process
when:
section -- use conditional logic in the workflow insteadAccessing process outputs via
PROCESS_NAME.out
-- just assign the return value of the process to a variableExperimental process recursion -- invoke process in a
reduce
orscan
operator insteadMany operators and some channel factories can be removed, simplified, or replaced with regular functions, e.g.
splitCsv
operator is equivalent tosplitCsv
function withflatMap
,collectFile
can be replaced by amergeText
function which can be used withgroupTuple
andsort
to group and sort entries as before. The operator library can be much smaller and simpler, but also won't be needed as much because of the other improvements around value channels and processesExtra
Some other improvements which are needed "behind the scenes" to make everything work:
New script parser (Formal grammar and parser nextflow-io/nextflow#4613) will simplify the Nextflow syntax, improve error reporting, form the basis of a language server, and enable custom syntax like
|>
andrecord
With static types and the params schema, Nextflow can infer the type of every variable at compile-time instead of run-time, and the language server can use this to display type hints in the IDE. The type of each line is commented in this PR to demonstrate what the hint-on-hover would show.
Similarly, the config parser and Config schema nextflow-io/nextflow#4201 will make the config syntax strict and type-checked, for better error-reporting and IDE tooling (i.e. code completion)
The DAG will be constructed at compile-time instead of run-time, which will allow the DAG to be more comprehensive -- include params and how they connect to processes, include conditional pipeline code (e.g. if-else statements), allow
nextflow inspect
to list every container that might possibly be used, etc