Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Beyond DSL2 #309

Draft
wants to merge 10 commits into
base: dev
Choose a base branch
from
Draft

Proposal: Beyond DSL2 #309

wants to merge 10 commits into from

Conversation

bentsherman
Copy link

@bentsherman bentsherman commented May 1, 2024

This PR is a showcase of many language improvements we are working on. The changes vary widely from things that can be done today, to things that will be possible in upcoming releases, to things that are still being designed. I wanted to lay out a comprehensive vision for where we're going, even for things potentially far in the future, to help explain how we are thinking about new features right now.

View only the changes proposed for DSL2+: #312

New features / changes:

  • Use static types, record types: under development (Static types for process inputs/outputs nextflow-io/nextflow#4553). Specify process inputs and outputs as regular variable declarations with any type, including user-defined record types.

    • Paths are automatically detected and staged.
    • Inputs can have default values and can be passed by name (not shown in this PR).
    • Use Optional<T> (or possibly T?) to denote optional output (not shown in this PR).
    • Use Path and List<Path> to distinguish between a single file or list of files.
    • Use topic: section to send values to topics (e.g. tool versions)
    • Record types can be imported from modules like processes
  • Replace pipe operator | with |> which works not only with channels / processes / workflows but any function call:

    // x, y can be any value
    // f can be function, process, operator, workflow
    x |> f == f(x)
    
    // even a closure!
    x |> { x -> f(x, y) } == f(x, y)
  • Formalize Channel type which is a queue channel. Treat value channels as regular values that can be used without dataflow logic, for example:

    // convert a list into a queue channel and back into a list
    vals = 1..10 |> Channel.fromList |> collect
    // `vals` is just a value, so just use it!
    println "${vals.size()}"

    Any operator that currently returns a value channel will just return a regular value, which can be used without e.g. a map operator

  • Treat process as a regular function. You can call the process in the workflow body with regular values, which is like calling it with all value channels (i.e. it will execute once). Or you can call it in an operator closure, also with regular values.

    Calling a process in a map operator is like calling it with a queue channel:

    Channel.fromPath( "inputs/*.fastq" )
      |> map { fastq -> FASTQC( fastq ) }

    You can call a process in a reduce operator to do process iteration:

    Channel.fromPath( "inputs/*.txt" )
      |> reduce { result, file -> ACCUMULATE( result, file ) }

    This way, you never call a process directly with channels, only with regular values. The way you call a process is exactly the way it looks in the definition (now with static types).

    Like before, a process can only be called once in a workflow, unless you use import aliases.

Deprecations:

  • Use of params outside the top-level workflow -- subworkflows should receive params as explicit inputs

  • params() and addParams() methods with include statement -- pass params as process / workflow inputs

  • -entry command-line option -- use params to select different subworkflows from the top-level workflow instead

  • Object-method syntax for operators e.g. foo.collect() -- operators are just standalone functions, you can do either collect(foo) or foo |> collect

  • Value channels -- everything that was a value channel in DSL2 will appear to the user as a regular value, even though Nextflow might represent them as value channels "under the hood"

  • Process when: section -- use conditional logic in the workflow instead

  • Accessing process outputs via PROCESS_NAME.out -- just assign the return value of the process to a variable

  • Experimental process recursion -- invoke process in a reduce or scan operator instead

  • Many operators and some channel factories can be removed, simplified, or replaced with regular functions, e.g. splitCsv operator is equivalent to splitCsv function with flatMap, collectFile can be replaced by a mergeText function which can be used with groupTuple and sort to group and sort entries as before. The operator library can be much smaller and simpler, but also won't be needed as much because of the other improvements around value channels and processes

Extra

Some other improvements which are needed "behind the scenes" to make everything work:

  • New script parser (Formal grammar and parser nextflow-io/nextflow#4613) will simplify the Nextflow syntax, improve error reporting, form the basis of a language server, and enable custom syntax like |> and record

  • With static types and the params schema, Nextflow can infer the type of every variable at compile-time instead of run-time, and the language server can use this to display type hints in the IDE. The type of each line is commented in this PR to demonstrate what the hint-on-hover would show.

  • Similarly, the config parser and Config schema nextflow-io/nextflow#4201 will make the config syntax strict and type-checked, for better error-reporting and IDE tooling (i.e. code completion)

  • The DAG will be constructed at compile-time instead of run-time, which will allow the DAG to be more comprehensive -- include params and how they connect to processes, include conditional pipeline code (e.g. if-else statements), allow nextflow inspect to list every container that might possibly be used, etc

Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Comment on lines 114 to 117
|> map { meta ->
def sample = new Sample( meta, meta.fastq_aspera.tokenize(';').take(2).collect( name -> file(name) ) )
ASPERA_CLI ( sample, 'era-fasp', aspera_cli_args )
} // fastq: Channel<Sample>, md5: Channel<Sample>
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mahesh-panchal to your point about dynamic args, I think we can do even better in DSL3:

|> map { meta ->
  def sample = new Sample( /* ... */ )
  ASPERA_CLI ( sample, 'era-fasp', "${meta.key}" )
}

Because we call the process in a map operator explicitly (currently it is implied), we can control how the process is invoked for each task within the operator closure, instead of passing multiple queue channels.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this specifically is much better. Treating processes like functions is soooo much better since there's no implicit transformation stuff going on with all the singleton/queue channel stuff. It has to be formed and then mapped. And tuples disappear too, except in channels (?).

Actually I think what's worrying me about this syntax is the mixing of input types. An input could be a channel (e.g. MULTIQC_MAPPINGS_CONFIG ( mappings ) lower down ) or it could be an input set (e.g. this dynamically defined Sample). This is already confusing to new comers where we commonly see people trying to use channels inside map, branch, etc.
I guess one could explain the second option as passing dynamically defined singleton channels.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this proposal, there are no "value" channels, only queue channels and regular values. So the MULTIQC_MAPPINGS_CONFIG ( mappings ) is no different because mappings is just a value. It may be an async value, and Nextflow might represent it as a value channel under the hood, but to the user it should be indistinguishable from a regular value

In other words, you can not call a process with a channel, only values

@bentsherman bentsherman changed the title DSL2+ / DSL3 preview DSL2+ / DSL3 proof-of-concept May 1, 2024
@bentsherman bentsherman changed the title DSL2+ / DSL3 proof-of-concept Preview: DSL2+ (and beyond) May 1, 2024
Comment on lines +44 to +58
SRA (
ids,
params.ena_metadata_fields ?: '',
params.sample_mapping_fields,
params.nf_core_pipeline ?: '',
params.nf_core_rnaseq_strandedness ?: 'auto',
params.download_method,
params.skip_fastq_download,
params.dbgap_key,
params.aspera_cli_args,
params.sra_fastq_ftp_args,
params.sratools_fasterqdump_args,
params.sratools_pigz_args,
params.outdir
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My immediate thought it to make a map or SraParams object to handle all these params in a single object. Map is simpler but an object would give you the typing I'd like.

Suggested change
SRA (
ids,
params.ena_metadata_fields ?: '',
params.sample_mapping_fields,
params.nf_core_pipeline ?: '',
params.nf_core_rnaseq_strandedness ?: 'auto',
params.download_method,
params.skip_fastq_download,
params.dbgap_key,
params.aspera_cli_args,
params.sra_fastq_ftp_args,
params.sratools_fasterqdump_args,
params.sratools_pigz_args,
params.outdir
)
mySraParam = SraParams(
params.ena_metadata_fields ?: '',
params.sample_mapping_fields,
params.nf_core_pipeline ?: '',
params.nf_core_rnaseq_strandedness ?: 'auto',
params.download_method,
params.skip_fastq_download,
params.dbgap_key,
params.aspera_cli_args,
params.sra_fastq_ftp_args,
params.sratools_fasterqdump_args,
params.sratools_pigz_args,
params.outdir
)
SRA (
ids,
mySraParams
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I imagine someone will pass the whole dang params object in 🤔 .

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could make a record type 😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An option I'd considered was to have those params as part of the record that supplied meta and files to stage.

Comment on lines 18 to 19
Sample fastq = new Sample(meta, path("*fastq.gz"))
Sample md5 = new Sample(meta, path("*md5"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simplify this or is it important to be explicit? Feels like a lot of boilerplate to set some outputs? I think implicit looks a bit nicer and I can't really see the downside?

Suggested change
Sample fastq = new Sample(meta, path("*fastq.gz"))
Sample md5 = new Sample(meta, path("*md5"))
fastq = Sample(meta, path("*fastq.gz"))
md5 = Sample(meta, path("*md5"))

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've considered it. The proposed syntax is the most basic form that matches how variables are declared in general.

I guess you don't really need the output type on the left if it can always be inferred from the right-hand side.

Alternatively, since we always call a process with a single input -> single output with this proposed syntax, instead of channels, we could also just specify each record element as a separate input/output and bundle them into records in the workflow as needed. But that might be unwieldy for records with many elements.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some refactoring today. Since process calls are so much more flexible now, I think we can simplify a few things here:

  • the output type can be omitted, inferred from the right-hand side
  • if there is only one output, the output name can be omitted because the process will just return that value directly
  • if there are multiple outputs, the process will return an "implicit" record type, similar to the process .out but with single values instead of channels.
  • I removed the use of Sample from all processes since it doesn't add much value. Instead I only bundle some things into Samples at the workflow level where it makes sense. Of course for larger record types it might be different.

Hopefully that makes the types less daunting as well. Basically the only place where they need to be explicitly declared is for function/process/workflow inputs.

val pipeline
val strandedness
val mapping_fields
List<Map> sra_metadata
Copy link
Contributor

@adamrtalbot adamrtalbot May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean a sample of 1 no longer has to be explicitly tested if you want to use it? https://github.com/nf-core/modules/blob/5f12fc2128f419a8750c5b0620e4b54d7aa33fec/modules/nf-core/ashlar/main.nf#L27-L29

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes exactly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List<Map> syntax is a bit hard/complex

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it's a little clunky but it's inherited from Groovy directly? [Map] or List(Map) is nicer looking?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also it's a list of maps so you can't get much simpler than List<Map>

types/types.nf Outdated
Comment on lines 2 to 5
record Sample {
Map<String,?> meta
List<Path> files
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e3d1f7e715553d8a1ea0df79f6335bed

Comment on lines +56 to +75
//
// MODULE: Get SRA run information for public database ids
//
|> map { id ->
SRA_IDS_TO_RUNINFO ( id, ena_metadata_fields )
} // Channel<Path>
//
// MODULE: Parse SRA run information, create file containing FTP links and read into workflow as [ meta, [reads] ]
//
|> map(SRA_RUNINFO_TO_FTP) // Channel<Path>
|> set { runinfo_ftp } // Channel<Path>
|> flatMap { tsv ->
splitCsv(tsv, header:true, sep:'\t')
} // Channel<Map>
|> map { meta ->
meta + [single_end: meta.single_end.toBoolean()]
} // Channel<Map>
|> unique // Channel<Map>
|> set { sra_metadata } // Channel<Map>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually might have the biggest impact. Piping has never been more popular, especially with biologists because of R. Being able to write the pipeline in this functional way might help with express the mental model of channels. Sorry, too much language there but I like this a lot.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. A lot of moving pieces have to come together to make this work. Does it make sense to you how I am calling the process like a function in the map operator?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took a second to process, but this is soooo much better. This deals with what I wanted much better than how I initially thought about from the current syntax.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is |> used for?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooh, I see, sorry

Copy link
Member

@mahesh-panchal mahesh-panchal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the pipe replacement |> is easier to see, and provides some visual directionality, helping readability.

Something extra: How about also shifting:

workflow {
    workflow.onComplete {
    }
}

to

workflow {
    onStart:
    ...

    take:
    ...

    main:
    ...
    
    onComplete:
    ...
}

There are some things I really like here, but I have reservations about other stuff like how channels are obfuscated with their channel values, and process outputs

main.nf Show resolved Hide resolved
Comment on lines +44 to +58
SRA (
ids,
params.ena_metadata_fields ?: '',
params.sample_mapping_fields,
params.nf_core_pipeline ?: '',
params.nf_core_rnaseq_strandedness ?: 'auto',
params.download_method,
params.skip_fastq_download,
params.dbgap_key,
params.aspera_cli_args,
params.sra_fastq_ftp_args,
params.sratools_fasterqdump_args,
params.sratools_pigz_args,
params.outdir
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An option I'd considered was to have those params as part of the record that supplied meta and files to stage.

@@ -86,6 +102,11 @@ workflow {
)
}

publish {
directory params.outdir
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have the syntax have an = or : here for readability? Or is this a function?

directory(params.outdir)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a function call under the hood, so you could use parentheses here. it is the same syntax as process directives.

personally I would rather put these settings in the config file because they seem more like config, but Paolo prefers this form for now

Comment on lines +56 to +75
//
// MODULE: Get SRA run information for public database ids
//
|> map { id ->
SRA_IDS_TO_RUNINFO ( id, ena_metadata_fields )
} // Channel<Path>
//
// MODULE: Parse SRA run information, create file containing FTP links and read into workflow as [ meta, [reads] ]
//
|> map(SRA_RUNINFO_TO_FTP) // Channel<Path>
|> set { runinfo_ftp } // Channel<Path>
|> flatMap { tsv ->
splitCsv(tsv, header:true, sep:'\t')
} // Channel<Map>
|> map { meta ->
meta + [single_end: meta.single_end.toBoolean()]
} // Channel<Map>
|> unique // Channel<Map>
|> set { sra_metadata } // Channel<Map>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took a second to process, but this is soooo much better. This deals with what I wanted much better than how I initially thought about from the current syntax.

Comment on lines 23 to 25
topic:
[ task.process, 'sratools', eval("fasterq-dump --version 2>&1 | grep -Eo '[0-9.]+'") ] >> 'versions'
[ task.process, 'pigz', eval("pigz --version 2>&1 | sed 's/pigz //g'") ] >> 'versions'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of like this, but I dislike the name topic. I don't feel like the word communicates what it's function is.

It would also be nice if we could supply a regex to validate what should be returned by the eval for some fast fail behavior when there's extra stuff being emitted. Where would one define a global variable pattern? E.g.

def SOFTWARE_VERSION = /\d+.../
def SHASUM = /\w{16}/ 

Or maybe this should be a class? like you can filter { Number }.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

topic is a term from stream processing, used to collect related events from many different sources. in this case we are sending the tool version info to a custom "versions" topic, then the workflow reads from that topic to build the versions yaml file.

eval is just a function defined in the output / topic scope, so you could wrap it in a custom validation function:

def validate( pattern, text ) {
  // ...
}

// ...
  topic:
  validate( /foo/, eval('...') ) >> 'versions'

workflows/sra/nextflow.config Show resolved Hide resolved
sraCheckENAMetadataFields(ena_metadata_fields)
} else {
input = file(input)
if (!isSraId(input))
error('Ids provided via --input not recognised please make sure they are either SRA / ENA / GEO / DDBJ ids!')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a set of functions for reporting errors, tips, warnings, etc to the user without reporting script line number? As in, there should be a distinction between error messages generated for the user, and error messages generated for the developer.

And ideally something that doesn't put something into a channel if the channel is empty.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think there is an ongoing discussion for that here: nextflow-io/nextflow#4937

in any case, it can be done independently of these language improvements

//
// Prefetch sequencing reads in SRA format.
//
input = SRATOOLS_PREFETCH ( input, ncbi_settings, dbgap_key )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this.

|> operator { var -> 
    var = process1( var, ... )
    process2( var, ... )
}

This mixing is confusing.
It should be:

|> operator { var -> process1(var, ...) }
|> operator { var -> process2(var, ...) }

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can keep them separate if you want. I combined them here mainly to show that it's possible. they are just functions after all, so why not be able to compose them?

.set { ch_mappings }
sra_metadata // Channel<Map>
|> collect // List<Map>
|> { sra_metadata ->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we don't need map if we can just supply closures? Does map have a purpose then?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the pipe into closure is a shorthand for this:

    index_files = SRA_TO_SAMPLESHEET (
        sra_metadata |> collect, // a.k.a. collect(sra_metadata)
        nf_core_pipeline,
        nf_core_rnaseq_strandedness,
        sample_mapping_fields
    )

it's a convenient way to keep the pipeline going when you can't express the step as a curried function call. in this case, I want to supply some extra arguments to SRA_TO_SAMPLESHEET, so I can use the closure to customize that function call instead of breaking up the pipeline.

actually I'd like to be able to curry the process call just like an operator:

    sra_metadata
        |> collect
        |> SRA_TO_SAMPLESHEET (
                nf_core_pipeline,
                nf_core_rnaseq_strandedness,
                sample_mapping_fields
        )
        |> set { index_files }

in any case, it's important to understand that the result of sra_metadata |> collect is a value (a list of meta maps), not a value channel. you can't use operators like map on a value here, only on channels. there are no more value channels, only queue channels

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. So my general confusion is around the point it's either a stream or value, vs it always being a stream.

Does this mean operators like transpose will be redefined, since it'll be problematic to distinguish between a stream vs a value, and what follows |> could be either a Channel operator or Collection function.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in order for |> to work with anything without causing ambiguities, everything needs to be typed. I've gone back and forth on whether to allow operators to accept list inputs and "cast" them to channels, but ultimately I think I would prefer to force the user to be explicit. it's also not that hard:

1..10 
  |> Channel.of // it's just an extra line
  |> map { /* ... */ }

I think this makes it perfectly clear what can go into an operator: only channels (i.e. queue channels). Anything else can be converted into a channel beforehand using Channel.of() or Channel.fromList(). So no, we won't need to change operators like transpose, and any operator that currently returns a value channel like collect will just return a regular value.

You can also clearly distinguish between the List::transpose() method and the transpose operator:

[ 1, 2, 3 ].transpose()
[ 1, 2, 3 ] |> Channel.of |> transpose

Note that operators can no longer be called using the dot syntax, and you can't use |> to call an object method, only standalone functions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that operators can no longer be called using the dot syntax, and you can't use |> to call an object method, only standalone functions.

This helps with transparency and readability a lot.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I commented the expected type of each line off to the right so that you can tell whether something is a channel or value. Ideally the IDE tooling will be able to show these type hints in the editor

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My issue with understanding the comments was that my mental model was still incorrect at the time of reading it, so it just caused confusion. It makes much more sense now my mental model is corrected.

Comment on lines 98 to 104
|> map { meta ->
new Tuple2<Map,String>( meta, meta.run_accession )
} // Channel<Tuple2<Map,String>>
|> FASTQ_DOWNLOAD_PREFETCH_FASTERQDUMP_SRATOOLS (
dbgap_key ? file(dbgap_key, checkIfExists: true) : [],
sratools_fasterqdump_args,
sratools_pigz_args ) // Channel<Sample>
Copy link
Member

@mahesh-panchal mahesh-panchal May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So how does this compose inputs? Anything that's piped in is taken as the first channel, otherwise we need to use map?

This would otherwise be:

|> map { meta ->
                def tuple2 = new Tuple2<Map,String>( meta, meta.run_accession )
                FASTQ_DOWNLOAD_PREFETCH_FASTERQDUMP_SRATOOLS (
                    tuple2,
                    dbgap_key ? file(dbgap_key, checkIfExists: true) : [],
                    sratools_fasterqdump_args,
                    sratools_pigz_args )
    }

I'm not sure I'm liking the flexibility in the new syntax. This makes readability harder in my opinion.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the pipe source becomes the first argument in the call. this is already how it works for operators, so just extending it to functions / processes / workflows in general

but you can't call a workflow in an operator because a workflow itself contains dataflow logic. a process is more like a regular function, which is why it can be called anywhere within a workflow.

I suspect that this syntax is easier to understand for someone new to Nextflow, but possibly harder to someone used to DSL2. People have learned a lot of things in order to cope with the complexity of dataflow logic, which will need to be unlearned

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this example is actually the inverse of the other one, so it can also be written as:

    sra_metadata
        // ...
        |> { sra_metadata ->
            FASTQ_DOWNLOAD_PREFETCH_FASTERQDUMP_SRATOOLS (
                sra_metadata,
                dbgap_key ? file(dbgap_key, checkIfExists: true) : [],
                sratools_fasterqdump_args,
                sratools_pigz_args )
        }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but you can't call a workflow in an operator because a workflow itself contains dataflow logic. a process is more like a regular function, which is why it can be called anywhere within a workflow.

Is there a reason this has to be the case? At the moment workflows seem just to be channel zip-ties, and I think many would really like them to be more like functions

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's how I used to feel as well, but it doesn't really make sense in general. Calling a workflow in a map operator implies that you want the workflow to independently process each value from the input channel, like a process. But what if the workflow has dataflow logic like groupTuple and reduce? Then it needs to operate on the channel as a whole, not just the individual values.

Now there is a special case, which is a workflow that only calls processes and certain operators like map and filter, for example:

workflow FOO {
  take: input
  main: input |> map(PROC1) |> map(PROC2) |> set { out }
  emit: out
}

This workflow could in theory be called within an operator because it never needs the entire channel, each value is processed independently. But in that case, it could just be an operator closure!

workflow {
  // closure equivalent to workflow FOO
  input |> map { val ->
    PROC2(PROC1(val))

    // can use pipes here btw
    // val |> PROC1 |> PROC2
  }
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what if the workflow has dataflow logic like groupTuple and reduce ? Then it needs to operate on the channel as a whole, not just the individual values.

I'm not convinced of this. If I knew that the subworkflow acted more like a function, then I would expect these operators only to work on the subset I passed as input and not everything. I have to admit though I don't have much experience with scatter/gather implementations so outside of the naive implementation I haven't thought about it a lot.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this more, I guess you could treat workflows like functions and execute each workflow on an independent set of inputs. For example you could have a channel of channels, map it with a workflow, then each workflow invocation operates on one channel. This is actually something that has been requested before.

But I suspect it would add a lot of complexity without much benefit over what is already possible. I'll have to think on it though, maybe it'll become clearer after the first round of development

@samuell
Copy link

samuell commented May 2, 2024

FWIW, I have a tiny feedback, that has came up after the previous discussion (which I wasn't aware of):

The fair keyword, describes what is to my knowledge very often called "FIFO" (First-In, First-Out) in other contexts, and might have been a clearer name? (That said, perhaps not worth the change...)

@bentsherman
Copy link
Author

@samuell I would say, just submit an issue for that, it is more of an API change than a syntax change

@samuell
Copy link

samuell commented May 3, 2024

Reading through the suggestion in more detail now, I'm a little concerned about this one:

The DAG will be constructed at compile-time instead of run-time, which will allow the DAG to be more comprehensive -- include params and how they connect to processes, include conditional pipeline code (e.g. if-else statements), allow nextflow inspect to list every container that might possibly be used, etc

In my experience, there are some use cases that require run-time generated DAGs, for example when initiating pipeline structure based on values extracted as part of the workflow.

This is common e.g. in machine learning, where you might run hyper-parameter tuning, which generates values which are send to initialize downstream processes, but that might potentially also influence how the DAG is generated downstream.

I've been writing about it before: https://bionics.it/posts/dynamic-workflow-scheduling

Not sure how well this applies here, but want to raise the flag about it, since it is a real limitation we have been running into with other pipeline systems (Luigi).

EDIT: Actually, I guess since we are almost definitely talking about the DAG of processes and not the DAG of tasks, a compile-time DAG would still not rule out all of dynamic scheduling (Since the dataflow paradigm of Nextflow does dynamic task scheduling inherently). Still, it seems some cases of dynamic scheduling might be affected; those that require the process DAG structure to be defined based on outcomes of previous computations.

Sample md5 = new Sample(meta, path("*md5"))

topic:
[ task.process, 'aspera_cli', eval('ascli --version') ] >> 'versions'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why brackets?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess is that this could be any object but a List is easy to process

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it's just a list literal, could be any expression

Copy link
Member

@mahesh-panchal mahesh-panchal May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there reasons for the above over

    'versions' << [ task.process, 'aspera_cli', eval('ascli --version') ]

This is more consistent with add/append isn't it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that could also work

@bentsherman
Copy link
Author

@samuell yes I'm talking about the process i.e. "abstract" DAG, which Nextflow already constructs before executing the pipeline. But it has to execute the script in order to do this which limits its usefulness.

Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
@mahesh-panchal
Copy link
Member

How come there are new keywords (let, fn, etc)? What's the difference to def?

@bentsherman
Copy link
Author

Just another idea to consider. With a formal grammar, we don't have to adhere so closely to Groovy, we can make whatever syntax we want, as long as it can be translated to Groovy AST. So as a demonstration I have replaced def with more specific keywords: fn for function defintion, let for variable that can't be reassigned, var for variable that can be reassigned (essentially def vs final in Groovy).

Notice I also changed how types are specified: <name>: <tyoe> instead of <type> <name>, which I like personally because it emphasizes the semantic name over the type which is optional.

@mahesh-panchal
Copy link
Member

Is it going to be problematic if people combine groovy and this grammar? For example in exec: blocks.

@bentsherman
Copy link
Author

It would apply to all Nextflow code, including exec: blocks

@mahesh-panchal
Copy link
Member

I understood it would apply to all, but my question was really if it was possible that people could mix grammars and if so what would happen: e.g.

exec:
let some_var = do_stuff
def another_thing = do_other_stuff

@bentsherman
Copy link
Author

We would either drop def in the next DSL version (a hard cut-off) or support it temporarily with a compiler warning

@bentsherman bentsherman changed the title Preview: DSL2+ (and beyond) Proposal: Beyond DSL2 May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants