Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow definition & composition [RFC] #1276

Closed
pditommaso opened this issue Aug 13, 2019 · 10 comments
Closed

Workflow definition & composition [RFC] #1276

pditommaso opened this issue Aug 13, 2019 · 10 comments
Milestone

Comments

@pditommaso
Copy link
Member

pditommaso commented Aug 13, 2019

Workflow components

This feature adds the support for workflow component definition. It allows the definition of a sequence of processes (and other workflows) as a reusable and composable component with its own inputs and outputs.

Workflow definition

The workflow keyword allows the definition of sub-workflow components that enclose the
invocation of one or more processes, operators and other workflows. For example:

    workflow my_pipeline {
        foo()
        bar( foo.out.collect() )
    }

Once defined it can be invoked from another (sub)workflow component definition as any other function or process.

Workflow parameters

A workflow component can access any variable and parameter defined in the outer scope:

        params.data = '/some/data/file'

        workflow my_pipeline  {
            if( params.data )
                bar(params.data)
            else
                bar(foo())
        }

Workflow inputs

A workflow component can declare one or more input channels using the get keyword. For example:

        workflow my_pipeline {
            get: data
            main:
            foo(data)
            bar(foo.out)
        }

WARN: When the get is used the beginning of the workflow body needs to be identified with the main keyword.

Then, the input can be specified a argument on the workflow invocation statement:

    my_pipeline( Channel.from('/some/data') )

NOTE: Workflow inputs are by definition channel data structure. If a basic data type is provided instead, ie. number, string, list, etc. it's implicitly converted to a channel value (ie. non-consumable).

Workflow outputs

A workflow component can declare one or more out channels using the emit keyword. For example:

        workflow my_pipeline {
            main:
              foo(data)
              bar(foo.out)
            emit:
              bar.out
        }

Then, the result of the my_pipeline execution can be accessed using the out property ie.
my_pipeline.out. When is declared more than one output channels, use the array bracket notation to access each output component as described for the process.

Alternatively, the output channel can be accessed using the identifier name to which it's assigned in the emit declaration:

         workflow my_pipeline {
            main:
              foo(data)
              bar(foo.out)
            emit:
              my_data = bar.out
        }

Then, the result of the above snippet can be accessed using the my_pipeline.out.my_data.

Implicit workflow

A workflow definition which does not define any name is assumed to be the main workflow and it's implicitly executed. Therefore it's the entry point of the workflow application.

Example

An example script is available at this link and here.

Request for comments

This feature is available as experimental in the latest snapshot 19.08.0-SNAPSHOT build 5148.

Any feedback and comment are welcome.

@mes5k
Copy link
Contributor

mes5k commented Aug 28, 2019

Awesome! DSL 2 is excellent and this just helps clarify things further.

@pditommaso
Copy link
Member Author

@mes5k it took some time, but it's turning out awesome 😎

@haqle314
Copy link

haqle314 commented Sep 6, 2019

Hi, I don't know if I should submit a bug report for this but I noticed some inconsistencies with identifiers under emit:. Unless they are part of a pipeline, name of channels/processes/workflows are not recognized properly. Here's an example

nextflow.preview.dsl = 2

process sayHello {
    output: stdout()
    shell: "echo Hello world"
}
process greet {
    input: val person
    output: val greeting
    exec:
    greeting = "Hi, ${person}."
}

workflow foo {
    emit: sayHello | view
}

workflow bar {
    emit: foo
}

workflow greet1 {
    get: data
    emit: greet(data)
}
workflow greet2 {
    get: data
    emit: data | greet
}

workflow {
    greet1(Channel.from("Alice", "Bob")) | view
    greet2(Channel.from("Alice", "Bob")) | view
    foo | view
}
  • foo executes as expected but trying to execute bar results in the error
    Missing workflow output parameter: foo. It doesn't matter whether foo is a
    workflow or a process, the same error message will appear.
  • Trying to execute greet1 will result in the error message No such variable: data.

@pditommaso
Copy link
Member Author

pditommaso commented Sep 9, 2019

Thank for trying this out! I've spotted a couple of problems in your script:

  1. the emit in workflow bar should reference explicitly foo.out
  2. same problem with the workflow greet1

I'm copying below the working script for your convenience.

nextflow.preview.dsl = 2

process sayHello {
    output: stdout()
    shell: "echo Hello world"
}
process greet {
    input: val person
    output: val greeting
    exec:
    greeting = "Hi, ${person}."
}

workflow foo {
    emit: sayHello | view
}

workflow bar {
    emit: foo.out
}

workflow greet1 {
    get: data
    main: greet(data)
    emit: greet.out
}
workflow greet2 {
    get: data
    emit: data | greet
}

workflow {
    greet1(Channel.from("Alice", "Bob")) | view
    greet2(Channel.from("Alice", "Bob")) | view
    foo | view
}

@haqle314
Copy link

haqle314 commented Sep 9, 2019

Thanks for the script, @pditommaso. Now that I think about it, using foo.out
instead of foo makes sense. I just initially felt that emit: foo | view
works but emit: foo doesn't isn't very intuitive.

One thing about the script you posted is that I still have to explicitly execute
foo inside main

workflow bar {
    foo()                      // Need this line, bar.out emits null without it
    emit: foo.out
}

I'm bugging you about this because I think it's a lot more succint, being able
to skip the workflow body like test2 in
subworkflow-dsl2.nf

On an unrelated note, is there a particular reason why you picked get, main,
and emit as the keywords? I'd think that it's easier for users to pick up if
workflow definition mimics the structure of a process definition, something like
this.

workflow foo {
    input: data
    output: result
    exec:
    /* data processing steps */
}

@pditommaso
Copy link
Member Author

On an unrelated note, is there a particular reason why you picked get, main,
and emit as the keywords?

Sorry for the late reply. get and emit were chosen for the reason that the workflow declaration has a syntax different from the process input and output statements. Using the same keyword in a different context with a different syntax can open a space of ambiguity for the developer. However, others have expressed doubts in particular for the get keyword. I'll open a separate thread to discuss other proposals.

@bobamess
Copy link

Could you use take as a keyword instead of get? It's short and begins with a t, which is opposite to emit and has a similar meaning to get.

@pditommaso
Copy link
Member Author

Yes, also @evanfloden was proposing this. Next edge release will use take instead of get.

@darcyabjones
Copy link

Hi there!

I've been looking at using multiple related workflows defined in the same package using DSL2.
I can see a few ways to do it, but my current stumbling point is when publishing results.

What I'd like to be able to do is define multiple workflows and be able to run a subset of the total pipeline using the -entry flag.
But because this won't be the main/implicit workflow, I can't use the publish block.

Here's a contrived example:

nextflow.preview.dsl = 2

process greet {
    input:
    val person

    output:
    path "greeting_${person}.txt"

    script:
    """
    echo "Hi, ${person}." > "greeting_${person}.txt"
    """
}

process reply {
    input:
    val person1
    val person2
    path "in.txt"

    output:
    path "reply_${person1}_${person2}.txt"

    script:
    """
    echo "Oh hey ${person2}." | cat in.txt - > "reply_${person1}_${person2}.txt"
    """
}

workflow greet1 {
    get: data
    main: greet(data)
    emit: greet.out
}

workflow greet2 {
    get:
        data1
        data2
    main:
        greet(data1)
        reply(data1, data2, greet.out)
    emit: reply.out
}

workflow alternate {
    main:
        greet2(Channel.from("Darcy", "Paolo"), Channel.from("Alice", "Bob"))
}

workflow {
    main:
        greet1(Channel.from("Alice", "Bob"))
        greet2(Channel.from("Alice", "Bob"), Channel.from("Darcy", "Paolo"))

    publish:
        greet1.out to: "results"
        greet2.out to: "results"
}

Running nextflow run main.nf will publish as expected, but nextflow run main.nf -entry alternate cannot.

What do you imagine would be the best way to do this in dsl2?
The current workaround would be to use a number of conditional statements in the main workflow.
But even that would involve lots of boilerplate to do publishing (empty channels etc).

Thanks Darcy.

PS. Really loving DSL2 so far.
I can see a lot of potential for code re-use, modularity, and maybe even unit/integration testing.
Thanks for your great work!

@pditommaso
Copy link
Member Author

What I'd like to be able to do is define multiple workflows and be able to run a subset of the total pipeline using the -entry flag.

That's expected. The idea of sub-workflow is to make them composable and therefore not bound to any result path. You should create a separate script with its own main workflow and include modularised sub-workflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants