Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for supporting a dry-run like feature #1774

Closed
pditommaso opened this issue Oct 26, 2020 · 20 comments
Closed

Proposal for supporting a dry-run like feature #1774

pditommaso opened this issue Oct 26, 2020 · 20 comments
Milestone

Comments

@pditommaso
Copy link
Member

pditommaso commented Oct 26, 2020

When dealing with complex pipelines deployed across heterogeneous systems it's crucial beings able to quickly verify that all the components run as expected especially for cloud environments.

A classic way of managing this is having dry-run mechanism which simulates the run of the pipeline computing all the nodes (tasks) traversed in execution DAG.

However, this is not feasible in nextflow because by design tasks can contain partial output declarations e.g. output: path('*.bam') that captures all files produced with the extension .bam.

Therefore the expected task outputs cannot be determined without running itself.

This why the golden rule for nextflow pipelines is to include a minimal dataset that allows the complete execution of the pipeline locally and with a continuous integration system.

However, in some situation can be very difficult and even a small dataset could take too much storage and computing resources.

A possible alternative could be to add in the Nextflow process definition a command stub that can be used to mimic the expected outputs. For example

process foo {
  input: 
    path some_data
  output: 
   path '*.bam'
   
  dryrun: 
    """
    # check required tools are avail 
    which some_command || { echo "Missing required tools!"; exit 1; }

    # check the input file exists
    [ -f  $some_data ] || { echo "Missing required data!"; exit 1; }

    # create fake bam file
    echo "foo"   > gen1.bam 
    echo "bar"   > gen2.bam 
    """

  script: 
    """
    some_command --in $some_data --out gen1.bam gen2.bam
    """
}

The dryrun section is ignored unless the user specifies the CLI options -dry-run, in this case, when defined, it replaces the actual process script.

This could implement a nice alternative to quickly test the main execution logic and deploy in the target platform without hitting the real data.

This mechanism could also be used to rapidly draft the main execution flow just providing the task stubs ie. fake commands ., and replace them once the main flow works as expected.

@micans
Copy link

micans commented Oct 26, 2020

Sounds very interesting. I often build pipeline structures using toy files and touch and split et cetera, so this would, for me, nicely integrate building a pipeline with testing it.

For large pipelines it might be cumbersome to have dryrun: for all processes. I'm thinking of a mode where the normal process is used by default and the dry run process only if it is present, but can't work out if this could actually work / be useful. I guess one should not write large pipelines 😸

@drpatelh
Copy link

Sounds cool!! I wonder if we can also use this sort of feature to do unit testing of individual modules/processes? e.g. if we are able to stage some minimal test data from a remote repo like nf-core-testdatasets or a path relative to the module. We can then maybe have some sort of checking mechanism via md5sums or number of lines in the file that would be relatively easy to implement in bash?

@pditommaso
Copy link
Member Author

Excellent!

I'm thinking of a mode where the normal process is used by default and the dry run process only if it is present

That's how is expected to work when adding -dry-run CLI option.

I wonder if we can also use this sort of feature to do unit testing of individual modules/processes?

That's slight different, the plan is to cover unit (task) testing with another feature that allows to check the actual task result.

What about naming? not super convinced about dryrun: keyword.

@micans
Copy link

micans commented Oct 26, 2020

As for naming, since it's almost Halloween, I think nothing beats

skeleton:

💀 👻 🧟

@pditommaso
Copy link
Member Author

🤣 🤣 🤣

proto: ?

@drpatelh
Copy link

drpatelh commented Oct 26, 2020

rehearsal:, rehearse:, practice:, prototype:

@drpatelh
Copy link

Not super convinced by the name either but think dryrun will be the most obvious name for most people though.

@drpatelh
Copy link

Could have drytest: and unittest: (for the unit testing feature)?

@pditommaso
Copy link
Member Author

mock:, stub:

@drpatelh
Copy link

trial:, tryout:

@drpatelh
Copy link

drpatelh commented Oct 26, 2020

assay:, evaluate:, appraise:, practice:, pilot:, dummy: (although the latter may offend people nowadays...)

@micans
Copy link

micans commented Oct 26, 2020

So far quite like proto and stub, favourite is stub; it's very descriptive of what the section is.

@rsuchecki
Copy link
Contributor

Like the proposal and stub sounds great to me. Having said that, dryrun combined with -dry-run CLI option would minimize the cognitive load for new starters already juggling other keywords, directives and operators.

@pditommaso
Copy link
Member Author

pditommaso commented Nov 1, 2020

I agree that dryrun would sound more friendly for the average user, however, a pure dryrun feature would compute which tasks to run without executing them.

Instead, this feature does launch the pipeline replacing the process commands with a user-provided dummy implement. I think the name should reflect this difference to avoid further confuse the users and also to stress that it can be used to quickly prototype a pipeline using temporary commands stub.

I like to o the word stub I feel too technical. So far the best choices are:

a. tryout:
b. trail: trial:
c. pilot:
d. proto:

Adding @PaulHancock who first inspired this feature the past year during the Nextflow workshop at Pawsay.

@rsuchecki
Copy link
Contributor

Good point about the distinction between this feature and reasonable expectation of what dryrun might be @pditommaso

Should that be trial not trail? That'd be mi pick I think, but either could work. Just implement both 😜

As you mentioned the test feature earlier, I wonder if it wouldn't suffice to just implement that. After all, if I define a low-bar test for a process e.g. such that it outputs any file, then it effectively is my stub/trial and allows a dryrun? The upside would be that a developer by implementing even the simplest stub/trial is one step closer to implementing more substantial unit tests...

In other words isn't a stub/trial just a very basic test?

@pditommaso
Copy link
Member Author

Ooops yes trial not trail :D

Regarding the testing I see this more for quick run and prototyping, the plan for testing is to provide the ability to have self-contained tests for each task running the real command, that the most important to validate.

@mmatthews06
Copy link
Contributor

mmatthews06 commented Nov 3, 2020

Is this intended to be like a unit test framework with mocks?

If so, I feel like genuine MagicMock-like functionality from Python would be more useful than a dry-run. Mock path output, mock stdout/stderr, mock values outputs, all in Groovy. Also mock a shell, and make sure shell.calledOnceWith("bwa mem -R ${inputs} ..."), or something similar. Also E2E test that shell.return_value = "sample.bam" is passed to the next process properly.

I'm mixing unit testing and "end-to-end" testing here, but hopefully that makes sense.

I'm noodling through an interface and style in my head, but my first question above is more important. I saw talk of a unit test framework, so I may be in the wrong thread.

Edit: Oh, I see where you said this, now: "That's slight different, the plan is to cover unit (task) testing with another feature that allows to check the actual task result." Did you have any other questions, or want a more formal proposal, though?

@ewallace
Copy link

ewallace commented Nov 7, 2020

Yes please to dryrun or trial.

We commented on the lack of a dry-run feature in a recent tutorial paper on choosing pipeline frameworks, as a killer feature that nextflow currently lacks.

@pditommaso
Copy link
Member Author

Discussing more on this it seems the consensus is for stub: block definition and -stub-run command-line option.

I've merged on master a first implementation and drafted the docs here.

If you want to give it a try you can use this command

NXF_VER=20.11.0-SNAPSHOT nextflow run [your script] [-stub-run]

@pditommaso pditommaso added this to the v21.01.0 milestone Nov 9, 2020
@pditommaso pditommaso reopened this Nov 11, 2020
@pditommaso
Copy link
Member Author

This is avail starting from version 20.11.0-edge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants