Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic task cleanup #3849

Draft
wants to merge 40 commits into
base: master
Choose a base branch
from
Draft

Automatic task cleanup #3849

wants to merge 40 commits into from

Conversation

bentsherman
Copy link
Member

@bentsherman bentsherman commented Apr 10, 2023

Alternative to #3818

Instead of adding a temporary option to output paths, this PR facilitates the automatic cleanup through the cleanup config option. By setting cleanup = 'eager', Nextflow will automatically delete task directories during the workflow run. Caveats are documented in the PR.

TODO:

  • wait for outputs to be published
  • warn about incompatible publish modes
  • resumability
  • refactor based on workflow output DSL
  • remove lazy, eager strategies, make aggressive strategy resumable

Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
…emote paths)

Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
@bentsherman
Copy link
Member Author

As I mentioned in the other PR, this eager cleanup currently won't work correctly with file publishing because the publishing is asynchronous. So for each task we need to wait for any files to be published first...

Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
@bentsherman bentsherman changed the base branch from ben-task-graph to master April 28, 2023 18:43
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
@netlify
Copy link

netlify bot commented Jul 7, 2023

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit 9ce10dc
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/65c1015b846889000810a7cc

Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
@bentsherman bentsherman changed the title Add eager cleanup option Automatic task cleanup Aug 13, 2023
@pditommaso pditommaso force-pushed the master branch 2 times, most recently from 81f7cb7 to 8a43489 Compare August 20, 2023 20:13
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
@JohnHadish
Copy link

JohnHadish commented Mar 21, 2024

If I understand correctly, your current method of implementing this is to wait for the output of each process to be output then perform cleanup. It may be better to wait until the end of all processes of a single run and then delete. This would probably make it easier to run since you do not need to worry about what files need to be stored. This would still offer a significant amount of space saving.

Exmaple:
I have 10000 files to process. I have a machine where I can have 100 jobs at a time. My nextflow workflow has 5 processes, each which produces files I do not care about, but need for the next step. The run begins and the first 100 files are selected to process. File 1 goes through all 5 processes. File 1 is now done, and all of its temporary files can be deleted. Something interupts the run. File 2 had just completed process 3 during the interuption. The user restarts, File 2 continues where it was, as it has all intermediate files still.

@bentsherman
Copy link
Member Author

The cleanup strategy needs to be reworked due to the upcoming workflow output DSL #4670.

If the publish definition is moved to the workflow level, the task has no idea which of its outputs will be published, and the cleanup observer can't delete a task until it knows that all of the outputs that are going to be published have been published.

A simple solution would be to mark an output for deletion when it is published (and downstream tasks are done with it, etc). The downside is that outputs that are not published are not deleted.

Thinking further, the current POC of the output DSL just appends some "publish" operators to the DAG, so I might be able to trace each process output through the DAG to see if it's connected to a publish op. That way we know if an output can "not" be published and delete it sooner. It still misses files that "could" be published but aren't at runtime, e.g. because they get filtered out by a filter op, but I suspect this is an edge case that can be avoided with good pipeline design.

Finally, we can always fall back to the existing strategy of "delete whatever is left at the end". As long as the eager cleanup can delete enough files early on, it should be enough to move many pipelines from un-runnable to runnable.

@bentsherman bentsherman mentioned this pull request Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants