Ability to stop running pipeline. #1441

cbmckni · 2020-01-06T20:43:50Z

New feature

I would like for the user to be able to stop a running pipeline, on any of the supported executors.

Previous work would not be deleted, and the user would be able to resume the workflow as usual.

Usage scenario

It would be very useful to be able to stop a pipeline on command. This would allow users to stop pipelines that are running incorrectly, more easily replicate transient or intermittent faults, temporarily free up resources, etc.

Suggest implementation

Implementation would involve coding wrappers for each of the executors.

We primarily use Nextflow in a Kubernetes environment. For this example, Nextflow would keep track of running driver pods, then kill them and all other associated pods. This could probably be easily implemented if the driver pods keep track of associated process pods.

For this example, the command to kill a workflow submitted using

nextflow kuberun systemsgenetics/kinc-nf -v deepgtex-prp

would look something like

nextflow kuberun stop systemsgenetics/kinc-nf -v deepgtex-prp

or tags could be implemented, if they are not already:

nextflow kuberun systemsgenetics/kinc-nf -v deepgtex-prp -t workflow1

would look something like

nextflow kuberun stop workflow1

Thoughts?

If this feature was added, it could open the door to other useful features(such as the ability to stop and resume at an earlier process)

The text was updated successfully, but these errors were encountered:

pditommaso · 2020-01-06T20:47:33Z

You can just CTRL+C if it's running in the foreground or kill it as any other job if it's running in background

cbmckni · 2020-01-06T20:51:42Z

@pditommaso Thanks for the quick response. The issue is that we have multiple pipelines running in the background simultaneously, and killing the driver pod does not kill the associated process pods.

Is there a way to find the process pods associated with a particular driver?

pditommaso · 2020-01-06T20:52:21Z

it should, which executor are u using?

pditommaso · 2020-01-06T20:52:40Z

ah sorry, kubernetes

pditommaso · 2020-01-06T20:56:28Z

nextflow/modules/nextflow/src/main/groovy/nextflow/k8s/K8sTaskHandler.groovy

Lines 347 to 360 in a5a6a1f

    
           @Override 
        
           void kill() { 
        
               if( cleanupDisabled() ) 
        
                   return 
        
               if( podName ) { 
        
                   log.trace "[K8s] deleting pod name=$podName" 
        
                   client.podDelete(podName) 
        
               } 
        
               else { 
        
                   log.debug "[K8s] Oops.. invalid delete action" 
        
               } 
        
           }

cbmckni · 2020-01-06T20:57:19Z

Yes, if we do a run with 100 simultaneous processes, the running processes are not killed when we delete the driver. We can delete all nf-* processes, but if there is another pipeline running it kills those processes as well.

pditommaso · 2020-01-06T21:03:19Z

Does it happen only with a large number of pods? There's nothing useful in the log file? it may be the K8s it's unable to handle a large number of requests altogether.

cbmckni · 2020-01-06T21:13:48Z

I might know the issue. I realized that instead of killing the client process running in the background on our machines, we have been killing the driver pod on the K8s cluster. Other causes are when a transient error with the driver pod(such as network timeout) causes it to fail, leaving the process pods hanging.

I will test and get back to you.

If this is the issue, maybe there is a way for the client process to submit a "cleanup" pod if the driver pod fails/is unreachable, or resubmit the driver pod which still tracks the running process pods....

cbmckni · 2020-01-06T21:14:56Z

Does the client process track the process pods, or just the driver pod?

pditommaso · 2020-01-06T21:21:19Z

It tries to delete each job pod one by one.

cbmckni · 2020-01-06T21:31:57Z

Using version 19.07, neither driver or job pods are not deleted when the client process is closed in the foreground or background. I will test a few more times to see if this bug persists.

When I leave the client running but kill the driver pod, it is deleted but the job pods remain. I will also repeat this test a few times.

cbmckni · 2020-01-06T21:44:27Z

Bug persists, I killed the client with CTRL+C.

These were runs with 10 simultaneous jobs, so I do not think it is an issue caused by a large number of requests.

pditommaso · 2020-01-06T21:46:52Z

Please run a small test enabling the trace logging and including the resulting .nextflow.log file.

nextflow -trace nextflow.k8s kuberun .. etc

cbmckni · 2020-01-06T22:56:18Z

Here are the log files generated by the cluster and locally.

Note that while the timestamps are very different, the driver pod is the same, so it is the same run.

Also note that the local log is much smaller, presumably because it was killed early. I had to wait on the pipeline to finish before the cluster log stopped being written to.

nextflow.log.cluster.txt
nextflow.log.local.txt

bentsherman · 2020-01-30T19:51:58Z

Nextflow is normally able to clean up after itself for most executors. For example if you kill a nextflow run on PBS it deletes all submitted jobs via qdel. So the analogue in kubernetes would be for nextflow to delete all submitted pods via kubectl delete pod ....

The problem I think is that killing a nextflow run on k8s means deleting the submitter pod, so the nextflow process running on that pod might not get the CRTL-C signal that would normally trigger clean up. It might have to be implemented as a lifecycle hook instead.

@cbmckni In the meantime you can use a script I added to kube-runner called kube-pods.sh:

https://github.com/SystemsGenetics/kube-runner/blob/master/kube-pods.sh

It lists each pod with the associated nextflow run so you can use it as an example of how to find the worker pods for a particular run.

stale · 2020-04-27T02:34:54Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale bot added the wontfix label Apr 27, 2020

pditommaso added stale and removed wontfix labels Apr 27, 2020

stale bot closed this as completed Jun 27, 2020

ameynert mentioned this issue Mar 28, 2022

Link suggestions from 15-16 delivery carpentries-incubator/workflows-nextflow#76

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to stop running pipeline. #1441

Ability to stop running pipeline. #1441

cbmckni commented Jan 6, 2020 •

edited

pditommaso commented Jan 6, 2020

cbmckni commented Jan 6, 2020

pditommaso commented Jan 6, 2020

pditommaso commented Jan 6, 2020

pditommaso commented Jan 6, 2020 •

edited

cbmckni commented Jan 6, 2020

pditommaso commented Jan 6, 2020

cbmckni commented Jan 6, 2020

cbmckni commented Jan 6, 2020

pditommaso commented Jan 6, 2020

cbmckni commented Jan 6, 2020

cbmckni commented Jan 6, 2020

pditommaso commented Jan 6, 2020

cbmckni commented Jan 6, 2020

bentsherman commented Jan 30, 2020

stale bot commented Apr 27, 2020

Ability to stop running pipeline. #1441

Ability to stop running pipeline. #1441

Comments

cbmckni commented Jan 6, 2020 • edited

New feature

Usage scenario

Suggest implementation

pditommaso commented Jan 6, 2020

cbmckni commented Jan 6, 2020

pditommaso commented Jan 6, 2020

pditommaso commented Jan 6, 2020

pditommaso commented Jan 6, 2020 • edited

cbmckni commented Jan 6, 2020

pditommaso commented Jan 6, 2020

cbmckni commented Jan 6, 2020

cbmckni commented Jan 6, 2020

pditommaso commented Jan 6, 2020

cbmckni commented Jan 6, 2020

cbmckni commented Jan 6, 2020

pditommaso commented Jan 6, 2020

cbmckni commented Jan 6, 2020

bentsherman commented Jan 30, 2020

stale bot commented Apr 27, 2020

cbmckni commented Jan 6, 2020 •

edited

pditommaso commented Jan 6, 2020 •

edited