Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to stop running pipeline. #1441

Closed
cbmckni opened this issue Jan 6, 2020 · 16 comments
Closed

Ability to stop running pipeline. #1441

cbmckni opened this issue Jan 6, 2020 · 16 comments
Labels

Comments

@cbmckni
Copy link

cbmckni commented Jan 6, 2020

New feature

I would like for the user to be able to stop a running pipeline, on any of the supported executors.

Previous work would not be deleted, and the user would be able to resume the workflow as usual.

Usage scenario

It would be very useful to be able to stop a pipeline on command. This would allow users to stop pipelines that are running incorrectly, more easily replicate transient or intermittent faults, temporarily free up resources, etc.

Suggest implementation

Implementation would involve coding wrappers for each of the executors.

We primarily use Nextflow in a Kubernetes environment. For this example, Nextflow would keep track of running driver pods, then kill them and all other associated pods. This could probably be easily implemented if the driver pods keep track of associated process pods.

For this example, the command to kill a workflow submitted using

nextflow kuberun systemsgenetics/kinc-nf -v deepgtex-prp

would look something like

nextflow kuberun stop systemsgenetics/kinc-nf -v deepgtex-prp

or tags could be implemented, if they are not already:

nextflow kuberun systemsgenetics/kinc-nf -v deepgtex-prp -t workflow1

would look something like

nextflow kuberun stop workflow1

Thoughts?

If this feature was added, it could open the door to other useful features(such as the ability to stop and resume at an earlier process)

@pditommaso
Copy link
Member

You can just CTRL+C if it's running in the foreground or kill it as any other job if it's running in background

@cbmckni
Copy link
Author

cbmckni commented Jan 6, 2020

@pditommaso Thanks for the quick response. The issue is that we have multiple pipelines running in the background simultaneously, and killing the driver pod does not kill the associated process pods.

Is there a way to find the process pods associated with a particular driver?

@pditommaso
Copy link
Member

it should, which executor are u using?

@pditommaso
Copy link
Member

ah sorry, kubernetes

@pditommaso
Copy link
Member

pditommaso commented Jan 6, 2020

@Override
void kill() {
if( cleanupDisabled() )
return
if( podName ) {
log.trace "[K8s] deleting pod name=$podName"
client.podDelete(podName)
}
else {
log.debug "[K8s] Oops.. invalid delete action"
}
}

@cbmckni
Copy link
Author

cbmckni commented Jan 6, 2020

Yes, if we do a run with 100 simultaneous processes, the running processes are not killed when we delete the driver. We can delete all nf-* processes, but if there is another pipeline running it kills those processes as well.

@pditommaso
Copy link
Member

Does it happen only with a large number of pods? There's nothing useful in the log file? it may be the K8s it's unable to handle a large number of requests altogether.

@cbmckni
Copy link
Author

cbmckni commented Jan 6, 2020

I might know the issue. I realized that instead of killing the client process running in the background on our machines, we have been killing the driver pod on the K8s cluster. Other causes are when a transient error with the driver pod(such as network timeout) causes it to fail, leaving the process pods hanging.

I will test and get back to you.

If this is the issue, maybe there is a way for the client process to submit a "cleanup" pod if the driver pod fails/is unreachable, or resubmit the driver pod which still tracks the running process pods....

@cbmckni
Copy link
Author

cbmckni commented Jan 6, 2020

Does the client process track the process pods, or just the driver pod?

@pditommaso
Copy link
Member

It tries to delete each job pod one by one.

@cbmckni
Copy link
Author

cbmckni commented Jan 6, 2020

Using version 19.07, neither driver or job pods are not deleted when the client process is closed in the foreground or background. I will test a few more times to see if this bug persists.

When I leave the client running but kill the driver pod, it is deleted but the job pods remain. I will also repeat this test a few times.

@cbmckni
Copy link
Author

cbmckni commented Jan 6, 2020

Bug persists, I killed the client with CTRL+C.

These were runs with 10 simultaneous jobs, so I do not think it is an issue caused by a large number of requests.

@pditommaso
Copy link
Member

Please run a small test enabling the trace logging and including the resulting .nextflow.log file.

nextflow -trace nextflow.k8s kuberun .. etc

@cbmckni
Copy link
Author

cbmckni commented Jan 6, 2020

Here are the log files generated by the cluster and locally.

Note that while the timestamps are very different, the driver pod is the same, so it is the same run.

Also note that the local log is much smaller, presumably because it was killed early. I had to wait on the pipeline to finish before the cluster log stopped being written to.

nextflow.log.cluster.txt
nextflow.log.local.txt

@bentsherman
Copy link
Member

Nextflow is normally able to clean up after itself for most executors. For example if you kill a nextflow run on PBS it deletes all submitted jobs via qdel. So the analogue in kubernetes would be for nextflow to delete all submitted pods via kubectl delete pod ....

The problem I think is that killing a nextflow run on k8s means deleting the submitter pod, so the nextflow process running on that pod might not get the CRTL-C signal that would normally trigger clean up. It might have to be implemented as a lifecycle hook instead.

@cbmckni In the meantime you can use a script I added to kube-runner called kube-pods.sh:

https://github.com/SystemsGenetics/kube-runner/blob/master/kube-pods.sh

It lists each pod with the associated nextflow run so you can use it as an example of how to find the worker pods for a particular run.

@stale
Copy link

stale bot commented Apr 27, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants