-
Notifications
You must be signed in to change notification settings - Fork 598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evict pod and decrease process forks if pod is restarted after exception #2787
Conversation
Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
I have not encountered this issue before. Last year we changed These pod failures should be handled properly by the task handler: nextflow/modules/nextflow/src/main/groovy/nextflow/k8s/K8sTaskHandler.groovy Lines 330 to 354 in 5485b69
which is called periodically by the polling monitor: nextflow/modules/nextflow/src/main/groovy/nextflow/processor/TaskPollingMonitor.groovy Lines 607 to 625 in 5485b69
I also have a sneaking suspicion that this issue might be resolved simply by using jobs instead of pods (#2751). |
Ok, I found a way to reproduce the error. I force it using a not existing image.
The official Nextflow Version produces the following output:
You can see that the initial task (a3/9faf08) is resubmitted multiple times. This happens because the task is not evicted. Using Nextflow with my fix, the execution looks like this:
|
I see now. I was getting confused between k8s pod eviction and eviction in the task polling monitor. I will have to defer to @pditommaso because this change affects all executors, not just k8s. |
Btw, there is an issue in task processor, not sure where, but if a task fails with this error - imagepullbackoff - whole pipeline never stops. This is because the task processor sees this failed task in state submitted, so it waits for finish which never happens. Does this patch fix this as well? |
Yes, think it's related. Well spotted @Lehmann-Fabian 👍 |
yes, it is fixed by this commit. |
Happy to hear 🙂 |
I think this patch is double-edged sword. I just run workflow on a new cluster (i.e., empty docker image cache) and one node has been banned from docker hub due to too many pull per seconds, that is fair, but the Job/Pod eventually succeeded, however nextflow failed whole run as there was a moment when image pull backoff was really happening. @pditommaso any ideas what to do here? Note: my nexflow version is a bit old - 22.06, not sure whether there are some fixes in later versions. |
Not sure the problem you are reporting is related to this change |
In rare cases, the following code
nextflow/modules/nextflow/src/main/groovy/nextflow/k8s/client/K8sClient.groovy
Lines 301 to 324 in 7f7cdad
throws an exception which is then handled here:
nextflow/modules/nextflow/src/main/groovy/nextflow/processor/TaskPollingMonitor.groovy
Line 537 in 7f7cdad
In that cases, the failed pod is again and again recognized as failed, and another new instance is started (maxRetries often).
In the maxRetries + 1 case, Nextflow fails the workflow, as all retries are exhausted.
The problem here is that the failed pod is never evicted, and the exception handling happens again and again.
I can hardly reproduce the error.