Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evict pod and decrease process forks if pod is restarted after exception #2787

Merged
merged 3 commits into from
Apr 23, 2022

Conversation

Lehmann-Fabian
Copy link
Contributor

In rare cases, the following code

if( waiting.reason == 'ErrImagePull' || waiting.reason == 'ImagePullBackOff') {
def message = "K8s pod image cannot be pulled"
if( waiting.message ) message += " -- $waiting.message"
final cause = new K8sResponseException(resp)
throw new PodUnschedulableException(message, cause)
}
if( waiting.reason == 'CreateContainerConfigError' ) {
def message = "K8s pod configuration failed"
if( waiting.message ) message += " -- $waiting.message"
final cause = new K8sResponseException(resp)
throw new PodUnschedulableException(message, cause)
}
if( waiting.reason =~ /.+Error$/ ) {
def message = "K8s pod waiting on unknown error state"
if( waiting.message ) message += " -- $waiting.message"
final cause = new K8sResponseException(resp)
throw new PodUnschedulableException(message, cause)
}
final status = resp.status as Map
if( status?.phase == 'Failed' ) {
def message = "K8s pod in Failed state"
final cause = new K8sResponseException(resp)
throw new PodUnschedulableException(message, cause)
}

throws an exception which is then handled here:

In that cases, the failed pod is again and again recognized as failed, and another new instance is started (maxRetries often).
In the maxRetries + 1 case, Nextflow fails the workflow, as all retries are exhausted.
The problem here is that the failed pod is never evicted, and the exception handling happens again and again.

I can hardly reproduce the error.

Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
@Lehmann-Fabian Lehmann-Fabian changed the title Evict pod and decrease process forks if exception is handled Evict pod and decrease process forks if pod is restarted after exception Apr 14, 2022
@bentsherman bentsherman self-requested a review April 14, 2022 13:27
@bentsherman
Copy link
Member

bentsherman commented Apr 14, 2022

I have not encountered this issue before. Last year we changed PodUnschedulableException to extend ProcessException so that it is routed through the normal error strategy instead of causing workflow termination, and I was able to handle these errors with retries just fine. So I am hesitant to merge this fix without reproducing the error and confirming that this change fixes it.

These pod failures should be handled properly by the task handler:

boolean checkIfCompleted() {
if( !podName ) throw new IllegalStateException("Missing K8s pod name - cannot check if complete")
def state = getState()
if( state && state.terminated ) {
if( state.nodeTermination instanceof NodeTerminationException ) {
// kee track of the node termination error
task.error = (NodeTerminationException) state.nodeTermination
// mark the task as ABORTED since thr failure is caused by a node failure
task.aborted = true
}
else {
// finalize the task
task.exitStatus = readExitFile()
task.stdout = outputFile
task.stderr = errorFile
}
status = TaskStatus.COMPLETED
savePodLogOnError(task)
deletePodIfSuccessful(task)
updateTimestamps(state.terminated as Map)
return true
}
return false
}

which is called periodically by the polling monitor:

if( handler.checkIfCompleted() ) {
log.debug "Task completed > $handler"
// decrement forks count
handler.decProcessForks()
// since completed *remove* the task from the processing queue
evict(handler)
// finalize the tasks execution
final fault = handler.task.processor.finalizeTask(handler.task)
// notify task completion
session.notifyTaskComplete(handler)
// abort the execution in case of task failure
if (fault instanceof TaskFault) {
session.fault(fault, handler)
}
}

I also have a sneaking suspicion that this issue might be resolved simply by using jobs instead of pods (#2751).

@Lehmann-Fabian
Copy link
Contributor Author

Ok, I found a way to reproduce the error. I force it using a not existing image.

process a {

    container 'notavalidimage'
    errorStrategy 'retry' 
    maxRetries 3

    """
    echo "Hello World"
    """

}

workflow {
    a()
}

The official Nextflow Version produces the following output:

N E X T F L O W  ~  version 22.03.1-edge
Launching `/pvcdata/nextflowPodNotEvicted.nf` [distracted-dalembert] DSL2 - revision: 678c7f2318
[a3/9faf08] Submitted process > a
[a3/9faf08] NOTE: K8s pod image cannot be pulled -- rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/library/notavalidimage:latest": failed to resolve reference "docker.io/library/notavalidimage:latest": pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed -- Execution is retried (1)
[f4/6e9b10] Re-submitted process > a
[a3/9faf08] NOTE: K8s pod image cannot be pulled -- rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/library/notavalidimage:latest": failed to resolve reference "docker.io/library/notavalidimage:latest": pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed -- Execution is retried (2)
[f4/6e9b10] NOTE: K8s pod image cannot be pulled -- rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/library/notavalidimage:latest": failed to resolve reference "docker.io/library/notavalidimage:latest": pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed -- Execution is retried (2)
[01/2008a8] Re-submitted process > a
[2c/6e38aa] Re-submitted process > a
[a3/9faf08] NOTE: K8s pod image cannot be pulled -- rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/library/notavalidimage:latest": failed to resolve reference "docker.io/library/notavalidimage:latest": pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed -- Execution is retried (3)
[f4/6e9b10] NOTE: K8s pod image cannot be pulled -- rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/library/notavalidimage:latest": failed to resolve reference "docker.io/library/notavalidimage:latest": pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed -- Execution is retried (3)
[9c/c3b88b] Re-submitted process > a
[01/2008a8] NOTE: K8s pod image cannot be pulled -- rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/library/notavalidimage:latest": failed to resolve reference "docker.io/library/notavalidimage:latest": pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed -- Execution is retried (3)
[48/915341] Re-submitted process > a
[2c/6e38aa] NOTE: K8s pod image cannot be pulled -- rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/library/notavalidimage:latest": failed to resolve reference "docker.io/library/notavalidimage:latest": pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed -- Execution is retried (3)
[43/743081] Re-submitted process > a
[46/a22ed4] Re-submitted process > a
Error executing process > 'a'

Caused by:
  K8s pod image cannot be pulled -- Back-off pulling image "notavalidimage"

Command executed:

  echo "Hello World"

Command exit status:
  -

Command output:
  (empty)

Work dir:
  /pvcdata/work/a3/9faf082af36b6f63a6b08dad9f5671

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`


WARN: Killing running tasks (8)
Pod running: distracted-dalembert ... waiting for pod to stop running
Pod distracted-dalembert has changed from running state [terminated:[exitCode:1, reason:Error, startedAt:2022-04-19T09:03:13Z, finishedAt:2022-04-19T09:03:38Z, containerID:containerd://38d8f0363f4a02678f52787c1b5c7199aafbe7a789312e9b51df4294a84ca24d]]

You can see that the initial task (a3/9faf08) is resubmitted multiple times. This happens because the task is not evicted.
I don't think the problem is only related to Kubernetes, but all Executors who use the TaskPollingMonitor and don't handle the Exceptions.

Using Nextflow with my fix, the execution looks like this:

N E X T F L O W  ~  version 22.03.1-edge
Launching `/pvcdata/nextflowPodNotEvicted.nf` [soggy-majorana] DSL2 - revision: 678c7f2318
[79/0ff663] Submitted process > a
[79/0ff663] NOTE: K8s pod image cannot be pulled -- rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/library/notavalidimage:latest": failed to resolve reference "docker.io/library/notavalidimage:latest": pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed -- Execution is retried (1)
[b1/054571] Re-submitted process > a
[b1/054571] NOTE: K8s pod image cannot be pulled -- rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/library/notavalidimage:latest": failed to resolve reference "docker.io/library/notavalidimage:latest": pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed -- Execution is retried (2)
[fa/98f20c] Re-submitted process > a
[fa/98f20c] NOTE: K8s pod image cannot be pulled -- rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/library/notavalidimage:latest": failed to resolve reference "docker.io/library/notavalidimage:latest": pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed -- Execution is retried (3)
[03/242984] Re-submitted process > a
Error executing process > 'a'

Caused by:
  K8s pod image cannot be pulled -- rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/library/notavalidimage:latest": failed to resolve reference "docker.io/library/notavalidimage:latest": pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed

Command executed:

  echo "Hello World"

Command exit status:
  -

Command output:
  (empty)

Work dir:
  /pvcdata/work/03/2429849d206aebb3b77d99fc0fde47

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

@bentsherman
Copy link
Member

I see now. I was getting confused between k8s pod eviction and eviction in the task polling monitor. I will have to defer to @pditommaso because this change affects all executors, not just k8s.

@xhejtman
Copy link
Contributor

Btw, there is an issue in task processor, not sure where, but if a task fails with this error - imagepullbackoff - whole pipeline never stops. This is because the task processor sees this failed task in state submitted, so it waits for finish which never happens. Does this patch fix this as well?

@pditommaso
Copy link
Member

pditommaso commented Apr 23, 2022

Yes, think it's related. Well spotted @Lehmann-Fabian 👍

@pditommaso pditommaso merged commit 05ea0c8 into nextflow-io:master Apr 23, 2022
@pditommaso pditommaso added this to the 22.04.0 milestone Apr 23, 2022
@xhejtman
Copy link
Contributor

Yes, think it's related. Well spotted @Lehmann-Fabian +1

yes, it is fixed by this commit.

@Lehmann-Fabian
Copy link
Contributor Author

Yes, think it's related. Well spotted @Lehmann-Fabian +1

yes, it is fixed by this commit.

Happy to hear 🙂

@xhejtman
Copy link
Contributor

I think this patch is double-edged sword. I just run workflow on a new cluster (i.e., empty docker image cache) and one node has been banned from docker hub due to too many pull per seconds, that is fair, but the Job/Pod eventually succeeded, however nextflow failed whole run as there was a moment when image pull backoff was really happening.

@pditommaso any ideas what to do here? Note: my nexflow version is a bit old - 22.06, not sure whether there are some fixes in later versions.

@pditommaso
Copy link
Member

Not sure the problem you are reporting is related to this change

https://github.com/nextflow-io/nextflow/pull/2787/files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants