Evict pod and decrease process forks if pod is restarted after exception #2787

Lehmann-Fabian · 2022-04-12T08:25:47Z

In rare cases, the following code

nextflow/modules/nextflow/src/main/groovy/nextflow/k8s/client/K8sClient.groovy

Lines 301 to 324 in 7f7cdad

    
           if( waiting.reason == 'ErrImagePull' || waiting.reason == 'ImagePullBackOff') { 
        
               def message = "K8s pod image cannot be pulled" 
        
               if( waiting.message ) message += " -- $waiting.message" 
        
               final cause = new K8sResponseException(resp) 
        
               throw new PodUnschedulableException(message, cause) 
        
           } 
        
           if( waiting.reason == 'CreateContainerConfigError' ) { 
        
               def message = "K8s pod configuration failed" 
        
               if( waiting.message ) message += " -- $waiting.message" 
        
               final cause = new K8sResponseException(resp) 
        
               throw new PodUnschedulableException(message, cause) 
        
           } 
        
           if( waiting.reason =~ /.+Error$/ ) { 
        
               def message = "K8s pod waiting on unknown error state" 
        
               if( waiting.message ) message += " -- $waiting.message" 
        
               final cause = new K8sResponseException(resp) 
        
               throw new PodUnschedulableException(message, cause) 
        
           } 
        
           final status = resp.status as Map 
        
           if( status?.phase == 'Failed' ) { 
        
               def message = "K8s pod in Failed state" 
        
               final cause = new K8sResponseException(resp) 
        
               throw new PodUnschedulableException(message, cause) 
        
           }

throws an exception which is then handled here:

nextflow/modules/nextflow/src/main/groovy/nextflow/processor/TaskPollingMonitor.groovy

Line 537 in 7f7cdad

handleException(handler, error)

In that cases, the failed pod is again and again recognized as failed, and another new instance is started (maxRetries often).
In the maxRetries + 1 case, Nextflow fails the workflow, as all retries are exhausted.
The problem here is that the failed pod is never evicted, and the exception handling happens again and again.

I can hardly reproduce the error.

Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>

bentsherman · 2022-04-14T14:01:34Z

I have not encountered this issue before. Last year we changed PodUnschedulableException to extend ProcessException so that it is routed through the normal error strategy instead of causing workflow termination, and I was able to handle these errors with retries just fine. So I am hesitant to merge this fix without reproducing the error and confirming that this change fixes it.

These pod failures should be handled properly by the task handler:

nextflow/modules/nextflow/src/main/groovy/nextflow/k8s/K8sTaskHandler.groovy

Lines 330 to 354 in 5485b69

    
           boolean checkIfCompleted() { 
        
               if( !podName ) throw new IllegalStateException("Missing K8s pod name - cannot check if complete") 
        
               def state = getState() 
        
               if( state && state.terminated ) { 
        
                   if( state.nodeTermination instanceof NodeTerminationException ) { 
        
                       // kee track of the node termination error 
        
                       task.error = (NodeTerminationException) state.nodeTermination 
        
                       // mark the task as ABORTED since thr failure is caused by a node failure 
        
                       task.aborted = true 
        
                   } 
        
                   else { 
        
                       // finalize the task 
        
                       task.exitStatus = readExitFile() 
        
                       task.stdout = outputFile 
        
                       task.stderr = errorFile 
        
                   } 
        
                   status = TaskStatus.COMPLETED 
        
                   savePodLogOnError(task) 
        
                   deletePodIfSuccessful(task) 
        
                   updateTimestamps(state.terminated as Map) 
        
                   return true 
        
               } 
        
               return false 
        
           }

which is called periodically by the polling monitor:

nextflow/modules/nextflow/src/main/groovy/nextflow/processor/TaskPollingMonitor.groovy

Lines 607 to 625 in 5485b69

    
           if( handler.checkIfCompleted() ) { 
        
               log.debug "Task completed > $handler" 
        
               // decrement forks count 
        
               handler.decProcessForks() 
        
               // since completed *remove* the task from the processing queue 
        
               evict(handler) 
        
               // finalize the tasks execution 
        
               final fault = handler.task.processor.finalizeTask(handler.task) 
        
               // notify task completion 
        
               session.notifyTaskComplete(handler) 
        
               // abort the execution in case of task failure 
        
               if (fault instanceof TaskFault) { 
        
                   session.fault(fault, handler) 
        
               } 
        
           }

I also have a sneaking suspicion that this issue might be resolved simply by using jobs instead of pods (#2751).

Lehmann-Fabian · 2022-04-19T09:20:06Z

Ok, I found a way to reproduce the error. I force it using a not existing image.

process a {

    container 'notavalidimage'
    errorStrategy 'retry' 
    maxRetries 3

    """
    echo "Hello World"
    """

}

workflow {
    a()
}

The official Nextflow Version produces the following output:

N E X T F L O W  ~  version 22.03.1-edge
Launching `/pvcdata/nextflowPodNotEvicted.nf` [distracted-dalembert] DSL2 - revision: 678c7f2318
[a3/9faf08] Submitted process > a
[a3/9faf08] NOTE: K8s pod image cannot be pulled -- rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/library/notavalidimage:latest": failed to resolve reference "docker.io/library/notavalidimage:latest": pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed -- Execution is retried (1)
[f4/6e9b10] Re-submitted process > a
[a3/9faf08] NOTE: K8s pod image cannot be pulled -- rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/library/notavalidimage:latest": failed to resolve reference "docker.io/library/notavalidimage:latest": pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed -- Execution is retried (2)
[f4/6e9b10] NOTE: K8s pod image cannot be pulled -- rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/library/notavalidimage:latest": failed to resolve reference "docker.io/library/notavalidimage:latest": pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed -- Execution is retried (2)
[01/2008a8] Re-submitted process > a
[2c/6e38aa] Re-submitted process > a
[a3/9faf08] NOTE: K8s pod image cannot be pulled -- rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/library/notavalidimage:latest": failed to resolve reference "docker.io/library/notavalidimage:latest": pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed -- Execution is retried (3)
[f4/6e9b10] NOTE: K8s pod image cannot be pulled -- rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/library/notavalidimage:latest": failed to resolve reference "docker.io/library/notavalidimage:latest": pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed -- Execution is retried (3)
[9c/c3b88b] Re-submitted process > a
[01/2008a8] NOTE: K8s pod image cannot be pulled -- rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/library/notavalidimage:latest": failed to resolve reference "docker.io/library/notavalidimage:latest": pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed -- Execution is retried (3)
[48/915341] Re-submitted process > a
[2c/6e38aa] NOTE: K8s pod image cannot be pulled -- rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/library/notavalidimage:latest": failed to resolve reference "docker.io/library/notavalidimage:latest": pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed -- Execution is retried (3)
[43/743081] Re-submitted process > a
[46/a22ed4] Re-submitted process > a
Error executing process > 'a'

Caused by:
  K8s pod image cannot be pulled -- Back-off pulling image "notavalidimage"

Command executed:

  echo "Hello World"

Command exit status:
  -

Command output:
  (empty)

Work dir:
  /pvcdata/work/a3/9faf082af36b6f63a6b08dad9f5671

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`


WARN: Killing running tasks (8)
Pod running: distracted-dalembert ... waiting for pod to stop running
Pod distracted-dalembert has changed from running state [terminated:[exitCode:1, reason:Error, startedAt:2022-04-19T09:03:13Z, finishedAt:2022-04-19T09:03:38Z, containerID:containerd://38d8f0363f4a02678f52787c1b5c7199aafbe7a789312e9b51df4294a84ca24d]]

You can see that the initial task (a3/9faf08) is resubmitted multiple times. This happens because the task is not evicted.
I don't think the problem is only related to Kubernetes, but all Executors who use the TaskPollingMonitor and don't handle the Exceptions.

Using Nextflow with my fix, the execution looks like this:

N E X T F L O W  ~  version 22.03.1-edge
Launching `/pvcdata/nextflowPodNotEvicted.nf` [soggy-majorana] DSL2 - revision: 678c7f2318
[79/0ff663] Submitted process > a
[79/0ff663] NOTE: K8s pod image cannot be pulled -- rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/library/notavalidimage:latest": failed to resolve reference "docker.io/library/notavalidimage:latest": pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed -- Execution is retried (1)
[b1/054571] Re-submitted process > a
[b1/054571] NOTE: K8s pod image cannot be pulled -- rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/library/notavalidimage:latest": failed to resolve reference "docker.io/library/notavalidimage:latest": pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed -- Execution is retried (2)
[fa/98f20c] Re-submitted process > a
[fa/98f20c] NOTE: K8s pod image cannot be pulled -- rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/library/notavalidimage:latest": failed to resolve reference "docker.io/library/notavalidimage:latest": pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed -- Execution is retried (3)
[03/242984] Re-submitted process > a
Error executing process > 'a'

Caused by:
  K8s pod image cannot be pulled -- rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/library/notavalidimage:latest": failed to resolve reference "docker.io/library/notavalidimage:latest": pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed

Command executed:

  echo "Hello World"

Command exit status:
  -

Command output:
  (empty)

Work dir:
  /pvcdata/work/03/2429849d206aebb3b77d99fc0fde47

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

bentsherman · 2022-04-19T14:05:59Z

I see now. I was getting confused between k8s pod eviction and eviction in the task polling monitor. I will have to defer to @pditommaso because this change affects all executors, not just k8s.

xhejtman · 2022-04-23T13:44:41Z

Btw, there is an issue in task processor, not sure where, but if a task fails with this error - imagepullbackoff - whole pipeline never stops. This is because the task processor sees this failed task in state submitted, so it waits for finish which never happens. Does this patch fix this as well?

pditommaso · 2022-04-23T17:02:24Z

Yes, think it's related. Well spotted @Lehmann-Fabian 👍

xhejtman · 2022-04-23T22:23:29Z

Yes, think it's related. Well spotted @Lehmann-Fabian +1

yes, it is fixed by this commit.

Lehmann-Fabian · 2022-04-25T08:15:06Z

Yes, think it's related. Well spotted @Lehmann-Fabian +1

yes, it is fixed by this commit.

Happy to hear 🙂

xhejtman · 2022-10-29T15:42:31Z

I think this patch is double-edged sword. I just run workflow on a new cluster (i.e., empty docker image cache) and one node has been banned from docker hub due to too many pull per seconds, that is fair, but the Job/Pod eventually succeeded, however nextflow failed whole run as there was a moment when image pull backoff was really happening.

@pditommaso any ideas what to do here? Note: my nexflow version is a bit old - 22.06, not sure whether there are some fixes in later versions.

pditommaso · 2022-10-31T08:03:57Z

Not sure the problem you are reporting is related to this change

https://github.com/nextflow-io/nextflow/pull/2787/files

Evict pod and decrease process forks if exception is handled

c6cfede

Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>

Lehmann-Fabian changed the title ~~Evict pod and decrease process forks if exception is handled~~ Evict pod and decrease process forks if pod is restarted after exception Apr 14, 2022

bentsherman added the executor/k8s label Apr 14, 2022

bentsherman self-requested a review April 14, 2022 13:27

Merge branch 'master' into master

6ebe059

bentsherman requested a review from pditommaso April 19, 2022 14:06

Minor indentation change [ci fast]

f2647bc

pditommaso approved these changes Apr 23, 2022

View reviewed changes

pditommaso merged commit 05ea0c8 into nextflow-io:master Apr 23, 2022

pditommaso added this to the 22.04.0 milestone Apr 23, 2022

Lehmann-Fabian mentioned this pull request May 2, 2022

Nextflow not terminating after all tasks completed #2693

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evict pod and decrease process forks if pod is restarted after exception #2787

Evict pod and decrease process forks if pod is restarted after exception #2787

Lehmann-Fabian commented Apr 12, 2022

bentsherman commented Apr 14, 2022 •

edited

Lehmann-Fabian commented Apr 19, 2022

bentsherman commented Apr 19, 2022

xhejtman commented Apr 23, 2022

pditommaso commented Apr 23, 2022 •

edited

xhejtman commented Apr 23, 2022

Lehmann-Fabian commented Apr 25, 2022

xhejtman commented Oct 29, 2022

pditommaso commented Oct 31, 2022

	if( waiting.reason == 'ErrImagePull' \|\| waiting.reason == 'ImagePullBackOff') {
	def message = "K8s pod image cannot be pulled"
	if( waiting.message ) message += " -- $waiting.message"
	final cause = new K8sResponseException(resp)
	throw new PodUnschedulableException(message, cause)
	}
	if( waiting.reason == 'CreateContainerConfigError' ) {
	def message = "K8s pod configuration failed"
	if( waiting.message ) message += " -- $waiting.message"
	final cause = new K8sResponseException(resp)
	throw new PodUnschedulableException(message, cause)
	}
	if( waiting.reason =~ /.+Error$/ ) {
	def message = "K8s pod waiting on unknown error state"
	if( waiting.message ) message += " -- $waiting.message"
	final cause = new K8sResponseException(resp)
	throw new PodUnschedulableException(message, cause)
	}
	final status = resp.status as Map
	if( status?.phase == 'Failed' ) {
	def message = "K8s pod in Failed state"
	final cause = new K8sResponseException(resp)
	throw new PodUnschedulableException(message, cause)
	}

Evict pod and decrease process forks if pod is restarted after exception #2787

Evict pod and decrease process forks if pod is restarted after exception #2787

Conversation

Lehmann-Fabian commented Apr 12, 2022

bentsherman commented Apr 14, 2022 • edited

Lehmann-Fabian commented Apr 19, 2022

bentsherman commented Apr 19, 2022

xhejtman commented Apr 23, 2022

pditommaso commented Apr 23, 2022 • edited

xhejtman commented Apr 23, 2022

Lehmann-Fabian commented Apr 25, 2022

xhejtman commented Oct 29, 2022

pditommaso commented Oct 31, 2022

bentsherman commented Apr 14, 2022 •

edited

pditommaso commented Apr 23, 2022 •

edited