kubelet: respect exec probe timeouts #94115

andrewsykim · 2020-08-20T00:40:59Z

Signed-off-by: Andrew Sy Kim kim.andrewsy@gmail.com

What type of PR is this?

/kind bug

What this PR does / why we need it:
This PR fixes exec timeout issues for both dockershim and containerd.

When using kubelet dockershim, the exec timeout is not respected which results in timeouts for exec readiness/liveness probes to not be respected as well. This PR ensures the timeout passed into RunInContainer for dockershim is respected. The timeout value passed into RunInContainer is derived from the probe timeout in the case of readiness/liveness probes.

For containerd, the prober ignores timeout errors from remote runtime's ExecSync since it expects utilexec.ExitCodeError for any failed probes. Any other error ends up being ignored by the prober. This PR updates ExecSync to return utilexec.ExitCodeError when the grpc error from CRI is DeadlineContextExceeded.

Which issue(s) this PR fixes:

Fixes #94080

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

ACTION REQUIRED: a bug was fixed in kubelet where exec probe timeouts were not respected. Ensure that pods relying on this behavior are updated to correctly handle probe timeouts.

This change in behavior may be unexpected for some clusters and can be disabled by turning off the ExecProbeTimeout feature gate. This gate will be locked and removed in future releases so that exec probe timeouts are always respected.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

andrewsykim · 2020-08-20T00:47:42Z

I was able to reproduce this on a GKE cluster.

Deployed a pod where the exec probe takes 10s to pass and the timeout is set to 1. On nodes with existing kubelet, the probe passes since the timeout is not respected.

I manually updated one of the kubelets with this change and now the timeout is respected and the readiness probe fails.

$ kubectl get po -o wide
NAME                               READY   STATUS    RESTARTS   AGE     IP         NODE                                       NOMINATED NODE   READINESS GATES
slow-deployment-6989cc5777-bc92t   1/1     Running   0          4h26m   10.4.2.5   gke-cluster-1-default-pool-9c83cf52-q55l   <none>           <none>
slow-deployment-6989cc5777-cvhhz   1/1     Running   0          4h26m   10.4.0.7   gke-cluster-1-default-pool-9c83cf52-nchv   <none>           <none>
slow-deployment-6989cc5777-f4vhj   0/1     Running   0          4h26m   10.4.1.6   gke-cluster-1-default-pool-9c83cf52-n6x4   <none>           <none>

Deployment:

$ cat slow.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  name: slow-deployment
  labels:
    app: slow
spec:
  replicas: 3
  selector:
    matchLabels:
      app: slow
  template:
    metadata:
      labels:
        app: slow
    spec:
      containers:
      - name: slow
        image: jugosag/slow:1.0.0
        imagePullPolicy: Always
        ports:
        - containerPort: 8080
        env:
        - name: healthCheckSlowness
          value: "10"
        readinessProbe:
          exec:
            command:
            - curl
            - -f
            - localhost:8080/slow?timeInSecs=10
          initialDelaySeconds: 2
          periodSeconds: 1
          timeoutSeconds: 1

Thanks @jgoeres for the test deployment.

andrewsykim · 2020-08-20T00:51:14Z

pkg/kubelet/dockershim/exec.go

-			klog.Errorf("Exec session %s in container %s terminated but process still running!", execObj.ID, container.ID)
-			break
+			count++
+			if count == 5 {


I think this count check should be removed now since we're checking the timeout now, but I'll defer to kubelet maintainers since I don't have context on why this is here originally.

looks like earlier we just had an infinite loop until InspectExec can tell us that the process we started exited... the count was added to break out of it in 7748a02

Nvm, I think we need to keep this since it's expected behavior for kubelet to exit at some point with a nil error and allow the exec'd command to continue running.

That or there should be some default timeout here, but don't want to introduce new changes here.

dims · 2020-08-20T01:03:20Z

/assign @dashpole @sjenning

dims · 2020-08-20T01:04:27Z

@andrewsykim also additional context, see #50176

pkg/kubelet/dockershim/exec.go

SergeyKanzhelev · 2020-08-20T04:23:38Z

is it an alternative to this PR: #58925?

andrewsykim · 2020-08-20T15:08:05Z

/retest

andrewsykim · 2020-08-20T15:30:29Z

e2e failures look legit, will dig into it

/hold

andrewsykim · 2020-08-20T15:36:38Z

pkg/kubelet/dockershim/exec.go

@@ -110,28 +110,32 @@ func (*NativeExecHandler) ExecInContainer(client libdocker.Interface, container
 	}

 	ticker := time.NewTicker(2 * time.Second)
+	execTimeout := time.After(timeout)


need to check for 0 timeout, since exec outside of probes will have 0 timeout

dionysius · 2021-04-19T16:30:01Z

pkg/kubelet/dockershim/exec.go

+			// Only limit the amount of InspectExec calls if the exec timeout was not set.
+			// When a timeout is not set, we stop polling the exec session after 5 attempts and allow the process to continue running.
+			if execTimeout == nil {
+				count++
+				if count == 5 {
+					klog.Errorf("Exec session %s in container %s terminated but process still running!", execObj.ID, container.ID)
+					return nil
+				}
+			}


Doesn't this mean that any probe, without timeout, taking longer than 10 seconds will automatically be no error, independent of what the actual exit of the command is?

There have been a number of follow-ups to this PR since it landed so I suggest you take a look at what's committed on the master branch.

Ignore my question - interpretation error from my side, on master (and probably on this commit as well) client.StartExec() is blocking and thus this code block happens after the command has finished in some form. It is 10 seconds maximum for gathering the exit info, although the code on master is slightly different with the same effect.

PrabhuMathi · 2021-05-04T13:17:49Z

Facing same issue in 1.19.7 (AKS) with containerd(1.5.0-beta) as cri. Could you please advise in which version of containerd or kubernetes patch this was fixed?

Pod stuck in running 0/1 status

readinessProbe: exec: command: - sleep - "15" failureThreshold: 3 initialDelaySeconds: 10 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 2

Readiness probe errored: rpc error: code = DeadlineExceeded desc = failed to exec in container: timeout 2s exceeded: context deadline exceeded

matthyx · 2021-05-04T13:33:46Z

@PrabhuMathi I don't think this commit has been cherry-picked in 1.19 - you'll have to use 1.20 or above

PrabhuMathi · 2021-05-04T13:42:51Z

@PrabhuMathi I don't think this commit has been cherry-picked in 1.19 - you'll have to use 1.20 or above

@matthyx thanks for the update, Let me give timeout in the command itself as mentioned and proceed. Also started facing readiness probe failure after migrating to Containerd any ref on this please?

SergeyKanzhelev · 2021-05-04T16:48:54Z

@PrabhuMathi I don't think this commit has been cherry-picked in 1.19 - you'll have to use 1.20 or above

@matthyx thanks for the update, Let me give timeout in the command itself as mentioned and proceed. Also started facing readiness probe failure after migrating to Containerd any ref on this please?

As far as I remember, with Containerd before 1.19, probe will timeout, but it will not be treated as an error. With Docker there will be no timeout. Is it what you experience?

PrabhuMathi · 2021-05-04T17:01:54Z

@PrabhuMathi I don't think this commit has been cherry-picked in 1.19 - you'll have to use 1.20 or above

@matthyx thanks for the update, Let me give timeout in the command itself as mentioned and proceed. Also started facing readiness probe failure after migrating to Containerd any ref on this please?

As far as I remember, with Containerd before 1.19, probe will timeout, but it will not be treated as an error. With Docker there will be no timeout. Is it what you experience?

Yes true in docker same pod runs fine, but when i move it to contained node facing this issue, but actual problem i'm facing is not about timeout but readiness probe is failing with command (sleep 15)

matthyx · 2021-05-07T07:10:14Z

@PrabhuMathi can you share your podspec?

The timeout wrapper in health checks was added in helm/charts#11355 to work around Docker/containerd not respecting timeouts in probes (cf. kubernetes/kubernetes#58925). The upstream issue has been fixed since Kubernetes 1.20 (kubernetes/kubernetes#94115), and this wrapper causes degraded behavior (ie. any failure in the wrapped command only gets reported as "The monitored command dumped core", without details for the specific failure), so the original behavior should be restored.

andrewsykim mentioned this pull request Aug 20, 2020

exec-type liveness or readiness probes ignore timeout #94080

Closed

k8s-ci-robot requested review from odinuge and sjenning August 20, 2020 00:41

andrewsykim commented Aug 20, 2020

View reviewed changes

k8s-ci-robot assigned dashpole and sjenning Aug 20, 2020

andrewsykim commented Aug 20, 2020

View reviewed changes

pkg/kubelet/dockershim/exec.go Outdated Show resolved Hide resolved

andrewsykim force-pushed the fix-dockershim-exec branch 2 times, most recently from 2bb6cf2 to 330fc4f Compare August 20, 2020 01:54

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 20, 2020

andrewsykim commented Aug 20, 2020

View reviewed changes

andrewsykim force-pushed the fix-dockershim-exec branch 2 times, most recently from 8805bd1 to 7ee5e07 Compare August 20, 2020 16:03

kikisdeliveryservice mentioned this pull request Nov 12, 2020

Fixing Kubelet Exec Probe Timeouts kubernetes/enhancements#1972

Closed

k8s-ci-robot added release-note-action-required Denotes a PR that introduces potentially breaking changes that require user action. and removed release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Nov 15, 2020

andrewsykim mentioned this pull request Nov 18, 2020

e2e/node: add exec probe timeout tests to NodeConformance #96694

Merged

This was referenced Dec 3, 2020

feat: add support for Kubernetes v1.20.0-beta.2 Azure/aks-engine#4052

Closed

kubelet: ship new ExecProbeTimeout featuregate as false #97057

Closed

andrewsykim mentioned this pull request Dec 30, 2020

conformance: promote container exec probe timeout tests #97619

Merged

ehashman mentioned this pull request Jan 13, 2021

Request: Add ehashman as sig-node-reviewer #98036

Merged

13 tasks

jackfrancis mentioned this pull request Mar 5, 2021

Definitive end date for ExecProbeTimeout kubelet feature gate #99854

Closed

liggitt mentioned this pull request Mar 7, 2021

Failing Test: [sig-node] Probing container should be restarted with an exec liveness probe with timeout #99909

Closed

This was referenced Mar 9, 2021

conformance: skip exec livenessProbe test if ExecProbeTimeout=false #100028

Closed

Liveness probe errored: rpc error: code = DeadlineExceeded desc = context deadline exceeded #82987

Closed

neolit123 mentioned this pull request Mar 10, 2021

e2e/common/node: tag tests with MinimumKubeletVersion #100050

Merged

jackfrancis mentioned this pull request Mar 12, 2021

respect ExecProbeTimeout=false for dockershim #100200

Merged

ritazh mentioned this pull request Mar 17, 2021

Check timeoutSeconds for readinessProbe and livenessProbe open-policy-agent/gatekeeper-library#64

Closed

choover-broad mentioned this pull request Mar 19, 2021

DDO-1142 Self heal restarts for Orch ES client broadinstitute/terra-helm#328

Merged

dionysius reviewed Apr 19, 2021

View reviewed changes

wangyysde mentioned this pull request Apr 29, 2021

lock feature-gate execProbeTimeout and remove relevant code #101602

Closed

liggitt mentioned this pull request Apr 30, 2021

Normalize HTTP lifecycle handlers with HTTP probers #86139

Merged

jenting mentioned this pull request Jun 22, 2021

[BUG] Liveness/Readiness Probe timeoutSeconds at Kubernetes v1.20+ longhorn/longhorn#2699

Closed

stuartpb mentioned this pull request Sep 16, 2021

[bitnami/redis] Use timeoutSeconds to timeout health checks bitnami/charts#7520

Closed

2 tasks

aojea mentioned this pull request Nov 26, 2021

Liveness Probe with an invalid command doesn't trigger container restarts and ContainersReady remains True #106682

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kubelet: respect exec probe timeouts #94115

kubelet: respect exec probe timeouts #94115

andrewsykim commented Aug 20, 2020 •

edited

andrewsykim commented Aug 20, 2020 •

edited

andrewsykim Aug 20, 2020

dims Aug 20, 2020

andrewsykim Aug 20, 2020

andrewsykim Aug 20, 2020

dims commented Aug 20, 2020

dims commented Aug 20, 2020

SergeyKanzhelev commented Aug 20, 2020

andrewsykim commented Aug 20, 2020

andrewsykim commented Aug 20, 2020

andrewsykim Aug 20, 2020

dionysius Apr 19, 2021 •

edited

ehashman Apr 19, 2021

dionysius Apr 19, 2021

PrabhuMathi commented May 4, 2021 •

edited

matthyx commented May 4, 2021

PrabhuMathi commented May 4, 2021

SergeyKanzhelev commented May 4, 2021

PrabhuMathi commented May 4, 2021

matthyx commented May 7, 2021

kubelet: respect exec probe timeouts #94115

kubelet: respect exec probe timeouts #94115

Conversation

andrewsykim commented Aug 20, 2020 • edited

andrewsykim commented Aug 20, 2020 • edited

andrewsykim Aug 20, 2020

Choose a reason for hiding this comment

dims Aug 20, 2020

Choose a reason for hiding this comment

andrewsykim Aug 20, 2020

Choose a reason for hiding this comment

andrewsykim Aug 20, 2020

Choose a reason for hiding this comment

dims commented Aug 20, 2020

dims commented Aug 20, 2020

SergeyKanzhelev commented Aug 20, 2020

andrewsykim commented Aug 20, 2020

andrewsykim commented Aug 20, 2020

andrewsykim Aug 20, 2020

Choose a reason for hiding this comment

dionysius Apr 19, 2021 • edited

Choose a reason for hiding this comment

ehashman Apr 19, 2021

Choose a reason for hiding this comment

dionysius Apr 19, 2021

Choose a reason for hiding this comment

PrabhuMathi commented May 4, 2021 • edited

matthyx commented May 4, 2021

PrabhuMathi commented May 4, 2021

SergeyKanzhelev commented May 4, 2021

PrabhuMathi commented May 4, 2021

matthyx commented May 7, 2021

andrewsykim commented Aug 20, 2020 •

edited

andrewsykim commented Aug 20, 2020 •

edited

dionysius Apr 19, 2021 •

edited

PrabhuMathi commented May 4, 2021 •

edited