New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubelet: respect exec probe timeouts #94115
kubelet: respect exec probe timeouts #94115
Conversation
I was able to reproduce this on a GKE cluster. Deployed a pod where the exec probe takes 10s to pass and the timeout is set to 1. On nodes with existing kubelet, the probe passes since the timeout is not respected. I manually updated one of the kubelets with this change and now the timeout is respected and the readiness probe fails.
Deployment:
Thanks @jgoeres for the test deployment. |
pkg/kubelet/dockershim/exec.go
Outdated
klog.Errorf("Exec session %s in container %s terminated but process still running!", execObj.ID, container.ID) | ||
break | ||
count++ | ||
if count == 5 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this count check should be removed now since we're checking the timeout now, but I'll defer to kubelet maintainers since I don't have context on why this is here originally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like earlier we just had an infinite loop until InspectExec
can tell us that the process we started exited... the count was added to break out of it in 7748a02
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nvm, I think we need to keep this since it's expected behavior for kubelet to exit at some point with a nil error and allow the exec'd command to continue running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That or there should be some default timeout here, but don't want to introduce new changes here.
@andrewsykim also additional context, see #50176 |
2bb6cf2
to
330fc4f
Compare
is it an alternative to this PR: #58925? |
/retest |
e2e failures look legit, will dig into it /hold |
pkg/kubelet/dockershim/exec.go
Outdated
@@ -110,28 +110,32 @@ func (*NativeExecHandler) ExecInContainer(client libdocker.Interface, container | |||
} | |||
|
|||
ticker := time.NewTicker(2 * time.Second) | |||
execTimeout := time.After(timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to check for 0 timeout, since exec outside of probes will have 0 timeout
8805bd1
to
7ee5e07
Compare
// Only limit the amount of InspectExec calls if the exec timeout was not set. | ||
// When a timeout is not set, we stop polling the exec session after 5 attempts and allow the process to continue running. | ||
if execTimeout == nil { | ||
count++ | ||
if count == 5 { | ||
klog.Errorf("Exec session %s in container %s terminated but process still running!", execObj.ID, container.ID) | ||
return nil | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this mean that any probe, without timeout, taking longer than 10 seconds will automatically be no error, independent of what the actual exit of the command is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There have been a number of follow-ups to this PR since it landed so I suggest you take a look at what's committed on the master branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignore my question - interpretation error from my side, on master (and probably on this commit as well) client.StartExec()
is blocking and thus this code block happens after the command has finished in some form. It is 10 seconds maximum for gathering the exit info, although the code on master is slightly different with the same effect.
Facing same issue in 1.19.7 (AKS) with containerd(1.5.0-beta) as cri. Could you please advise in which version of containerd or kubernetes patch this was fixed? Pod stuck in running 0/1 status
|
@PrabhuMathi I don't think this commit has been cherry-picked in 1.19 - you'll have to use 1.20 or above |
@matthyx thanks for the update, Let me give timeout in the command itself as mentioned and proceed. Also started facing readiness probe failure after migrating to Containerd any ref on this please? |
As far as I remember, with Containerd before 1.19, probe will timeout, but it will not be treated as an error. With Docker there will be no timeout. Is it what you experience? |
Yes true in docker same pod runs fine, but when i move it to contained node facing this issue, but actual problem i'm facing is not about timeout but readiness probe is failing with command (sleep 15) |
@PrabhuMathi can you share your podspec? |
The timeout wrapper in health checks was added in helm/charts#11355 to work around Docker/containerd not respecting timeouts in probes (cf. kubernetes/kubernetes#58925). The upstream issue has been fixed since Kubernetes 1.20 (kubernetes/kubernetes#94115), and this wrapper causes degraded behavior (ie. any failure in the wrapped command only gets reported as "The monitored command dumped core", without details for the specific failure), so the original behavior should be restored.
Signed-off-by: Andrew Sy Kim kim.andrewsy@gmail.com
What type of PR is this?
/kind bug
What this PR does / why we need it:
This PR fixes exec timeout issues for both dockershim and containerd.
When using kubelet dockershim, the exec timeout is not respected which results in timeouts for exec readiness/liveness probes to not be respected as well. This PR ensures the timeout passed into RunInContainer for dockershim is respected. The timeout value passed into RunInContainer is derived from the probe timeout in the case of readiness/liveness probes.
For containerd, the prober ignores timeout errors from remote runtime's ExecSync since it expects
utilexec.ExitCodeError
for any failed probes. Any other error ends up being ignored by the prober. This PR updates ExecSync to returnutilexec.ExitCodeError
when the grpc error from CRI isDeadlineContextExceeded
.Which issue(s) this PR fixes:
Fixes #94080
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: