New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Respect timeout in ExecInContainer #87281
Conversation
This PR is based on @rhcarvalho 's former PR. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: tedyu The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test pull-kubernetes-e2e-gce |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a high level, I'm wondering if its truly possible for us to "respect" the timeout in ExecInContainer
, when docker
doesn't actually support killing the process? In other words, we can "respect" the timeout, by failing the readiness probe and trying it again, BUT that raises the risk of us attempting to run the process while previous attempts are still running.
Might we instead have to wait on this fix until Docker supports killing processes execing in a container?
@@ -58,6 +59,10 @@ func (d *dockerExitError) ExitStatus() int { | |||
// NativeExecHandler executes commands in Docker containers using Docker's exec API. | |||
type NativeExecHandler struct{} | |||
|
|||
// ExecInContainer executes a command in a Docker container. It may leave a | |||
// goroutine running the process in the container after the function returns | |||
// because of a timeout. However, the goroutine does not leak, it terminates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this ExecInContainer process never exits, we would leak goroutines, correct?
// Docker API doesn't support it. See | ||
// https://github.com/docker/docker/issues/9098. | ||
// For liveness probes this is probably okay, since the | ||
// container will be restarted. For readiness probes it means |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little worried about this behavior... specifically in the following situation:
Imagine we have a readiness probe configured to run every minute. The timeout is set to 5s, but the actual command we are executing never terminates.
Because we can't kill the process in the container, I believe we're going to create more and more processes in the container (and more and more goroutines), which seems dangerous to me?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was where the previous PR got stalled.
I left a comment on the linked moby issue - surki updated his branch in Dec 2019.
Also reviewed surki's commit. Hopefully that issue can get some traction.
For the readiness probe you mentioned, maybe we can adopt a threshold for the timeout so that we don't create many processes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dims thoughts?
I'm curious... is the lack of timeout in the ExecInContainer
causing issues for users in production? I'm just wondering if doing nothing until this is fixed in moby
is a feasible option...
@tedyu: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
This PR solves the same problem as the PR that is currently being actively discussed #94115. The agreement from SIG node meeting today was that the code change is only a small part of the fix and we need to actively look at side effects this change may bring to the production payloads. Please join the discussion at the PR: #94115. /close |
@tedyu: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@SergeyKanzhelev: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
As #26895 stated, timeoutSeconds is ignored for probe.
This PR adds code to respect the timeout parameter passed to ExecInContainer.
Which issue(s) this PR fixes:
Fixes #26895
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: