-
Notifications
You must be signed in to change notification settings - Fork 38.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make Exec probes respect timeout #27956
Conversation
Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test". This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry. Otherwise, if this message is too spammy, please complain to ixdy. |
3 similar comments
Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test". This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry. Otherwise, if this message is too spammy, please complain to ixdy. |
Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test". This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry. Otherwise, if this message is too spammy, please complain to ixdy. |
Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test". This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry. Otherwise, if this message is too spammy, please complain to ixdy. |
defer time.AfterFunc(timeout, func() { | ||
// FIXME: we should kill the process in the container, | ||
// but I couldn't find anything in the Docker API docs | ||
// on how to do it with the *Exec APIs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems unfortunate, but on the other hand it seems that there's an implicit timeout of 10s here (ticket frequency + count till 5 + break).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to check, wondering if the call to client.StartExec
is blocking or not until the command terminates, given that we set Detach: false
.
If it blocks till the command terminates, then I don't understand the for
loop below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StartExec is blocking. It didn't used to be. This code probably wasn't updated when we switched it to blocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's impossible to use the Docker API to stop or kill an Exec session, unfortunately. And they have stated they don't intend to add support to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StartExec is blocking. It didn't used to be. This code probably wasn't updated when we switched it to blocking.
IIUC that means that the for
loop below only ever runs once and count++
is never reached?
I could clean that up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I discovered it to be blocking in my terminal resizing PR. It didn't used to be. It's blocking because of
return d.holdHijackedConnection(sopts.RawTerminal || opts.Tty, sopts.InputStream, sopts.OutputStream, sopts.ErrorStream, resp) |
I assume that the for loop only ever runs once.
cc @Random-Liu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ncdc Can you post a link to the docker issue or PR around killing exec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhcarvalho - It's awesome to see this PR, let's use this one and close mine. I'll review this in a little bit |
Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test". This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry. Otherwise, if this message is too spammy, please complain to ixdy. |
@vishh in #26899 (comment) you asked for a new test case under Do you know if there are tests for the probes somewhere else? Thanks! |
The closest things I could find were these, but they are not testing that the timeout interrupted the probe: |
Found 2 tests in Seems that it's the place where we could add the timeout tests! However, I doubt that a single test case there will cover the 3 implementations ( |
}, | ||
}, | ||
}, | ||
}, 1, defaultObservationTimeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how many restarts we are to observe here, certainly at least 1. We may need to fine tune the numbers to get exactly 1 behavior. Open to suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly n counting is hard to do in tests. I'd rather see at least n
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Need to look again, I think the test helper has exact n semantic.
The tests haven't run yet, we need a "ok to test" comment, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked at the helper, it's indeed at least n semantics.
test/e2e/pods.go#L104-L110
I think checking for at least 1 restart (in the period of 2 minutes, as per defaultObservationTimeout
) should be enough. @ncdc @vishh do you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WFM
@ncdc, since you worked on the Exec probe code base, may I ask your review here? |
I've also got #25273 open to add terminal resizing support for exec & attach, and our 2 PRs are going to conflict. How do you want to proceed - your PR first or mine? |
I wouldn't mind going after, since your PR is much earlier, from May. But from the last comments I'm wondering if it is going to land anytime soon? At a glance I think the conflict will be just the method signature :) |
Yeah, I'm not sure. I need to find someone who can add Windows support. Or maybe just get my PR in without Windows support for starters, and then do follow-up to fix Windows. That's probably more realistic. |
Adapting the probe tests in |
It's already there -- https://github.com/kubernetes/kubernetes/pull/27956/files#diff-92d176a1025dcbee0981bb7f16cda942R1078 Though it probably needs an update on the expected number of restarts / change the helper to "at least n" semantics.
Alright! @vishh could you please trigger a test run? |
@@ -90,15 +88,22 @@ func (*NsenterExecHandler) ExecInContainer(client DockerInterface, container *do | |||
if stderr != nil { | |||
command.Stderr = stderr | |||
} | |||
|
|||
return command.Run() | |||
if err := command.Start(); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the Start
be executed outside of the if tty {
section? I don't see the command being started in the case of tty=true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cmd.Run
is the same as Cmd.Start
+ Cmd.Wait
.
What we're doing here is a refactoring, replacing Run
with Start
+ Wait
(see the last line of this method) so that we can write the timeout logic only once, instead of twice, in each branch of the if statement.
@vishh thanks for the review. Sorry it took me a few days to get back, I'm in the middle of a holiday here :-) Would you mind marking this for running the tests? It's very likely we need to update the "number of restarts" we expect in the test case, would be helpful to see them run. |
a1d2aed
to
09a4d30
Compare
Rebased and updated comment to include reference to Docker issue, thanks @ncdc! |
09a4d30
to
a00e0d4
Compare
Docker 1.10.3 showed the same behavior. Next I tried 1.12.1 from Fedora rawhide. $ docker version
Client:
Version: 1.12.1
API version: 1.24
Package version: docker-1.12.1-24.git9a3752d.fc26.x86_64
Go version: go1.7.1
Git commit: 9a3752d/1.12.1
Built:
OS/Arch: linux/amd64
Server:
Version: 1.12.1
API version: 1.24
Package version: docker-1.12.1-24.git9a3752d.fc26.x86_64
Go version: go1.7.1
Git commit: 9a3752d/1.12.1
Built:
OS/Arch: linux/amd64 Still closing the connection of the exec Docker API call doesn't kill the process being executed. Here's the code I'm using, in case I'm doing something obviously wrong: |
e9e3a6f
to
88a5cff
Compare
Rebased. |
#33301 touched exec code. Quoting a TODO from that PR, I would rather have a timeout as part of the exec signature before "exec is properly defined in CRI." @ncdc @sttts how about we support timeout only for rkt and nsenter, and continue to ignore it with nativeclient (and I would gladly document that it is not supported, given a pointer to where the docs live)? |
Does the CRI proposal incorporate that notion? CRI will be the future. |
This allows us to interrupt/kill the executed command if it exceeds the timeout.
Even though we cannot kill the process, we can return early and return an error. For liveness probes, this should trigger a container restart.
HTTPGet and TCPSocket probes respect the timeout, while Exec probes used to ignore it.
88a5cff
to
f18349a
Compare
Jenkins GCI Kubemark GCE e2e failed for commit f18349a. Full PR test history. The magic incantation to run this job again is |
Jenkins unit/integration failed for commit f18349a. Full PR test history. The magic incantation to run this job again is |
@rhcarvalho PR needs rebase |
FTR, discussed with @sttts and @ncdc on IRC on the 23rd, and decided to open a new PR with the "non-controversial parts", adding the timeout to the function signatures -> #33366. Support for timeouts in nsenter and rkt is implemented here, though to effectively support the most common case, nativeclient, we'd need changes in Docker. To avoid asymmetry among the nsenter, rkt and nativeclient implementations, we left out actually observing the timeouts in all of them, at least initially. Opted for opening a new PR to preserve the history here. |
This PR hasn't been active in 30 days. It will be closed in 59 days (Dec 29, 2016). You can add 'keep-open' label to prevent this from happening, or add a comment to keep it open another 90 days |
b1fcb90
to
f18349a
Compare
…ument Automatic merge from submit-queue Add timeout argument to ExecInContainer <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, read our contributor guidelines https://github.com/kubernetes/kubernetes/blob/master/CONTRIBUTING.md and developer guide https://github.com/kubernetes/kubernetes/blob/master/docs/devel/development.md 2. If you want *faster* PR reviews, read how: https://github.com/kubernetes/kubernetes/blob/master/docs/devel/faster_reviews.md 3. Follow the instructions for writing a release note: https://github.com/kubernetes/kubernetes/blob/master/docs/devel/pull-requests.md#release-notes --> **What this PR does / why we need it**: This is related to #26895. It brings a timeout to the signature of `ExecInContainer` so that we can take timeouts into account in the future. Unlike my first attempt in #27956, it doesn't immediately observe the timeout, because it is impossible to do it with the current state of the Docker Remote API (the default exec handler implementation). **Special notes for your reviewer**: This shares commits with #27956, but without some of them that have more controversial implications (actually supporting the timeouts). The original PR shall be closed in the current state to preserve the history (instead of dropping commits in that PR). Pinging the original people working on this change: @ncdc @sttts @vishh @dims **Release note**: <!-- Steps to write your release note: 1. Use the release-note-* labels to set the release note state (if you have access) 2. Enter your extended release note in the below block; leaving it blank means using the PR title as the release note. If no release note is required, just write `NONE`. --> ``` release-note NONE ```
Fixes #26895.
As described in the issue,
Exec
probes used to disregard the timeout setting observed byHTTPGet
andTCPSocket
probes.@dims opened #26899 to tackle it, but I think I could contribute with something more than just comments. @dims please if you see value in the changes here feel free to absorb it in your PR.
This is a bigger change than @dims's PR because we need to change interfaces to pass the timeout to the code that actually runs the command, so that the process can be killed after the timeout.
Includes an e2e test.
This change is![Reviewable](https://camo.githubusercontent.com/2d899f4291d07d3cd2fa4aaae1e3b243f164c23fce87d30a589ace0d496a444c/68747470733a2f2f72657669657761626c652e6b756265726e657465732e696f2f7265766965775f627574746f6e2e737667)