Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Exec probes respect timeout #27956

Closed

Conversation

rhcarvalho
Copy link
Contributor

@rhcarvalho rhcarvalho commented Jun 23, 2016

Fixes #26895.

As described in the issue, Exec probes used to disregard the timeout setting observed by HTTPGet and TCPSocket probes.

@dims opened #26899 to tackle it, but I think I could contribute with something more than just comments. @dims please if you see value in the changes here feel free to absorb it in your PR.

This is a bigger change than @dims's PR because we need to change interfaces to pass the timeout to the code that actually runs the command, so that the process can be killed after the timeout.

Includes an e2e test.


This change is Reviewable

@k8s-bot
Copy link

k8s-bot commented Jun 23, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

3 similar comments
@k8s-bot
Copy link

k8s-bot commented Jun 23, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

@k8s-bot
Copy link

k8s-bot commented Jun 23, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

@k8s-bot
Copy link

k8s-bot commented Jun 23, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

defer time.AfterFunc(timeout, func() {
// FIXME: we should kill the process in the container,
// but I couldn't find anything in the Docker API docs
// on how to do it with the *Exec APIs.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unfortunate, but on the other hand it seems that there's an implicit timeout of 10s here (ticket frequency + count till 5 + break).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to check, wondering if the call to client.StartExec is blocking or not until the command terminates, given that we set Detach: false.
If it blocks till the command terminates, then I don't understand the for loop below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StartExec is blocking. It didn't used to be. This code probably wasn't updated when we switched it to blocking.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's impossible to use the Docker API to stop or kill an Exec session, unfortunately. And they have stated they don't intend to add support to do so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StartExec is blocking. It didn't used to be. This code probably wasn't updated when we switched it to blocking.

IIUC that means that the for loop below only ever runs once and count++ is never reached?

I could clean that up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discovered it to be blocking in my terminal resizing PR. It didn't used to be. It's blocking because of

return d.holdHijackedConnection(sopts.RawTerminal || opts.Tty, sopts.InputStream, sopts.OutputStream, sopts.ErrorStream, resp)
.

I assume that the for loop only ever runs once.

cc @Random-Liu

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ncdc Can you post a link to the docker issue or PR around killing exec?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@k8s-github-robot k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels Jun 23, 2016
@dims
Copy link
Member

dims commented Jun 23, 2016

@rhcarvalho - It's awesome to see this PR, let's use this one and close mine. I'll review this in a little bit

@k8s-bot
Copy link

k8s-bot commented Jun 23, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

@rhcarvalho
Copy link
Contributor Author

rhcarvalho commented Jun 24, 2016

@vishh in #26899 (comment) you asked for a new test case under test/e2e_node/. I searched for "timeout" in that directory, but didn't find any tests related to testing the timeouts for the other probe types...

Do you know if there are tests for the probes somewhere else? Thanks!

@rhcarvalho
Copy link
Contributor Author

The closest things I could find were these, but they are not testing that the timeout interrupted the probe:

@rhcarvalho
Copy link
Contributor Author

Found 2 tests in test/e2e/pods.go that test restarts triggered by the liveness probe:
test/e2e/pods.go#L1024-L1076

Seems that it's the place where we could add the timeout tests!

However, I doubt that a single test case there will cover the 3 implementations (docker.NsenterExecHandler, docker.NativeExecHandler, rkt.Runtime). @vishh do you have a suggestion?

},
},
},
}, 1, defaultObservationTimeout)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how many restarts we are to observe here, certainly at least 1. We may need to fine tune the numbers to get exactly 1 behavior. Open to suggestions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly n counting is hard to do in tests. I'd rather see at least n

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Need to look again, I think the test helper has exact n semantic.
The tests haven't run yet, we need a "ok to test" comment, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked at the helper, it's indeed at least n semantics.
test/e2e/pods.go#L104-L110

I think checking for at least 1 restart (in the period of 2 minutes, as per defaultObservationTimeout) should be enough. @ncdc @vishh do you agree?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WFM

@rhcarvalho
Copy link
Contributor Author

@ncdc, since you worked on the Exec probe code base, may I ask your review here?

@ncdc
Copy link
Member

ncdc commented Jun 27, 2016

I've also got #25273 open to add terminal resizing support for exec & attach, and our 2 PRs are going to conflict. How do you want to proceed - your PR first or mine?

@rhcarvalho
Copy link
Contributor Author

I've also got #25273 open to add terminal resizing support for exec & attach, and our 2 PRs are going to conflict. How do you want to proceed - your PR first or mine?

I wouldn't mind going after, since your PR is much earlier, from May. But from the last comments I'm wondering if it is going to land anytime soon?

At a glance I think the conflict will be just the method signature :)

@ncdc
Copy link
Member

ncdc commented Jun 27, 2016

I wouldn't mind going after, since your PR is much earlier, from May. But from the last comments I'm wondering if it is going to land anytime soon?

Yeah, I'm not sure. I need to find someone who can add Windows support. Or maybe just get my PR in without Windows support for starters, and then do follow-up to fix Windows. That's probably more realistic.

@vishh
Copy link
Contributor

vishh commented Jun 27, 2016

@rhcarvalho

Adapting the probe tests in test/e2e/pods.go sgtm! You can clone one of those tests into test/e2e_node/ directory and add timeouts. Don't bother about the various exec implementations for now. Just test against the API.
We are working on improving the framework to handle updating flags which will let us test various drivers for exec.

@rhcarvalho
Copy link
Contributor Author

Adapting the probe tests in test/e2e/pods.go sgtm! You can clone one of those tests into test/e2e_node/ directory and add timeouts.

It's already there -- https://github.com/kubernetes/kubernetes/pull/27956/files#diff-92d176a1025dcbee0981bb7f16cda942R1078

Though it probably needs an update on the expected number of restarts / change the helper to "at least n" semantics.

Don't bother about the various exec implementations for now.

Alright!

@vishh could you please trigger a test run?

@@ -90,15 +88,22 @@ func (*NsenterExecHandler) ExecInContainer(client DockerInterface, container *do
if stderr != nil {
command.Stderr = stderr
}

return command.Run()
if err := command.Start(); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the Start be executed outside of the if tty { section? I don't see the command being started in the case of tty=true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cmd.Run is the same as Cmd.Start + Cmd.Wait.

What we're doing here is a refactoring, replacing Run with Start + Wait (see the last line of this method) so that we can write the timeout logic only once, instead of twice, in each branch of the if statement.

@rhcarvalho
Copy link
Contributor Author

@vishh thanks for the review. Sorry it took me a few days to get back, I'm in the middle of a holiday here :-)

Would you mind marking this for running the tests?

It's very likely we need to update the "number of restarts" we expect in the test case, would be helpful to see them run.

@rhcarvalho
Copy link
Contributor Author

Rebased and updated comment to include reference to Docker issue, thanks @ncdc!

@rhcarvalho
Copy link
Contributor Author

I'll upgrade Docker to 1.10.3 (latest version packaged for Fedora) and try again.

Docker 1.10.3 showed the same behavior. Next I tried 1.12.1 from Fedora rawhide.

$ docker version            
Client:                                                   
 Version:         1.12.1                                  
 API version:     1.24                                    
 Package version: docker-1.12.1-24.git9a3752d.fc26.x86_64 
 Go version:      go1.7.1                                 
 Git commit:      9a3752d/1.12.1                          
 Built:                                                   
 OS/Arch:         linux/amd64                             

Server:                                                   
 Version:         1.12.1                                  
 API version:     1.24                                    
 Package version: docker-1.12.1-24.git9a3752d.fc26.x86_64 
 Go version:      go1.7.1                                 
 Git commit:      9a3752d/1.12.1                          
 Built:                                                   
 OS/Arch:         linux/amd64                             

Still closing the connection of the exec Docker API call doesn't kill the process being executed.

Here's the code I'm using, in case I'm doing something obviously wrong:
rhcarvalho@08b6504

@sttts
Copy link
Contributor

sttts commented Sep 20, 2016

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 23, 2016
@rhcarvalho
Copy link
Contributor Author

Rebased.

@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 23, 2016
@rhcarvalho
Copy link
Contributor Author

rhcarvalho commented Sep 23, 2016

#33301 touched exec code. Quoting a TODO from that PR, I would rather have a timeout as part of the exec signature before "exec is properly defined in CRI."

@ncdc @sttts how about we support timeout only for rkt and nsenter, and continue to ignore it with nativeclient (and I would gladly document that it is not supported, given a pointer to where the docs live)?
It is disappointing because nativeclient is the default, but at least it incorporates the notion of a timeout in the many ExecInContainer signatures, allowing for a future fix.

@sttts
Copy link
Contributor

sttts commented Sep 23, 2016

Does the CRI proposal incorporate that notion? CRI will be the future.

This allows us to interrupt/kill the executed command if it exceeds the
timeout.
Even though we cannot kill the process, we can return early and return
an error. For liveness probes, this should trigger a container restart.
HTTPGet and TCPSocket probes respect the timeout, while Exec probes used
to ignore it.
@k8s-ci-robot
Copy link
Contributor

Jenkins GCI Kubemark GCE e2e failed for commit f18349a. Full PR test history.

The magic incantation to run this job again is @k8s-bot kubemark gci e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@k8s-ci-robot
Copy link
Contributor

Jenkins unit/integration failed for commit f18349a. Full PR test history.

The magic incantation to run this job again is @k8s-bot unit test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@k8s-github-robot
Copy link

@rhcarvalho PR needs rebase

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 27, 2016
@rhcarvalho
Copy link
Contributor Author

FTR, discussed with @sttts and @ncdc on IRC on the 23rd, and decided to open a new PR with the "non-controversial parts", adding the timeout to the function signatures -> #33366.

Support for timeouts in nsenter and rkt is implemented here, though to effectively support the most common case, nativeclient, we'd need changes in Docker.

To avoid asymmetry among the nsenter, rkt and nativeclient implementations, we left out actually observing the timeouts in all of them, at least initially.

Opted for opening a new PR to preserve the history here.

@k8s-github-robot
Copy link

This PR hasn't been active in 30 days. It will be closed in 59 days (Dec 29, 2016).

cc @ncdc @rhcarvalho @vishh

You can add 'keep-open' label to prevent this from happening, or add a comment to keep it open another 90 days

@rhcarvalho
Copy link
Contributor Author

Closing in favor of #33366 and #35893.

@rhcarvalho rhcarvalho closed this Oct 31, 2016
k8s-github-robot pushed a commit that referenced this pull request Nov 8, 2016
…ument

Automatic merge from submit-queue

Add timeout argument to ExecInContainer

<!--  Thanks for sending a pull request!  Here are some tips for you:
1. If this is your first time, read our contributor guidelines https://github.com/kubernetes/kubernetes/blob/master/CONTRIBUTING.md and developer guide https://github.com/kubernetes/kubernetes/blob/master/docs/devel/development.md
2. If you want *faster* PR reviews, read how: https://github.com/kubernetes/kubernetes/blob/master/docs/devel/faster_reviews.md
3. Follow the instructions for writing a release note: https://github.com/kubernetes/kubernetes/blob/master/docs/devel/pull-requests.md#release-notes
-->

**What this PR does / why we need it**: This is related to #26895. It brings a timeout to the signature of `ExecInContainer` so that we can take timeouts into account in the future. Unlike my first attempt in #27956, it doesn't immediately observe the timeout, because it is impossible to do it with the current state of the Docker Remote API (the default exec handler implementation).

**Special notes for your reviewer**: This shares commits with #27956, but without some of them that have more controversial implications (actually supporting the timeouts). The original PR shall be closed in the current state to preserve the history (instead of dropping commits in that PR).

Pinging the original people working on this change: @ncdc @sttts @vishh @dims 

**Release note**:

<!--  Steps to write your release note:
1. Use the release-note-* labels to set the release note state (if you have access) 
2. Enter your extended release note in the below block; leaving it blank means using the PR title as the release note. If no release note is required, just write `NONE`. 
-->

``` release-note
NONE
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants