Make Exec probes respect timeout #27956

rhcarvalho · 2016-06-23T18:11:55Z

As described in the issue, Exec probes used to disregard the timeout setting observed by HTTPGet and TCPSocket probes.

@dims opened #26899 to tackle it, but I think I could contribute with something more than just comments. @dims please if you see value in the changes here feel free to absorb it in your PR.

This is a bigger change than @dims's PR because we need to change interfaces to pass the timeout to the code that actually runs the command, so that the process can be killed after the timeout.

Includes an e2e test.

This change is

k8s-bot · 2016-06-23T18:12:22Z

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

k8s-bot · 2016-06-23T18:12:45Z

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

k8s-bot · 2016-06-23T18:13:12Z

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

k8s-bot · 2016-06-23T18:13:35Z

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

rhcarvalho · 2016-06-23T18:13:47Z

pkg/kubelet/dockertools/exec.go

+		defer time.AfterFunc(timeout, func() {
+			// FIXME: we should kill the process in the container,
+			// but I couldn't find anything in the Docker API docs
+			// on how to do it with the *Exec APIs.


This seems unfortunate, but on the other hand it seems that there's an implicit timeout of 10s here (ticket frequency + count till 5 + break).

I need to check, wondering if the call to client.StartExec is blocking or not until the command terminates, given that we set Detach: false.
If it blocks till the command terminates, then I don't understand the for loop below.

StartExec is blocking. It didn't used to be. This code probably wasn't updated when we switched it to blocking.

It's impossible to use the Docker API to stop or kill an Exec session, unfortunately. And they have stated they don't intend to add support to do so.

StartExec is blocking. It didn't used to be. This code probably wasn't updated when we switched it to blocking.

IIUC that means that the for loop below only ever runs once and count++ is never reached?

I could clean that up.

I discovered it to be blocking in my terminal resizing PR. It didn't used to be. It's blocking because of

kubernetes/pkg/kubelet/dockertools/kube_docker_client.go

Line 427 in 52ebd4e

return d.holdHijackedConnection(sopts.RawTerminal || opts.Tty, sopts.InputStream, sopts.OutputStream, sopts.ErrorStream, resp)

.

I assume that the for loop only ever runs once.

cc @Random-Liu

@ncdc Can you post a link to the docker issue or PR around killing exec?

moby/moby#9098

dims · 2016-06-23T18:24:12Z

@rhcarvalho - It's awesome to see this PR, let's use this one and close mine. I'll review this in a little bit

k8s-bot · 2016-06-23T22:04:23Z

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

rhcarvalho · 2016-06-24T08:02:36Z

@vishh in #26899 (comment) you asked for a new test case under test/e2e_node/. I searched for "timeout" in that directory, but didn't find any tests related to testing the timeouts for the other probe types...

Do you know if there are tests for the probes somewhere else? Thanks!

rhcarvalho · 2016-06-24T08:11:22Z

The closest things I could find were these, but they are not testing that the timeout interrupted the probe:

rhcarvalho · 2016-06-24T08:23:27Z

Found 2 tests in test/e2e/pods.go that test restarts triggered by the liveness probe:
test/e2e/pods.go#L1024-L1076

Seems that it's the place where we could add the timeout tests!

However, I doubt that a single test case there will cover the 3 implementations (docker.NsenterExecHandler, docker.NativeExecHandler, rkt.Runtime). @vishh do you have a suggestion?

rhcarvalho · 2016-06-24T08:29:30Z

test/e2e/pods.go

+					},
+				},
+			},
+		}, 1, defaultObservationTimeout)


I'm not sure how many restarts we are to observe here, certainly at least 1. We may need to fine tune the numbers to get exactly 1 behavior. Open to suggestions.

Exactly n counting is hard to do in tests. I'd rather see at least n

Agree. Need to look again, I think the test helper has exact n semantic.
The tests haven't run yet, we need a "ok to test" comment, I think.

Looked at the helper, it's indeed at least n semantics.
test/e2e/pods.go#L104-L110

I think checking for at least 1 restart (in the period of 2 minutes, as per defaultObservationTimeout) should be enough. @ncdc @vishh do you agree?

rhcarvalho · 2016-06-27T15:48:19Z

@ncdc, since you worked on the Exec probe code base, may I ask your review here?

ncdc · 2016-06-27T15:53:35Z

I've also got #25273 open to add terminal resizing support for exec & attach, and our 2 PRs are going to conflict. How do you want to proceed - your PR first or mine?

rhcarvalho · 2016-06-27T17:00:11Z

I've also got #25273 open to add terminal resizing support for exec & attach, and our 2 PRs are going to conflict. How do you want to proceed - your PR first or mine?

I wouldn't mind going after, since your PR is much earlier, from May. But from the last comments I'm wondering if it is going to land anytime soon?

At a glance I think the conflict will be just the method signature :)

ncdc · 2016-06-27T17:04:56Z

I wouldn't mind going after, since your PR is much earlier, from May. But from the last comments I'm wondering if it is going to land anytime soon?

Yeah, I'm not sure. I need to find someone who can add Windows support. Or maybe just get my PR in without Windows support for starters, and then do follow-up to fix Windows. That's probably more realistic.

vishh · 2016-06-27T18:35:36Z

@rhcarvalho

Adapting the probe tests in test/e2e/pods.go sgtm! You can clone one of those tests into test/e2e_node/ directory and add timeouts. Don't bother about the various exec implementations for now. Just test against the API.
We are working on improving the framework to handle updating flags which will let us test various drivers for exec.

rhcarvalho · 2016-06-27T20:50:44Z

Adapting the probe tests in test/e2e/pods.go sgtm! You can clone one of those tests into test/e2e_node/ directory and add timeouts.

It's already there -- https://github.com/kubernetes/kubernetes/pull/27956/files#diff-92d176a1025dcbee0981bb7f16cda942R1078

Though it probably needs an update on the expected number of restarts / change the helper to "at least n" semantics.

Don't bother about the various exec implementations for now.

Alright!

@vishh could you please trigger a test run?

vishh · 2016-07-01T21:26:04Z

pkg/kubelet/dockertools/exec.go

@@ -90,15 +88,22 @@ func (*NsenterExecHandler) ExecInContainer(client DockerInterface, container *do
 		if stderr != nil {
 			command.Stderr = stderr
 		}
-
-		return command.Run()
+		if err := command.Start(); err != nil {


Should the Start be executed outside of the if tty { section? I don't see the command being started in the case of tty=true

Cmd.Run is the same as Cmd.Start + Cmd.Wait.

What we're doing here is a refactoring, replacing Run with Start + Wait (see the last line of this method) so that we can write the timeout logic only once, instead of twice, in each branch of the if statement.

rhcarvalho · 2016-07-05T19:39:23Z

@vishh thanks for the review. Sorry it took me a few days to get back, I'm in the middle of a holiday here :-)

Would you mind marking this for running the tests?

It's very likely we need to update the "number of restarts" we expect in the test case, would be helpful to see them run.

rhcarvalho · 2016-07-05T19:47:27Z

Rebased and updated comment to include reference to Docker issue, thanks @ncdc!

rhcarvalho · 2016-09-20T16:15:39Z

I'll upgrade Docker to 1.10.3 (latest version packaged for Fedora) and try again.

Docker 1.10.3 showed the same behavior. Next I tried 1.12.1 from Fedora rawhide.

$ docker version            
Client:                                                   
 Version:         1.12.1                                  
 API version:     1.24                                    
 Package version: docker-1.12.1-24.git9a3752d.fc26.x86_64 
 Go version:      go1.7.1                                 
 Git commit:      9a3752d/1.12.1                          
 Built:                                                   
 OS/Arch:         linux/amd64                             

Server:                                                   
 Version:         1.12.1                                  
 API version:     1.24                                    
 Package version: docker-1.12.1-24.git9a3752d.fc26.x86_64 
 Go version:      go1.7.1                                 
 Git commit:      9a3752d/1.12.1                          
 Built:                                                   
 OS/Arch:         linux/amd64

Still closing the connection of the exec Docker API call doesn't kill the process being executed.

Here's the code I'm using, in case I'm doing something obviously wrong:
rhcarvalho@08b6504

sttts · 2016-09-20T17:43:42Z

Look at this comment: https://github.com/docker/docker/blob/3ea762b9f6ba256cf51bd2c35988f0c48bccf0b0/api/server/router/container/exec.go#L109 👎

rhcarvalho · 2016-09-23T10:09:51Z

Rebased.

rhcarvalho · 2016-09-23T10:20:03Z

#33301 touched exec code. Quoting a TODO from that PR, I would rather have a timeout as part of the exec signature before "exec is properly defined in CRI."

@ncdc @sttts how about we support timeout only for rkt and nsenter, and continue to ignore it with nativeclient (and I would gladly document that it is not supported, given a pointer to where the docs live)?
It is disappointing because nativeclient is the default, but at least it incorporates the notion of a timeout in the many ExecInContainer signatures, allowing for a future fix.

sttts · 2016-09-23T10:23:06Z

Does the CRI proposal incorporate that notion? CRI will be the future.

This allows us to interrupt/kill the executed command if it exceeds the timeout.

Even though we cannot kill the process, we can return early and return an error. For liveness probes, this should trigger a container restart.

HTTPGet and TCPSocket probes respect the timeout, while Exec probes used to ignore it.

k8s-ci-robot · 2016-09-26T16:42:15Z

Jenkins GCI Kubemark GCE e2e failed for commit f18349a. Full PR test history.

The magic incantation to run this job again is @k8s-bot kubemark gci e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

k8s-ci-robot · 2016-09-26T16:51:15Z

Jenkins unit/integration failed for commit f18349a. Full PR test history.

The magic incantation to run this job again is @k8s-bot unit test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

k8s-github-robot · 2016-09-27T16:04:16Z

@rhcarvalho PR needs rebase

rhcarvalho · 2016-09-30T11:01:24Z

FTR, discussed with @sttts and @ncdc on IRC on the 23rd, and decided to open a new PR with the "non-controversial parts", adding the timeout to the function signatures -> #33366.

Support for timeouts in nsenter and rkt is implemented here, though to effectively support the most common case, nativeclient, we'd need changes in Docker.

To avoid asymmetry among the nsenter, rkt and nativeclient implementations, we left out actually observing the timeouts in all of them, at least initially.

Opted for opening a new PR to preserve the history here.

k8s-github-robot · 2016-10-30T11:19:26Z

This PR hasn't been active in 30 days. It will be closed in 59 days (Dec 29, 2016).

cc @ncdc @rhcarvalho @vishh

You can add 'keep-open' label to prevent this from happening, or add a comment to keep it open another 90 days

rhcarvalho · 2016-10-31T10:03:15Z

Closing in favor of #33366 and #35893.

@ncdc

…ument Automatic merge from submit-queue Add timeout argument to ExecInContainer  **What this PR does / why we need it**: This is related to #26895. It brings a timeout to the signature of `ExecInContainer` so that we can take timeouts into account in the future. Unlike my first attempt in #27956, it doesn't immediately observe the timeout, because it is impossible to do it with the current state of the Docker Remote API (the default exec handler implementation). **Special notes for your reviewer**: This shares commits with #27956, but without some of them that have more controversial implications (actually supporting the timeouts). The original PR shall be closed in the current state to preserve the history (instead of dropping commits in that PR). Pinging the original people working on this change: @ncdc @sttts @vishh @dims **Release note**:  ``` release-note NONE ```

googlebot added the cla: yes label Jun 23, 2016

rhcarvalho reviewed Jun 23, 2016
View reviewed changes

k8s-github-robot assigned vishh Jun 23, 2016

k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels Jun 23, 2016

dims mentioned this pull request Jun 23, 2016

Add timeout to Exec Readiness Probe #26899

Closed

rhcarvalho reviewed Jun 24, 2016
View reviewed changes

vishh reviewed Jul 1, 2016
View reviewed changes

rhcarvalho force-pushed the execincontainer-timeout branch from a1d2aed to 09a4d30 Compare July 5, 2016 19:46

rhcarvalho force-pushed the execincontainer-timeout branch from 09a4d30 to a00e0d4 Compare July 5, 2016 20:18

k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 23, 2016

rhcarvalho force-pushed the execincontainer-timeout branch from e9e3a6f to 88a5cff Compare September 23, 2016 10:07

k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 23, 2016

rhcarvalho mentioned this pull request Sep 23, 2016

Add timeout argument to ExecInContainer #33366

Merged

rhcarvalho added 6 commits September 26, 2016 18:25

Add timeout argument to ExecInContainer

e52ea26

This allows us to interrupt/kill the executed command if it exceeds the timeout.

NsenterExecHandler: kill exec'ed cmd after timeout

d82bc85

NativeExecHandler: return after timeout

6d9573a

Even though we cannot kill the process, we can return early and return an error. For liveness probes, this should trigger a container restart.

rkt.Runtime: kill exec'ed cmd after timeout

657cb46

Set timeout in Exec probes

3f8fe87

HTTPGet and TCPSocket probes respect the timeout, while Exec probes used to ignore it.

Add e2e test for exec probe with timeout

f18349a

rhcarvalho force-pushed the execincontainer-timeout branch from 88a5cff to f18349a Compare September 26, 2016 16:25

k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 27, 2016

k8s-github-robot mentioned this pull request Oct 6, 2016

[k8s.io] Downward API volume should update annotations on modification [Conformance] {Kubernetes e2e suite} #34014

Closed

rhcarvalho mentioned this pull request Oct 31, 2016

Implement streaming CRI methods in dockershim #35661

Merged

rhcarvalho force-pushed the execincontainer-timeout branch 2 times, most recently from b1fcb90 to f18349a Compare October 31, 2016 09:52

rhcarvalho closed this Oct 31, 2016

cnelson mentioned this pull request May 16, 2017

Wrap redis check in timeout cloud-gov/kubernetes-broker#49

Merged

Make Exec probes respect timeout #27956

Make Exec probes respect timeout #27956

Conversation

rhcarvalho commented Jun 23, 2016 • edited Loading

k8s-bot commented Jun 23, 2016

k8s-bot commented Jun 23, 2016

k8s-bot commented Jun 23, 2016

k8s-bot commented Jun 23, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dims commented Jun 23, 2016

k8s-bot commented Jun 23, 2016

rhcarvalho commented Jun 24, 2016 • edited Loading

rhcarvalho commented Jun 24, 2016

rhcarvalho commented Jun 24, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhcarvalho commented Jun 27, 2016

ncdc commented Jun 27, 2016

rhcarvalho commented Jun 27, 2016

ncdc commented Jun 27, 2016

vishh commented Jun 27, 2016

rhcarvalho commented Jun 27, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhcarvalho commented Jul 5, 2016

rhcarvalho commented Jul 5, 2016

rhcarvalho commented Sep 20, 2016

sttts commented Sep 20, 2016

rhcarvalho commented Sep 23, 2016

rhcarvalho commented Sep 23, 2016 • edited Loading

sttts commented Sep 23, 2016

k8s-ci-robot commented Sep 26, 2016

k8s-ci-robot commented Sep 26, 2016

k8s-github-robot commented Sep 27, 2016

rhcarvalho commented Sep 30, 2016

k8s-github-robot commented Oct 30, 2016

rhcarvalho commented Oct 31, 2016

rhcarvalho commented Jun 23, 2016 •

edited

Loading

rhcarvalho commented Jun 24, 2016 •

edited

Loading

rhcarvalho commented Sep 23, 2016 •

edited

Loading