Skip to content

Conversation

@louyihua
Copy link
Contributor

What this PR does / why we need it:
Currently, the timeout setting is only effective for HTTP/TCP prober.
This patch add an implementation for the exec prober to timeout.

Release note:

Make timeout setting available to the exec prober

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 19, 2018
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 19, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: louyihua
We suggest the following additional approver: dchen1107

Assign the PR to them by writing /assign @dchen1107 in a comment when ready.

No associated issue. Update pull-request body to add a reference to an issue, or get approval with /approve no-issue

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@m1093782566
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 19, 2018
Currently, the timeout setting is only effective for HTTP/TCP prober.
This patch add an implementation for the exec prober to timeout.
@mtaufen
Copy link
Contributor

mtaufen commented Jan 19, 2018

code mostly looks good
/assign @yujuhong

@yujuhong
Copy link
Contributor

Why not setting the timeout to the context in the docker client? https://github.com/kubernetes/kubernetes/blob/v1.10.0-alpha.1/pkg/kubelet/dockershim/libdocker/kube_docker_client.go#L462

/assign @ncdc

@louyihua
Copy link
Contributor Author

@yujuhong The exec prober does not use this interface, so this PR does not touch it.

@yujuhong
Copy link
Contributor

I think it does use the libdocker.Interface. Check again?

@louyihua
Copy link
Contributor Author

@yujuhong Checked again under OpenShift 3.9.
An exec probe starts with this stack trace:

 0  0x00000000035f88ab in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/apis/cri/v1alpha1/runtime.(*runtimeServiceClient).ExecSync
    at ./openshift-3.9.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/apis/cri/v1alpha1/runtime/api.pb.go:3738
 1  0x00000000041c54c3 in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/remote.(*RemoteRuntimeService).ExecSync
    at ./openshift-3.9.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/remote/remote_runtime.go:330
 2  0x000000000434b33e in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/kuberuntime.instrumentedRuntimeService.ExecSync
    at ./openshift-3.9.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/kuberuntime/instrumented_services.go:147
 3  0x0000000004368ab0 in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/kuberuntime.(*instrumentedRuntimeService).ExecSync
    at ./openshift-3.9.0/<autogenerated>:1
 4  0x0000000004354b8d in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/kuberuntime.(*kubeGenericRuntimeManager).RunInContainer
    at ./openshift-3.9.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/kuberuntime/kuberuntime_container.go:805
 5  0x000000000437bccf in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/prober.(*prober).newExecInContainer.func1
    at ./openshift-3.9.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/prober/prober.go:245
 6  0x0000000004378ec7 in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/prober.execInContainer.CombinedOutput
    at ./openshift-3.9.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/prober/prober.go:254
 7  0x00000000043759a9 in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/probe/exec.execProber.Probe
    at ./openshift-3.9.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/probe/exec/exec.go:37
 8  0x0000000004375dc0 in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/probe/exec.(*execProber).Probe
    at ./openshift-3.9.0/<autogenerated>:1
 9  0x000000000437854c in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/prober.(*prober).runProbe
    at ./openshift-3.9.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/prober/prober.go:152
10  0x000000000437749e in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/prober.(*prober).runProbeWithRetries
    at ./openshift-3.9.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/prober/prober.go:129
11  0x00000000043766de in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/prober.(*prober).probe
    at ./openshift-3.9.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/prober/prober.go:98
12  0x000000000437b546 in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/prober.(*worker).doProbe
    at ./openshift-3.9.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/prober/worker.go:201
13  0x000000000437ab68 in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/prober.(*worker).run
    at ./openshift-3.9.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/prober/worker.go:120
14  0x0000000000460531 in runtime.goexit
    at /usr/local/go/src/runtime/asm_amd64.s:2337

At the top frame of the above stack trace, a gRPC call is invoked and the the control transfers to ExecInContainer defined in dockershim/exec.go as the following stack trace shows:

0  0x0000000004282211 in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/dockershim.(*NativeExecHandler).ExecInContainer
   at ./openshift-3.9.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/dockershim/exec.go:63
1  0x0000000004280895 in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/dockershim.(*streamingRuntime).exec
   at ./openshift-3.9.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/dockershim/docker_streaming.go:58
2  0x0000000004280d01 in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/dockershim.(*dockerService).ExecSync
   at ./openshift-3.9.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/dockershim/docker_streaming.go:81
3  0x000000000431d64c in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/dockershim/remote.(*dockerService).ExecSync
   at ./openshift-3.9.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/dockershim/remote/docker_service.go:158
4  0x00000000035fbce6 in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/apis/cri/v1alpha1/runtime._RuntimeService_ExecSync_Handler
   at ./openshift-3.9.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/apis/cri/v1alpha1/runtime/api.pb.go:4121
5  0x0000000001f8c97b in github.com/openshift/origin/vendor/google.golang.org/grpc.(*Server).processUnaryRPC
   at ./openshift-3.9.0/_output/local/go/src/github.com/openshift/origin/vendor/google.golang.org/grpc/server.go:753
6  0x0000000001f901c2 in github.com/openshift/origin/vendor/google.golang.org/grpc.(*Server).handleStream
   at ./openshift-3.9.0/_output/local/go/src/github.com/openshift/origin/vendor/google.golang.org/grpc/server.go:955
7  0x0000000001f9607f in github.com/openshift/origin/vendor/google.golang.org/grpc.(*Server).serveStreams.func1.1
   at ./openshift-3.9.0/_output/local/go/src/github.com/openshift/origin/vendor/google.golang.org/grpc/server.go:517
8  0x0000000000460531 in runtime.goexit
   at /usr/local/go/src/runtime/asm_amd64.s:2337

Neither of above stack traces touchs the libdocker package.

@yujuhong
Copy link
Contributor

https://github.com/kubernetes/kubernetes/blob/v1.10.0-alpha.1/pkg/kubelet/dockershim/exec.go#L61

func (*NativeExecHandler) ExecInContainer(client libdocker.Interface, ...) {

@louyihua
Copy link
Contributor Author

Another PR that modifies the libdocker.Interface is proposed in #58925.

@yujuhong
Copy link
Contributor

yujuhong commented Feb 6, 2018

I'd like @ncdc's input on what's the best place to do the timeout check.

@ncdc
Copy link
Member

ncdc commented Feb 7, 2018

@yujuhong I honestly have a hard time keeping track of what the code path is, given CRI and dockershim and libdocker and whatnot. I'd like to help out, but maybe you could give me an overview of how things work in the kubelet these days?

@yujuhong
Copy link
Contributor

yujuhong commented Feb 7, 2018

@yujuhong I honestly have a hard time keeping track of what the code path is, given CRI and dockershim and libdocker and whatnot. I'd like to help out, but maybe you could give me an overview of how things work in the kubelet these days?

@ncdc this part of the code actually hasn't changed much, although being moved around a couple of times and harder to trace now :-)
I think the main question is where to handle the timeout. This PR does it in the ExecInContainer(), while the other PR at the lower level StartExec(). Initially I thought we could leverage the context timeout used in docker operations more by doing it a the lower level, but it doesn't seem to be the case and the change is even larger. Given that ExecInContainer is the only caller of StartExec, maybe we should just add the code here (i.e., this PR).

@mtaufen mtaufen removed their assignment Feb 7, 2018
@dims
Copy link
Member

dims commented Feb 13, 2018

/unassign @dims

@louyihua
Copy link
Contributor Author

louyihua commented Mar 7, 2018

@ncdc any comments to this PR?

@ncdc
Copy link
Member

ncdc commented Mar 7, 2018

My apologies for the delayed response.

I think there are 2 possible courses of action here.

  1. Control the timeout in ExecInContainer, killing the exec'd pid when you hit the timeout (this PR)

  2. Control the timeout in StartExec, killing the exec'd pid inside StartExec when you hit the timeout, so that StartExec fully owns the management of the exec'd process. I don't think the other PR is ideal as-is, because it splits the ownership of the timeout and the killing of the process between StartExec and ExecInContainer.

From a design purity standpoint, I think option 2 makes more sense, since it allows all current and future callers of StartExec to have timeout support.

Would you agree?

@yujuhong I will, however, defer to sig-node to make the final decision. I think either this PR or a re-worked #58925 that matches my option 2 would be acceptable.

@louyihua
Copy link
Contributor Author

louyihua commented Mar 7, 2018

@ncdc I agree

@ncdc
Copy link
Member

ncdc commented May 10, 2018

Can we close this since it looks like #58925 is the way we want to go?

@louyihua
Copy link
Contributor Author

@ncdc OK, close this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants