Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful termination fails on terminationGracePeriodSeconds > 2 minutes #31219

Closed
pracucci opened this issue Aug 23, 2016 · 9 comments
Closed
Assignees
Labels
area/client-libraries kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.
Milestone

Comments

@pracucci
Copy link

Kubernetes version: 1.3.5

I've a pod with terminationGracePeriodSeconds set to 600 (10 minutes). When I delete the pod, if it takes more than 2 minutes to shutdown, then weird things happen (like the networking stops working after 2 minutes it's in the Terminating state). After digging a big in the Kubernetes sources, I do believe I've individuated the root cause. Please see my report below.

Extract from the node's syslog:

Aug 23 07:18:47 docker_manager.go:1326] Killing container "497c5cdb46d919092f99359f6761d06c00db40439a51f4609dbc2d2174f56a50 test-termination default/test-termination-1781375857-hj5z2" with 600 second grace period
Aug 23 07:20:47 docker_manager.go:1367] Container "497c5cdb46d919092f99359f6761d06c00db40439a51f4609dbc2d2174f56a50 test-termination default/test-termination-1781375857-hj5z2" termination failed after 2m0.000209508s: operation timeout: context deadline exceeded

When a container should be killed, the killContainer() (manager.go) is called. At some point it does:

err := dm.client.StopContainer(ID, int(gracePeriod))
if err == nil {
    glog.V(2).Infof("Container %q exited after %s", name, unversioned.Now().Sub(start.Time))
} else {
    glog.V(2).Infof("Container %q termination failed after %s: %v", name, unversioned.Now().Sub(start.Time), err)
}

Looking at StopContainer() (kube_docker_client.go) you can see it does:

err := d.client.ContainerStop(ctx, id, timeout)

Now, the d.client.ContainerStop() blocks until it completes the execution or the input timeout expires (the input timeout is the grace period - set to 600 seconds in my test).

However, the d.client instance has a defaultTimeout of 2 minutes, thus if the grace period is > 2 minutes then the ContainerStop() request times out before the grace period. If my analysis is correct, we should set a client timeout a bit higher than the input timeout (grace period), if the latter is > 2 minutes.

What's your take?

@k8s-github-robot k8s-github-robot added area/client-libraries sig/node Categorizes an issue or PR as relevant to SIG Node. labels Aug 23, 2016
@sarvigalava
Copy link

Same issue here on Kubernetes 1.3.3.

We have terminationGracePeriodSeconds set to several hours in order to successfully complete data processiong.

The pod's IP field gets empty value after entering Terminating state and the ifconfig command lists only the loopback interface within container.

In the Kubernetes 1.2.0 the networking remains available while the pod is in terminating state.

@yujuhong yujuhong added the kind/bug Categorizes issue or PR as related to a bug. label Aug 23, 2016
@yujuhong yujuhong added this to the v1.4 milestone Aug 23, 2016
@yujuhong
Copy link
Contributor

Yes, the request timeout should be greater than the termination grace period. We should fix this.

/cc @kubernetes/sig-node

@dchen1107 dchen1107 added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Aug 23, 2016
@dchen1107
Copy link
Member

@pracucci Thanks for reporting the issue with such detail debugging information.

@Random-Liu Could you please take a look?

@dims
Copy link
Member

dims commented Aug 23, 2016

@yujuhong @dchen1107 @Random-Liu may i take a stab at it?

dims added a commit to dims/kubernetes that referenced this issue Aug 23, 2016
When terminationGracePeriodSeconds is set to > 2 minutes (which is
the default request timeout), ContainerStop() times out at 2 minutes.
We should check the timeout being passed in and bump up the
request timeout if needed.

Fixes kubernetes#31219
@yujuhong
Copy link
Contributor

@dims no objection from me.

@Random-Liu
Copy link
Member

Yeah, this is a bug. @dims Thanks a lot for fixing it! :)

@Random-Liu
Copy link
Member

@yujuhong @dchen1107 This is introduced in v1.3, do we need a cherry-pick?

@dchen1107 dchen1107 assigned dims and unassigned dchen1107 and Random-Liu Aug 23, 2016
@dchen1107
Copy link
Member

@dims thanks for fixing the issue. I don't think we are going to cut another patch solely for this bug based on our patch policy. But @Random-Liu or @dims could one of you send a cherrypick pr against 1.3 branch. So that we could include the fix if we cut another patch release for 1.3?

Also @Random-Liu can we add a node-e2e test to make sure we won't have such regression in the future?

Thanks!

dims added a commit to dims/kubernetes that referenced this issue Aug 23, 2016
When terminationGracePeriodSeconds is set to > 2 minutes (which is
the default request timeout), ContainerStop() times out at 2 minutes.
We should check the timeout being passed in and bump up the
request timeout if needed.

Fixes kubernetes#31219
@dims
Copy link
Member

dims commented Aug 23, 2016

@dchen1107 @Random-Liu : here's the PR for 1.3 #31279

k8s-github-robot pushed a commit that referenced this issue Aug 25, 2016
Automatic merge from submit-queue

Increase request timeout based on termination grace period

When terminationGracePeriodSeconds is set to > 2 minutes (which is
the default request timeout), ContainerStop() times out at 2 minutes.
We should check the timeout being passed in and bump up the
request timeout if needed.

Fixes #31219
shyamjvs pushed a commit to shyamjvs/kubernetes that referenced this issue Dec 1, 2016
When terminationGracePeriodSeconds is set to > 2 minutes (which is
the default request timeout), ContainerStop() times out at 2 minutes.
We should check the timeout being passed in and bump up the
request timeout if needed.

Fixes kubernetes#31219
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/client-libraries kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

7 participants