Graceful termination fails on terminationGracePeriodSeconds > 2 minutes #31219

pracucci · 2016-08-23T08:31:12Z

Kubernetes version: 1.3.5

I've a pod with terminationGracePeriodSeconds set to 600 (10 minutes). When I delete the pod, if it takes more than 2 minutes to shutdown, then weird things happen (like the networking stops working after 2 minutes it's in the Terminating state). After digging a big in the Kubernetes sources, I do believe I've individuated the root cause. Please see my report below.

Extract from the node's syslog:

Aug 23 07:18:47 docker_manager.go:1326] Killing container "497c5cdb46d919092f99359f6761d06c00db40439a51f4609dbc2d2174f56a50 test-termination default/test-termination-1781375857-hj5z2" with 600 second grace period
Aug 23 07:20:47 docker_manager.go:1367] Container "497c5cdb46d919092f99359f6761d06c00db40439a51f4609dbc2d2174f56a50 test-termination default/test-termination-1781375857-hj5z2" termination failed after 2m0.000209508s: operation timeout: context deadline exceeded

When a container should be killed, the killContainer() (manager.go) is called. At some point it does:

err := dm.client.StopContainer(ID, int(gracePeriod))
if err == nil {
    glog.V(2).Infof("Container %q exited after %s", name, unversioned.Now().Sub(start.Time))
} else {
    glog.V(2).Infof("Container %q termination failed after %s: %v", name, unversioned.Now().Sub(start.Time), err)
}

Looking at StopContainer() (kube_docker_client.go) you can see it does:

err := d.client.ContainerStop(ctx, id, timeout)

Now, the d.client.ContainerStop() blocks until it completes the execution or the input timeout expires (the input timeout is the grace period - set to 600 seconds in my test).

However, the d.client instance has a defaultTimeout of 2 minutes, thus if the grace period is > 2 minutes then the ContainerStop() request times out before the grace period. If my analysis is correct, we should set a client timeout a bit higher than the input timeout (grace period), if the latter is > 2 minutes.

What's your take?

The text was updated successfully, but these errors were encountered:

sarvigalava · 2016-08-23T12:34:57Z

Same issue here on Kubernetes 1.3.3.

We have terminationGracePeriodSeconds set to several hours in order to successfully complete data processiong.

The pod's IP field gets empty value after entering Terminating state and the ifconfig command lists only the loopback interface within container.

In the Kubernetes 1.2.0 the networking remains available while the pod is in terminating state.

yujuhong · 2016-08-23T17:26:50Z

Yes, the request timeout should be greater than the termination grace period. We should fix this.

/cc @kubernetes/sig-node

dchen1107 · 2016-08-23T17:41:45Z

@pracucci Thanks for reporting the issue with such detail debugging information.

@Random-Liu Could you please take a look?

dims · 2016-08-23T17:42:11Z

@yujuhong @dchen1107 @Random-Liu may i take a stab at it?

When terminationGracePeriodSeconds is set to > 2 minutes (which is the default request timeout), ContainerStop() times out at 2 minutes. We should check the timeout being passed in and bump up the request timeout if needed. Fixes kubernetes#31219

yujuhong · 2016-08-23T17:48:52Z

@dims no objection from me.

Random-Liu · 2016-08-23T18:11:47Z

Yeah, this is a bug. @dims Thanks a lot for fixing it! :)

Random-Liu · 2016-08-23T18:14:23Z

@yujuhong @dchen1107 This is introduced in v1.3, do we need a cherry-pick?

dchen1107 · 2016-08-23T18:23:43Z

@dims thanks for fixing the issue. I don't think we are going to cut another patch solely for this bug based on our patch policy. But @Random-Liu or @dims could one of you send a cherrypick pr against 1.3 branch. So that we could include the fix if we cut another patch release for 1.3?

Also @Random-Liu can we add a node-e2e test to make sure we won't have such regression in the future?

Thanks!

When terminationGracePeriodSeconds is set to > 2 minutes (which is the default request timeout), ContainerStop() times out at 2 minutes. We should check the timeout being passed in and bump up the request timeout if needed. Fixes kubernetes#31219

dims · 2016-08-23T18:40:50Z

@dchen1107 @Random-Liu : here's the PR for 1.3 #31279

Automatic merge from submit-queue Increase request timeout based on termination grace period When terminationGracePeriodSeconds is set to > 2 minutes (which is the default request timeout), ContainerStop() times out at 2 minutes. We should check the timeout being passed in and bump up the request timeout if needed. Fixes #31219

When terminationGracePeriodSeconds is set to > 2 minutes (which is the default request timeout), ContainerStop() times out at 2 minutes. We should check the timeout being passed in and bump up the request timeout if needed. Fixes kubernetes#31219

k8s-github-robot added area/client-libraries sig/node Categorizes an issue or PR as relevant to SIG Node. labels Aug 23, 2016

yujuhong added the kind/bug Categorizes issue or PR as related to a bug. label Aug 23, 2016

yujuhong added this to the v1.4 milestone Aug 23, 2016

dchen1107 assigned Random-Liu and dchen1107 Aug 23, 2016

dchen1107 added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Aug 23, 2016

dims mentioned this issue Aug 23, 2016

Increase request timeout based on termination grace period #31275

Merged

dchen1107 assigned dims and unassigned dchen1107 and Random-Liu Aug 23, 2016

dchen1107 assigned Random-Liu Aug 23, 2016

k8s-github-robot closed this as completed in #31275 Aug 25, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graceful termination fails on terminationGracePeriodSeconds > 2 minutes #31219

Graceful termination fails on terminationGracePeriodSeconds > 2 minutes #31219

pracucci commented Aug 23, 2016

sarvigalava commented Aug 23, 2016

yujuhong commented Aug 23, 2016

dchen1107 commented Aug 23, 2016

dims commented Aug 23, 2016

yujuhong commented Aug 23, 2016

Random-Liu commented Aug 23, 2016

Random-Liu commented Aug 23, 2016

dchen1107 commented Aug 23, 2016

dims commented Aug 23, 2016

Graceful termination fails on terminationGracePeriodSeconds > 2 minutes #31219

Graceful termination fails on terminationGracePeriodSeconds > 2 minutes #31219

Comments

pracucci commented Aug 23, 2016

sarvigalava commented Aug 23, 2016

yujuhong commented Aug 23, 2016

dchen1107 commented Aug 23, 2016

dims commented Aug 23, 2016

yujuhong commented Aug 23, 2016

Random-Liu commented Aug 23, 2016

Random-Liu commented Aug 23, 2016

dchen1107 commented Aug 23, 2016

dims commented Aug 23, 2016