Endpoint watchers cannot assume Ready pods are "Ready" in the presence of process restarts #13364

smarterclayton · 2015-08-30T21:23:04Z

If a pod process dies and is restarted by Kube (triggering the ready flag to be reset), there's no guarantee a load balancer watching endpoints sees the pod go back to not ready (kubelet updating status -> endpoints controller -> load balancer watcher) prior to the pod starting to listen on its port. For any process that opens its TCP port before being ready, this means when a process dies and is restarted, load balancers will continue to hit the endpoint before its readiness check passes.

One solution is to have your load balancer include the readiness check - but that only works a) if your load balancer supports it, or b) if you chose http or tcp (load balancers can't use the exec check). Most load balancers do support ready checks, although not all support the same set of options.

One possible fix for the exec check is to expose on the kubelet an endpoint that can be a readiness proxy for the process - https://kubelet:10250/v1/pods/<pod name or maybe pod ip>/readinesscheck that uses the internal ready bool to return. This would put slightly higher load on the kubelet, and mean the kubelet restarting would result in load balancers thinking the pod is down.

The text was updated successfully, but these errors were encountered:

smarterclayton · 2015-08-30T21:23:44Z

@kubernetes/goog-cluster @kubernetes/rh-cluster-infra

lavalamp · 2015-08-31T20:06:22Z

no guarantee a load balancer watching endpoints sees the pod go back to not ready (kubelet updating status -> endpoints controller -> load balancer watcher) prior to the pod starting to listen on its port.

Are you sure? Kubelet should send a not ready status, then restart, then later a ready status. Endpoint controller should remove from the list, then later readd. The only thing not guaranteed here would be the latency between death and removal.

If the stop and restart are so fast that they both happen in a single pass of the endpoint controller, then the pod would stay in the list. If that's a problem (why would it be?), then we could change endpoint controller to remove, then readd on the next loop--endpoint controller should know that a restart happened.

smarterclayton · 2015-08-31T20:18:24Z

So the use case is JBoss, which comes up and starts listening on port 8080 within about ~5-10 seconds. However, WAR loading takes longer than that - another 10-60s depending on how big your app is. The ready check guards that latter portion. So sequence of events is:

Pod starts first time, goes ready, added to endpoints list, load balancer sees endpoints change and puts pod in rotation
Pod is SIGKILLED
Client hits load balancer, load balancer tries to open connection, sees it can't, takes it out of rotation
Kubelet observes pod death, starts new container, updates ready status to false
Load balancer gets a new request, continues to use old endpoints list, sees that the port is open, sends request to pod
JBoss rejects request with 404 because app isn't loaded
Load balancer observes endpoints list update, removes endpoints from rotation.
JBoss finishes and marks ready, propagates out

Essentially there is always a race between 4/5 that depending on the various propagation delays, the load balancer can observe port 8080 open before it observes the endpoint list purge the pod. We can minimize the window, but it's not truly zero.

thockin · 2015-08-31T20:29:08Z

I'll fight hard to NOT put kubelet downtime in the critical path for pods.

There's a natural propagation delay here, as you point out. Could we
mitigate by forcing Kubelet to POST the not-ready status BEFORE restarting
the pod? At least LB's will have a chance to respond to not-ready before
the port reopens...

On Mon, Aug 31, 2015 at 1:19 PM, Clayton Coleman notifications@github.com
wrote:

So the use case is JBoss, which comes up and starts listening on port 8080
within about ~5-10 seconds. However, WAR loading takes longer than that -
another 10-60s depending on how big your app is. The ready check guards
that latter portion. So sequence of events is:

Pod starts first time, goes ready, added to endpoints list, load
balancer sees endpoints change and puts pod in rotation

Pod is SIGKILLED

Client hits load balancer, load balancer tries to open connection,
sees it can't, takes it out of rotation

Kubelet observes pod death, starts new container, updates ready
status to false

Load balancer gets a new request, continues to use old endpoints
list, sees that the port is open, sends request to pod

JBoss rejects request with 404 because app isn't loaded

Load balancer observes endpoints list update, removes endpoints
from rotation.

JBoss finishes and marks ready, propagates out

Essentially there is always a race between 4/5 that depending on the
various propagation delays, the load balancer can observe port 8080 open
before it observes the endpoint list purge the pod. We can minimize the
window, but it's not truly zero.

—
Reply to this email directly or view it on GitHub
#13364 (comment)
.

smarterclayton · 2015-08-31T20:50:43Z

Hrm - not-ready has to propagate two layers asynchronously so there's
potentially infinite latency. You'd have to define a minimum bounce
latency in the load balancer which makes you less tolerant to network
flakes.

On Mon, Aug 31, 2015 at 5:29 PM, Tim Hockin notifications@github.com
wrote:

I'll fight hard to NOT put kubelet downtime in the critical path for pods.

There's a natural propagation delay here, as you point out. Could we
mitigate by forcing Kubelet to POST the not-ready status BEFORE restarting
the pod? At least LB's will have a chance to respond to not-ready before
the port reopens...

On Mon, Aug 31, 2015 at 1:19 PM, Clayton Coleman <notifications@github.com

wrote:

So the use case is JBoss, which comes up and starts listening on port
8080
within about ~5-10 seconds. However, WAR loading takes longer than that -
another 10-60s depending on how big your app is. The ready check guards
that latter portion. So sequence of events is:

Pod starts first time, goes ready, added to endpoints list, load
balancer sees endpoints change and puts pod in rotation

Pod is SIGKILLED

Client hits load balancer, load balancer tries to open connection,
sees it can't, takes it out of rotation

Kubelet observes pod death, starts new container, updates ready
status to false

Load balancer gets a new request, continues to use old endpoints
list, sees that the port is open, sends request to pod

JBoss rejects request with 404 because app isn't loaded

Load balancer observes endpoints list update, removes endpoints
from rotation.

JBoss finishes and marks ready, propagates out

Essentially there is always a race between 4/5 that depending on the
various propagation delays, the load balancer can observe port 8080 open
before it observes the endpoint list purge the pod. We can minimize the
window, but it's not truly zero.

—
Reply to this email directly or view it on GitHub
<
#13364 (comment)

.

—
Reply to this email directly or view it on GitHub
#13364 (comment)
.

Clayton Coleman | Lead Engineer, OpenShift

cmluciano · 2017-05-15T20:12:25Z

Is this issue still relevant?

thockin · 2017-05-17T06:58:52Z

I think this is an app problem - reporting ready when it is not.

smarterclayton added sig/network Categorizes an issue or PR as relevant to SIG Network. area/kubelet area/usability area/reliability labels Aug 30, 2015

ramr mentioned this issue Aug 31, 2015

Router: Restarting the haproxy router still dispatches connections to a downed backend openshift/origin#4437

Closed

lavalamp added kind/design Categorizes issue or PR as related to design. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/node Categorizes an issue or PR as relevant to SIG Node. team/cluster labels Aug 31, 2015

thockin closed this as completed May 17, 2017

smarterclayton mentioned this issue Mar 21, 2018

[Proposal] Improved graceful shutdown (zero downtime) openshift/origin#18914

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Endpoint watchers cannot assume Ready pods are "Ready" in the presence of process restarts #13364

Endpoint watchers cannot assume Ready pods are "Ready" in the presence of process restarts #13364

smarterclayton commented Aug 30, 2015

smarterclayton commented Aug 30, 2015

lavalamp commented Aug 31, 2015

smarterclayton commented Aug 31, 2015

thockin commented Aug 31, 2015

smarterclayton commented Aug 31, 2015

cmluciano commented May 15, 2017

thockin commented May 17, 2017

Endpoint watchers cannot assume Ready pods are "Ready" in the presence of process restarts #13364

Endpoint watchers cannot assume Ready pods are "Ready" in the presence of process restarts #13364

Comments

smarterclayton commented Aug 30, 2015

smarterclayton commented Aug 30, 2015

lavalamp commented Aug 31, 2015

smarterclayton commented Aug 31, 2015

thockin commented Aug 31, 2015

smarterclayton commented Aug 31, 2015

cmluciano commented May 15, 2017

thockin commented May 17, 2017