Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Endpoint watchers cannot assume Ready pods are "Ready" in the presence of process restarts #13364

Closed
smarterclayton opened this issue Aug 30, 2015 · 7 comments
Labels
area/kubelet area/reliability area/usability kind/design Categorizes issue or PR as related to design. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@smarterclayton
Copy link
Contributor

If a pod process dies and is restarted by Kube (triggering the ready flag to be reset), there's no guarantee a load balancer watching endpoints sees the pod go back to not ready (kubelet updating status -> endpoints controller -> load balancer watcher) prior to the pod starting to listen on its port. For any process that opens its TCP port before being ready, this means when a process dies and is restarted, load balancers will continue to hit the endpoint before its readiness check passes.

One solution is to have your load balancer include the readiness check - but that only works a) if your load balancer supports it, or b) if you chose http or tcp (load balancers can't use the exec check). Most load balancers do support ready checks, although not all support the same set of options.

One possible fix for the exec check is to expose on the kubelet an endpoint that can be a readiness proxy for the process - https://kubelet:10250/v1/pods/<pod name or maybe pod ip>/readinesscheck that uses the internal ready bool to return. This would put slightly higher load on the kubelet, and mean the kubelet restarting would result in load balancers thinking the pod is down.

@smarterclayton smarterclayton added sig/network Categorizes an issue or PR as relevant to SIG Network. area/kubelet area/usability area/reliability labels Aug 30, 2015
@smarterclayton
Copy link
Contributor Author

@kubernetes/goog-cluster @kubernetes/rh-cluster-infra

@lavalamp lavalamp added kind/design Categorizes issue or PR as related to design. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/node Categorizes an issue or PR as relevant to SIG Node. team/cluster labels Aug 31, 2015
@lavalamp
Copy link
Member

no guarantee a load balancer watching endpoints sees the pod go back to not ready (kubelet updating status -> endpoints controller -> load balancer watcher) prior to the pod starting to listen on its port.

Are you sure? Kubelet should send a not ready status, then restart, then later a ready status. Endpoint controller should remove from the list, then later readd. The only thing not guaranteed here would be the latency between death and removal.

If the stop and restart are so fast that they both happen in a single pass of the endpoint controller, then the pod would stay in the list. If that's a problem (why would it be?), then we could change endpoint controller to remove, then readd on the next loop--endpoint controller should know that a restart happened.

@smarterclayton
Copy link
Contributor Author

So the use case is JBoss, which comes up and starts listening on port 8080 within about ~5-10 seconds. However, WAR loading takes longer than that - another 10-60s depending on how big your app is. The ready check guards that latter portion. So sequence of events is:

  1. Pod starts first time, goes ready, added to endpoints list, load balancer sees endpoints change and puts pod in rotation
  2. Pod is SIGKILLED
  3. Client hits load balancer, load balancer tries to open connection, sees it can't, takes it out of rotation
  4. Kubelet observes pod death, starts new container, updates ready status to false
  5. Load balancer gets a new request, continues to use old endpoints list, sees that the port is open, sends request to pod
  6. JBoss rejects request with 404 because app isn't loaded
  7. Load balancer observes endpoints list update, removes endpoints from rotation.
  8. JBoss finishes and marks ready, propagates out

Essentially there is always a race between 4/5 that depending on the various propagation delays, the load balancer can observe port 8080 open before it observes the endpoint list purge the pod. We can minimize the window, but it's not truly zero.

@thockin
Copy link
Member

thockin commented Aug 31, 2015

I'll fight hard to NOT put kubelet downtime in the critical path for pods.

There's a natural propagation delay here, as you point out. Could we
mitigate by forcing Kubelet to POST the not-ready status BEFORE restarting
the pod? At least LB's will have a chance to respond to not-ready before
the port reopens...

On Mon, Aug 31, 2015 at 1:19 PM, Clayton Coleman notifications@github.com
wrote:

So the use case is JBoss, which comes up and starts listening on port 8080
within about ~5-10 seconds. However, WAR loading takes longer than that -
another 10-60s depending on how big your app is. The ready check guards
that latter portion. So sequence of events is:

  1. Pod starts first time, goes ready, added to endpoints list, load
    balancer sees endpoints change and puts pod in rotation
  2. Pod is SIGKILLED
  3. Client hits load balancer, load balancer tries to open connection,
    sees it can't, takes it out of rotation
  4. Kubelet observes pod death, starts new container, updates ready
    status to false
  5. Load balancer gets a new request, continues to use old endpoints
    list, sees that the port is open, sends request to pod
  6. JBoss rejects request with 404 because app isn't loaded
  7. Load balancer observes endpoints list update, removes endpoints
    from rotation.
  8. JBoss finishes and marks ready, propagates out

Essentially there is always a race between 4/5 that depending on the
various propagation delays, the load balancer can observe port 8080 open
before it observes the endpoint list purge the pod. We can minimize the
window, but it's not truly zero.


Reply to this email directly or view it on GitHub
#13364 (comment)
.

@smarterclayton
Copy link
Contributor Author

Hrm - not-ready has to propagate two layers asynchronously so there's
potentially infinite latency. You'd have to define a minimum bounce
latency in the load balancer which makes you less tolerant to network
flakes.

On Mon, Aug 31, 2015 at 5:29 PM, Tim Hockin notifications@github.com
wrote:

I'll fight hard to NOT put kubelet downtime in the critical path for pods.

There's a natural propagation delay here, as you point out. Could we
mitigate by forcing Kubelet to POST the not-ready status BEFORE restarting
the pod? At least LB's will have a chance to respond to not-ready before
the port reopens...

On Mon, Aug 31, 2015 at 1:19 PM, Clayton Coleman <notifications@github.com

wrote:

So the use case is JBoss, which comes up and starts listening on port
8080
within about ~5-10 seconds. However, WAR loading takes longer than that -
another 10-60s depending on how big your app is. The ready check guards
that latter portion. So sequence of events is:

  1. Pod starts first time, goes ready, added to endpoints list, load
    balancer sees endpoints change and puts pod in rotation
  2. Pod is SIGKILLED
  3. Client hits load balancer, load balancer tries to open connection,
    sees it can't, takes it out of rotation
  4. Kubelet observes pod death, starts new container, updates ready
    status to false
  5. Load balancer gets a new request, continues to use old endpoints
    list, sees that the port is open, sends request to pod
  6. JBoss rejects request with 404 because app isn't loaded
  7. Load balancer observes endpoints list update, removes endpoints
    from rotation.
  8. JBoss finishes and marks ready, propagates out

Essentially there is always a race between 4/5 that depending on the
various propagation delays, the load balancer can observe port 8080 open
before it observes the endpoint list purge the pod. We can minimize the
window, but it's not truly zero.


Reply to this email directly or view it on GitHub
<
#13364 (comment)

.


Reply to this email directly or view it on GitHub
#13364 (comment)
.

Clayton Coleman | Lead Engineer, OpenShift

@cmluciano
Copy link

Is this issue still relevant?

@thockin
Copy link
Member

thockin commented May 17, 2017

I think this is an app problem - reporting ready when it is not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet area/reliability area/usability kind/design Categorizes issue or PR as related to design. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

4 participants