Support health (readiness) checks #620

Closed
dbcode opened this Issue Jul 25, 2014 · 6 comments

Projects

None yet

5 participants

@dbcode
dbcode commented Jul 25, 2014

GCE network load balancers can be configured with health checks (periodic HTTP requests to a user-defined endpoint), such that instances are removed from the pool if they don't respond to the health checks promptly with a 200 status code.

Kubernetes should be able to reuse the same health checks, such that if a user has created a service that they wish to use from Kubernetes, their health checks will do what they expect them to do: cause any unhealthy instances to be removed from the load balancing pool until healthy again.

Ideally, if N frontends talk to M backends, this should not result in N x M health check HTTP requests per interval (i.e. each of the N frontends independently health checking each of the M backends). If that's not possible, maybe Kubernetes could transparently create and use a GCE network load balancer for each service that has more than a certain number of replicas (whether marked as "external" or not), instead of trying to do its own load balancing.

@brendandburns
Contributor

This is already basically possible.

The kubelet implements HTTP health checks, and restarts the container if it is failing. So no task should actually be failing for very long. This means that for M backends you only do M health checks.

Taking it a step further, we could consider adding health checks to the Service polling, but in some ways that seems redundant, since only healthy tasks should be in the service pool anyway.

@dbcode
dbcode commented Jul 25, 2014

Can you clarify for me - let's say I have a cluster of 100 frontend containers using a backend service with 200 containers, and I have an HTTP health check on the backend service polling the URL "/healthy" every 5 seconds. How many requests to /healthy does each backend instance (container) see every 5 seconds?

Also, is restarting the container something that can be configured? I may not want to restart the container; e.g. on instance migration, I might want to just remove it from the LB pool 30-60 seconds before the migration takes place, and then put it back in once the migration is complete (thus minimizing broken connections).

@brendandburns
Contributor

The current health check is a "liveness" health check, that is performed at the backend container. Thus, your 200 backends would only see one health check every 5 seconds.

It is important to note that this is not a "readiness" healthcheck which indicates that it is ready to serve. We currently don't have a notion of "readiness", but we will add it eventually. When we do, we'll implement it in the same way, so that the health check is still at the level of the backend controller, not the frontend service, so health checks are still 1-1 with the backend container.

For your second point, that's exactly the reason for differentiating between liveness and readiness.

@bgrant0607 bgrant0607 changed the title from Support health checks in Kubernetes load balancing pools to Support health (readiness) checks in Kubernetes load balancing pools Jul 25, 2014
@bgrant0607
Member

There are many scenarios where it is useful to differentiate between liveness and readiness:

  • Graceful draining
  • Startup latency
  • Offline for data reloading or other maintenance

And many components (any systems that disrupt pods and/or hosts + any systems that manage sets of pods) care about readiness: rollout tools, reschedulers, kernel updaters, worker pool managers, ...

@bgrant0607
Member

Readiness information would be useful during rolling service updates, also.

@bgrant0607 bgrant0607 referenced this issue in smarterclayton/kubernetes Sep 12, 2014
@smarterclayton smarterclayton Proposal for v1beta3 API
* Separate metadata from objects
* Identify current state of objects consistently
* Introduce BoundPod(s) as distinct from Pod to represent pods
  scheduled onto a host
* Use "spec" instead of "state"
* Rename Minion -> Node
* Add UID and Annotations on Metadata
* Treat lists differently from resources
* Remove ContainerManifest
d695810
@bgrant0607 bgrant0607 changed the title from Support health (readiness) checks in Kubernetes load balancing pools to Support health (readiness) checks Oct 2, 2014
@bgrant0607 bgrant0607 added this to the v0.9 milestone Oct 4, 2014
@bgrant0607 bgrant0607 added priority/P3 and removed priority/P2 labels Jan 9, 2015
@goltermann goltermann removed this from the v0.9 milestone Feb 6, 2015
@dbcode dbcode removed this from the v0.9 milestone Feb 6, 2015
@bgrant0607
Member

Readiness has been implemented. Yeah! Kudos to @mikedanese.

@bgrant0607 bgrant0607 closed this Feb 14, 2015
@metadave metadave pushed a commit to metadave/kubernetes that referenced this issue Feb 22, 2017
@redbaron @mgoodness redbaron + mgoodness Don't run Zookeeper as a shell subprocess (#620)
* Don't run Zookeeper as a shell subprocess
* Version bump
4c43bd2
@stephanwesten stephanwesten referenced this issue in kubernetes/kubernetes.github.io Mar 16, 2017
Open

Issue with /docs/user-guide/replication-controller/ #2859

1 of 2 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment