Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Liveness Probes: mention that they can worsen app availability #16607

Open
hjacobs opened this issue Sep 29, 2019 · 11 comments

Comments

@hjacobs
Copy link

commented Sep 29, 2019

This is a Feature Request

Change request for: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/

What would you like to be added

I would like to have a clear warning on the Configure Liveness, Readiness and Startup Probes page that Liveness Probes can worsen the situation for applications: they can lead to cascading failures (e.g. in high-load situations when the health endpoint does not respond anymore + app restarts might take a long time).

Some proposed text (I'm open for anything better):

Please note that liveness probes can lead to cascading failures,
e.g. causing excessive downtime due to container restarts in high-load situations.
Understand the difference between readiness and liveness probes
and when to apply them for your app.

Why is this needed

Kubernetes documentation pages are often read by inexperienced app developers who are not familiar with Kubernetes. The current page only mentions Liveness Probes as a way to increase availability for containers which get stuck, but does not say anything about the danger of using Liveness Probes. I observe app developers often using Liveness Probes in the same way as Readiness Probes (sometimes even with the exact same probe settings), which will cause more harm than good.

For more context and Zalando's recommendations for app developers, see my blog post: https://srcco.de/posts/kubernetes-liveness-probes-are-dangerous.html

@hjacobs

This comment has been minimized.

Copy link
Author

commented Sep 29, 2019

This was triggered by this Tweet which suggests reading the docs is enough (which is not enough as the pitfalls are not mentioned): https://twitter.com/Guillaume_Swiss/status/1178258781152563200

@hjacobs

This comment has been minimized.

Copy link
Author

commented Sep 29, 2019

I don't want to add "never use liveness probes", but I think the documentation page is currently not balanced as it mentions "availability" only as benefit of Liveness Probes and not the downsides.

@hjacobs

This comment has been minimized.

Copy link
Author

commented Sep 29, 2019

Note that the answer to "When should you use a liveness probe?" also does not provide any hints on potential dangers:

If the process in your Container is able to crash on its own whenever it encounters an issue or becomes unhealthy, you do not necessarily need a liveness probe; the kubelet will automatically perform the correct action in accordance with the Pod’s restartPolicy.

If you’d like your Container to be killed and restarted if a probe fails, then specify a liveness probe, and specify a restartPolicy of Always or OnFailure.

For the naive application developer this might sound good ("restart my container on failure? yes, I want this of course!") --- the documentation does not mention that there is no coordination across Pods and that PDBs are not respected, i.e. that often all containers are restarted due to some external event/dependency (e.g. high load, health check on DB which has a hiccup, etc).

@thockin thockin self-assigned this Sep 29, 2019
@sftim

This comment has been minimized.

Copy link
Contributor

commented Sep 30, 2019

This is a valid issue
/priority backlog

@thockin

This comment has been minimized.

Copy link
Member

commented Sep 30, 2019

I'm OK to add something, but I'd rather phrase it in a more positive light. E.g.

Liveness probes can be a powerful way to recover from application failures, but they should be used with caution. Liveness probes must be configured carefully to ensure that they truly indicate unrecoverable application failure, for example a deadlock. A common pattern for liveness probes is to use the same low-cost HTTP endpoint as for readiness probes, but with a higher failureThreshold. This ensures that the pod is observed as not-ready for some period of time before it is hard killed.

Incorrect configuration of liveness probes can lead to cascading failures. For example, killing the pod when it has high load, as opposed to being crashed, can lead to failed client requests or traffic being shifted onto other pods in the same deployment or service, thereby overloading them.

Understand the difference between readiness and liveness probes and when to apply them for your app.

@sftim

This comment has been minimized.

Copy link
Contributor

commented Oct 1, 2019

My gut feeling is that there's enough material here for a new Task page, aimed at developers (think CKAD), that describes how to design readiness, liveness and startup probes for your workload.

The website has a lot of task pages, but (IMO) that's because there are a lot of different tasks that readers might want to do.

@hjacobs

This comment has been minimized.

Copy link
Author

commented Oct 1, 2019

@thockin your text proposal LGTM 😄

@thockin

This comment has been minimized.

Copy link
Member

commented Oct 1, 2019

@szuecs

This comment has been minimized.

Copy link
Member

commented Oct 1, 2019

Thanks @thockin, very good proposed text, that doesn’t hide the problem, and guides how and when to use

@hjacobs

This comment has been minimized.

Copy link
Author

commented Oct 1, 2019

I'm happy to do a PR, but I'm not sure if it should be after the second paragraph or somewhere else (?).

@szuecs

This comment has been minimized.

Copy link
Member

commented Oct 1, 2019

I would create a short paragraph about LivenessProbe and ReadinessProbe just before the more verbose sections that show the how.
Maybe https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes would be a better place and maybe link from the configuration page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.