Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic Threshold for Readiness Probe Failures #115917

Closed
eightnoteight opened this issue Feb 21, 2023 · 20 comments
Closed

Panic Threshold for Readiness Probe Failures #115917

eightnoteight opened this issue Feb 21, 2023 · 20 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@eightnoteight
Copy link

eightnoteight commented Feb 21, 2023

What would you like to be added?

When using external dependencies inside healthchecks or using fixed thread pool applications like python+gunicorn, ruby+passenger, PHP etc,... upstream service failures tend to cause cascading impact on the service as well.

one solution is removing the upstream services inside the healthchecks and that's how we are solving currently but that still won't solve the case where healthchecks are failing due to excessive load. like suddenly 2x requests came on service that caused some pods to reach capacity and start failing readiness probes which in turn causes a cascading impact where scheduler removes the pods from the service.

we faced similar issues in envoy and unhealthy panic threshold in envoy helped a lot - https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/load_balancing/panic_threshold

the feature has a lot of moving parts though i.e it shouldn't trigger during normal scenarios like deployments, rotating the pods of a service, interruptions in the underlying ec2 leading to replacements of pods in significant chunks.

references to similar outages -

  1. https://loft.sh/blog/kubernetes-readiness-probes-examples-and-common-pitfalls/#cascading-failures
  2. https://doordash.engineering/2022/08/09/how-to-handle-kubernetes-health-checks/

Why is this needed?

To avoid cascading outages when more than 30-40% of the pods of a deployment starts failing readiness probes

@eightnoteight eightnoteight added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 21, 2023
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 21, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@eightnoteight eightnoteight changed the title Readiness Probe Failure Panic Threshold Panic Threshold for Readiness Probe Failures Feb 21, 2023
@SergeyKanzhelev
Copy link
Member

/sig node

it's a hard issue to tackle. Do you have a suggestions how we can solve it generically? Probes are designed to be local and not account for any external information like a deployment size.

One way to solve it in the application is to expose a completely separate port with separate thread pool just for readiness checks. Another is to use HPA that is based on load the app experience.

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 21, 2023
@eightnoteight
Copy link
Author

eightnoteight commented Feb 21, 2023

One way to solve it in the application is to expose a completely separate port with separate thread pool just for readiness checks. Another is to use HPA that is based on load the app experience.

ha yes, for some of our microservices we already do this, but languages like python & ruby come with a lot of memory overhead per thread so by having a separate threadpool for healthchecks we have to provision a lot more memory

Probes are designed to be local and not account for any external information like a deployment size.

ooh, thinking just in terms of readiness probe, since the pod is getting removed from the k8s "service" it would be a centralised logic right? i.e the service resource can choose to not remove the pod if more than 20% of the pods in the service are failing readiness probe.

I'm guessing liveness probe can't account for such external information as kubelet would decide to restart the container

@eightnoteight
Copy link
Author

Another is to use HPA that is based on load the app experience.

I don't think this will help much, because the deployment once it goes into this state is already in a cascading state i.e for x load we need y pods. if only a fraction of y pods are healthy they immediately get bombarded with all the traffic that they also go unhealthy and the old ones become healthy but again by putting all traffic on a fraction of pods they too go unhealthy and so the incident will have a cascading state where only way to recover is to cut off all the load and reintroduce slowly.

@SergeyKanzhelev
Copy link
Member

/sig network

for this proposal:

i.e the service resource can choose to not remove the pod if more than 20% of the pods in the service are failing readiness probe.

I don't think it is generalizable enough as it will be hard to reason of totally broken service vs. one that can get back if only we keep it alive.

Another two alternaives:

  • switch to tcp readiness check. TCP will not depend on the language-specific pool sizes and will give a reasonable signal for readiness. Startup probes may ensure that the service is actually up and running.
  • write a status message in a file and exec probe checks this file and validates the message is fresh. This way no http resource will be used by probes at all.

@k8s-ci-robot k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Feb 21, 2023
@sftim
Copy link
Contributor

sftim commented Feb 21, 2023

Maybe the Gateway project could consider ways to help here, given some changes to in-tree code.

For example:

Service B selects the same Pods as Service A, but with a different rule for including unhealthy backends.
Send the traffic to Service A where possible, but if Service A marks itself as degraded then send 90% of the traffic to Service B and 10% to Service A.

If neither Service has any endpoints, reverse-proxy incoming HTTP traffic to https://external-sorry-page.example/

@eightnoteight
Copy link
Author

but if Service A marks itself as degraded then send 90% of the traffic to Service B and 10% to Service A.

awesome solution. just confused a bit on how gateway would determine that Service A is degraded

just to confirm, Is it based on the panic threshold? i.e if 70% is the panic threshold and if len(Service A) < 0.7 * len(Service B) then consider "Service A" is degraded?

@eightnoteight
Copy link
Author

just wanted to note another solution. @sftim since outlier detection is already planned for Gateway project, is it possible to use healthchecks as a source for outlier detection?

it would be something like I won't set the readiness probe but I'll set the outlier detection probe.

error rate based outlier detection is good but it also varies a lot between the workloads i.e some workloads baseline error rate is already high, so outlier detection mechanism needs to be tuned for each microservice

@sftim
Copy link
Contributor

sftim commented Feb 25, 2023

just wanted to note another solution. @sftim since outlier detection is already planned for Gateway project, is it possible to use healthchecks as a source for outlier detection?

it would be something like I won't set the readiness probe but I'll set the outlier detection probe.

error rate based outlier detection is good but it also varies a lot between the workloads i.e some workloads baseline error rate is already high, so outlier detection mechanism needs to be tuned for each microservice

I don't feel I can comment on this.

@aojea
Copy link
Member

aojea commented Feb 25, 2023

I think that Services should keep using Pods readiness as source of truth , in this case it seems that the problem is that the readiness probe is sending the wrong signal, but users can develop their readiness strategy using readiness gates, instead of adding a know to skip readiness values

https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/580-pod-readiness-gates

@thockin is the expert here, I'm sure he can expand more on this

@eightnoteight
Copy link
Author

eightnoteight commented Feb 26, 2023

https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/580-pod-readiness-gates

@aojea I went through the doc, the major issue I see is that the doc uses AND condition for total readiness i.e a pod is ready when both readiness check passes and external readiness gates also pass. (please correct me if I'm wrong)

Pod is ready == containers are ready AND conditions in ReadinessGates are True

But for panic threshold central logic, the central readiness gate needs to override the local readiness check i.e local readiness check will be false. Still, the pod should be considered ready because the central readiness gate is passing

probably this extension can be added to the proposal i.e readiness gate overriding the behaviour

@aojea
Copy link
Member

aojea commented Feb 26, 2023

unless I'm mixing things, containers are ready means that the container is actually running, not related to probes. If the container inside the pod is not running it doesn't matter what you do, the app is dead.

From your description, I understand that you want to achieve a central probing mechanism and not rely on the kubelet readiness probes, it sounds that you can do it using this feature

@thockin
Copy link
Member

thockin commented Feb 27, 2023

When using external dependencies inside healthchecks

Don't do that

one solution is removing the upstream services inside the healthchecks

Do that

that still won't solve the case where healthchecks are failing due to excessive load. like suddenly 2x requests came on service that caused some pods to reach capacity and start failing readiness probes

In a sense this is exactly what is intended. If your pod is SO busy that it
can't service the healthcheck, I think you WANT LBs to stop sending it traffic,
don't you?

causes a cascading impact where scheduler removes the pods from the service.

Not the scheduler, but yes.

can choose to not remove the pod if more than 20% of the pods in the service are failing readiness probe.

In theory that is possible, but as you said above "a lot of moving parts". We
already have a fallback - if ALL pods are terminating, we ignore readiness. But if
even 1 is ready, we respect readiness.

IIUC you're asking for that to a) be extended to readyiness; and b) be a
per-service knob?

This would probably have to be implmented in the service proxy control
planess. I say "planes" because there are several of them now - kube-proxy,
kpng, antrea, Cilium, OVN, ... Ever feature we add around this area comes at
high cost and needs to be justified.

Send the traffic to Service A where possible, but if Service A marks itself as degraded then send 90% of the traffic to Service B and 10% to Service A.

#56440 ?

@aojea - I don't think readinessGates can solve this - it's really about
endpoint selection, if I understand.

@aojea
Copy link
Member

aojea commented Feb 27, 2023

@aojea - I don't think readinessGates can solve this - it's really about

I may misunderstood then , I thought that he wanted to build an unified custom probe handling logic ... and this logic influences the endpoints of course

@thockin
Copy link
Member

thockin commented Mar 2, 2023

Should we call this a dup of #56440 ?

@shaneutt
Copy link
Member

shaneutt commented Mar 2, 2023

/assign @shaneutt

@SergeyKanzhelev
Copy link
Member

/remove-sig node

since the idea of switching to more relaxed readiness checks is not working here, removing sig node.

@k8s-ci-robot k8s-ci-robot removed the sig/node Categorizes an issue or PR as relevant to SIG Node. label Mar 2, 2023
@shaneutt
Copy link
Member

shaneutt commented Mar 2, 2023

@eightnoteight we discussed this in the SIG Network community meeting today and we felt we had further questions and needed a bit more information.

Our first question was whether having the ability to provide a "backup selector" would fulfill the need here as referenced by @thockin above: Could you check out #56440 and let us know if this could be considered a duplicate of it and be handled by that effort?

Secondly we had considered whether having backup selector style functionality in Gateway API might be helpful to this kind of situation in the future (as we feel Service is already overloaded and we want to be very careful about what we add there). Do you have familiarity with Gateway API, and do you feel like having "backup selector"-like functionality on route resources (e.g. TCPRoute) would be helpful in your case?

/triage needs-information

@k8s-ci-robot k8s-ci-robot added the triage/needs-information Indicates an issue needs more information in order to work on it. label Mar 2, 2023
@shaneutt
Copy link
Member

Our current thinking has been that #56440 (or maybe more specifically kubernetes-sigs/gateway-api#1802) might be a good place to get functionality that would be relevant with this situation, but since it's been a while since we've heard back here we'll consider this one closed for the time being.

/close

Please follow up in #56440 and kubernetes-sigs/gateway-api#1802 to advocate for your use case, or if you feel there's more follow up that needs to be done here on this issue please don't hesitate to drop a comment and follow up. 🖖

@k8s-ci-robot
Copy link
Contributor

@shaneutt: Closing this issue.

In response to this:

Our current thinking has been that #56440 (or maybe more specifically kubernetes-sigs/gateway-api#1802) might be a good place to get functionality that would be relevant with this situation, but since it's been a while since we've heard back here we'll consider this one closed for the time being.

/close

Please follow up in #56440 and kubernetes-sigs/gateway-api#1802 to advocate for your use case, or if you feel there's more follow up that needs to be done here on this issue please don't hesitate to drop a comment and follow up. 🖖

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

7 participants