Panic Threshold for Readiness Probe Failures #115917

eightnoteight · 2023-02-21T04:05:21Z

What would you like to be added?

When using external dependencies inside healthchecks or using fixed thread pool applications like python+gunicorn, ruby+passenger, PHP etc,... upstream service failures tend to cause cascading impact on the service as well.

one solution is removing the upstream services inside the healthchecks and that's how we are solving currently but that still won't solve the case where healthchecks are failing due to excessive load. like suddenly 2x requests came on service that caused some pods to reach capacity and start failing readiness probes which in turn causes a cascading impact where scheduler removes the pods from the service.

we faced similar issues in envoy and unhealthy panic threshold in envoy helped a lot - https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/load_balancing/panic_threshold

the feature has a lot of moving parts though i.e it shouldn't trigger during normal scenarios like deployments, rotating the pods of a service, interruptions in the underlying ec2 leading to replacements of pods in significant chunks.

references to similar outages -

Why is this needed?

To avoid cascading outages when more than 30-40% of the pods of a deployment starts failing readiness probes

k8s-ci-robot · 2023-02-21T04:05:30Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

SergeyKanzhelev · 2023-02-21T05:52:57Z

/sig node

it's a hard issue to tackle. Do you have a suggestions how we can solve it generically? Probes are designed to be local and not account for any external information like a deployment size.

One way to solve it in the application is to expose a completely separate port with separate thread pool just for readiness checks. Another is to use HPA that is based on load the app experience.

eightnoteight · 2023-02-21T05:57:39Z

One way to solve it in the application is to expose a completely separate port with separate thread pool just for readiness checks. Another is to use HPA that is based on load the app experience.

ha yes, for some of our microservices we already do this, but languages like python & ruby come with a lot of memory overhead per thread so by having a separate threadpool for healthchecks we have to provision a lot more memory

Probes are designed to be local and not account for any external information like a deployment size.

ooh, thinking just in terms of readiness probe, since the pod is getting removed from the k8s "service" it would be a centralised logic right? i.e the service resource can choose to not remove the pod if more than 20% of the pods in the service are failing readiness probe.

I'm guessing liveness probe can't account for such external information as kubelet would decide to restart the container

eightnoteight · 2023-02-21T06:04:39Z

Another is to use HPA that is based on load the app experience.

I don't think this will help much, because the deployment once it goes into this state is already in a cascading state i.e for x load we need y pods. if only a fraction of y pods are healthy they immediately get bombarded with all the traffic that they also go unhealthy and the old ones become healthy but again by putting all traffic on a fraction of pods they too go unhealthy and so the incident will have a cascading state where only way to recover is to cut off all the load and reintroduce slowly.

SergeyKanzhelev · 2023-02-21T06:19:14Z

/sig network

for this proposal:

i.e the service resource can choose to not remove the pod if more than 20% of the pods in the service are failing readiness probe.

I don't think it is generalizable enough as it will be hard to reason of totally broken service vs. one that can get back if only we keep it alive.

Another two alternaives:

switch to tcp readiness check. TCP will not depend on the language-specific pool sizes and will give a reasonable signal for readiness. Startup probes may ensure that the service is actually up and running.
write a status message in a file and exec probe checks this file and validates the message is fresh. This way no http resource will be used by probes at all.

sftim · 2023-02-21T17:44:22Z

Maybe the Gateway project could consider ways to help here, given some changes to in-tree code.

For example:

Service B selects the same Pods as Service A, but with a different rule for including unhealthy backends.
Send the traffic to Service A where possible, but if Service A marks itself as degraded then send 90% of the traffic to Service B and 10% to Service A.

If neither Service has any endpoints, reverse-proxy incoming HTTP traffic to https://external-sorry-page.example/

eightnoteight · 2023-02-21T17:55:37Z

but if Service A marks itself as degraded then send 90% of the traffic to Service B and 10% to Service A.

awesome solution. just confused a bit on how gateway would determine that Service A is degraded

just to confirm, Is it based on the panic threshold? i.e if 70% is the panic threshold and if len(Service A) < 0.7 * len(Service B) then consider "Service A" is degraded?

eightnoteight · 2023-02-25T17:02:16Z

just wanted to note another solution. @sftim since outlier detection is already planned for Gateway project, is it possible to use healthchecks as a source for outlier detection?

it would be something like I won't set the readiness probe but I'll set the outlier detection probe.

error rate based outlier detection is good but it also varies a lot between the workloads i.e some workloads baseline error rate is already high, so outlier detection mechanism needs to be tuned for each microservice

sftim · 2023-02-25T20:14:43Z

just wanted to note another solution. @sftim since outlier detection is already planned for Gateway project, is it possible to use healthchecks as a source for outlier detection?

it would be something like I won't set the readiness probe but I'll set the outlier detection probe.

error rate based outlier detection is good but it also varies a lot between the workloads i.e some workloads baseline error rate is already high, so outlier detection mechanism needs to be tuned for each microservice

I don't feel I can comment on this.

aojea · 2023-02-25T22:43:56Z

I think that Services should keep using Pods readiness as source of truth , in this case it seems that the problem is that the readiness probe is sending the wrong signal, but users can develop their readiness strategy using readiness gates, instead of adding a know to skip readiness values

https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/580-pod-readiness-gates

@thockin is the expert here, I'm sure he can expand more on this

eightnoteight · 2023-02-26T13:06:37Z

https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/580-pod-readiness-gates

@aojea I went through the doc, the major issue I see is that the doc uses AND condition for total readiness i.e a pod is ready when both readiness check passes and external readiness gates also pass. (please correct me if I'm wrong)

Pod is ready == containers are ready AND conditions in ReadinessGates are True

But for panic threshold central logic, the central readiness gate needs to override the local readiness check i.e local readiness check will be false. Still, the pod should be considered ready because the central readiness gate is passing

probably this extension can be added to the proposal i.e readiness gate overriding the behaviour

aojea · 2023-02-26T14:28:12Z

unless I'm mixing things, containers are ready means that the container is actually running, not related to probes. If the container inside the pod is not running it doesn't matter what you do, the app is dead.

From your description, I understand that you want to achieve a central probing mechanism and not rely on the kubelet readiness probes, it sounds that you can do it using this feature

thockin · 2023-02-27T21:31:14Z

When using external dependencies inside healthchecks

Don't do that

one solution is removing the upstream services inside the healthchecks

Do that

that still won't solve the case where healthchecks are failing due to excessive load. like suddenly 2x requests came on service that caused some pods to reach capacity and start failing readiness probes

In a sense this is exactly what is intended. If your pod is SO busy that it
can't service the healthcheck, I think you WANT LBs to stop sending it traffic,
don't you?

causes a cascading impact where scheduler removes the pods from the service.

Not the scheduler, but yes.

can choose to not remove the pod if more than 20% of the pods in the service are failing readiness probe.

In theory that is possible, but as you said above "a lot of moving parts". We
already have a fallback - if ALL pods are terminating, we ignore readiness. But if
even 1 is ready, we respect readiness.

IIUC you're asking for that to a) be extended to readyiness; and b) be a
per-service knob?

This would probably have to be implmented in the service proxy control
planess. I say "planes" because there are several of them now - kube-proxy,
kpng, antrea, Cilium, OVN, ... Ever feature we add around this area comes at
high cost and needs to be justified.

Send the traffic to Service A where possible, but if Service A marks itself as degraded then send 90% of the traffic to Service B and 10% to Service A.

#56440 ?

@aojea - I don't think readinessGates can solve this - it's really about
endpoint selection, if I understand.

aojea · 2023-02-27T21:55:44Z

@aojea - I don't think readinessGates can solve this - it's really about

I may misunderstood then , I thought that he wanted to build an unified custom probe handling logic ... and this logic influences the endpoints of course

thockin · 2023-03-02T17:29:51Z

Should we call this a dup of #56440 ?

shaneutt · 2023-03-02T17:30:13Z

/assign @shaneutt

SergeyKanzhelev · 2023-03-02T17:32:01Z

/remove-sig node

since the idea of switching to more relaxed readiness checks is not working here, removing sig node.

shaneutt · 2023-03-02T18:15:31Z

@eightnoteight we discussed this in the SIG Network community meeting today and we felt we had further questions and needed a bit more information.

Our first question was whether having the ability to provide a "backup selector" would fulfill the need here as referenced by @thockin above: Could you check out #56440 and let us know if this could be considered a duplicate of it and be handled by that effort?

Secondly we had considered whether having backup selector style functionality in Gateway API might be helpful to this kind of situation in the future (as we feel Service is already overloaded and we want to be very careful about what we add there). Do you have familiarity with Gateway API, and do you feel like having "backup selector"-like functionality on route resources (e.g. TCPRoute) would be helpful in your case?

/triage needs-information

shaneutt · 2023-03-30T16:56:16Z

Our current thinking has been that #56440 (or maybe more specifically kubernetes-sigs/gateway-api#1802) might be a good place to get functionality that would be relevant with this situation, but since it's been a while since we've heard back here we'll consider this one closed for the time being.

/close

Please follow up in #56440 and kubernetes-sigs/gateway-api#1802 to advocate for your use case, or if you feel there's more follow up that needs to be done here on this issue please don't hesitate to drop a comment and follow up. 🖖

k8s-ci-robot · 2023-03-30T16:56:22Z

@shaneutt: Closing this issue.

In response to this:

Our current thinking has been that #56440 (or maybe more specifically kubernetes-sigs/gateway-api#1802) might be a good place to get functionality that would be relevant with this situation, but since it's been a while since we've heard back here we'll consider this one closed for the time being.

/close

Please follow up in #56440 and kubernetes-sigs/gateway-api#1802 to advocate for your use case, or if you feel there's more follow up that needs to be done here on this issue please don't hesitate to drop a comment and follow up. 🖖

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

eightnoteight added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 21, 2023

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 21, 2023

eightnoteight changed the title ~~Readiness Probe Failure Panic Threshold~~ Panic Threshold for Readiness Probe Failures Feb 21, 2023

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 21, 2023

k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Feb 21, 2023

k8s-ci-robot assigned shaneutt Mar 2, 2023

k8s-ci-robot removed the sig/node Categorizes an issue or PR as relevant to SIG Node. label Mar 2, 2023

k8s-ci-robot added the triage/needs-information Indicates an issue needs more information in order to work on it. label Mar 2, 2023

eightnoteight mentioned this issue Mar 2, 2023

service backup selector #56440

Closed

shaneutt mentioned this issue Mar 10, 2023

Add a priority to a backendRef kubernetes-sigs/gateway-api#1802

Open

k8s-ci-robot closed this as completed Mar 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Panic Threshold for Readiness Probe Failures #115917

Panic Threshold for Readiness Probe Failures #115917

eightnoteight commented Feb 21, 2023 •

edited

k8s-ci-robot commented Feb 21, 2023

SergeyKanzhelev commented Feb 21, 2023

eightnoteight commented Feb 21, 2023 •

edited

eightnoteight commented Feb 21, 2023

SergeyKanzhelev commented Feb 21, 2023

sftim commented Feb 21, 2023

eightnoteight commented Feb 21, 2023

eightnoteight commented Feb 25, 2023

sftim commented Feb 25, 2023

aojea commented Feb 25, 2023

eightnoteight commented Feb 26, 2023 •

edited

aojea commented Feb 26, 2023

thockin commented Feb 27, 2023

aojea commented Feb 27, 2023 •

edited

thockin commented Mar 2, 2023

shaneutt commented Mar 2, 2023

SergeyKanzhelev commented Mar 2, 2023

shaneutt commented Mar 2, 2023

shaneutt commented Mar 30, 2023

k8s-ci-robot commented Mar 30, 2023

Panic Threshold for Readiness Probe Failures #115917

Panic Threshold for Readiness Probe Failures #115917

Comments

eightnoteight commented Feb 21, 2023 • edited

What would you like to be added?

Why is this needed?

k8s-ci-robot commented Feb 21, 2023

SergeyKanzhelev commented Feb 21, 2023

eightnoteight commented Feb 21, 2023 • edited

eightnoteight commented Feb 21, 2023

SergeyKanzhelev commented Feb 21, 2023

sftim commented Feb 21, 2023

eightnoteight commented Feb 21, 2023

eightnoteight commented Feb 25, 2023

sftim commented Feb 25, 2023

aojea commented Feb 25, 2023

eightnoteight commented Feb 26, 2023 • edited

aojea commented Feb 26, 2023

thockin commented Feb 27, 2023

aojea commented Feb 27, 2023 • edited

thockin commented Mar 2, 2023

shaneutt commented Mar 2, 2023

SergeyKanzhelev commented Mar 2, 2023

shaneutt commented Mar 2, 2023

shaneutt commented Mar 30, 2023

k8s-ci-robot commented Mar 30, 2023

eightnoteight commented Feb 21, 2023 •

edited

eightnoteight commented Feb 21, 2023 •

edited

eightnoteight commented Feb 26, 2023 •

edited

aojea commented Feb 27, 2023 •

edited