New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Panic Threshold for Readiness Probe Failures #115917
Comments
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/sig node it's a hard issue to tackle. Do you have a suggestions how we can solve it generically? Probes are designed to be local and not account for any external information like a deployment size. One way to solve it in the application is to expose a completely separate port with separate thread pool just for readiness checks. Another is to use HPA that is based on load the app experience. |
ha yes, for some of our microservices we already do this, but languages like python & ruby come with a lot of memory overhead per thread so by having a separate threadpool for healthchecks we have to provision a lot more memory
ooh, thinking just in terms of readiness probe, since the pod is getting removed from the k8s "service" it would be a centralised logic right? i.e the service resource can choose to not remove the pod if more than 20% of the pods in the service are failing readiness probe. I'm guessing liveness probe can't account for such external information as kubelet would decide to restart the container |
I don't think this will help much, because the deployment once it goes into this state is already in a cascading state i.e for x load we need y pods. if only a fraction of y pods are healthy they immediately get bombarded with all the traffic that they also go unhealthy and the old ones become healthy but again by putting all traffic on a fraction of pods they too go unhealthy and so the incident will have a cascading state where only way to recover is to cut off all the load and reintroduce slowly. |
/sig network for this proposal:
I don't think it is generalizable enough as it will be hard to reason of totally broken service vs. one that can get back if only we keep it alive. Another two alternaives:
|
Maybe the Gateway project could consider ways to help here, given some changes to in-tree code. For example: Service B selects the same Pods as Service A, but with a different rule for including unhealthy backends. If neither Service has any endpoints, reverse-proxy incoming HTTP traffic to https://external-sorry-page.example/ |
awesome solution. just confused a bit on how gateway would determine that Service A is degraded just to confirm, Is it based on the panic threshold? i.e if 70% is the panic threshold and |
just wanted to note another solution. @sftim since outlier detection is already planned for Gateway project, is it possible to use healthchecks as a source for outlier detection? it would be something like I won't set the readiness probe but I'll set the outlier detection probe. error rate based outlier detection is good but it also varies a lot between the workloads i.e some workloads baseline error rate is already high, so outlier detection mechanism needs to be tuned for each microservice |
I don't feel I can comment on this. |
I think that Services should keep using Pods readiness as source of truth , in this case it seems that the problem is that the readiness probe is sending the wrong signal, but users can develop their readiness strategy using readiness gates, instead of adding a know to skip readiness values https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/580-pod-readiness-gates @thockin is the expert here, I'm sure he can expand more on this |
@aojea I went through the doc, the major issue I see is that the doc uses
But for panic threshold central logic, the central readiness gate needs to override the local readiness check i.e local readiness check will be false. Still, the pod should be considered ready because the central readiness gate is passing probably this extension can be added to the proposal i.e readiness gate overriding the behaviour |
unless I'm mixing things, From your description, I understand that you want to achieve a central probing mechanism and not rely on the kubelet readiness probes, it sounds that you can do it using this feature |
Don't do that
Do that
In a sense this is exactly what is intended. If your pod is SO busy that it
Not the scheduler, but yes.
In theory that is possible, but as you said above "a lot of moving parts". We IIUC you're asking for that to a) be extended to readyiness; and b) be a This would probably have to be implmented in the service proxy control
#56440 ? @aojea - I don't think readinessGates can solve this - it's really about |
I may misunderstood then , I thought that he wanted to build an unified custom probe handling logic ... and this logic influences the endpoints of course |
Should we call this a dup of #56440 ? |
/assign @shaneutt |
/remove-sig node since the idea of switching to more relaxed readiness checks is not working here, removing sig node. |
@eightnoteight we discussed this in the SIG Network community meeting today and we felt we had further questions and needed a bit more information. Our first question was whether having the ability to provide a "backup selector" would fulfill the need here as referenced by @thockin above: Could you check out #56440 and let us know if this could be considered a duplicate of it and be handled by that effort? Secondly we had considered whether having backup selector style functionality in Gateway API might be helpful to this kind of situation in the future (as we feel /triage needs-information |
Our current thinking has been that #56440 (or maybe more specifically kubernetes-sigs/gateway-api#1802) might be a good place to get functionality that would be relevant with this situation, but since it's been a while since we've heard back here we'll consider this one closed for the time being. /close Please follow up in #56440 and kubernetes-sigs/gateway-api#1802 to advocate for your use case, or if you feel there's more follow up that needs to be done here on this issue please don't hesitate to drop a comment and follow up. 🖖 |
@shaneutt: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What would you like to be added?
When using external dependencies inside healthchecks or using fixed thread pool applications like python+gunicorn, ruby+passenger, PHP etc,... upstream service failures tend to cause cascading impact on the service as well.
one solution is removing the upstream services inside the healthchecks and that's how we are solving currently but that still won't solve the case where healthchecks are failing due to excessive load. like suddenly 2x requests came on service that caused some pods to reach capacity and start failing readiness probes which in turn causes a cascading impact where scheduler removes the pods from the service.
we faced similar issues in envoy and unhealthy panic threshold in envoy helped a lot - https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/load_balancing/panic_threshold
the feature has a lot of moving parts though i.e it shouldn't trigger during normal scenarios like deployments, rotating the pods of a service, interruptions in the underlying ec2 leading to replacements of pods in significant chunks.
references to similar outages -
Why is this needed?
To avoid cascading outages when more than 30-40% of the pods of a deployment starts failing readiness probes
The text was updated successfully, but these errors were encountered: