New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
l5d incorrectly reporting/caching endpoints after service is scaled to 0 #1635
Comments
Thanks @jsurdilla! We'll dig into this. |
It seems possible to me that this issue is related to/shares the same root cause as #1626 ... |
Thanks for linking to that issue @hawkw, that's helpful. It sounds like it; we'll run a test on the latest nighty and see if the issue still exists. |
@hawkw , I'm able to reproduce this on the nightly build. |
@jsurdilla yeah, that is (unfortunately!) not too surprising...#1626 also still exists after that snapshot. Just to double-check, though, are you using |
@hawkw , tested it on both |
@jsurdilla okay, thanks for confirming, I've done the same, and determined that the |
Tried it quickly, can confirm that I'm able to reproduce on |
Hi @jsurdilla and @Taik, pull request #1638 should fix this issue. We'll try and get the fix out as soon as possible! |
That's amazing, looking forward to checking it out. Thank you! |
The way `stabilize()` is currently implemented means that when a service is scaled down and then scaled back up, or removed and then re-created, the namer returns a new `Var` with the new address of that service. However, when a client stack for the service is created, it's given the original `Var` and cached, so it doesn't receive the updated address. I've changed `stabilize()` to update the previous `Var`, rather than creating a new one. I've added tests in `EndpointsNamerTest` based on events fired by Kubernetes while I was trying to debug this issue – there's a test that addresses are correctly updated after a scale down/scale up, and a test that references to the original `Var` are now properly updated. Fixes #1626. Fixes #1635. Fixes #1636.
The `linkerd check` command was not validating whether data plane proxies were successfully reporting metrics to Prometheus. Introduce a new check that validates data plane proxies are found in Prometheus. This is made possible via the existing `ListPods` endpoint in the public API, which includes an `Added` field, indicating a pod's metrics were found in Prometheus. Fixes linkerd#1517 Signed-off-by: Andrew Seigner <siggy@buoyant.io>
Issue Type:
What happened:
l5d incorrectly reports the number endpoints available after a service has been scaled all the way down to 0 and stays in an invalid state even after scaling up.
Linkerd logs reports service failure and does not correctly remove the stale node IPs:
What you expected to happen:
l5d should correctly report new endpoints as available for a service as replica count goes up an down regardless of the number of replicas.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
The text was updated successfully, but these errors were encountered: