-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent routing failures with HTTPRoute #12610
Comments
Hi @mgs255! I've been trying to reproduce this but I haven't seen the behavior you described. Of course, I don't have your full cms-api environment, but I've set up some More specifically, I installed emojivoto:
and then apply the following manifests in the emojivoto namespace:
Then by execing into a shell in an injected pod in the emojivoto namespace, I can curl
As expected, this will return a response from emojivoto half the time and bb half the time, but the response is always a 200. I think we have 3 potential avenues to explore next:
|
Hi @adleong I've managed to reproduce the issue with debug logging enabled In our environment. As I mentioned previously it is very intermittent. It was running for around 30 hours before the failure occurred. I've shared the gzipped file with you via the Linkerd slack as a DM attachment. All 3 services and the HTTPRoute are located in the same namespace. Hopefully this will help you get to the root cause. If there is anything else I can do to help, please let me know. Thanks! |
Hi @mgs255! Thanks so much for those logs, they were super helpful. After some investigation, I think this may be due to a race condition related to how backend services are processed by the policy controller. See: #12635 Ideally, this fix will be released in this week's edge release for you to test. |
@adleong that is great news, speedy work! We will keep an eye out for the next edge release. I notice as part of that change you added some new debug logs in this area of the proxy, if we were to keep debug logging enabled for a while we evaluate the release, could you give me a steer as to the log levels we should be targeting for this policy code? I'm not very familiar with Rust's log level specification. Globally setting debug produces very verbose output. |
I typically run with a log level of |
@mgs255 did you get a chance to try the latest edge release with this fix? edge-24.5.4 or later should have this. |
We have been seeing this issue as well, will try the fix |
@wmorgan @adleong Yes, indeed! We have been running this version in all of our production environments for at least a week now. We have alerts set up to monitor for the Thank you both for the speedy turnaround on this. Happy to keep this open for a bit longer to let it soak or close this issue now. I'll let you decide. |
Awesome, thank you for the report @mgs255! |
What is the issue?
We are currently running the linkerd edge-24.5.2 in our dev clusters and are using httproute objects to perform traffic splitting. We have been seeing some intermittent failures with request routing when there are httproute objects active. We are seeing all requests to the parent service failing with 500 errors. We first noticed this behaviour when we upgraded from edge-2024.3.2 to 2024.3.4.
In this case we have a httproute set up as follows:
When this occurs all requests to the parent service fail with 500 errors. Deleting or editing the HTTPRoute object temporarily restores service to a working state.
How can it be reproduced?
As mentioned it this has occurred only intermittently, but we started noticing when we upgraded from edge-24.3.2 to edge-24.3.4.
Logs, error output, etc
When this starts to fail we see the following error messages logged in the linkerd-proxy in the calling service - in this case a pod
cms-api-gateway
output of
linkerd check -o short
We do have prometheus metrics but it is run out of cluster
Environment
Possible solution
No response
Additional context
I've currently enabled debug logging on the service which was previously failing and will attach additional context when/if I have it.
Would you like to work on fixing this bug?
None
The text was updated successfully, but these errors were encountered: