-
Notifications
You must be signed in to change notification settings - Fork 7.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Outlier detection enforcing on 5xx status codes despite being explicitly configured not to #25220
Comments
hmm.. this looks like an Envoy level issue. It is possible that we forgot to tune some magic knob in outlier detection (one not exposed to end user). @lambdai / @lizan / @PiotrSikora thoughts? |
@Stono Could you paste the config dump?
I am not fully sure how (and whether) gateway error metric is be reported. The idea is gateway error is the TCP connect failure(not even established) between client envoy and server envoy. The connect is not even established, let alone a http request hitting the destination service. So when a gateway error occurs, we don't expect the destination service report a 5xx. |
My point is that gateway error could happens w/o the corresponding metric at the dest envoy. |
@lambdai i'm not sure what you're getting at, there are no gateway errors. They shouldn't be, because I have configured:
Which means only 502, 503 and 504 should be considered, as per the docs? |
Oh I think I understand, you're saying that the remote destination connection errors aren't being recorded? I don't know if that's the case, as you can see from my two graphs the outliers certainly follow the 500 codes? And we don’t see it on services which aren’t returning 500s I've messaged you on slack to send you any info you need |
That is my mental model. My impression is that gateway error should not be recorded by dest (consumer-gateway) but could be recorded by source envoy with 500. Basically for the source envoy in nginx pod, the envoy is rejecting a host from dest (consumer-gateway) by the error of source-envoy to dest-envoy: gateway error or a normal 5xx |
to be 100% clear, those 500 codes are coming from |
DetectorHostMonitorImpl::putResultNoLocalExternalSplit reveals the fact that the response code to downstream request(mapping to istio_request_total) is a source of outlier detection signal, BUT it's not the only one. Take a real world example: if the retry does hide an destination workload failure, the response code is 200, but the outlier detection of that failing workload is accounted. What's more, the host failure is normalized to a fake http response code from destination workload (consumer-gateway sidecar envoy is expected in this case), but that response code is not returned by destination envoy. It could be a tcp connect failure. tl;dr |
I'm not really following, but I think you're saying they are hidden TCP failures that are contributing to the envoy outlier detection? How do we prove that? Also what do you mean I'm sorry but it seems too suspicious that the two services we have that produce genuine 500 error codes also trigger outlier detection. And we have other services doing equally as much load that are not triggering it. To be clear: We have a correlation between services that return genuine 500 errors, and outlier detection being triggered? I will enable outlierLogs via sidecar injection tomorrow and see what they tell us |
FYI i've removed two variables:
No change in the amount of outlier active |
I've finally got outlier detection logs working with some hacking, here are a few samples:
And this is configured with:
Which translates to a cluster config of:
I've noticed the From reading https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/outlier#arch-overview-outlier-detection, it appears |
Seems like you're seeing outlier detection evictions due to the success rate std dev calculations, you should be able to disable that by setting This is a different detector than the consecutive 5xx one: instead of tracking how many 5xx it sees in a row, it compares the number of 5xx errors of a specific host vs all the other hosts and eject it if it falls outside of the configured std dev range. |
OK so from an Istio wrapping envoy perspective, it feels wrong to have From a users perspective, we only want outlier detections based on consecutive gateway errors (as per our config), it is very important that 500 codes do not counter as outliers for us - hence me setting |
I agree. We should not enable success rate based outlier detection. PR to fix it #25534 |
@lambdai @ramaraochavali @howardjohn please can we reopen this until the 1.5 and 1.6 cherry-picks are in? |
why is the success rate detector even active when the consecutive gateway error detector is explicitly set? |
The default is to on (enforcing 100%) when you enable outlier detection by setting other detectors - Should the default be off like enforcing_failure_percentage - may be yes. @snowp is there any specific reason on why it is enabled by default? |
PRs to 1.5 and 1.6 are merged. So closing this |
@ramaraochavali thank you for seeing this through. :) |
Bug description
Given the following DestinationRule, configured as per https://archive.istio.io/v1.5/docs/reference/config/networking/destination-rule/:
Whilst i have configured is in the
proxy-config clusters
of the upstream:You can see this service is not producing gateway errors, only
500
:However you can see outliers being active based on this:
[ ] Configuration Infrastructure
[ ] Docs
[ ] Installation
[x] Networking
[ ] Performance and Scalability
[x] Policies and Telemetry
[ ] Security
[ ] Test and Release
[ ] User Experience
[ ] Developer Infrastructure
Expected behavior
Outlier detection to only fire on gateway errors, not 500.
It would also be nice to be able to turn on outlier detection logging at a pod level with an annotation to debug this: #25219
Steps to reproduce the bug
Version (include the output of
istioctl version --remote
andkubectl version
andhelm version
if you used Helm)1.5.7
How was Istio installed?
Helm
Environment where bug was observed (cloud vendor, OS, etc)
GKE 1.15
The text was updated successfully, but these errors were encountered: