Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outlier detection enforcing on 5xx status codes despite being explicitly configured not to #25220

Closed
Stono opened this issue Jul 5, 2020 · 20 comments · Fixed by #25534
Closed

Comments

@Stono
Copy link
Contributor

Stono commented Jul 5, 2020

Bug description
Given the following DestinationRule, configured as per https://archive.istio.io/v1.5/docs/reference/config/networking/destination-rule/:

    outlierDetection:
      baseEjectionTime: 30s 
      consecutive5xxErrors: 0
      consecutiveGatewayErrors: 5
      interval: 5s
      maxEjectionPercent: 25
      minHealthPercent: 50

Whilst i have configured is in the proxy-config clusters of the upstream:

        "outlierDetection": {
            "consecutive5xx": 0,
            "interval": "5s",
            "baseEjectionTime": "30s",
            "maxEjectionPercent": 25,
            "enforcingConsecutive5xx": 0,
            "consecutiveGatewayFailure": 5,
            "enforcingConsecutiveGatewayFailure": 100
        },

You can see this service is not producing gateway errors, only 500:
Screenshot 2020-07-05 at 21 06 30

However you can see outliers being active based on this:
Screenshot 2020-07-05 at 21 06 37

[ ] Configuration Infrastructure
[ ] Docs
[ ] Installation
[x] Networking
[ ] Performance and Scalability
[x] Policies and Telemetry
[ ] Security
[ ] Test and Release
[ ] User Experience
[ ] Developer Infrastructure

Expected behavior
Outlier detection to only fire on gateway errors, not 500.
It would also be nice to be able to turn on outlier detection logging at a pod level with an annotation to debug this: #25219

Steps to reproduce the bug

Version (include the output of istioctl version --remote and kubectl version and helm version if you used Helm)
1.5.7

How was Istio installed?
Helm

Environment where bug was observed (cloud vendor, OS, etc)
GKE 1.15

@rshriram
Copy link
Member

rshriram commented Jul 6, 2020

hmm.. this looks like an Envoy level issue. It is possible that we forgot to tune some magic knob in outlier detection (one not exposed to end user). @lambdai / @lizan / @PiotrSikora thoughts?

@lambdai
Copy link
Contributor

lambdai commented Jul 6, 2020

@Stono Could you paste the config dump?

You can see this service is not producing gateway errors, only 500:

I am not fully sure how (and whether) gateway error metric is be reported. The idea is gateway error is the TCP connect failure(not even established) between client envoy and server envoy. The connect is not even established, let alone a http request hitting the destination service. So when a gateway error occurs, we don't expect the destination service report a 5xx.

@lambdai
Copy link
Contributor

lambdai commented Jul 6, 2020

My point is that gateway error could happens w/o the corresponding metric at the dest envoy.
Actually this is important: gateway error is usually considered safe to retry b/c the http request is not received by destination.

@Stono
Copy link
Contributor Author

Stono commented Jul 6, 2020

@lambdai i'm not sure what you're getting at, there are no gateway errors.
The destination service is returning a status code 500 (it has a very low error rate but its there) and these are tripping the envoy outlier detection.

They shouldn't be, because I have configured:

      consecutive5xxErrors: 0
      consecutiveGatewayErrors: 5

Which means only 502, 503 and 504 should be considered, as per the docs?

@Stono
Copy link
Contributor Author

Stono commented Jul 6, 2020

Oh I think I understand, you're saying that the remote destination connection errors aren't being recorded?

I don't know if that's the case, as you can see from my two graphs the outliers certainly follow the 500 codes? And we don’t see it on services which aren’t returning 500s

I've messaged you on slack to send you any info you need

@lambdai
Copy link
Contributor

lambdai commented Jul 6, 2020

you're saying that the remote destination connection errors aren't being recorded?

That is my mental model. My impression is that gateway error should not be recorded by dest (consumer-gateway) but could be recorded by source envoy with 500.
Now that source envoy accept a http request, source envoy has to report response code :)

Basically for the source envoy in nginx pod, the envoy is rejecting a host from dest (consumer-gateway) by the error of source-envoy to dest-envoy: gateway error or a normal 5xx
But the istio-request-total(the very beginning image of the description) is tracking the errors of http request hitting source-envoy.

@Stono
Copy link
Contributor Author

Stono commented Jul 6, 2020

to be 100% clear, those 500 codes are coming from consumer-gateway, the application is returning 500s (its a bug in the app that needs fixing - but you can follow the trace all the way through to consumer-gateway as well as the errors in the app), and the graph shows reporter=source metrics

@lambdai
Copy link
Contributor

lambdai commented Jul 6, 2020

DetectorHostMonitorImpl::putResultNoLocalExternalSplit reveals the fact that the response code to downstream request(mapping to istio_request_total) is a source of outlier detection signal, BUT it's not the only one.

Take a real world example: if the retry does hide an destination workload failure, the response code is 200, but the outlier detection of that failing workload is accounted.

What's more, the host failure is normalized to a fake http response code from destination workload (consumer-gateway sidecar envoy is expected in this case), but that response code is not returned by destination envoy. It could be a tcp connect failure.

tl;dr
Both istio_request_totals from source sidecar envoy and from dest sidecar envoy is populate to outlier detector. But even the union amount of the tracked requests could be less than outlier detector's samples

@Stono
Copy link
Contributor Author

Stono commented Jul 6, 2020

I'm not really following, but I think you're saying they are hidden TCP failures that are contributing to the envoy outlier detection? How do we prove that?

Also what do you mean fake http response code

I'm sorry but it seems too suspicious that the two services we have that produce genuine 500 error codes also trigger outlier detection. And we have other services doing equally as much load that are not triggering it.

To be clear: We have a correlation between services that return genuine 500 errors, and outlier detection being triggered?

I will enable outlierLogs via sidecar injection tomorrow and see what they tell us

@Stono
Copy link
Contributor Author

Stono commented Jul 7, 2020

FYI i've removed two variables:

  • Rolled back to 1.5.6
  • Removed http2 envoyFilter

No change in the amount of outlier active

@Stono
Copy link
Contributor Author

Stono commented Jul 14, 2020

I've finally got outlier detection logs working with some hacking, here are a few samples:

{"type":"SUCCESS_RATE","cluster_name":"outbound|80||app.consumer-gateway.svc.cluster.local","upstream_url":"10.194.63.170:8305","action":"EJECT","num_ejections":1,"enforced":true,"eject_success_rate_event":{"host_success_rate":99,"cluster_average_success_rate":99,"cluster_success_rate_ejection_threshold":99},"timestamp":"2020-07-14T09:21:38.530Z"}
{"type":"CONSECUTIVE_5XX","cluster_name":"outbound|80||app.consumer-gateway.svc.cluster.local","upstream_url":"10.194.63.170:8305","action":"UNEJECT","num_ejections":1,"enforced":false,"timestamp":"2020-07-14T09:22:08.533Z","secs_since_last_action":"30"}
{"type":"SUCCESS_RATE","cluster_name":"outbound|80||app.consumer-gateway.svc.cluster.local","upstream_url":"10.194.59.171:8305","action":"EJECT","num_ejections":1,"enforced":true,"eject_success_rate_event":{"host_success_rate":99,"cluster_average_success_rate":99,"cluster_success_rate_ejection_threshold":99},"timestamp":"2020-07-14T09:22:17.709Z"}
{"type":"CONSECUTIVE_5XX","cluster_name":"outbound|80||app.consumer-gateway.svc.cluster.local","upstream_url":"10.194.59.171:8305","action":"UNEJECT","num_ejections":1,"enforced":false,"timestamp":"2020-07-14T09:22:47.713Z","secs_since_last_action":"30"}
{"type":"SUCCESS_RATE","cluster_name":"outbound|80||app.consumer-gateway.svc.cluster.local","upstream_url":"10.194.38.234:8305","action":"EJECT","num_ejections":1,"enforced":true,"eject_success_rate_event":{"host_success_rate":99,"cluster_average_success_rate":99,"cluster_success_rate_ejection_threshold":99},"timestamp":"2020-07-14T09:24:51.298Z"}
{"type":"SUCCESS_RATE","cluster_name":"outbound|80||app.consumer-gateway.svc.cluster.local","upstream_url":"10.194.2.181:8305","action":"EJECT","num_ejections":1,"enforced":true,"eject_success_rate_event":{"host_success_rate":99,"cluster_average_success_rate":99,"cluster_success_rate_ejection_threshold":99},"timestamp":"2020-07-14T09:24:53.546Z"}
{"type":"CONSECUTIVE_5XX","cluster_name":"outbound|80||app.consumer-gateway.svc.cluster.local","upstream_url":"10.194.2.181:8305","action":"UNEJECT","num_ejections":1,"enforced":false,"timestamp":"2020-07-14T09:25:23.548Z","secs_since_last_action":"30"}
{"type":"CONSECUTIVE_5XX","cluster_name":"outbound|80||app.consumer-gateway.svc.cluster.local","upstream_url":"10.194.38.234:8305","action":"UNEJECT","num_ejections":1,"enforced":false,"timestamp":"2020-07-14T09:25:21.301Z","secs_since_last_action":"30"}

And this is configured with:

      outlierDetection:
        baseEjectionTime: 30s
        consecutive5xxErrors: 0
        consecutiveGatewayErrors: 5
        interval: 5s
        maxEjectionPercent: 25
        minHealthPercent: 50

Which translates to a cluster config of:

        "outlierDetection": {
            "consecutive5xx": 0,
            "interval": "5s",
            "baseEjectionTime": "30s",
            "maxEjectionPercent": 25,
            "enforcingConsecutive5xx": 0,
            "consecutiveGatewayFailure": 5,
            "enforcingConsecutiveGatewayFailure": 100
        }

I've noticed the CONSECUTIVE_5XX have enforced=false which i presume means they're just reporting? but then there is a SUCCESS_RATE which being enforced. There are no CONSECUTIVE_GATEWAY_FAILURE being logged yet.

From reading https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/outlier#arch-overview-outlier-detection, it appears SUCCESS_RATE is including 500 codes which means my consecutive5xxErrors setting is effectively being ignored.

@snowp
Copy link

snowp commented Jul 14, 2020

Seems like you're seeing outlier detection evictions due to the success rate std dev calculations, you should be able to disable that by setting enforcing_success_rate to 0.

This is a different detector than the consecutive 5xx one: instead of tracking how many 5xx it sees in a row, it compares the number of 5xx errors of a specific host vs all the other hosts and eject it if it falls outside of the configured std dev range.

@Stono
Copy link
Contributor Author

Stono commented Jul 14, 2020

OK so from an Istio wrapping envoy perspective, it feels wrong to have SUCCESS_RATE calculations enabled under the hood (as it's counter intuitive to the current configuration options of consecutive5xx and consecutiveGatewayFailure.

From a users perspective, we only want outlier detections based on consecutive gateway errors (as per our config), it is very important that 500 codes do not counter as outliers for us - hence me setting consecutive5xx: 0

@ramaraochavali
Copy link
Contributor

ramaraochavali commented Jul 15, 2020

I agree. We should not enable success rate based outlier detection. PR to fix it #25534

@Stono
Copy link
Contributor Author

Stono commented Jul 15, 2020

@lambdai @ramaraochavali @howardjohn please can we reopen this until the 1.5 and 1.6 cherry-picks are in?

#25543
#25542

@rshriram rshriram reopened this Jul 16, 2020
@rshriram
Copy link
Member

why is the success rate detector even active when the consecutive gateway error detector is explicitly set?

@ramaraochavali
Copy link
Contributor

The default is to on (enforcing 100%) when you enable outlier detection by setting other detectors - Should the default be off like enforcing_failure_percentage - may be yes. @snowp is there any specific reason on why it is enabled by default?

@ramaraochavali
Copy link
Contributor

#25566 - 1.5
#25567 - 1.6

@ramaraochavali
Copy link
Contributor

PRs to 1.5 and 1.6 are merged. So closing this

@Stono
Copy link
Contributor Author

Stono commented Jul 18, 2020

@ramaraochavali thank you for seeing this through. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants