-
Notifications
You must be signed in to change notification settings - Fork 7.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Many blackholed requests cause Envoy to consume excessive ram/cpu #25963
Comments
I suspect this is because of metrics exposed by istio stats filter. Not very familiar with Knative set up. But does it rely on istio metrics? If not, can you add the following install options to disable stats emitted by Istio.
or you can simply delete stat-filter-1.6 envoyfilter: If your system do rely on istio metrics, you can edit stats filter configuration to disable host header fallback:
|
Interesting. I'll play with this tomorrow (time willing) and report back. Knative has its own metrics pipelines for its own operations, but I have a hunch (need to test) that DataDog's Istio integration makes use of the Ist io metrics. I'll turn it off and see what happens. |
I got some time to try out the suggestions. Turning off telemetry is a little painful as it seems to break how Datadog is getting data. We are using their default integration, and I have not dug into any sort of configuration that might exist there. Removing
Any thoughts on next steps? In the meantime, I should update the cluster to 1.6.7 (from 1.6.4). |
No you should not remove it, You should add that line |
disable_host_header_fallback was already set to true (state prior to edits) which is likely what got me turned around. Your initial diff was clear, I mentally broke when trying to apply. |
Yeah, but did you add it to other sections? https://github.com/istio/istio/blob/1.6.7/manifests/charts/istio-control/istio-discovery/templates/telemetryv2_1.6.yaml#L194 and https://github.com/istio/istio/blob/1.6.7/manifests/charts/istio-control/istio-discovery/templates/telemetryv2_1.6.yaml#L146 I was meant to add this option to those two blocks. |
Apparently I need to slow down and read, because once I did that things look like they will be in a better state.
Now I need to find a way to prevent the Istio Operator from overwriting the values. I can see a few options, but I think this looks to be a much better state.
Is this something you think might be changing in a future release? Thank you immensely for your patience in the face of my apparently lacking comprehension skills. |
So we did not provide a way to override stats filter configuration in the hope that we will have telemetry API by 1.7 but looks like that is not going to happen in short term. @douglas-reid @mandarjog @kyessenov wdyt about adding config override into stats filter installation option? Without this, I think it would be hard for user to manage any patch on stats filter like this issue or any metric customization. |
Although the configuration will only control root namespace envoyfilters, so user will still need to maintain per namespace envoyfilter for specific use case, but at least it won't be overridden when upgrading Istio. |
yeah, I would support pulling out filter configs out of envoy filter resource. wdyt @mandarjog ? |
Adding config override into install seems like a decent start. FWIW, our goals for 1.7 were to design a new telemetry API, but we had a separate task to improve metrics customization. |
istio/istio.io#7952 added guide on how to use istioctl to customization metrics. It is in preliminary doc now: https://preliminary.istio.io/latest/docs/tasks/observability/metrics/customize-metrics/ |
Bug description
A program that makes occasional requests to services outside of the mesh combined with the cluster having
outboundTrafficPolicy:REGISTRY_ONLY
set will result in the Envoy process accumulating an unbounded number of stats entries (? stuff available on /stats) eventually causing Envoy to consume lots of ram and in some cases stops functioning all together.[ ] Configuration Infrastructure
[ ] Docs
[ ] Installation
[X] Networking
[X] Performance and Scalability
[ ] Policies and Telemetry
[ ] Security
[ ] Test and Release
[ ] User Experience
[ ] Developer Infrastructure
Expected behavior
Ram/CPU stays bounded and Envoy remains functional
Steps to reproduce the bug
Important context: Istio has
outboundTrafficPolicy:REGISTRY_ONLY
set.The Knative autoscaler acquired a feature which attempts to scrape statistics directly from individual pods. This direct scraping happens on autoscaler startup, and on occasion when some internal state gets reset. This results in periodic bursts of 1-10ish requests being attempted to individual pods to grab stats. After the autoscaler attempts to make these fetches, and finds out it isn't possible, it proceeds to scrape via the service (good!). As pods move around, and the autoscaler's internal state gets reset, it goes back to attempting to talk to the pods directly. This causes the number of /stats entries in Envoy to continuously grow because each attempt to scrape a pod results in an attempt to contact a unique IP. Eventually Envoy struggles to operate correctly, and seems to either lock up or becomes so slow as to be non-functional. With the Knative autoscaler it seems to take roughly a week to cause full breakdown of Envoy. The direct pod probes happen at a rate of about 10ish every 15 minutes (really slow burn). Lots of details in knative/serving#8761 for the curious.
https://github.com/nairb774/experimental/blob/28f76fcc2db73ed30c46aff7ce4b25a47515d25c/http-prober/main.go is a simplified and accelerated reproduction of the behavior of the Knative autoscaler component. This can be deployed/run with
ko
to simulate a simplified Knative autoscaler behavior. Within a minute or two of running the http-prober, the/stats/prometheus
page, on the Envoy dashboard (istioctl dashboard envoy), takes forever to return, and the response size is well over 100MiB (in the one test I ran) with about 140k rows. The memory usage of Envoy also ballooned to about 500MiB in that minute time.Ideally the slow burn version of the Knative autoscaler wouldn't cause Envoy to topple over (though it is just as easy to say the autoscaler is as much at fault). It doesn't look like the blackhole metrics get garbage collected if they sit idle for a long time.
Version (include the output of
istioctl version --remote
andkubectl version
andhelm version
if you used Helm)Knative v0.16.0 (for what it is worth)
How was Istio installed?
Istio Operator (version 1.6.4)
Environment where bug was observed (cloud vendor, OS, etc)
AWS EKS 1.16
AWS Managed Node Groups (1.16.8-20200618 / ami-0d960646974cf9e5b)
Here is an operator config for Istio: istio-system.yaml.gz which is from the cluster exhibiting the issue both via the Knative autoscaler as well as the minimal program above. I'm a little apprehensive dumping so much information from the cluster and making it publically accessible. I'm happy to pull specific logs/info that might be useful. I'm even happy to hop on a VC and do some live debugging/reproductions if that would help out.
The text was updated successfully, but these errors were encountered: