Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many blackholed requests cause Envoy to consume excessive ram/cpu #25963

Closed
nairb774 opened this issue Jul 30, 2020 · 12 comments
Closed

Many blackholed requests cause Envoy to consume excessive ram/cpu #25963

nairb774 opened this issue Jul 30, 2020 · 12 comments

Comments

@nairb774
Copy link
Contributor

nairb774 commented Jul 30, 2020

Bug description

A program that makes occasional requests to services outside of the mesh combined with the cluster having outboundTrafficPolicy:REGISTRY_ONLY set will result in the Envoy process accumulating an unbounded number of stats entries (? stuff available on /stats) eventually causing Envoy to consume lots of ram and in some cases stops functioning all together.

[ ] Configuration Infrastructure
[ ] Docs
[ ] Installation
[X] Networking
[X] Performance and Scalability
[ ] Policies and Telemetry
[ ] Security
[ ] Test and Release
[ ] User Experience
[ ] Developer Infrastructure

Expected behavior

Ram/CPU stays bounded and Envoy remains functional

Steps to reproduce the bug

Important context: Istio has outboundTrafficPolicy:REGISTRY_ONLY set.

The Knative autoscaler acquired a feature which attempts to scrape statistics directly from individual pods. This direct scraping happens on autoscaler startup, and on occasion when some internal state gets reset. This results in periodic bursts of 1-10ish requests being attempted to individual pods to grab stats. After the autoscaler attempts to make these fetches, and finds out it isn't possible, it proceeds to scrape via the service (good!). As pods move around, and the autoscaler's internal state gets reset, it goes back to attempting to talk to the pods directly. This causes the number of /stats entries in Envoy to continuously grow because each attempt to scrape a pod results in an attempt to contact a unique IP. Eventually Envoy struggles to operate correctly, and seems to either lock up or becomes so slow as to be non-functional. With the Knative autoscaler it seems to take roughly a week to cause full breakdown of Envoy. The direct pod probes happen at a rate of about 10ish every 15 minutes (really slow burn). Lots of details in knative/serving#8761 for the curious.

https://github.com/nairb774/experimental/blob/28f76fcc2db73ed30c46aff7ce4b25a47515d25c/http-prober/main.go is a simplified and accelerated reproduction of the behavior of the Knative autoscaler component. This can be deployed/run with ko to simulate a simplified Knative autoscaler behavior. Within a minute or two of running the http-prober, the /stats/prometheus page, on the Envoy dashboard (istioctl dashboard envoy), takes forever to return, and the response size is well over 100MiB (in the one test I ran) with about 140k rows. The memory usage of Envoy also ballooned to about 500MiB in that minute time.

Ideally the slow burn version of the Knative autoscaler wouldn't cause Envoy to topple over (though it is just as easy to say the autoscaler is as much at fault). It doesn't look like the blackhole metrics get garbage collected if they sit idle for a long time.

Version (include the output of istioctl version --remote and kubectl version and helm version if you used Helm)

$ istioctl version --remote
client version: 1.6.5
control plane version: 1.6.4
data plane version: 1.6.4 (57 proxies)
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.6-beta.0", GitCommit:"e7f962ba86f4ce7033828210ca3556393c377bcc", GitTreeState:"clean", BuildDate:"2020-01-15T08:26:26Z", GoVersion:"go1.13.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.8-eks-e16311", GitCommit:"e163110a04dcb2f39c3325af96d019b4925419eb", GitTreeState:"clean", BuildDate:"2020-03-27T22:37:12Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}

Knative v0.16.0 (for what it is worth)

How was Istio installed?

Istio Operator (version 1.6.4)

Environment where bug was observed (cloud vendor, OS, etc)

AWS EKS 1.16
AWS Managed Node Groups (1.16.8-20200618 / ami-0d960646974cf9e5b)

Here is an operator config for Istio: istio-system.yaml.gz which is from the cluster exhibiting the issue both via the Knative autoscaler as well as the minimal program above. I'm a little apprehensive dumping so much information from the cluster and making it publically accessible. I'm happy to pull specific logs/info that might be useful. I'm even happy to hop on a VC and do some live debugging/reproductions if that would help out.

@bianpengyuan
Copy link
Contributor

I suspect this is because of metrics exposed by istio stats filter. Not very familiar with Knative set up. But does it rely on istio metrics? If not, can you add the following install options to disable stats emitted by Istio.

  values:
    telemetry:
      enabled: false
      v2:
        enabled: false

or you can simply delete stat-filter-1.6 envoyfilter: kubectl delete -n istio-system envoyfilter stats-filter-1.6.

If your system do rely on istio metrics, you can edit stats filter configuration to disable host header fallback:
kubectl edit -n istio-system envoyfilter stats-filter-1.6, and make the following change to all filter configuration block:

                 {
                   "debug": "false",
                   "stat_prefix": "istio",
+                  "disable_host_header_fallback": true
                 }

@nairb774
Copy link
Contributor Author

Interesting. I'll play with this tomorrow (time willing) and report back. Knative has its own metrics pipelines for its own operations, but I have a hunch (need to test) that DataDog's Istio integration makes use of the Ist io metrics. I'll turn it off and see what happens.

@nairb774
Copy link
Contributor Author

nairb774 commented Aug 5, 2020

I got some time to try out the suggestions. Turning off telemetry is a little painful as it seems to break how Datadog is getting data. We are using their default integration, and I have not dug into any sort of configuration that might exist there.

Removing disable_host_header_fallback doesn't seem to help. Assuming I'm looking at the right code (https://github.com/istio/proxy/blob/1.6.4/extensions/stats/plugin.cc#L157 and https://github.com/istio/proxy/blob/1.6.4/extensions/common/context.cc#L106-L110), I would assume it to work. I restarted the istiod pods and then restarted the problematic Knative pod. I've attached all of the envoyfilter objects in the istio-system namespace. As a last ditch effort, I even edited the 1.4 and 1.5 objects as well, with no success. Stats lines like the following are still showing up for every errant attempt of the Knative autoscaler to try to reach individual pods:

reporter=.=source;.;source_workload=.=autoscaler;.;source_workload_namespace=.=knative-serving;.;source_principal=.=unknown;.;source_app=.=autoscaler;.;source_version=.=unknown;.;source_canonical_service=.=autoscaler;.;source_canonical_revision=.=latest;.;destination_workload=.=unknown;.;destination_workload_namespace=.=unknown;.;destination_principal=.=unknown;.;destination_app=.=unknown;.;destination_version=.=unknown;.;destination_service=.=10.11.11.74:9090;.;destination_service_name=.=BlackHoleCluster;.;destination_service_namespace=.=unknown;.;destination_canonical_service=.=unknown;.;destination_canonical_revision=.=latest;.;request_protocol=.=http;.;response_code=.=502;.;grpc_response_status=.=;.;response_flags=.=-;.;connection_security_policy=.=unknown;.;_istio_requests_total: 1
reporter=.=source;.;source_workload=.=autoscaler;.;source_workload_namespace=.=knative-serving;.;source_principal=.=unknown;.;source_app=.=autoscaler;.;source_version=.=unknown;.;source_canonical_service=.=autoscaler;.;source_canonical_revision=.=latest;.;destination_workload=.=unknown;.;destination_workload_namespace=.=unknown;.;destination_principal=.=unknown;.;destination_app=.=unknown;.;destination_version=.=unknown;.;destination_service=.=10.11.14.241:9090;.;destination_service_name=.=BlackHoleCluster;.;destination_service_namespace=.=unknown;.;destination_canonical_service=.=unknown;.;destination_canonical_revision=.=latest;.;request_protocol=.=http;.;response_code=.=502;.;grpc_response_status=.=;.;response_flags=.=-;.;connection_security_policy=.=unknown;.;_istio_requests_total: 1
reporter=.=source;.;source_workload=.=autoscaler;.;source_workload_namespace=.=knative-serving;.;source_principal=.=unknown;.;source_app=.=autoscaler;.;source_version=.=unknown;.;source_canonical_service=.=autoscaler;.;source_canonical_revision=.=latest;.;destination_workload=.=unknown;.;destination_workload_namespace=.=unknown;.;destination_principal=.=unknown;.;destination_app=.=unknown;.;destination_version=.=unknown;.;destination_service=.=10.11.32.253:9090;.;destination_service_name=.=BlackHoleCluster;.;destination_service_namespace=.=unknown;.;destination_canonical_service=.=unknown;.;destination_canonical_revision=.=latest;.;request_protocol=.=http;.;response_code=.=502;.;grpc_response_status=.=;.;response_flags=.=-;.;connection_security_policy=.=unknown;.;_istio_requests_total: 1
...
reporter=.=source;.;source_workload=.=autoscaler;.;source_workload_namespace=.=knative-serving;.;source_principal=.=unknown;.;source_app=.=autoscaler;.;source_version=.=unknown;.;source_canonical_service=.=autoscaler;.;source_canonical_revision=.=latest;.;destination_workload=.=unknown;.;destination_workload_namespace=.=unknown;.;destination_principal=.=unknown;.;destination_app=.=unknown;.;destination_version=.=unknown;.;destination_service=.=10.11.11.74:9090;.;destination_service_name=.=BlackHoleCluster;.;destination_service_namespace=.=unknown;.;destination_canonical_service=.=unknown;.;destination_canonical_revision=.=latest;.;request_protocol=.=http;.;response_code=.=502;.;grpc_response_status=.=;.;response_flags=.=-;.;connection_security_policy=.=unknown;.;_istio_request_bytes: P0(nan,1200.0) P25(nan,1225.0) P50(nan,1250.0) P75(nan,1275.0) P90(nan,1290.0) P95(nan,1295.0) P99(nan,1299.0) P99.5(nan,1299.5) P99.9(nan,1299.9) P100(nan,1300.0)
reporter=.=source;.;source_workload=.=autoscaler;.;source_workload_namespace=.=knative-serving;.;source_principal=.=unknown;.;source_app=.=autoscaler;.;source_version=.=unknown;.;source_canonical_service=.=autoscaler;.;source_canonical_revision=.=latest;.;destination_workload=.=unknown;.;destination_workload_namespace=.=unknown;.;destination_principal=.=unknown;.;destination_app=.=unknown;.;destination_version=.=unknown;.;destination_service=.=10.11.11.74:9090;.;destination_service_name=.=BlackHoleCluster;.;destination_service_namespace=.=unknown;.;destination_canonical_service=.=unknown;.;destination_canonical_revision=.=latest;.;request_protocol=.=http;.;response_code=.=502;.;grpc_response_status=.=;.;response_flags=.=-;.;connection_security_policy=.=unknown;.;_istio_request_duration_milliseconds: P0(nan,0.0) P25(nan,0.0) P50(nan,0.0) P75(nan,0.0) P90(nan,0.0) P95(nan,0.0) P99(nan,0.0) P99.5(nan,0.0) P99.9(nan,0.0) P100(nan,0.0)
reporter=.=source;.;source_workload=.=autoscaler;.;source_workload_namespace=.=knative-serving;.;source_principal=.=unknown;.;source_app=.=autoscaler;.;source_version=.=unknown;.;source_canonical_service=.=autoscaler;.;source_canonical_revision=.=latest;.;destination_workload=.=unknown;.;destination_workload_namespace=.=unknown;.;destination_principal=.=unknown;.;destination_app=.=unknown;.;destination_version=.=unknown;.;destination_service=.=10.11.11.74:9090;.;destination_service_name=.=BlackHoleCluster;.;destination_service_namespace=.=unknown;.;destination_canonical_service=.=unknown;.;destination_canonical_revision=.=latest;.;request_protocol=.=http;.;response_code=.=502;.;grpc_response_status=.=;.;response_flags=.=-;.;connection_security_policy=.=unknown;.;_istio_response_bytes: P0(nan,54.0) P25(nan,54.25) P50(nan,54.5) P75(nan,54.75) P90(nan,54.9) P95(nan,54.95) P99(nan,54.99) P99.5(nan,54.995) P99.9(nan,54.999) P100(nan,55.0)
...

Any thoughts on next steps? In the meantime, I should update the cluster to 1.6.7 (from 1.6.4).

@bianpengyuan
Copy link
Contributor

bianpengyuan commented Aug 5, 2020

No you should not remove it, You should add that line "disable_host_header_fallback": true to all filter configs. Sorry the diff here #25963 (comment) might not be very clear.

@nairb774
Copy link
Contributor Author

nairb774 commented Aug 5, 2020

disable_host_header_fallback was already set to true (state prior to edits) which is likely what got me turned around. Your initial diff was clear, I mentally broke when trying to apply.

@nairb774
Copy link
Contributor Author

nairb774 commented Aug 5, 2020

Apparently I need to slow down and read, because once I did that things look like they will be in a better state.

reporter=.=source;.;source_workload=.=autoscaler;.;source_workload_namespace=.=knative-serving;.;source_principal=.=unknown;.;source_app=.=autoscaler;.;source_version=.=unknown;.;source_canonical_service=.=autoscaler;.;source_canonical_revision=.=latest;.;destination_workload=.=unknown;.;destination_workload_namespace=.=unknown;.;destination_principal=.=unknown;.;destination_app=.=unknown;.;destination_version=.=unknown;.;destination_service=.=unknown;.;destination_service_name=.=BlackHoleCluster;.;destination_service_namespace=.=unknown;.;destination_canonical_service=.=unknown;.;destination_canonical_revision=.=latest;.;request_protocol=.=http;.;response_code=.=502;.;grpc_response_status=.=;.;response_flags=.=-;.;connection_security_policy=.=unknown;.;_istio_requests_total: 17
reporter=.=source;.;source_workload=.=autoscaler;.;source_workload_namespace=.=knative-serving;.;source_principal=.=unknown;.;source_app=.=autoscaler;.;source_version=.=unknown;.;source_canonical_service=.=autoscaler;.;source_canonical_revision=.=latest;.;destination_workload=.=unknown;.;destination_workload_namespace=.=unknown;.;destination_principal=.=unknown;.;destination_app=.=unknown;.;destination_version=.=unknown;.;destination_service=.=unknown;.;destination_service_name=.=BlackHoleCluster;.;destination_service_namespace=.=unknown;.;destination_canonical_service=.=unknown;.;destination_canonical_revision=.=latest;.;request_protocol=.=http;.;response_code=.=502;.;grpc_response_status=.=;.;response_flags=.=-;.;connection_security_policy=.=unknown;.;_istio_request_bytes: P0(nan,1200.0) P25(nan,1225.0) P50(nan,1250.0) P75(nan,1275.0) P90(nan,1290.0) P95(nan,1295.0) P99(nan,1299.0) P99.5(nan,1299.5) P99.9(nan,1299.9) P100(nan,1300.0)
reporter=.=source;.;source_workload=.=autoscaler;.;source_workload_namespace=.=knative-serving;.;source_principal=.=unknown;.;source_app=.=autoscaler;.;source_version=.=unknown;.;source_canonical_service=.=autoscaler;.;source_canonical_revision=.=latest;.;destination_workload=.=unknown;.;destination_workload_namespace=.=unknown;.;destination_principal=.=unknown;.;destination_app=.=unknown;.;destination_version=.=unknown;.;destination_service=.=unknown;.;destination_service_name=.=BlackHoleCluster;.;destination_service_namespace=.=unknown;.;destination_canonical_service=.=unknown;.;destination_canonical_revision=.=latest;.;request_protocol=.=http;.;response_code=.=502;.;grpc_response_status=.=;.;response_flags=.=-;.;connection_security_policy=.=unknown;.;_istio_request_duration_milliseconds: P0(nan,0.0) P25(nan,0.0) P50(nan,0.0) P75(nan,11.2917) P90(nan,11.7167) P95(nan,11.8583) P99(nan,11.9717) P99.5(nan,11.9858) P99.9(nan,11.9972) P100(nan,12.0)
reporter=.=source;.;source_workload=.=autoscaler;.;source_workload_namespace=.=knative-serving;.;source_principal=.=unknown;.;source_app=.=autoscaler;.;source_version=.=unknown;.;source_canonical_service=.=autoscaler;.;source_canonical_revision=.=latest;.;destination_workload=.=unknown;.;destination_workload_namespace=.=unknown;.;destination_principal=.=unknown;.;destination_app=.=unknown;.;destination_version=.=unknown;.;destination_service=.=unknown;.;destination_service_name=.=BlackHoleCluster;.;destination_service_namespace=.=unknown;.;destination_canonical_service=.=unknown;.;destination_canonical_revision=.=latest;.;request_protocol=.=http;.;response_code=.=502;.;grpc_response_status=.=;.;response_flags=.=-;.;connection_security_policy=.=unknown;.;_istio_response_bytes: P0(nan,54.0) P25(nan,54.25) P50(nan,54.5) P75(nan,54.75) P90(nan,54.9) P95(nan,54.95) P99(nan,54.99) P99.5(nan,54.995) P99.9(nan,54.999) P100(nan,55.0)

Now I need to find a way to prevent the Istio Operator from overwriting the values. I can see a few options, but I think this looks to be a much better state.

I was meant to add this option to those two blocks.

Is this something you think might be changing in a future release?

Thank you immensely for your patience in the face of my apparently lacking comprehension skills.

@bianpengyuan
Copy link
Contributor

So we did not provide a way to override stats filter configuration in the hope that we will have telemetry API by 1.7 but looks like that is not going to happen in short term.

@douglas-reid @mandarjog @kyessenov wdyt about adding config override into stats filter installation option? Without this, I think it would be hard for user to manage any patch on stats filter like this issue or any metric customization.

@bianpengyuan
Copy link
Contributor

bianpengyuan commented Aug 6, 2020

Although the configuration will only control root namespace envoyfilters, so user will still need to maintain per namespace envoyfilter for specific use case, but at least it won't be overridden when upgrading Istio.

@kyessenov
Copy link
Contributor

yeah, I would support pulling out filter configs out of envoy filter resource. wdyt @mandarjog ?

@douglas-reid
Copy link
Contributor

Adding config override into install seems like a decent start. FWIW, our goals for 1.7 were to design a new telemetry API, but we had a separate task to improve metrics customization.

@bianpengyuan
Copy link
Contributor

istio/istio.io#7952 added guide on how to use istioctl to customization metrics. It is in preliminary doc now: https://preliminary.istio.io/latest/docs/tasks/observability/metrics/customize-metrics/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants