Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Approx. 20% increase in istio-proxy memory usage on 1.7 vs 1.6 #28652

Closed
Stono opened this issue Nov 6, 2020 · 11 comments
Closed

Approx. 20% increase in istio-proxy memory usage on 1.7 vs 1.6 #28652

Stono opened this issue Nov 6, 2020 · 11 comments

Comments

@Stono
Copy link
Contributor

Stono commented Nov 6, 2020

Bug description
Hi,
I've been chatting to @howardjohn about this, and felt it warrants a github issue to track.

We've recently upgrade a staging cluster which has circa 500 istio-proxies on it from 1.6.13 to 1.7 and once we'd updated all sidecars we observed about a 20-25% increase in istio-proxy memory cluster wide:

Screenshot 2020-11-06 at 14 31 11

On top of the 10% added going from 1.5 -> 1.6 due to SDS, these increases are getting hard to stomach.

We can compare this cluster with our production cluster. They're identical other than traffic patterns and number of endpoints (the 1.6 cluster having twice as many endpoints and significantly more load - so if anything, it should be higher).

Looking at the min across the board gives a decent indication of proxy memory usage before they start taking load, you can see the min on 1.7 is around the 50mb mark:

Screenshot 2020-11-06 at 15 00 44

Whereas on 1.6 it's more like 40mb:

Screenshot 2020-11-06 at 15 00 48

For comparison purposes, we run a service on all of our clusters call istio-test, as you can see here on the 1.7 cluster we're around 50mb average usage:

Screenshot 2020-11-06 at 14 53 37

Vs. the 1.6 cluster which is around 40mb:

Screenshot 2020-11-06 at 14 53 41

Both are configured identically as their purpose is to compare istio releases.

[ ] Docs
[ ] Installation
[ ] Networking
[x] Performance and Scalability
[ ] Extensions and Telemetry
[ ] Security
[ ] Test and Release
[x] User Experience
[ ] Developer Infrastructure

Expected behavior
Not a 20% increase in memory between releases.

Steps to reproduce the bug

Version (include the output of istioctl version --remote and kubectl version --short and helm version if you used Helm)
1.7.5-4d93d71598da6e07cf9c68e32aaa0c4eadc308a4

How was Istio installed?
Helm

Environment where bug was observed (cloud vendor, OS, etc)
GKE

@Stono
Copy link
Contributor Author

Stono commented Nov 6, 2020

I'm going to document the things I check here as I go.
The first thing I wanted to validate was that we haven't had an explosion of metrics.

There are some new metrics in the 1.7 envoy:

< # TYPE envoy_cluster_upstream_rq_max_duration_reached counter
< # TYPE envoy_cluster_zone_europe_west4_a__upstream_rq counter
< # TYPE envoy_cluster_zone_europe_west4_a__upstream_rq_200 counter
< # TYPE envoy_cluster_zone_europe_west4_a__upstream_rq_completed counter
< # TYPE envoy_server_envoy_bug_failures counter

It looks as if locality metrics are now enabled by default, which is a bit annoying as we don't need them, and also, they're still in a strange broken statsd format (the europe-west4-a part should be a label). See: #20235.

Either way, i wouldn't consider this a large amount of additional metrics so wouldn't account for the 10mb base jump in proxies.

@Stono
Copy link
Contributor Author

Stono commented Nov 6, 2020

top from 1.7:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
      6 istio-p+  20   0  785232  75056  48768 S   0.0  0.3   0:24.06 pilot-agent
     18 istio-p+  20   0  178772  55004  33220 S   0.3  0.2   7:38.41 envoy

top from 1.6:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
      6 istio-p+  20   0  762540  56272  36640 S   0.0  0.2   2:01.13 pilot-agent
     20 istio-p+  20   0  176796  53488  32276 S   0.0  0.2  45:14.45 envoy

Seems to show the increase is in pilot-agent rather than envoy

@Stono
Copy link
Contributor Author

Stono commented Nov 6, 2020

Associated: #26232

@kyessenov
Copy link
Contributor

istio-agent increased in memory because xDS proxy was added, which requires importing a whole bunch of protobufs from envoy. I think the question is whether the value added by the proxy justifies the cost.

@mandarjog
Copy link
Contributor

Is xds proxy already present in 1.7?
If you look at the effect on memory 1.7 uses 15MB of additional unshared memory. Details in the attached bug.

@mandarjog
Copy link
Contributor

multiple questions

  1. Is it possible for a user to pay the cost of xds proxy only when they use features that require it.
  2. Is it possible to fix it?
  3. 25% is a large number, but the absolute number is what really matters which is 50-60mb per proxy, that is why I felt we don't need to make it a release blocker. A Java app would use 100s MBs to a few GBs.
    But in aggregate istio ends up using more memory for sure.

@howardjohn
Copy link
Member

On 3 month old memory: when looking into this, its extremely challenging to reduce the binary size in 1.7

In 1.9 we will drop the k8s imports, which I have done locally, and its 40mb, so we definitely have a long term path forward. Also the gogo api migration will drop multiple mb as well

@kyessenov
Copy link
Contributor

There's a budget of how much users are going to tolerate per-pod overhead. Keep in mind Wasm overhead and whether it's better to use the budget for xDS proxy or Wasm.

@howardjohn
Copy link
Member

We can reduce the overhead of XDS proxy. We don't unmarhsal Any so we don't need to import all the filters. The bare cost of just importing core XDS is not that high

@Stono
Copy link
Contributor Author

Stono commented Nov 6, 2020

A Java app would use 100s MBs to a few GBs.

I'm not sure the point here, you can equally consider the other extreme - where you've got people building golang microservices running @ 10-20mb, adding istio @ 50-60mb to their application architecture is a significant increase. The other key word there is microservices. Larger proxies will favour people with larger/monolithic services, which is counter-productive when one of istios biggest selling points is the observability it gives you when you're building large microservice architectures!

25% is a large number, but the absolute number is what really matters which is 50-60mb per proxy, that is why I felt we don't need to make it a release blocker

"25% increase in memory usage of the data plane" rightly paints a very different picture in peoples heads than "15mb increase in memory usage per proxy". So that is why the percentage matters, because it more accurately depicts the impact to the users regardless of their number of proxies.

In real terms for me, that's over a 15gb increase cluster wide which isn't something to be sniffed at. Especially considering we've just took 10gb increase just to facilitate SDS. So in the course of two releases we've not far off doubled the operating overhead (memory) of Istio. This means proxy memory usage now accounts for 8-10% of all our memory usage.

@Stono
Copy link
Contributor Author

Stono commented Nov 9, 2020

I'm going to close this in favour of #26232 which seems to be tracking the regression and all associated fixes.

@howardjohn has also back ported #28670 to 1.7 which is awaiting merge which will help a bit with the bloat on 1.7 (believe to land roughly half way between what I've observed here and 1.6).

I think; for those coming into this issue, we'll have to accept some increase in 1.7, with many improvements coming in 1.8 and 1.9.

@Stono Stono closed this as completed Nov 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants