Approx. 20% increase in istio-proxy memory usage on 1.7 vs 1.6 #28652

Stono · 2020-11-06T14:50:34Z

Bug description
Hi,
I've been chatting to @howardjohn about this, and felt it warrants a github issue to track.

We've recently upgrade a staging cluster which has circa 500 istio-proxies on it from 1.6.13 to 1.7 and once we'd updated all sidecars we observed about a 20-25% increase in istio-proxy memory cluster wide:

On top of the 10% added going from 1.5 -> 1.6 due to SDS, these increases are getting hard to stomach.

We can compare this cluster with our production cluster. They're identical other than traffic patterns and number of endpoints (the 1.6 cluster having twice as many endpoints and significantly more load - so if anything, it should be higher).

Looking at the min across the board gives a decent indication of proxy memory usage before they start taking load, you can see the min on 1.7 is around the 50mb mark:

Whereas on 1.6 it's more like 40mb:

For comparison purposes, we run a service on all of our clusters call istio-test, as you can see here on the 1.7 cluster we're around 50mb average usage:

Vs. the 1.6 cluster which is around 40mb:

Both are configured identically as their purpose is to compare istio releases.

[ ] Docs
[ ] Installation
[ ] Networking
[x] Performance and Scalability
[ ] Extensions and Telemetry
[ ] Security
[ ] Test and Release
[x] User Experience
[ ] Developer Infrastructure

Expected behavior
Not a 20% increase in memory between releases.

Steps to reproduce the bug

Version (include the output of istioctl version --remote and kubectl version --short and helm version if you used Helm)
1.7.5-4d93d71598da6e07cf9c68e32aaa0c4eadc308a4

How was Istio installed?
Helm

Environment where bug was observed (cloud vendor, OS, etc)
GKE

The text was updated successfully, but these errors were encountered:

Stono · 2020-11-06T15:23:18Z

I'm going to document the things I check here as I go.
The first thing I wanted to validate was that we haven't had an explosion of metrics.

There are some new metrics in the 1.7 envoy:

< # TYPE envoy_cluster_upstream_rq_max_duration_reached counter
< # TYPE envoy_cluster_zone_europe_west4_a__upstream_rq counter
< # TYPE envoy_cluster_zone_europe_west4_a__upstream_rq_200 counter
< # TYPE envoy_cluster_zone_europe_west4_a__upstream_rq_completed counter
< # TYPE envoy_server_envoy_bug_failures counter

It looks as if locality metrics are now enabled by default, which is a bit annoying as we don't need them, and also, they're still in a strange broken statsd format (the europe-west4-a part should be a label). See: #20235.

Either way, i wouldn't consider this a large amount of additional metrics so wouldn't account for the 10mb base jump in proxies.

Stono · 2020-11-06T15:29:20Z

top from 1.7:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
      6 istio-p+  20   0  785232  75056  48768 S   0.0  0.3   0:24.06 pilot-agent
     18 istio-p+  20   0  178772  55004  33220 S   0.3  0.2   7:38.41 envoy

top from 1.6:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
      6 istio-p+  20   0  762540  56272  36640 S   0.0  0.2   2:01.13 pilot-agent
     20 istio-p+  20   0  176796  53488  32276 S   0.0  0.2  45:14.45 envoy

Seems to show the increase is in pilot-agent rather than envoy

Stono · 2020-11-06T15:56:39Z

Associated: #26232

kyessenov · 2020-11-06T17:11:41Z

istio-agent increased in memory because xDS proxy was added, which requires importing a whole bunch of protobufs from envoy. I think the question is whether the value added by the proxy justifies the cost.

mandarjog · 2020-11-06T17:46:26Z

Is xds proxy already present in 1.7?
If you look at the effect on memory 1.7 uses 15MB of additional unshared memory. Details in the attached bug.

mandarjog · 2020-11-06T18:52:59Z

multiple questions

Is it possible for a user to pay the cost of xds proxy only when they use features that require it.
Is it possible to fix it?
25% is a large number, but the absolute number is what really matters which is 50-60mb per proxy, that is why I felt we don't need to make it a release blocker. A Java app would use 100s MBs to a few GBs.
But in aggregate istio ends up using more memory for sure.

howardjohn · 2020-11-06T18:55:04Z

On 3 month old memory: when looking into this, its extremely challenging to reduce the binary size in 1.7

In 1.9 we will drop the k8s imports, which I have done locally, and its 40mb, so we definitely have a long term path forward. Also the gogo api migration will drop multiple mb as well

kyessenov · 2020-11-06T18:56:01Z

There's a budget of how much users are going to tolerate per-pod overhead. Keep in mind Wasm overhead and whether it's better to use the budget for xDS proxy or Wasm.

howardjohn · 2020-11-06T18:58:05Z

We can reduce the overhead of XDS proxy. We don't unmarhsal Any so we don't need to import all the filters. The bare cost of just importing core XDS is not that high

Stono · 2020-11-06T19:06:39Z

A Java app would use 100s MBs to a few GBs.

I'm not sure the point here, you can equally consider the other extreme - where you've got people building golang microservices running @ 10-20mb, adding istio @ 50-60mb to their application architecture is a significant increase. The other key word there is microservices. Larger proxies will favour people with larger/monolithic services, which is counter-productive when one of istios biggest selling points is the observability it gives you when you're building large microservice architectures!

25% is a large number, but the absolute number is what really matters which is 50-60mb per proxy, that is why I felt we don't need to make it a release blocker

"25% increase in memory usage of the data plane" rightly paints a very different picture in peoples heads than "15mb increase in memory usage per proxy". So that is why the percentage matters, because it more accurately depicts the impact to the users regardless of their number of proxies.

In real terms for me, that's over a 15gb increase cluster wide which isn't something to be sniffed at. Especially considering we've just took 10gb increase just to facilitate SDS. So in the course of two releases we've not far off doubled the operating overhead (memory) of Istio. This means proxy memory usage now accounts for 8-10% of all our memory usage.

Stono · 2020-11-09T09:46:40Z

I'm going to close this in favour of #26232 which seems to be tracking the regression and all associated fixes.

@howardjohn has also back ported #28670 to 1.7 which is awaiting merge which will help a bit with the bloat on 1.7 (believe to land roughly half way between what I've observed here and 1.6).

I think; for those coming into this issue, we'll have to accept some increase in 1.7, with many improvements coming in 1.8 and 1.9.

istio-policy-bot added area/perf and scalability area/user experience labels Nov 6, 2020

Stono mentioned this issue Nov 6, 2020

Ingress proxy memory usage possible regression for 1.7.4 #28287

Closed

Stono closed this as completed Nov 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Approx. 20% increase in istio-proxy memory usage on 1.7 vs 1.6 #28652

Approx. 20% increase in istio-proxy memory usage on 1.7 vs 1.6 #28652

Stono commented Nov 6, 2020 •

edited

Loading

Stono commented Nov 6, 2020 •

edited

Loading

Stono commented Nov 6, 2020 •

edited

Loading

Stono commented Nov 6, 2020

kyessenov commented Nov 6, 2020

mandarjog commented Nov 6, 2020

mandarjog commented Nov 6, 2020

howardjohn commented Nov 6, 2020

kyessenov commented Nov 6, 2020

howardjohn commented Nov 6, 2020

Stono commented Nov 6, 2020 •

edited

Loading

Stono commented Nov 9, 2020

Approx. 20% increase in istio-proxy memory usage on 1.7 vs 1.6 #28652

Approx. 20% increase in istio-proxy memory usage on 1.7 vs 1.6 #28652

Comments

Stono commented Nov 6, 2020 • edited Loading

Stono commented Nov 6, 2020 • edited Loading

Stono commented Nov 6, 2020 • edited Loading

Stono commented Nov 6, 2020

kyessenov commented Nov 6, 2020

mandarjog commented Nov 6, 2020

mandarjog commented Nov 6, 2020

howardjohn commented Nov 6, 2020

kyessenov commented Nov 6, 2020

howardjohn commented Nov 6, 2020

Stono commented Nov 6, 2020 • edited Loading

Stono commented Nov 9, 2020

Stono commented Nov 6, 2020 •

edited

Loading

Stono commented Nov 6, 2020 •

edited

Loading

Stono commented Nov 6, 2020 •

edited

Loading

Stono commented Nov 6, 2020 •

edited

Loading