New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High RAM usage for istio-proxy #8247

Closed
Stono opened this Issue Aug 26, 2018 · 8 comments

Comments

Projects
None yet
4 participants
@Stono
Contributor

Stono commented Aug 26, 2018

Hey,
The bump to 1.0.1 has significantly improved my boot times, yay, but I'm noticing a little higher memory footprint for the sidecars, they're all circa 220meg(ish) on our preprod cluster:

Numbers:

  • VirtualServices: 45
  • ServiceEntry: 14
  • Services: 76
  • Pods: 214

screen shot 2018-08-26 at 08 32 33

Adding 250ish meg of ram to the memory footprint of every pod, some of which only use around 500meg themselves is quite a bit of an overhead.

Interestingly, they're significantly less in dev (which has less apps on it, thus less services), so there appears to be a correlation:

Numbers:

  • VirtualServices: 22
  • ServiceEntry: 14
  • Services: 52
  • Pods: 139

screen shot 2018-08-26 at 08 45 21

@Stono Stono changed the title from Potentially exponential growth problem for istio-proxy RAM usage to High RAM usage for istio-proxy Aug 26, 2018

@Stono

This comment has been minimized.

Contributor

Stono commented Aug 26, 2018

OK I've done some more digging in the dev cluster and it appears correlation =/= causation, I've found some istio-proxy pods on the less utilised cluster also using 2.5x the RAM of others:

screen shot 2018-08-26 at 08 56 21

What I have noticed is all the istio-proxies that are using more ram, have some event at different times in their history where ram usage really jumped:

screen shot 2018-08-26 at 08 57 21

screen shot 2018-08-26 at 08 58 31

I've attached the log from the second sreenshots istio-proxy, which i must add is quite spammy with cluster reloads and grpc disconnections?

istio-proxy.log

@rshriram

This comment has been minimized.

Member

rshriram commented Aug 27, 2018

cc @PiotrSikora LCTrie mem optimizations don't seem to help.

@PiotrSikora

This comment has been minimized.

Member

PiotrSikora commented Aug 29, 2018

@Stono compared to #7912 and my experiments, those numbers look awfully high for the number of services in your clusters. How many CPUs are on the host systems running those containers? Could you run those commands for me?

$ kubectl exec -it <pod> -c istio-proxy -- grep -c ^processor /proc/cpuinfo
$ kubectl exec -it <pod> -c istio-proxy -- pgrep -w envoy | wc -l

There are 2 major sources of high RAM usage in the proxy:

  1. Per cluster/service overhead (~80% of which is allocated for stats), which is currently proportional to number of services in the cluster,
  2. Per worker thread overhead for each cluster/service, which makes 1. even worse.

By default, proxy will start worker thread for each CPU on the host system, which means that the more powerful machine you're using, the more RAM proxy is going to consume by default.

If you're not processing thousands of requests per second, then you might try to lower the number of worker threads to something more reasonable, even 1 or 2, using the global.proxy.concurrency setting, i.e.:

helm install install/kubernetes/helm/istio --name istio --namespace istio-system --set global.proxy.concurrency=1

Unfortunately, the change didn't make into 1.0.1 release, so you'd need to use daily from 20180828 or newer to play with it now.

@Stono

This comment has been minimized.

Contributor

Stono commented Aug 29, 2018

❯ kubectl exec -it $pod -c istio-proxy -- grep -c ^processor /proc/cpuinfo
12
❯ kubectl exec -it $pod -c istio-proxy -- pgrep -w envoy | wc -l
      30

I can't unfortunately use daily releases on these systems.

I'm confused however that istio-proxy would be creating threads per cpu on the host, rather than looking at the cgroup allocation for that pod (we would ideally want to limit the amount of cpu a proxy would use). Otherwise you run the real risk of noisy neighbour.

Nodes are by nature multi tenant, based on our size of node for us we get this sort of setup:

❯ k get pods --all-namespaces -o wide | grep gke-delivery-platform-cpu1ram2-f11b16a5-nkfb
conference-application       conference-application-6b754b56c8-4jkrm                    2/2       Running   0          6h        10.198.0.26    gke-delivery-platform-cpu1ram2-f11b16a5-nkfb
consumer-platform            consumer-platform-5c78cd6bc9-js86x                         2/2       Running   0          2d        10.198.0.5     gke-delivery-platform-cpu1ram2-f11b16a5-nkfb
core-system                  prometheus-node-exporter-v68sm                             1/1       Running   0          5d        10.193.32.2    gke-delivery-platform-cpu1ram2-f11b16a5-nkfb
core-system                  weave-scope-agent-brmhq                                    1/1       Running   0          5d        10.193.32.2    gke-delivery-platform-cpu1ram2-f11b16a5-nkfb
data-platform                alertmanager-0                                             2/2       Running   0          5d        10.198.0.2     gke-delivery-platform-cpu1ram2-f11b16a5-nkfb
data-platform                cloudwatch-billing-exporter-5b698f9dbc-g9vcn               1/1       Running   0          5d        10.198.0.14    gke-delivery-platform-cpu1ram2-f11b16a5-nkfb
kube-system                  calico-node-4hzlh                                          2/2       Running   0          5d        10.193.32.2    gke-delivery-platform-cpu1ram2-f11b16a5-nkfb
kube-system                  calico-typha-horizontal-autoscaler-5545fbd5d6-9qgrq        1/1       Running   0          5d        10.198.0.11    gke-delivery-platform-cpu1ram2-f11b16a5-nkfb
kube-system                  filebeat-preprod-gtrww                                     1/1       Running   0          5d        10.198.0.10    gke-delivery-platform-cpu1ram2-f11b16a5-nkfb
kube-system                  ip-masq-agent-s594r                                        1/1       Running   0          5d        10.193.32.2    gke-delivery-platform-cpu1ram2-f11b16a5-nkfb
kube-system                  kube-dns-788979dc8f-29scp                                  4/4       Running   0          5d        10.198.0.13    gke-delivery-platform-cpu1ram2-f11b16a5-nkfb
kube-system                  kube-proxy-gke-delivery-platform-cpu1ram2-f11b16a5-nkfb    1/1       Running   0          5d        10.193.32.2    gke-delivery-platform-cpu1ram2-f11b16a5-nkfb
kube-system                  metrics-server-v0.2.1-7486f5bd67-clfm5                     2/2       Running   0          5d        10.198.0.6     gke-delivery-platform-cpu1ram2-f11b16a5-nkfb
location-service-soap        location-service-soap-56f7dd754f-5glkh                     2/2       Running   0          23h       10.198.0.19    gke-delivery-platform-cpu1ram2-f11b16a5-nkfb
private-advert-service       private-advert-service-57d7c6f859-w5pm8                    2/2       Running   0          1h        10.198.0.49    gke-delivery-platform-cpu1ram2-f11b16a5-nkfb
sauron-web-partners          sauron-web-partners-5659987d69-62cpj                       2/2       Running   0          49m       10.198.0.52    gke-delivery-platform-cpu1ram2-f11b16a5-nkfb
sauron-web                   sauron-web-549884d69b-2gq5k                                2/2       Running   0          40m       10.198.0.53    gke-delivery-platform-cpu1ram2-f11b16a5-nkfb
search-solr                  zoonavigator-6874bcc68b-5qg8j                              1/1       Running   0          5d        10.198.0.17    gke-delivery-platform-cpu1ram2-f11b16a5-nkfb
search-solr                  zoonavigator-api-f6cb74f4d-w4rq9                           1/1       Running   0          5d        10.198.0.4     gke-delivery-platform-cpu1ram2-f11b16a5-nkfb
vehicle-metric-service       vehicle-metric-service-6f5df59fdb-w2ps8                    2/2       Running   0          1d        10.198.0.12    gke-delivery-platform-cpu1ram2-f11b16a5-nkfb

7 of these apps have istio sidecars, so that's 7 processes creating 12 threads?

@mandarjog

This comment has been minimized.

Contributor

mandarjog commented Aug 29, 2018

@Stono, you can update the sidecar-injector config map and add

  • —concurrency
  • “1”

And then restart a pod to be reinjected.

@Stono

This comment has been minimized.

Contributor

Stono commented Aug 30, 2018

Will give it a go today @mandarjog - ta

@Stono

This comment has been minimized.

Contributor

Stono commented Aug 30, 2018

Rolling out concurrency=1 seems to have significantly reduced our memory use, below shows sidecars in mb:

screen shot 2018-08-30 at 13 53 17

This does feel like a mitigation though, i would still question having proxy default threads === cores on node, imagine someone was running 64 core nodes, with 32 apps (thus 32 proxies) on there, it'd be pretty catastrophic :)

Feels like it'd be more sensible to set a fixed, lower limit and then document how to tweak under high throughput?

@PiotrSikora

This comment has been minimized.

Member

PiotrSikora commented Aug 30, 2018

@Stono great, thanks for testing!

Yes, I totally agree that we should have 1 or 2 proxy threads by default, at least for the sidecar, otherwise we're stealing resources away from the workload. Adding global.proxy.concurrency (and soon annotations for it) was just prerequisite to dialing it down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment