Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

istio pilot cpu spikes #23031

Closed
turbotankist opened this issue Apr 17, 2020 · 9 comments
Closed

istio pilot cpu spikes #23031

turbotankist opened this issue Apr 17, 2020 · 9 comments
Labels
area/perf and scalability lifecycle/automatically-closed Indicates a PR or issue that has been closed automatically. lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while

Comments

@turbotankist
Copy link

turbotankist commented Apr 17, 2020

istio-pilot cpu usage too high spikes

kubectl top po -n istio-system -l app=pilot 
NAME                           CPU(cores)   MEMORY(bytes)   
istio-pilot-86bcb99c64-56ftw   8347m        491Mi           
istio-pilot-86bcb99c64-kc9qt   11512m       595Mi           
istio-pilot-86bcb99c64-pzdh7   11306m       511Mi           
istio-pilot-86bcb99c64-qvl58   10464m       343Mi           
istio-pilot-86bcb99c64-wt857   3694m        1978Mi

Expected behavior
https://archive.istio.io/v1.4/docs/ops/deployment/performance-and-scalability/
expected fewer cpu usage

Version (include the output of istioctl version --remote and kubectl version and helm version if you used Helm)

istioctl version --remote                    <aws:px-dev>
client version: 1.4.6
citadel version: 1.4.5
citadel version: 1.4.5
galley version: 1.4.5
galley version: 1.4.5
ingressgateway version: 1.4.5
ingressgateway version: 1.4.5
ingressgateway version: 1.4.5
ingressgateway version: 1.4.5
ingressgateway version: 1.4.5
pilot version: 1.4.5
pilot version: 1.4.5
pilot version: 1.4.5
pilot version: 1.4.5
pilot version: 1.4.5
policy version: 1.4.5
sidecar-injector version: 1.4.5
sidecar-injector version: 1.4.5
telemetry version: 1.4.5
waf-gateway-private version: 
waf-gateway-private version: 
waf-gateway-public version: 
waf-gateway-public version: 
waf-gateway-solaris version: 
waf-gateway-solaris version: 
waf-gateway-vpn version: 
waf-gateway-vpn version: 
data plane version: 1.4.5 (700 proxies)

How was Istio installed?

    helm upgrade --namespace istio-system --install istio ./helm/istio/istio \
    -f ./helm/istio/istio/values.yaml \
    --set pilot.image=docker.io/cilium/istio_pilot:{{ istio_version }} \
    --set sidecarInjectorWebhook.enabled=true \
    --set global.controlPlaneSecurityEnabled=true \
    --set global.mtls.enabled=true \
    --set global.proxy.image=docker.io/cilium/istio_proxy:{{ istio_version }} \
    --set ingress.enabled=false \
    --set egressgateway.enabled=true \
    --set global.proxy.resources.requests.cpu=50m \
    --set global.proxy.resources.requests.memory=96Mi \
    --set global.proxy.resources.limits.cpu=1000m \
    --set global.proxy.resources.limits.memory=960Mi \

Environment where bug was observed (cloud vendor, OS, etc)
AWS elastic kubernetes system

@howardjohn
Copy link
Member

We cannot really help with performance issues without way more info. Size of services/pods/namespace, are you using Sidecar, rate of config change, rate of endpoints change, ... See https://github.com/istio/istio/wiki/Analyzing-Istio-Performance as well. I also recommend using Istio 1.5 which has more performance improvements

@turbotankist
Copy link
Author

Problem still is on:

image

image

image

May be I can show any metrics from pilot?
Also can be a reason of this problem too many number of gateways? we have about 300-500 namespaces and each has their own gateway config.

@howardjohn
Copy link
Member

Yes, large amount of config contributes to the cost. Istio 1.6, releasing next week, also has very substantial performance improvements. It would be very helpful to get a CPU profile following https://github.com/istio/istio/wiki/Analyzing-Istio-Performance, otherwise we will just be guessing where time is spent

@turbotankist
Copy link
Author

I've got cpu profile:
image
image
pilot.txt

@howardjohn
Copy link
Member

Thanks! if you still have it around can you upload the raw tar.gz of the profile? it helps to dig around a bit more

But ultimately right now there doesn't look to be any major areas that are going wrong, just looks like standard high load of pilot

@infa-ddeore
Copy link

infa-ddeore commented Jul 30, 2020

@howardjohn

i am also facing the issue, istiod (1.5.8) cpu usage spiked up when active xDS connections increased from ~185 to ~220

istiod pod scaled to its max 5 count, i updated HPA max to 10, then it scaled to 10

image
image
image

here is one of istiod pod cpu profile
pprof.pilot-discovery.samples.cpu.001.pb.gz

I am trying to find:

  1. why istiod needed so much cpu for only ~220 xDS connections
  2. why XDS request size increased a lot all of a sudden

@howardjohn
Copy link
Member

Look at the "Pilot Errors" section - you have a huge error rate, which is likely causing weird things to happen. Look at ACK ERROR in the istiod logs to see what is going wrong

@infa-ddeore
Copy link

Look at the "Pilot Errors" section - you have a huge error rate, which is likely causing weird things to happen. Look at ACK ERROR in the istiod logs to see what is going wrong

thanks for the pointer, issue was resolved on its own by the time I posted the comment, see the drop at the end of all graphs.

istiod logs show 2 types of ACK ERROR for 600k+ in last 3 days.

This one is due to the k8 secret referred in gateway object is missing

2020-07-31T04:25:33.876552Z\twarn\tads\tADS:LDS: ACK ERROR 10.xx.xx.xx:35438 router~10.xx.xx.xx~usergateway-7844d4c765-8nplq.dev-ns~dev-ns.svc.cluster.local-5135 Internal:Error adding/updating listener(s) 0.0.0.0_10002: Invalid path: /etc/istio/ingressgateway-ca-certs/ca-chain.cert.pem

i guess that this error is due to istiod is at 1.5.8 and the user gw is still at 1.4.x so not getting the certificate on filesystem

timestamp":1596172249487,"log":"2020-07-31T05:10:49.486763Z\twarn\tads\tADS:CDS: ACK ERROR 10.22.10.65:33788 router~10.22.10.65~some-namespace-istio-custom-ingress-gateways-98d8db596-8bwst.some-namespace~some-namespace.svc.cluster.local-5268 Internal:Error adding/updating cluster(s) outbound|443||kubernetes.default.svc.cluster.local: Invalid path: /etc/certs/cert-chain.pem, outbound|44134||tiller-deploy.kube-system.svc.cluster.local: Invalid path: /etc/certs/cert-chain.pem, outbound|80||
....
....
....

do you know how increased error would had caused the issue?

Because issue is resolved i dont know if increased error rate caused it, I will update here if this issue is reproducible.

@istio-policy-bot istio-policy-bot added the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Oct 29, 2020
@istio-policy-bot
Copy link

🚧 This issue or pull request has been closed due to not having had activity from an Istio team member since 2020-07-30. If you feel this issue or pull request deserves attention, please reopen the issue. Please see this wiki page for more information. Thank you for your contributions.

Created by the issue and PR lifecycle manager.

@istio-policy-bot istio-policy-bot added the lifecycle/automatically-closed Indicates a PR or issue that has been closed automatically. label Nov 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/perf and scalability lifecycle/automatically-closed Indicates a PR or issue that has been closed automatically. lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while
Projects
None yet
Development

No branches or pull requests

4 participants