New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Istio Sidecar consuming high CPU Istio 1.30-1.3.3 #18229
Comments
|
from a different deployment with same issue strace -f -r -p 13144 |
|
same issue |
|
I'm also experiencing this, istio is using the whole CPU, and it autoscaled to 5 (using 5 vCPU). I made a simple install on GKE using this parameters (no further customization) |
|
We have recently upgraded from 1.2.0 to 1.3.1 and experiencing the same issue. Did some investigation and it seems like the issue for us only appears when Istio acts as an HTTP proxy. For example, we only have at the moment liveness and readiness checks running, no real traffic. In the proxy logs I'm seeing the switch when I add 'name: http' to the service definition port section; Log section: Service2 and service3 has 'name: http' at the port definition the others not. CPU usage under with Istio 1.3.1: Downgraded back to Istio 1.2.0 and the issue disappeared. istio is getting deployed with the following helm template generation: edit: fixed bad formatting |
|
We are also facing the issue after an upgrade from 1.2.4 to 1.3.3; as a temporary fix, we used the Furthermore, and I double checked it, the ingress gateways are not affected on our clusters. |
|
Everyone, would you mind experimenting with a different proxy image that would include gdb and Envoy's symbols so we could try to see if some kind of infinite loop is identifiable in a backtrace? If you're manually injecting Istio, you could replace in the injected sidecar (as a reminder, this could be generated by by You need to add to the existing securityContext, the capabilities required for gdb (SYS_PTRACE) as shown here: Once this is done and your pods running, the goal would be to attach a gdb to the running pid of envoy in the istio-proxy container, and share the output of info threads, to see if there's a pattern in your different use cases. You could also Ctrl-C and continue to figure out if one of the thread is looping abnormally (everything described here look like some kind of infinite loop). I've built a bunch of these image for the various 1.3.x proxyv2, you can find them on docker.io hub. Just FYI, you can build the same following this simple Dockerfile template: You have to adjust for your version in the FROM tag (1.3.4 in this example) and pick the envoy symbol hash as indicated in istio.deps of the same version, e.g. https://github.com/istio/istio/blob/1.3.4/istio.deps would show c33dc49585e5e7b5f616c8b5377a5f1f52505e20 for the proxy "lastStableSHA": . |
I did the test on one of our services in an environment of ours with the Image: docker.io/francoispesce/proxyv2:1.3.1-gdb: |
|
I did the test on my production env where this issue happens, using on a pod with high-cpu usage envoy: for comparison, on a a pod with no high-cpu usage envoy I've got this: Strangely, after quitting the gdb process, envoy process was killed (did not show up on Moreover, on previous debugging, in our case, we were able to see (using
|
|
Everyone, thanks again for your help, I'd like to ask again if you can share some of your configuration dump? I'm curious about any network details (is there any UDP listeners configured for example). |
|
@StefanCenusa |
|
@lambdai found the bug tonight. fix will be out shortly. In the meantime, we think setting global.proxy.protocolDetectionTimeout = 0 will address the issue for unpatched 1.3 versions. See https://istio.io/docs/reference/config/installation-options/#global-options |
|
@duderino I've set to I guess I'll wait for the fix to try it out... |
|
@duderino didn't work for me either |
|
Try To verify it applied correctly, run |
|
@howardjohn after applying this patch is it supposed to "auto-magically" start using fewer CPU? EDIT: It worked without further intervention |
|
@howardjohn @lambdai can you elaborate on the underlying root cause? Is it that the memory buffers allocated for protocol detection are not freed appropriately after timeout? |
I have redeployed Istio 1.3.1 into one of our environments with these settings and can confirm that the high CPU usage disappears on the sidecars. |
|
Experienced the same issue with istio |
|
This is fixed in 1.3.5 (released today), 1.4.0 (releasing soon), and mitigations provided for other releases, so I think its safe to close this. Thanks everyone for helping us track this down |
I'll be honest, a little new to istio here, but I was having (i think) similar issues described in this thread. I updated from 1.3.2 to 1.3.5 and still had high CPU, but after applying the Helm values here the CPU usage seems to have come way down. I'll continue to monitor/verify, but wanted to chime in here hoping it's helpful |
|
@jl-gogovapps its possible you did not update the sidecars? See https://istio.io/docs/setup/upgrade/steps/#sidecar-upgrade you need to actually restart the pods |
|
Awesome news that it is fixed in 1.3.5. It would be appreciated if 1.3.5's release note mentioned about it. |
|
@nak3 see
|
|
Oh! That is this one. I found the CVE was linked to this now. Thank you and sorry for bothering you @howardjohn |
|
(@howardjohn ) Either way, thats good to know! |
|
So, in release 1.3.5 and above, it's not necessary keep these two settings? |
|
I need to use istio 1.0.5 because I am working on OPENSHIFT cluster. can anybody help me how do I setup these values in that case as manifest command is not there istioctl 1.0.5 version. |
|
that feature isn't in 1.0, so you can't disable it since it doesn't exist
…On Wed, Jan 8, 2020, 3:38 AM dudiakash ***@***.***> wrote:
I need to use istio 1.0.5 because I am working on OPENSHIFT cluster. can
anybody help me how do I setup
pilot.env.PILOT_INBOUND_PROTOCOL_DETECTION_TIMEOUT=0s
global.proxy.protocolDetectionTimeout=0s
these values in that case as manifest command is not there istioctl 1.0.5
version.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#18229?email_source=notifications&email_token=AAEYGXIIMNNDIOZKNUTMAGTQ4W3L7A5CNFSM4JEDXZXKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIMDIZY#issuecomment-572011623>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAEYGXNCLTXF6OQS4THCCD3Q4W3L7ANCNFSM4JEDXZXA>
.
|
|
I tried to use --set values.gateways.istio-ingressgateway.resources.requests.cpu=500m but it doesn't works with istio 1.5.2 |
Bug description
We are running Istio v1.3.3 (this also happend on v1.3.0), most of our applications are running fine. However, in one of our deployments the CPU on the sidecar is pegged at 1.0 for the life of the pod. This happens for the majority of the pods in the same deployment but not all. The pod contains nginx, istio-proxy and an application container all of which have normal CPU use except istio-proxy. I turned on debugging on the sidecar however there are not really interesting.
I was able to collect an strace from the PID of the problematic pod
Here is the output of
kubectl -n $namespace exec -it $pod -c istio-proxy -- top on a pod with busy-envoy
There has been a forum thread about it, but this is now ongoing without any resolution. So far the only workaround has been to downgrade the sidecar to 1.2.6
https://discuss.istio.io/t/istio-sidecar-consuming-high-cpu/3894
Any ideas on how I can troubleshoot further?
Affected product area (please put an X in all that apply)
[ ] Configuration Infrastructure
[ ] Docs
[ ] Installation
[ ] Networking
[x ] Performance and Scalability
[ ] Policies and Telemetry
[ ] Security
[ ] Test and Release
[ ] User Experience
[ ] Developer Infrastructure
Expected behavior
CPU not going high
Steps to reproduce the bug
Unable to reproduce
Version (include the output of istioctl version --remote and kubectl version)
citadel version: 1.3.0-1.3.3
galley version: 1.3.0-1.3.3
ingressgateway version: 1.3.0-1.3.3
pilot version: 1.3.0-1.3.3
policy version: 1.3.0-1.3.3
sidecar-injector version: 1.3.0-1.3.3
telemetry version: 1.3.0-1.3.3
How was Istio installed?
helm chart
Environment where bug was observed (cloud vendor, OS, etc)
amazon eks
The text was updated successfully, but these errors were encountered: