Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Istio Sidecar consuming high CPU Istio 1.30-1.3.3 #18229

Closed
crhuber opened this issue Oct 23, 2019 · 30 comments · Fixed by envoyproxy/envoy#8922
Closed

Istio Sidecar consuming high CPU Istio 1.30-1.3.3 #18229

crhuber opened this issue Oct 23, 2019 · 30 comments · Fixed by envoyproxy/envoy#8922
Assignees
Milestone

Comments

@crhuber
Copy link

crhuber commented Oct 23, 2019

Bug description
We are running Istio v1.3.3 (this also happend on v1.3.0), most of our applications are running fine. However, in one of our deployments the CPU on the sidecar is pegged at 1.0 for the life of the pod. This happens for the majority of the pods in the same deployment but not all. The pod contains nginx, istio-proxy and an application container all of which have normal CPU use except istio-proxy. I turned on debugging on the sidecar however there are not really interesting.

 istio-proxy [2019-09-25 09:45:06.322][27][debug][client] [external/envoy/source/common/http/codec_client.cc:95] [C1145026] response complete
 istio-proxy [2019-09-25 09:45:06.322][27][debug][filter] [src/envoy/http/mixer/filter.cc:214] Called Mixer::Filter : onDestroy state: 2
 istio-proxy [2019-09-25 09:45:06.322][27][debug][pool] [external/envoy/source/common/http/http1/conn_pool.cc:199] [C1145026] response complete
 istio-proxy [2019-09-25 09:45:06.322][27][debug][pool] [external/envoy/source/common/http/http1/conn_pool.cc:237] [C1145026] moving to ready
 istio-proxy [2019-09-25 09:45:06.322][27][debug][filter] [src/envoy/http/mixer/filter.cc:228] Called Mixer::Filter : log
 istio-proxy [2019-09-25 09:45:06.322][27][debug][filter] [./src/envoy/http/mixer/report_data.h:144] No dynamic_metadata found for filter envoy.filters.http.rbac
 istio-proxy [2019-09-25 09:45:06.323][27][debug][filter] [src/istio/control/client_context_base.cc:138] Report attributes: attributes {
 istio-proxy   key: "connection.mtls"
 istio-proxy   value {
 istio-proxy     bool_value: false
 istio-proxy   }
 istio-proxy }
 istio-proxy attributes {
 istio-proxy   key: "context.protocol"
 istio-proxy   value {
 istio-proxy     string_value: "http"
 istio-proxy   }
 istio-proxy }
 istio-proxy attributes {
 istio-proxy   key: "context.prox
 istio-proxy [2019-09-25 09:45:06.324][27][debug][connection] [external/envoy/source/common/network/connection_impl.cc:520] [C1145208] remote close
 istio-proxy [2019-09-25 09:45:06.324][27][debug][connection] [external/envoy/source/common/network/connection_impl.cc:190] [C1145208] closing socket: 0
 istio-proxy [2019-09-25 09:45:06.324][27][debug][main] [external/envoy/source/server/connection_handler_impl.cc:80] [C1145208] adding to cleanup list
 istio-proxy [2019-09-25 09:45:06.324][27][debug][filter] [src/envoy/tcp/mixer/filter.cc:100] [C1138233] Called tcp filter onRead bytes: 17
 istio-proxy [2019-09-25 09:45:06.324][28][debug][router] [external/envoy/source/common/router/router.cc:1002] [C637330][S2924433164096281138] upstream headers complete: end_stream=false
 istio-proxy [2019-09-25 09:45:06.324][28][debug][filter] [src/envoy/http/mixer/filter.cc:141] Called Mixer::Filter : encodeHeaders 2
 istio-proxy [2019-09-25 09:45:06.324][28][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:1435] [C637330][S2924433164096281138] encoding headers via codec (end_stream=false):

I was able to collect an strace from the PID of the problematic pod

sudo strace -c -f -p 5296
strace: Process 5296 attached with 15 threads
^Cstrace: Process 5296 detached
strace: Process 5329 detached
strace: Process 5330 detached
strace: Process 5331 detached
strace: Process 5332 detached
strace: Process 5338 detached
strace: Process 5340 detached
strace: Process 5341 detached
strace: Process 5342 detached
strace: Process 5343 detached
strace: Process 5344 detached
strace: Process 5345 detached
strace: Process 5346 detached
strace: Process 5347 detached
strace: Process 6356 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 73.39    0.176384         256       687        44 futex
  9.04    0.021728          49       436           nanosleep
  8.82    0.021201         122       173           epoll_pwait
  6.45    0.015504        3100         5         2 restart_syscall
  0.64    0.001548          30        50           write
  0.50    0.001208          11       109        50 read
  0.31    0.000753          75        10           sched_yield
  0.30    0.000715          14        50        25 accept4
  0.24    0.000567          22        25           close
  0.16    0.000385           7        50           epoll_ctl
  0.08    0.000181           7        25           getsockname
  0.07    0.000178           7        25           setsockopt
------ ----------- ----------- --------- --------- ----------------
100.00    0.240352                  1645       121 total

Here is the output of
kubectl -n $namespace exec -it $pod -c istio-proxy -- top on a pod with busy-envoy

top - 15:33:18 up 18:25,  0 users,  load average: 1.58, 1.56, 1.53
Tasks:   3 total,   1 running,   2 sleeping,   0 stopped,   0 zombie
%Cpu(s): 22.5 us, 15.0 sy,  0.0 ni, 62.0 id,  0.0 wa,  0.0 hi,  0.4 si,  0.0 st
KiB Mem :  7807704 total,  1493152 free,  3275896 used,  3038656 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  4147820 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                                                                                                                                                                              
   21 istio-p+  20   0  228232 112156  19180 S 104.7  1.4  51:18.67 envoy                                                                                                                                                                                                                                                                                                
    1 istio-p+  20   0  138296  25400  20892 S   0.0  0.3   0:07.84 pilot-agent                                                                                                                                                                                                                                                                                          
   33 istio-p+  20   0   36836   3016   2584 R   0.0  0.0   0:00.00 top

There has been a forum thread about it, but this is now ongoing without any resolution. So far the only workaround has been to downgrade the sidecar to 1.2.6

https://discuss.istio.io/t/istio-sidecar-consuming-high-cpu/3894

Any ideas on how I can troubleshoot further?

Affected product area (please put an X in all that apply)

[ ] Configuration Infrastructure
[ ] Docs
[ ] Installation
[ ] Networking
[x ] Performance and Scalability
[ ] Policies and Telemetry
[ ] Security
[ ] Test and Release
[ ] User Experience
[ ] Developer Infrastructure

Expected behavior
CPU not going high

Steps to reproduce the bug
Unable to reproduce

Version (include the output of istioctl version --remote and kubectl version)

citadel version: 1.3.0-1.3.3
galley version: 1.3.0-1.3.3
ingressgateway version: 1.3.0-1.3.3
pilot version: 1.3.0-1.3.3
policy version: 1.3.0-1.3.3
sidecar-injector version: 1.3.0-1.3.3
telemetry version: 1.3.0-1.3.3

How was Istio installed?
helm chart

Environment where bug was observed (cloud vendor, OS, etc)
amazon eks

@crhuber
Copy link
Author

crhuber commented Oct 23, 2019

from a different deployment with same issue
sudo strace -c -f -p 13144

strace: Process 13144 attached with 15 threads
^Cstrace: Process 13144 detached
strace: Process 13207 detached
strace: Process 13208 detached
strace: Process 13209 detached
strace: Process 13210 detached
strace: Process 13211 detached
strace: Process 13213 detached
strace: Process 13218 detached
strace: Process 13219 detached
strace: Process 13220 detached
strace: Process 13221 detached
strace: Process 13222 detached
strace: Process 13225 detached
strace: Process 13226 detached
strace: Process 15569 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 63.45    0.008818          90        97         6 futex
 15.72    0.002184          28        77           nanosleep
 11.23    0.001561          57        27           epoll_pwait
  6.82    0.000948         189         5         2 restart_syscall
  1.17    0.000162          20         8           write
  0.50    0.000070          17         4           close
  0.42    0.000058           3        19         8 read
  0.36    0.000050           6         8         4 accept4
  0.21    0.000029           3         8           epoll_ctl
  0.09    0.000012           3         4           setsockopt
  0.04    0.000005           1         4           getsockname
  0.00    0.000000           0         1           sched_yield
------ ----------- ----------- --------- --------- ----------------
100.00    0.013897                   262        20 total

strace -f -r -p 13144

strace: Process 13144 attached with 15 threads
[pid 15569]      0.000000 futex(0xc0001864c8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 13226]      0.000020 restart_syscall(<... resuming interrupted futex ...> <unfinished ...>
[pid 13225]      0.000025 futex(0xc000369648, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 13222]      0.000009 waitid(P_PID, 21,  <unfinished ...>
[pid 13221]      0.000027 futex(0xc000368f48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 13220]      0.000031 futex(0xc0003444c8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 13219]      0.000010 restart_syscall(<... resuming interrupted futex ...> <unfinished ...>
[pid 13218]      0.000009 epoll_pwait(5,  <unfinished ...>
[pid 13213]      0.000023 futex(0x25efb58, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 13211]      0.000025 restart_syscall(<... resuming interrupted futex ...> <unfinished ...>
[pid 13210]      0.000028 read(4,  <unfinished ...>
[pid 13209]      0.000031 futex(0xc000070848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 13208]      0.000010 futex(0x25efc40, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 13207]      0.000027 restart_syscall(<... resuming interrupted futex ...> <unfinished ...>
[pid 13144]      0.000028 futex(0x25d3088, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 13218]      2.516289 <... epoll_pwait resumed> [{EPOLLIN, {u32=3188154256, u64=140122201218960}}], 128, -1, NULL, 0) = 1
[pid 13218]      0.000049 futex(0x25d2430, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 13218]      0.000036 accept4(3,  <unfinished ...>
[pid 13207]      0.000016 <... restart_syscall resumed> ) = 0
[pid 13207]      0.000033 nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid 13218]      0.000028 <... accept4 resumed> {sa_family=AF_INET6, sin6_port=htons(39410), inet_pton(AF_INET6, "::ffff:172.24.90.191", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, [112->28], SOCK_CLOEXEC|SOCK_NONBLOCK) = 6
[pid 13218]      0.000082 epoll_ctl(5, EPOLL_CTL_ADD, 6, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=3188153632, u64=140122201218336}} <unfinished ...>
[pid 13207]      0.000040 <... nanosleep resumed> NULL) = 0
[pid 13218]      0.000012 <... epoll_ctl resumed> ) = 0
[pid 13218]      0.000028 getsockname(6,  <unfinished ...>
[pid 13207]      0.000016 nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid 13218]      0.000014 <... getsockname resumed> {sa_family=AF_INET6, sin6_port=htons(15020), inet_pton(AF_INET6, "::ffff:172.24.69.216", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, [112->28]) = 0
[pid 13218]      0.000039 setsockopt(6, SOL_TCP, TCP_NODELAY, [1], 4) = 0
[pid 13218]      0.000056 futex(0xc000070848, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 13207]      0.000018 <... nanosleep resumed> NULL) = 0
[pid 13218]      0.000010 <... futex resumed> ) = 1
[pid 13209]      0.000012 <... futex resumed> ) = 0
[pid 13207]      0.000014 nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid 13218]      0.000021 accept4(3,  <unfinished ...>
[pid 13209]      0.000014 epoll_pwait(5,  <unfinished ...>
[pid 13218]      0.000011 <... accept4 resumed> 0xc0000a6bb0, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
[pid 13209]      0.000113 <... epoll_pwait resumed> [{EPOLLIN|EPOLLOUT, {u32=3188153632, u64=140122201218336}}], 128, 0, NULL, 0) = 1
[pid 13207]      0.000018 <... nanosleep resumed> NULL) = 0
[pid 13209]      0.000011 epoll_pwait(5,  <unfinished ...>
[pid 13218]      0.000011 read(6,  <unfinished ...>
[pid 13207]      0.000011 nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid 13218]      0.000012 <... read resumed> "GET /healthz/ready HTTP/1.1\r\nHos"..., 4096) = 130
[pid 13218]      0.000053 futex(0xc000368f48, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 13221]      0.000070 <... futex resumed> ) = 0
[pid 13207]      0.000014 <... nanosleep resumed> NULL) = 0
[pid 13221]      0.000013 futex(0xc000369648, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 13207]      0.000016 nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid 13225]      0.000013 <... futex resumed> ) = 0
[pid 13221]      0.000010 <... futex resumed> ) = 1
[pid 13225]      0.000010 futex(0xc0001864c8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 13221]      0.000012 read(6,  <unfinished ...>
[pid 15569]      0.000011 <... futex resumed> ) = 0
[pid 13225]      0.000009 <... futex resumed> ) = 1
```

@elvizlai
Copy link

same issue

@fpesce
Copy link
Contributor

fpesce commented Oct 30, 2019

@elvizlai / @crhuber Would you mind sharing some of your configuration dump?
Possibly using the dump_kubernetes.sh tool of this PR: the dump_kubernetes.sh tools that we're working on: (c.f. this Pull Request) : #18013 ?

@gustavovalverde
Copy link

I'm also experiencing this, istio is using the whole CPU, and it autoscaled to 5 (using 5 vCPU).

I made a simple install on GKE using this parameters (no further customization)

helm install install/kubernetes/helm/istio-init --name istio-init --namespace istio-system --set certmanager.enabled=true

helm install install/kubernetes/helm/istio-cni --name istio-cni --namespace kube-system \
    --set cniBinDir=/home/kubernetes/bin \

helm install install/kubernetes/helm/istio --name istio --namespace istio-system \
    --set istio_cni.enabled=true \
    --set kiali.enabled=true \
    --set grafana.enabled=true \
    --set tracing.enabled=true \
    --set gateways.istio-ingressgateway.sds.enabled=true \
    --set gateways.istio-ingressgateway.resources.requests.cpu=500m \
    --set gateways.istio-ingressgateway.resources.requests.memory=256Mi \
    --set pilot.traceSampling=30 \
    --set sidecarInjectorWebhook.rewriteAppHTTPProbe=true \
    --set gateways.istio-ingressgateway.loadBalancerIP=""

@sl31pn1r
Copy link

sl31pn1r commented Nov 4, 2019

We have recently upgraded from 1.2.0 to 1.3.1 and experiencing the same issue.

Did some investigation and it seems like the issue for us only appears when Istio acts as an HTTP proxy.
Sidecar is using anywhere between 1-2 CPU cores if the service definition has the name: http at the port section (as the documentation recommends it). If the sidecar acts only as a TCP proxy there is no issue.

For example, we only have at the moment liveness and readiness checks running, no real traffic.

In the proxy logs I'm seeing the switch when I add 'name: http' to the service definition port section;

   - name: http
     nodePort: 30920
     port: 80
     protocol: TCP
     targetPort: 6450

Log section:

[2019-11-04T10:44:57.965Z] "- - -" 0 - "-" "-" 0 0 4002 - "-" "-" "-" "-" "127.0.0.1:6450" inbound|80||service2.service2.svc.cluster.local 127.0.0.1:44098 172.XX.XX.56:6450 172.XX.XX.4:61737 - -
[2019-11-04T10:45:03.751Z] "GET /liveness HTTP/1.1" 200 - "-" "-" 0 32 6 6 "-" "kube-probe/1.14" "3254795b-0327-469e-9fec-866184147f52" "172.XX.XX.56:6450" "127.0.0.1:6450" inbound|80|http|service2.service2.svc.cluster.local - 172.XX.XX.56:6450 172.XX.XX..35:53446 - default
[2019-11-04T10:45:03.870Z] "GET /readiness HTTP/1.1" 200 - "-" "-" 0 215 5 4 "-" "kube-probe/1.14" "85ef2b38-03c3-4519-8601-ff0fc28745fd" "172.XX.XX.56:6450" "127.0.0.1:6450" inbound|80|http|service2.service2.svc.cluster.local - 172.XX.XX.56:6450 172.XX.XX.35:53452 - default

Service2 and service3 has 'name: http' at the port definition the others not.

CPU usage under with Istio 1.3.1:

NAMESPACE          NAME                                      CPU(cores)   MEMORY(bytes)
service1           service1-65f7d9d8c4-26cd4                      8m           111Mi
service2           service2-7f75b699d4-29q85                1115m        166Mi
service3           service3-6fc5b956cf-sbmbt                     1502m        108Mi
service4           service4-6f594f98c-zdq5p                      10m          169Mi 

Downgraded back to Istio 1.2.0 and the issue disappeared.
CPU usage under 1.2.0:

NAMESPACE          NAME                                      CPU(cores)   MEMORY(bytes)
service1       service1-65f7d9d8c4-zdq5p                      8m           109Mi
service2      service2-7f75b699d4-hp5dq                      7m        169Mi
service3           service3-6fc5b956cf-xt9bc                2m        100Mi
service4           service4-6f594f98c-s9np6                     7m          170Mi

istio is getting deployed with the following helm template generation:

helm template /istio-${ISTIO_VERSION}/install/kubernetes/helm/istio \
                --name istio \
                --namespace istio-system \
                --set gateways.istio-ingressgateway.serviceAnnotations."service\.beta\.kubernetes\.io/azure-dns-label-name"="$DNS_LABEL_NAME" \
                --set gateways.istio-egressgateway.enabled='true' \
                --set global.outboundTrafficPolicy.mode="REGISTRY_ONLY" \
                --set global.proxy.accessLogFile="/dev/stdout" \
                --set global.controlPlaneSecurityEnabled=true \
                --set mixer.adapters.useAdapterCRDs=false \
                --set grafana.enabled=true \
                --set tracing.enabled=true \
                --set kiali.enabled=true \
                --set "kiali.dashboard.jaegerURL=http://jaeger-query:16686" \
                --set "kiali.dashboard.grafanaURL=http://grafana:3000" \
                > /home/admin/istio-$ISTIO_VERSION.yaml

edit: fixed bad formatting

@hendrikmaus
Copy link

We are also facing the issue after an upgrade from 1.2.4 to 1.3.3; as a temporary fix, we used the sidecar.istio.io/proxyImage annotation to downgrade the istio-proxy image to 1.2.4. This has been working out for the past 24 hours without other issues.

Furthermore, and I double checked it, the ingress gateways are not affected on our clusters.

@fpesce
Copy link
Contributor

fpesce commented Nov 5, 2019

Everyone, would you mind experimenting with a different proxy image that would include gdb and Envoy's symbols so we could try to see if some kind of infinite loop is identifiable in a backtrace?

If you're manually injecting Istio, you could replace in the injected sidecar (as a reminder, this could be generated by istioctl kube-inject -f yourconf.yaml, c.f. bookinfo for example).
If you're running istio.1.3.3, it will look like replacing:

        image: docker.io/istio/proxyv2:1.3.3

by

        image: docker.io/francoispesce/proxyv2:1.3.3-gdb

You need to add to the existing securityContext, the capabilities required for gdb (SYS_PTRACE) as shown here:

        securityContext:
          readOnlyRootFilesystem: true
          runAsUser: 1337
          capabilities:
            add:
            - SYS_PTRACE

Once this is done and your pods running, the goal would be to attach a gdb to the running pid of envoy in the istio-proxy container, and share the output of info threads, to see if there's a pattern in your different use cases. You could also Ctrl-C and continue to figure out if one of the thread is looping abnormally (everything described here look like some kind of infinite loop).
Here's a quick example on how it would look like to run gdb and get these infos:

$ kubectl exec -it yourpod -c istio-proxy -- bash

istio-proxy@yourpod:/$ ps auxw
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
istio-p+      1  1.2  0.1 138216 23196 ?        Ssl  07:17   0:00 /usr/local/bin/pilot-agent proxy sidecar --domain default.svc.cluster.local --configPath /etc/istio/proxy --binaryPath /usr/local/bin/envoy --ser
istio-p+     17 11.4  0.2 151420 35892 ?        Sl   07:17   0:02 /usr/local/bin/envoy -c /etc/istio/proxy/envoy-rev0.json --restart-epoch 0 --drain-time-s 45 --parent-shutdown-time-s 60 --service-cluster produc
istio-p+     31  1.0  0.0  18260  3300 pts/0    Ss   07:17   0:00 bash
istio-p+     40  0.0  0.0  34424  2792 pts/0    R+   07:17   0:00 ps auxw
istio-proxy@yourpod:/$ sudo gdb -p 17

(gdb) info threads
  Id   Target Id         Frame 
* 1    Thread 0x7fcbde882d00 (LWP 17) "envoy" 0x00007fcbdda71a13 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
  2    Thread 0x7fcbdd969700 (LWP 20) "default-executo" 0x00007fcbddd41360 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
  3    Thread 0x7fcbdd168700 (LWP 21) "resolver-execut" 0x00007fcbddd41360 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
  4    Thread 0x7fcbdc967700 (LWP 22) "grpc_global_tim" 0x00007fcbddd41709 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
  5    Thread 0x7fcbdc166700 (LWP 23) "envoy" 0x00007fcbdda71a13 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
  6    Thread 0x7fcbdb965700 (LWP 24) "envoy" 0x00007fcbdda71a13 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
  7    Thread 0x7fcbdb164700 (LWP 25) "envoy" 0x00007fcbdda71a13 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
  8    Thread 0x7fcbda963700 (LWP 26) "envoy" 0x00007fcbdda71a13 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
  9    Thread 0x7fcbda162700 (LWP 27) "envoy" 0x00007fcbddd41360 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
  10   Thread 0x7fcbd9961700 (LWP 28) "envoy" 0x00007fcbddd41360 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
  11   Thread 0x7fcbd9160700 (LWP 29) "envoy" 0x00007fcbdda6b4d9 in syscall () from /lib/x86_64-linux-gnu/libc.so.6

I've built a bunch of these image for the various 1.3.x proxyv2, you can find them on docker.io hub.

Just FYI, you can build the same following this simple Dockerfile template:

FROM istio/proxyv2:1.3.4
RUN apt-get update && \
    apt-get install --no-install-recommends -y \
      gdb \
   && apt-get upgrade -y \
   && apt-get clean \
   && rm -rf  /var/log/*log /var/lib/apt/lists/* /var/log/apt/* /var/lib/dpkg/*-old /var/cache/debconf/*-old

ADD https://storage.googleapis.com/istio-build/proxy/envoy-symbol-c33dc49585e5e7b5f616c8b5377a5f1f52505e20.tar.gz /tmp/
RUN rm /usr/local/bin/envoy \
  && tar -xvzf /tmp/envoy-symbol-c33dc49585e5e7b5f616c8b5377a5f1f52505e20.tar.gz --directory / \
  && rm /tmp/envoy-symbol-c33dc49585e5e7b5f616c8b5377a5f1f52505e20.tar.gz

You have to adjust for your version in the FROM tag (1.3.4 in this example) and pick the envoy symbol hash as indicated in istio.deps of the same version, e.g. https://github.com/istio/istio/blob/1.3.4/istio.deps would show c33dc49585e5e7b5f616c8b5377a5f1f52505e20 for the proxy "lastStableSHA": .

@sl31pn1r
Copy link

sl31pn1r commented Nov 5, 2019

"> Once this is done and your pods running, the goal would be to attach a gdb to the running pid of envoy in the istio-proxy container, and share the output of info threads, to see if there's a pattern in your different use cases. You could also Ctrl-C and continue to figure out if one of the thread is looping abnormally (everything described here look like some kind of infinite loop)."

I did the test on one of our services in an environment of ours with the Image: docker.io/francoispesce/proxyv2:1.3.1-gdb:

(gdb) info threads
  Id   Target Id         Frame
* 1    Thread 0x7f358f17dd00 (LWP 15) "envoy" 0x00007f358e36ca13 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
  2    Thread 0x7f358e264700 (LWP 18) "default-executo" 0x00007f358e63c360 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
  3    Thread 0x7f358da63700 (LWP 19) "resolver-execut" 0x00007f358e63c360 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
  4    Thread 0x7f358d262700 (LWP 20) "grpc_global_tim" 0x00007f358e63c709 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
  5    Thread 0x7f358ca61700 (LWP 22) "envoy" 0x00007f358e36ca13 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
  6    Thread 0x7f358c260700 (LWP 23) "envoy" 0x00007f358e36ca13 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
  7    Thread 0x7f358ba5f700 (LWP 25) "envoy" 0x00007f358e3664d9 in syscall () from /lib/x86_64-linux-gnu/libc.so.6
  8    Thread 0x7f358b25e700 (LWP 26) "envoy" 0x00007f358e36ca13 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
  9    Thread 0x7f358a25c700 (LWP 27) "envoy" 0x00007f358e63c360 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
  10   Thread 0x7f358aa5d700 (LWP 28) "envoy" 0x00007f358e36ca13 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
  11   Thread 0x7f3589a5b700 (LWP 29) "envoy" 0x00007f358e63c360 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
  12   Thread 0x7f358925a700 (LWP 32) "envoy" 0x00007f358e3664d9 in syscall () from /lib/x86_64-linux-gnu/libc.so.6

@StefanCenusa
Copy link

I did the test on my production env where this issue happens, using docker.io/francoispesce/proxyv2:1.3.3-gdb

on a pod with high-cpu usage envoy:

(gdb) info threads
  Id   Target Id         Frame
* 1    Thread 0x7f1a5998ed00 (LWP 24) "envoy" 0x00007f1a58b7da13 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
  2    Thread 0x7f1a58a75700 (LWP 25) "default-executo" 0x00007f1a58e4d360 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
  3    Thread 0x7f1a58274700 (LWP 26) "resolver-execut" 0x00007f1a58e4d360 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
  4    Thread 0x7f1a57a73700 (LWP 27) "grpc_global_tim" 0x00007f1a58e4d709 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
  5    Thread 0x7f1a57272700 (LWP 28) "envoy" 0x00007f1a58b7da13 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
  6    Thread 0x7f1a56a71700 (LWP 29) "envoy" 0x00007f1a58b7da13 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
  7    Thread 0x7f1a56270700 (LWP 30) "envoy" 0x00007f1a58b7da13 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
  8    Thread 0x7f1a55a6f700 (LWP 31) "envoy" 0x00007f1a58b7da13 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
  9    Thread 0x7f1a5526e700 (LWP 32) "envoy" 0x00007f1a58e4d360 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
  10   Thread 0x7f1a54a6d700 (LWP 33) "envoy" 0x00007f1a58e4d360 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
  11   Thread 0x7f1a5426c700 (LWP 34) "envoy" 0x00007f1a58b774d9 in syscall () from /lib/x86_64-linux-gnu/libc.so.6

for comparison, on a a pod with no high-cpu usage envoy I've got this:

(gdb) info threads
  Id   Target Id         Frame
* 1    Thread 0x7f940423ed00 (LWP 22) "envoy" 0x00007f940342da13 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
  2    Thread 0x7f9403325700 (LWP 23) "default-executo" 0x00007f94036fd360 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
  3    Thread 0x7f9402b24700 (LWP 24) "resolver-execut" 0x00007f94036fd360 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
  4    Thread 0x7f9402323700 (LWP 25) "grpc_global_tim" 0x00007f94036fd709 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
  5    Thread 0x7f9401b22700 (LWP 26) "envoy" 0x00007f940342da13 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
  6    Thread 0x7f9401321700 (LWP 27) "envoy" 0x00007f940342da13 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
  7    Thread 0x7f9400b20700 (LWP 28) "envoy" 0x00007f94034274d9 in syscall () from /lib/x86_64-linux-gnu/libc.so.6
  8    Thread 0x7f940031f700 (LWP 29) "envoy" 0x00007f940342da13 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
  9    Thread 0x7f93ffb1e700 (LWP 30) "envoy" 0x00007f940342da13 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
  10   Thread 0x7f93ff31d700 (LWP 31) "envoy" 0x00007f94036fd360 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
  11   Thread 0x7f93feb1c700 (LWP 32) "envoy" 0x00007f94036fd360 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0

Strangely, after quitting the gdb process, envoy process was killed (did not show up on ps auxw anymore).

Moreover, on previous debugging, in our case, we were able to see (using strace on the node) that envoy was in a loop of epoll_wait() on a socket (from an established connection from another pod in the mesh). A weird thing was that on that connection there were only 2 segments sent in/out.

strace -c -f -p $pid

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 70.69   17.048433          38    448634           epoll_wait
 29.14    7.028070       42594       165        27 futex
  0.17    0.040000       40000         1         1 restart_syscall
  0.00    0.000000           0        58           read
  0.00    0.000000           0        59           write
  0.00    0.000000           0         1           close
  0.00    0.000000           0        41        20 readv
  0.00    0.000000           0        20           writev
  0.00    0.000000           0         9           epoll_ctl
------ ----------- ----------- --------- --------- ----------------
100.00   24.116503                448988        48 total

strace -f -r -p $pid on the node where the busy envoy runs, we see lots of

[pid 26651]      0.000036 epoll_wait(15, [{EPOLLOUT, {u32=603, u64=603}}], 128, 5) = 1
[pid 26651]      0.000036 epoll_wait(15, [{EPOLLOUT, {u32=603, u64=603}}], 128, 5) = 1
[pid 26651]      0.000037 epoll_wait(15, [{EPOLLOUT, {u32=603, u64=603}}], 128, 5) = 1
[pid 26651]      0.000038 epoll_wait(15, [{EPOLLOUT, {u32=603, u64=603}}], 128, 5) = 1
[pid 26651]      0.000042 epoll_wait(15, [{EPOLLOUT, {u32=603, u64=603}}], 128, 5) = 1
[pid 26651]      0.000045 epoll_wait(15, [{EPOLLOUT, {u32=603, u64=603}}], 128, 5) = 1
[pid 26651]      0.000562 epoll_wait(15, [{EPOLLOUT, {u32=603, u64=603}}], 128, 5) = 1
[pid 26651]      0.000051 epoll_wait(15, [{EPOLLOUT, {u32=603, u64=603}}], 128, 1) = 1
[pid 26651]      0.000038 epoll_wait(15, [{EPOLLOUT, {u32=603, u64=603}}], 128, 1) = 1
tcp    ESTAB      0      0      x.x.x.202:15006              x.x.x.203:36810               users:(("envoy",pid=29,fd=603))
     skmem:(r0,rb12582912,t0,tb12582912,f0,w0,o0,bl0) ts sack cubic wscale:9,9 rto:204 rtt:0.013/0.006 mss:8899 cwnd:10 segs_in:2 send 54763.1Mbps lastsnd:38913588 lastrcv:38913588 lastack:38913588 rcv_space:26697

@fpesce
Copy link
Contributor

fpesce commented Nov 6, 2019

Everyone, thanks again for your help, I'd like to ask again if you can share some of your configuration dump?
You can send it to me at [my-github-handle] @ google.com .

I'm curious about any network details (is there any UDP listeners configured for example).

@lambdai
Copy link
Contributor

lambdai commented Nov 6, 2019

@StefanCenusa
Could you share your config_dump of the busy sidecar proxy? I started to think about you have non-trivial configs/filters so that I cannot reproduce

@duderino
Copy link
Contributor

duderino commented Nov 6, 2019

@lambdai found the bug tonight. fix will be out shortly.

In the meantime, we think setting global.proxy.protocolDetectionTimeout = 0 will address the issue for unpatched 1.3 versions. See https://istio.io/docs/reference/config/installation-options/#global-options

@StefanCenusa
Copy link

@duderino I've set to 0ms but still having the issue. moreover, my pilot was already running with

  # if protocol sniffing is enabled for outbound
  enableProtocolSniffingForOutbound: false
  # if protocol sniffing is enabled for inbound
  enableProtocolSniffingForInbound: false

I guess I'll wait for the fix to try it out...

@sergii-s
Copy link

sergii-s commented Nov 6, 2019

@duderino didn't work for me either

@howardjohn
Copy link
Member

howardjohn commented Nov 6, 2019

Try --set pilot.env.PILOT_INBOUND_PROTOCOL_DETECTION_TIMEOUT=0s --set global.proxy.protocolDetectionTimeout=0s. Somehow we manged to make this extremely complicated to configure...

To verify it applied correctly, run curl localhost:15000/config_dump | grep listener_filters_timeout. Everything should be "listener_filters_timeout": 0

@gustavovalverde
Copy link

gustavovalverde commented Nov 6, 2019

@howardjohn after applying this patch is it supposed to "auto-magically" start using fewer CPU?

EDIT: It worked without further intervention

@nrjpoddar
Copy link
Member

@howardjohn @lambdai can you elaborate on the underlying root cause? Is it that the memory buffers allocated for protocol detection are not freed appropriately after timeout?

@sl31pn1r
Copy link

sl31pn1r commented Nov 8, 2019

Try --set pilot.env.PILOT_INBOUND_PROTOCOL_DETECTION_TIMEOUT=0s --set global.proxy.protocolDetectionTimeout=0s

I have redeployed Istio 1.3.1 into one of our environments with these settings and can confirm that the high CPU usage disappears on the sidecars.

@bzurkowski
Copy link

Experienced the same issue with istio 1.3.4. Fix with disabling protocol detection timeout worked for me.

@howardjohn
Copy link
Member

This is fixed in 1.3.5 (released today), 1.4.0 (releasing soon), and mitigations provided for other releases, so I think its safe to close this. Thanks everyone for helping us track this down

@jl-gogovapps
Copy link

This is fixed in 1.3.5 (released today), 1.4.0 (releasing soon), and mitigations provided for other releases, so I think its safe to close this. Thanks everyone for helping us track this down

I'll be honest, a little new to istio here, but I was having (i think) similar issues described in this thread. I updated from 1.3.2 to 1.3.5 and still had high CPU, but after applying the Helm values here the CPU usage seems to have come way down. I'll continue to monitor/verify, but wanted to chime in here hoping it's helpful

@howardjohn
Copy link
Member

@jl-gogovapps its possible you did not update the sidecars? See https://istio.io/docs/setup/upgrade/steps/#sidecar-upgrade you need to actually restart the pods

@nak3
Copy link
Member

nak3 commented Nov 13, 2019

Awesome news that it is fixed in 1.3.5. It would be appreciated if 1.3.5's release note mentioned about it.

@howardjohn
Copy link
Member

@nak3 see

Security update
ISTIO-SECURITY-2019-006 A DoS vulnerability has been discovered in Envoy.
CVE-2019-18817: An infinite loop can be triggered in Envoy if the option continue_on_listener_filters_timeout is set to True, which is the case in Istio. This vulnerability could be leveraged for a DoS attack.

@nak3
Copy link
Member

nak3 commented Nov 13, 2019

Oh! That is this one. I found the CVE was linked to this now. Thank you and sorry for bothering you @howardjohn

@jl-gogovapps
Copy link

jl-gogovapps commented Nov 13, 2019

(@howardjohn )
Thanks! I'm not sure how to know unfortunately. I applied the changes (1.3.2 -> 1.3.5) through helm upgrade. However I did not observe if the pods restarted, or if only the new istio services were rolled out. I did observe the pods roll when I applied the configuration changes, which I did after the upgrade.

Either way, thats good to know!

@maskshell
Copy link

maskshell commented Jan 2, 2020

So, in release 1.3.5 and above, it's not necessary keep these two settings?

pilot.env.PILOT_INBOUND_PROTOCOL_DETECTION_TIMEOUT=0s 
global.proxy.protocolDetectionTimeout=0s

@howardjohn

@dudiakash
Copy link

I need to use istio 1.0.5 because I am working on OPENSHIFT cluster. can anybody help me how do I setup
pilot.env.PILOT_INBOUND_PROTOCOL_DETECTION_TIMEOUT=0s
global.proxy.protocolDetectionTimeout=0s

these values in that case as manifest command is not there istioctl 1.0.5 version.

@howardjohn
Copy link
Member

howardjohn commented Jan 8, 2020 via email

@c4m4
Copy link

c4m4 commented Jul 3, 2020

I tried to use --set values.gateways.istio-ingressgateway.resources.requests.cpu=500m but it doesn't works with istio 1.5.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.