Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v2.2dev8] looping stream #662

Closed
wdauchy opened this issue May 29, 2020 · 45 comments
Closed

[v2.2dev8] looping stream #662

wdauchy opened this issue May 29, 2020 · 45 comments
Labels
status: fixed This issue is a now-fixed bug. type: bug This issue describes a bug.

Comments

@wdauchy
Copy link
Contributor

wdauchy commented May 29, 2020

Output of haproxy -vv and uname -a

HA-Proxy version 2.2-dev8 2020/05/22 - https://haproxy.org/
Status: development branch - not safe for use in production.
Known bugs: https://github.com/haproxy/haproxy/issues?q=is:issue+is:open
Running on: Linux 5.4.42-2.el7.x86_64 #1 SMP Wed May 20 17:07:45 UTC 2020 x86_64
Build options :
  TARGET  = linux-glibc
  CPU     = generic
  CC      = gcc
  CFLAGS  = -O2 -g -Wall -Wextra -Wdeclaration-after-statement -fwrapv -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-old-style-declaration -Wno-ignored-qualifiers -Wno-clobbered -Wno-missing-field-initializers -Wtype-limits -DTCP_USER_TIMEOUT=18
  OPTIONS = USE_PCRE=1 USE_PCRE_JIT=1 USE_BACKTRACE=1 USE_LINUX_TPROXY=1 USE_GETADDRINFO=1 USE_OPENSSL=1 USE_ZLIB=1 USE_TFO=1 USE_NS=1 USE_SYSTEMD=1

Feature list : +EPOLL -KQUEUE +NETFILTER +PCRE +PCRE_JIT -PCRE2 -PCRE2_JIT +POLL -PRIVATE_CACHE +THREAD -PTHREAD_PSHARED +BACKTRACE -STATIC_PCRE -STATIC_PCRE2 +TPROXY +LINUX_TPROXY +LINUX_SPLICE +LIBCRYPT +CRYPT_H +GETADDRINFO +OPENSSL -LUA +FUTEX +ACCEPT4 +ZLIB -SLZ +CPU_AFFINITY +TFO +NS +DL +RT -DEVICEATLAS -51DEGREES -WURFL +SYSTEMD -OBSOLETE_LINKER +PRCTL +THREAD_DUMP -EVPORTS

Default settings :
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_THREADS=64, default=32).
Built with OpenSSL version : OpenSSL 1.1.1g  21 Apr 2020
Running on OpenSSL version : OpenSSL 1.1.1g  21 Apr 2020
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with network namespace support.
Built with gcc compiler version 4.8.5 20150623 (Red Hat 4.8.5-39)
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with PCRE version : 8.32 2012-11-30
Running on PCRE version : 8.32 2012-11-30
PCRE library supports JIT : yes
Encrypted password support via crypt(3): yes
Built with zlib version : 1.2.7
Running on zlib version : 1.2.7
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with the Prometheus exporter as a service

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
              h2 : mode=HTTP       side=FE|BE     mux=H2
            fcgi : mode=HTTP       side=BE        mux=FCGI
       <default> : mode=HTTP       side=FE|BE     mux=H1
       <default> : mode=TCP        side=FE|BE     mux=PASS

Available services :
        prometheus-exporter

Available filters :
        [SPOE] spoe
        [CACHE] cache
        [FCGI] fcgi-app
        [TRACE] trace
        [COMP] compression

on top of v2.2dev8 we have the following patches:

63a8738 MEDIUM: pools: directly free objects when pools are too much crowded
a1e4f8c MINOR: pools: compute an estimate of each pool's average needed objects
56192cc BUG/MEDIUM: checks: Don't add a tcpcheck ruleset twice in the shared tree

What's the configuration?

global
    log 127.0.0.1 format rfc5424 local0 info
    daemon
    stats socket /var/lib/haproxy/stats level admin mode 600 user haproxy group haproxy expose-fd listeners
    stats timeout 2m
    maxconn 524288
    user haproxy
    group haproxy
    set-dumpable
    nbproc 1
    tune.bufsize 33792
    tune.ssl.default-dh-param 2048
    ssl-default-bind-options no-sslv3 no-tlsv10 no-tlsv11 no-tls-tickets
    ssl-default-bind-ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES128-SHA256:ECDHE-RSA-AES256-SHA384
    ssl-default-bind-ciphersuites TLS_AES_128_GCM_SHA256:TLS_CHACHA20_POLY1305_SHA256:TLS_AES_256_GCM_SHA384
    ssl-default-server-options no-sslv3 no-tlsv10 no-tlsv11 no-tls-tickets
    ssl-default-server-ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES128-SHA256:ECDHE-RSA-AES256-SHA384:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES256-SHA
    ssl-default-server-ciphersuites TLS_AES_128_GCM_SHA256:TLS_CHACHA20_POLY1305_SHA256:TLS_AES_256_GCM_SHA384
    ssl-server-verify none
    hard-stop-after 120s
    nbthread 16
    cpu-map auto:1/1-16 0-15
    spread-checks 5
    chroot /etc/haproxy/chroot
    strict-limits
    no busy-polling

defaults
    mode http
    log global
    option httplog
    option http-keep-alive
    option forwardfor except 127.0.0.0/8
    option redispatch 1
    option http-ignore-probes
    retries 3
    retry-on conn-failure empty-response response-timeout 0rtt-rejected
    timeout http-request 10s
    timeout queue 1s
    timeout connect 10s
    timeout client 60s
    timeout server 60s
    timeout http-keep-alive 10s
    timeout check 5s
    maxconn 524288
    balance roundrobin
    http-reuse always
    default-server inter 5s fastinter 1s fall 3 slowstart 20s observe layer7 error-limit 5 on-error fail-check pool-purge-delay 10s tfo
    http-check expect status 200

frontend fe_foo
    bind x.x.x.x:80 name http_ip4 process 1/all tfo namespace haproxy
    bind x.x.x.x:443 name https_ip4 process 1/all ssl crt /etc/haproxy/tls/ alpn h2,http/1.1 tfo allow-0rtt namespace haproxy

    acl http  ssl_fc,not
    acl https ssl_fc
    monitor-uri /delivery/monitor/lb-check
    errorfile 200 /etc/haproxy/errorfiles/200.http
    tcp-request content capture fc_rtt len 10
    capture request header Host len 64
    capture request header user-agent len 128
    capture request header cf-ray len 20
    log-format "%tr %TR/%Tw/%Ta %ac/%fc/%bc/%sc/%rc %sq/%bq %{+Q}r"
    log-format-sd [user@51719\ cli_ip=\"%ci\"\ cli_port=\"%cp\"\ ftd=\"%f\"\ ftd_ip=\"%fi\"\ ftd_port=\"%fp\"\ bkd=\"%b\"\ srv=\"%si:%sp\"\ status=\"%ST\"\ bytes_r=\"%B\"\ tsc=\"%tsc\"\ sslv=\"%sslv\"\ sslc=\"%sslc\"\ h_host=\"%[capture.req.hdr(1)]\"\ meth=\"%HM\"\ version=\"%HV\"\ h_ua=\"%[capture.req.hdr(2)]\"\ cf_dc=\"%[capture.req.hdr(3),word(2,-)]\"\ fc_rtt=\"%[capture.req.hdr(0)]\"\ time_tr=\"%Tr\"\ time_tc=\"%Tc\"\ time_tu=\"%Tu\"]
    http-response set-log-level silent if { rand(100) ge 1 }

    http-request del-header connection if { req.hdr(connection) close }
    http-request set-header x-proto ssl if https
    http-request set-header x-forwarded-proto https if https
    http-request set-header x-tls-sessionid %[ssl_fc_session_id,hex] if https
    http-request add-header x-client-ip %[src]
    http-request set-header client-ip %[src]

    http-request set-header x-real-host %[req.hdr(host)]
    http-request set-header x-server-address %[dst]
    acl content_encoding_header_exists req.hdr(content-encoding) -m found
    http-request set-header x-original-content-encoding %[req.hdr(content-encoding)] if content_encoding_header_exists
    acl host_header_exists req.hdr(host) -m found
    http-request set-header host %[req.hdr(host),field(1,:)] if host_header_exists
    acl ::disabled_http_methods method CONNECT
    http-request deny if ::disabled_http_methods
    http-request disable-l7-retry if ! METH_GET
    http-request redirect scheme https code 302 if !{ ssl_fc } METH_GET
    http-request redirect scheme https code 307 if !{ ssl_fc } ! METH_GET

    acl static-web_url_prefix path_beg -i /.well-known/security.txt
    use_backend be_static-web if static-web_url_prefix
    use_backend %[base,lower,map_beg("/etc/haproxy/maps/fe_foo")]

    default_backend be_foo

backend be_foo
    option httpchk GET /v1/agent/health/service/name/foo?format=text HTTP/1.0
    http-check send-state
    default-server check port 8500
    server srv0 10.0.0.0:31918 weight 10
    server srv1 10.0.0.1:31875 weight 10
    server-template srv 2-27 0.0.0.0:0 check disabled

I might provide more info if needed

Actual behavior

We are getting quite regular infinite loop in v2.2dev8:

 A bogus STREAM [0x7fcbea72aa00] is spinning at 104116 calls per second and refuses to die, aborting now! Please report this error to developers [strm=0x7fcbea72aa00 src=x.x.x.x fe=fe_foo be=be_foo dst=srv21 rqf=48c4a060 rqa=0 rpf=c0044060 rpa=4000000 sif=EST,88000 sib=CLO,280190 af=(nil),0 csf=0x7fcc12431e80,104000 ab=(nil),0 csb=0x7fcc0ecf5ac0,0 cof=0x7fcbc8383240,80003300:H2(0x7fcc04ed8600)/SSL(0x7fcbc0cf9100)/tcpv4(25654) cob=0x7fcbaa7fab80,c2300:H1(0x7fcb99709fe0)/RAW((nil))/tcpv4(51) filters={}]
info thread
  Id   Target Id         Frame 
  16   Thread 0x7fcc34bf6700 (LWP 22296) 0x00007fcc7fc04eb3 in epoll_wait () from /lib64/libc.so.6
  15   Thread 0x7fcc35bf8700 (LWP 22294) activity_count_runtime () at include/proto/activity.h:87
  14   Thread 0x7fcc33bf4700 (LWP 22298) 0x000056390c56746f in next_timer_expiry () at src/task.c:304
  13   Thread 0x7fcc399ff700 (LWP 22287) thread_harmless_end () at include/common/hathreads.h:563
  12   Thread 0x7fcc343f5700 (LWP 22297) 0x00007fcc7fc04eb3 in epoll_wait () from /lib64/libc.so.6
  11   Thread 0x7fcc36bfa700 (LWP 22292) 0x00007fcc7fc04eb3 in epoll_wait () from /lib64/libc.so.6
  10   Thread 0x7fcc373fb700 (LWP 22291) 0x00007fcc7fc04eb3 in epoll_wait () from /lib64/libc.so.6
  9    Thread 0x7fcc353f7700 (LWP 22295) 0x00007ffcfb9f2919 in clock_gettime ()
  8    Thread 0x7fcc3a7fd700 (LWP 22286) 0x00007fcc7fc04eb3 in epoll_wait () from /lib64/libc.so.6
  7    Thread 0x7fcc38bfe700 (LWP 22288) 0x00007fcc7fc04eb3 in epoll_wait () from /lib64/libc.so.6
  6    Thread 0x7fcc3b7ff700 (LWP 22284) 0x00007fcc7fc04eb3 in epoll_wait () from /lib64/libc.so.6
  5    Thread 0x7fcc3affe700 (LWP 22285) 0x000056390c5671ba in wake_expired_tasks () at src/task.c:212
  4    Thread 0x7fcc363f9700 (LWP 22293) 0x00007fcc7fc04eb3 in epoll_wait () from /lib64/libc.so.6
  3    Thread 0x7fcc37bfc700 (LWP 22290) 0x00007fcc7fc04eb3 in epoll_wait () from /lib64/libc.so.6
  2    Thread 0x7fcc81dce440 (LWP 22283) 0x00007fcc7fc04eb3 in epoll_wait () from /lib64/libc.so.6
* 1    Thread 0x7fcc383fd700 (LWP 22289) 0x00007fcc7fb3c387 in raise () from /lib64/libc.so.6
p stopping
$1 = 1
Thread 16 (Thread 0x7fcc34bf6700 (LWP 22296)):
#0  0x00007fcc7fc04eb3 in epoll_wait () from /lib64/libc.so.6
#1  0x000056390c439964 in _do_poll (p=<optimized out>, exp=<optimized out>, wake=<optimized out>) at src/ev_epoll.c:186
#2  0x000056390c517060 in run_poll_loop () at src/haproxy.c:2861
#3  0x000056390c5174ec in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:2979
#4  0x00007fcc81154ea5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fcc7fc048dd in clone () from /lib64/libc.so.6

Thread 15 (Thread 0x7fcc35bf8700 (LWP 22294)):
#0  activity_count_runtime () at include/proto/activity.h:87
#1  _do_poll (p=<optimized out>, exp=<optimized out>, wake=<optimized out>) at src/ev_epoll.c:182
#2  0x000056390c517060 in run_poll_loop () at src/haproxy.c:2861
#3  0x000056390c5174ec in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:2979
#4  0x00007fcc81154ea5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fcc7fc048dd in clone () from /lib64/libc.so.6

Thread 14 (Thread 0x7fcc33bf4700 (LWP 22298)):
#0  0x000056390c56746f in next_timer_expiry () at src/task.c:304
#1  0x000056390c5171c7 in run_poll_loop () at src/haproxy.c:2858
#2  0x000056390c5174ec in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:2979
#3  0x00007fcc81154ea5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fcc7fc048dd in clone () from /lib64/libc.so.6

Thread 13 (Thread 0x7fcc399ff700 (LWP 22287)):
#0  thread_harmless_end () at include/common/hathreads.h:563
#1  _do_poll (p=<optimized out>, exp=<optimized out>, wake=<optimized out>) at src/ev_epoll.c:201
#2  0x000056390c517060 in run_poll_loop () at src/haproxy.c:2861
#3  0x000056390c5174ec in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:2979
#4  0x00007fcc81154ea5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fcc7fc048dd in clone () from /lib64/libc.so.6

Thread 12 (Thread 0x7fcc343f5700 (LWP 22297)):
#0  0x00007fcc7fc04eb3 in epoll_wait () from /lib64/libc.so.6
#1  0x000056390c439964 in _do_poll (p=<optimized out>, exp=<optimized out>, wake=<optimized out>) at src/ev_epoll.c:186
#2  0x000056390c517060 in run_poll_loop () at src/haproxy.c:2861
#3  0x000056390c5174ec in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:2979
#4  0x00007fcc81154ea5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fcc7fc048dd in clone () from /lib64/libc.so.6

Thread 11 (Thread 0x7fcc36bfa700 (LWP 22292)):
#0  0x00007fcc7fc04eb3 in epoll_wait () from /lib64/libc.so.6
#1  0x000056390c439964 in _do_poll (p=<optimized out>, exp=<optimized out>, wake=<optimized out>) at src/ev_epoll.c:186
#2  0x000056390c517060 in run_poll_loop () at src/haproxy.c:2861
#3  0x000056390c5174ec in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:2979
#4  0x00007fcc81154ea5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fcc7fc048dd in clone () from /lib64/libc.so.6

Thread 10 (Thread 0x7fcc373fb700 (LWP 22291)):
#0  0x00007fcc7fc04eb3 in epoll_wait () from /lib64/libc.so.6
#1  0x000056390c439964 in _do_poll (p=<optimized out>, exp=<optimized out>, wake=<optimized out>) at src/ev_epoll.c:186
#2  0x000056390c517060 in run_poll_loop () at src/haproxy.c:2861
#3  0x000056390c5174ec in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:2979
#4  0x00007fcc81154ea5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fcc7fc048dd in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x7fcc353f7700 (LWP 22295)):
#0  0x00007ffcfb9f2919 in clock_gettime ()
#1  0x00007fcc7fc1a7fd in clock_gettime () from /lib64/libc.so.6
#2  0x000056390c439a50 in now_cpu_time () at include/common/time.h:538
#3  tv_leaving_poll (timeout=<optimized out>, interrupted=1) at include/common/time.h:602
#4  _do_poll (p=<optimized out>, exp=<optimized out>, wake=<optimized out>) at src/ev_epoll.c:199
#5  0x000056390c517060 in run_poll_loop () at src/haproxy.c:2861
#6  0x000056390c5174ec in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:2979
#7  0x00007fcc81154ea5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007fcc7fc048dd in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x7fcc3a7fd700 (LWP 22286)):
#0  0x00007fcc7fc04eb3 in epoll_wait () from /lib64/libc.so.6
#1  0x000056390c439964 in _do_poll (p=<optimized out>, exp=<optimized out>, wake=<optimized out>) at src/ev_epoll.c:186
#2  0x000056390c517060 in run_poll_loop () at src/haproxy.c:2861
#3  0x000056390c5174ec in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:2979
#4  0x00007fcc81154ea5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fcc7fc048dd in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7fcc38bfe700 (LWP 22288)):
#0  0x00007fcc7fc04eb3 in epoll_wait () from /lib64/libc.so.6
#1  0x000056390c439964 in _do_poll (p=<optimized out>, exp=<optimized out>, wake=<optimized out>) at src/ev_epoll.c:186
#2  0x000056390c517060 in run_poll_loop () at src/haproxy.c:2861
#3  0x000056390c5174ec in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:2979
#4  0x00007fcc81154ea5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fcc7fc048dd in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7fcc3b7ff700 (LWP 22284)):
#0  0x00007fcc7fc04eb3 in epoll_wait () from /lib64/libc.so.6
#1  0x000056390c439964 in _do_poll (p=<optimized out>, exp=<optimized out>, wake=<optimized out>) at src/ev_epoll.c:186
#2  0x000056390c517060 in run_poll_loop () at src/haproxy.c:2861
#3  0x000056390c5174ec in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:2979
#4  0x00007fcc81154ea5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fcc7fc048dd in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7fcc3affe700 (LWP 22285)):
#0  0x000056390c5671ba in wake_expired_tasks () at src/task.c:212
#1  0x000056390c517074 in run_poll_loop () at src/haproxy.c:2804
#2  0x000056390c5174ec in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:2979
#3  0x00007fcc81154ea5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fcc7fc048dd in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7fcc363f9700 (LWP 22293)):
#0  0x00007fcc7fc04eb3 in epoll_wait () from /lib64/libc.so.6
#1  0x000056390c439964 in _do_poll (p=<optimized out>, exp=<optimized out>, wake=<optimized out>) at src/ev_epoll.c:186
#2  0x000056390c517060 in run_poll_loop () at src/haproxy.c:2861
#3  0x000056390c5174ec in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:2979
#4  0x00007fcc81154ea5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fcc7fc048dd in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7fcc37bfc700 (LWP 22290)):
#0  0x00007fcc7fc04eb3 in epoll_wait () from /lib64/libc.so.6
#1  0x000056390c439964 in _do_poll (p=<optimized out>, exp=<optimized out>, wake=<optimized out>) at src/ev_epoll.c:186
#2  0x000056390c517060 in run_poll_loop () at src/haproxy.c:2861
#3  0x000056390c5174ec in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:2979
#4  0x00007fcc81154ea5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fcc7fc048dd in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7fcc81dce440 (LWP 22283)):
#0  0x00007fcc7fc04eb3 in epoll_wait () from /lib64/libc.so.6
#1  0x000056390c439964 in _do_poll (p=<optimized out>, exp=<optimized out>, wake=<optimized out>) at src/ev_epoll.c:186
#2  0x000056390c517060 in run_poll_loop () at src/haproxy.c:2861
#3  0x000056390c5174ec in run_thread_poll_loop (data=data@entry=0x0) at src/haproxy.c:2979
#4  0x000056390c43726e in main (argc=<optimized out>, argv=0x7ffcfb9bb308) at src/haproxy.c:3681

Thread 1 (Thread 0x7fcc383fd700 (LWP 22289)):
#0  0x00007fcc7fb3c387 in raise () from /lib64/libc.so.6
#1  0x00007fcc7fb3da78 in abort () from /lib64/libc.so.6
#2  0x000056390c473350 in stream_dump_and_crash (obj=obj@entry=0x7fcbea72ab66, rate=104116) at src/stream.c:2696
#3  0x000056390c4755e5 in process_stream (t=<optimized out>, context=0x7fcbea72aa00, state=<optimized out>) at src/stream.c:1517
#4  0x000056390c567666 in run_tasks_from_list (list=list@entry=0x56390c94c540 <task_per_thread+832>, max=max@entry=149) at src/task.c:368
#5  0x000056390c567e5e in process_runnable_tasks () at src/task.c:516
#6  0x000056390c51707b in run_poll_loop () at src/haproxy.c:2807
#7  0x000056390c5174ec in run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:2979
#8  0x00007fcc81154ea5 in start_thread () from /lib64/libpthread.so.0
#9  0x00007fcc7fc048dd in clone () from /lib64/libc.so.6

I can provide coredump privately if there are not enough information.

@wdauchy wdauchy added status: needs-triage This issue needs to be triaged. type: bug This issue describes a bug. labels May 29, 2020
@wtarreau
Copy link
Member

wtarreau commented May 29, 2020

Thank you William. Could you please issue "p *(struct stream*)0x7fcbea72aa00" so that we get all the elements from that stream ? It's interesting to see that the other side has closed, I suspect that there's an unclean shutdown situation that's not correctly handled by process_stream().

@wdauchy
Copy link
Contributor Author

wdauchy commented May 29, 2020

here it is

 p *(struct stream*)0x7fcbea72aa00
$1 = {flags = 3086, uniq_id = 10800265, target = 0x7fcc543b7000, req = {flags = 1220845664, analysers = 0, buf = {size = 0, area = 0x0, data = 0, head = 0}, pipe = 0x0, output = 0, to_forward = 0, last_read = 39614, xfer_large = 0 '\000', xfer_small = 0 '\000', total = 2437, rex = 0, wex = 0, rto = 60000, wto = 60000, analyse_exp = 0}, res = {flags = 3221504096, analysers = 67108864, buf = {size = 33792, area = 0x7fcb8fed41c0 "Ѓ", data = 33792, head = 0}, pipe = 0x0, output = 11807, to_forward = 0, last_read = 39749, xfer_large = 0 '\000', xfer_small = 0 '\000', total = 199836, rex = 0, wex = 1623295397, rto = 60000, 
    wto = 60000, analyse_exp = 0}, be = 0x7fcc54331800, sess = 0x7fcbe1b35940, srv_conn = 0x0, pend_pos = 0x0, txn = 0x7fcbadaa4da0, task = 0x7fcbb4beac60, pending_events = 0, priority_class = 0, priority_offset = 0, list = {n = 0x7fcbf7003a08, p = 0x7fcb8be58408}, by_srv = {n = 0x7fcc543b71c8, p = 0x7fcc543b71c8}, back_refs = {n = 0x7fcbea72ab28, p = 0x7fcbea72ab28}, buffer_wait = {target = 0x7fcbea72aa00, wakeup_cb = 0x56390c46be70 <stream_buf_available>, list = {next = 0x7fcbea72ab48, prev = 0x7fcbea72ab48}}, call_rate = {curr_sec = 1590761135, curr_ctr = 100000, prev_ctr = 12437}, store_count = 0, obj_type = OBJ_TYPE_STREAM, 
  store = {{ts = 0x0, table = 0x0}, {ts = 0x0, table = 0x0}, {ts = 0x0, table = 0x0}, {ts = 0x0, table = 0x0}, {ts = 0x0, table = 0x0}, {ts = 0x0, table = 0x0}, {ts = 0x0, table = 0x0}, {ts = 0x0, table = 0x0}}, target_addr = 0x7fcbdb72d480, stkctr = {{entry = 0, table = 0x0}, {entry = 0, table = 0x0}, {entry = 0, table = 0x0}}, strm_flt = {filters = {n = 0x7fcbea72ac20, p = 0x7fcbea72ac20}, current = {0x0, 0x0}, flags = 0, nb_req_data_filters = 0 '\000', nb_rsp_data_filters = 0 '\000', offset = {0, 0}}, req_cap = 0x7fcbaf73fa40, res_cap = 0x0, vars_txn = {head = {n = 0x7fcbea72ac68, p = 0x7fcbea72ac68}, scope = SCOPE_TXN, 
    size = 0, rwlock = 0}, vars_reqres = {head = {n = 0x7fcbea72ac88, p = 0x7fcbea72ac88}, scope = SCOPE_RES, size = 0, rwlock = 0}, si = {{state = SI_ST_EST, prev_state = SI_ST_EST, flags = 557056, end = 0x7fcc12431e80, ops = 0x56390c83a720 <si_conn_ops>, exp = 0, err_type = 0, conn_retries = 0, hcto = 0, wait_event = {tasklet = 0x7fcbf038e2c0, events = 2}, l7_buffer = {size = 0, area = 0x0, data = 0, head = 0}}, {state = SI_ST_CLO, prev_state = SI_ST_CLO, flags = 2621840, end = 0x7fcc0ecf5ac0, ops = 0x56390c83a720 <si_conn_ops>, exp = 0, err_type = 0, conn_retries = 3, hcto = 0, wait_event = {tasklet = 0x7fcbdaebd000, 
        events = 0}, l7_buffer = {size = 33792, area = 0x7fcb8fd53a80 "Ѓ", data = 2682, head = 2682}}}, logs = {logwait = 25860, level = -1, accept_date = {tv_sec = 1590761076, tv_usec = 754500}, tv_accept = {tv_sec = 1590761081, tv_usec = 185155}, t_handshake = 0, t_idle = 53597, tv_request = {tv_sec = 1590761134, tv_usec = 782624}, t_queue = 53597, t_connect = 53597, t_data = 53730, t_close = 0, srv_queue_pos = 0, prx_queue_pos = 0, bytes_in = 0, bytes_out = 0}, do_log = 0x56390c503780 <strm_log>, srv_error = 0x56390c4a0dd0 <http_return_srv_error>, pcli_next_pid = 0, pcli_flags = 0, unique_id = {ptr = 0x0, len = 0}, 
  current_rule_list = 0x7fcc52c36870, current_rule = 0x0, hlua = 0x0, dns_ctx = {dns_requester = 0x0, hostname_dn = 0x0, hostname_dn_len = 0, parent = 0x0}}

@wtarreau
Copy link
Member

l7_buffer = {size = 33792, ...

I didn't notice the l7 retries, but here if we have this buffer and a close on the other side, I'm suspecting we might be in the process of trying to retry, otherwise the buffer would have been released. If you can try again commenting the "retry-on" line to confirm or infirm the hypothesis, it would be nice. If it happens to be the trigger, we'll ping Olivier.

@wdauchy
Copy link
Contributor Author

wdauchy commented May 29, 2020

I started a new test without retry-on config line. It seems to trigger the same loop, to be confirmed when I will get a core.

@wdauchy
Copy link
Contributor Author

wdauchy commented May 29, 2020

(confirmed, the new core is the same)

@wdauchy
Copy link
Contributor Author

wdauchy commented May 30, 2020

I tried to do a bisection, but don't put too much value on it, I'm not 100% sure of the results:

commit 065a025 is the first bad commit
("MEDIUM: connection: don't stop receiving events in the FD handler")

# bad: [5a753bd7b78b0f5a4945f3929c721171b954f1fb] [RELEASE] Released version 2.2-dev4
# good: [32bf97fb6048e0fb7afe8c336e6a1594fbde9430] [RELEASE] Released version 2.2-dev3
git bisect start 'v2.2-dev4' 'v2.2-dev3'
# bad: [cfca1422c7de05e34b72d2d57bc057b215aa9f8f] MINOR: ssl: reach a ckch_store from a sni_ctx
git bisect bad cfca1422c7de05e34b72d2d57bc057b215aa9f8f
# bad: [55a6c4f34de59cc3d5d5079968e382c6fc303ef9] BUILD: tools: remove obsolete and conflicting trace() from standard.c
git bisect bad 55a6c4f34de59cc3d5d5079968e382c6fc303ef9
# good: [1ed3781e2130a0a7c7d4b35e58990a6d30d5f538] MINOR: fd: merge the read and write error bits into RW error
git bisect good 1ed3781e2130a0a7c7d4b35e58990a6d30d5f538
# bad: [493d9dc6bac2c28a48382bcc240cd787447a909b] MEDIUM: mux-h1: do not blindly wake up the tasklet at end of request anymore
git bisect bad 493d9dc6bac2c28a48382bcc240cd787447a909b
# good: [b045bb221a3d14e9b674265591e2def0e7deb919] MINOR: mux-h1: Remove useless case-insensitive comparisons
git bisect good b045bb221a3d14e9b674265591e2def0e7deb919
# good: [7e59c0a5e13df4611bb000d18465d363819373f2] MEDIUM: connection: make the subscribe() call able to wakeup if ready
git bisect good 7e59c0a5e13df4611bb000d18465d363819373f2
# bad: [065a025610fe86691ea3780330182d502d603ec1] MEDIUM: connection: don't stop receiving events in the FD handler
git bisect bad 065a025610fe86691ea3780330182d502d603ec1
# first bad commit: [065a025610fe86691ea3780330182d502d603ec1] MEDIUM: connection: don't stop receiving events in the FD handler

@wtarreau
Copy link
Member

wtarreau commented Jun 3, 2020

I'm still failing to understand how that could wake the stream up given that it's at the opposite of the chain. Given that the input is closed (both sides have CF_SHUTR), what might happen would be that we do not disable reading despite being closed and the event would then be reported multiple times, but that would be a wrong reason a just a symptom. Here both sides have CF_SHUTR so I don't understand why the stream doesn't stop and abort completely.

@wtarreau
Copy link
Member

wtarreau commented Jun 3, 2020

In fact I'm still wondering what the impact of retries can be here (and not necessarily L7). Maybe we're attempting a retry just after the client has aborted and we ignore this condition. I'll need to try to model this case to understand if (and how) it can happen.

@wtarreau
Copy link
Member

So after analysis it turns out that your bisect was correct. There is a fundamental flaw in the asynchronous connection wake up sequence which we thought we worked around a few commits later but didn't completely. The problem is that we don't know if there's still someone listening on a connection to decide whether or not to disable an event. Initially someone is subscribed, then as soon as it's woken up it gets unsubscribed. It's comparable to EPOLL_ONESHOT. So the read event strikes, the subscriber (the mux) is woken up, it receives some data, fills its buffer doesn't subscribe again. We purposely don't disable polling, hoping that these data will be quickly forwarded and we can get back to reading again.

The problem is that forwarding requires that someone is actually consuming the received data, and that if the channel's buffer is full because the client stalled, this doesn't happen. In this case nobody disables polling at all, the mux hasn't resubscribed due to the buffer full condition and will not be woken up again.

In your case it's more visible because it happens on the very last block and the server has closed. As such it contains a shutdown even that will be reported forever. This one has a special effect since it makes the mux wake up the upper layer stream which is seen as spinning. But disabling polling only on POLL_HUP would be pointless as I manage to make it loop as well by retrieving a 1 GB file and pressing Ctrl-S in curl immediately at the beginning of the transfer, which causes the transfer to stall with system buffers full. In this case the stream is not woken up, neither the mux. Only the conn_fd_handler() which hopes "someone" will stop the fd but there's nobody there anymore.

For now the only solution is to revert this change. This tells us two things that we've been suspecting over time but are now proven by this issue:

  • we MUST have a way to know that a connection event is still in the queue for being processed. This could be as simple as setting a flag on the connection that the mux handler clears. Or it could consist in not removing the subscribers anymore. This will allow to disable events for which there's nobody listening anymore and just ignore them as long as the event lies in the queue.
  • we really need to have a low-level mux-to-mux forward mechanism so that forwarding happens with a very low latency and we don't need to disable FD polling anymore because we won't be waiting for a tasklet to be scheduled.

I'm going to work on fixing this now and will push patches after thorough testing.

I still think there's a minor glitch at the stream-interface level based on the flag combinations I've seen but I don't think it has a real impact. CCing @cognet who could be interested in this side effect since we got this initial idea together. Let's now see what we can do for 2.3 :-)

@wdauchy
Copy link
Contributor Author

wdauchy commented Jun 17, 2020

thank you very much for the full explanation, I learned a lot and have still to process the following commits you mentioned. As I'm spending some time on sockets for io_uring, it is useful information for me to understand the interaction between raw_sock/connection/listener/mux.
I'm also glad to read my bisect was somehow useful!

haproxy-mirror pushed a commit that referenced this issue Jun 17, 2020
This effectively reverts the two following commits:

  6f95f6e ("OPTIM: connection: disable receiving on disabled events when the run queue is too high")
  065a025 ("MEDIUM: connection: don't stop receiving events in the FD handler")

The problem as reported in issue #662 is that when the events signals
the readiness of input data that has to be forwarded over a congested
stream, the mux will read data and wake the stream up to forward them,
but the buffer full condition makes this impossible immediately, then
nobody in the chain will be able to disable the event after it was
first reported. And given we don't know at the connection level whether
an event was already reported or not, we can't decide anymore to
forcefully stop it if for any reason its processing gets delayed.

The problem is magnified in issue #662 by the fact that a shutdown is
reported with pending data occupying the buffer. The shutdown will
strike in loops and cause the upper layer stream to be notified until
it's handled, but with a buffer full it's not possible to call cs_recv()
hence to purge the event.

All this can only be handled optimally by implementing a lower layer,
direct mux-to-mux forwarding that will not require any scheduling. This
was no wake up will be needed and the event will be instantly handled
or paused for a long time.

For now let's simply revert these optimizations. Running a 1 MB transfer
test over H2 using 8 connections having each 32 streams with a limited
link of 320 Mbps shows the following profile before this fix:

   calls  syscall       (100% CPU)
  ------  -------
  259878  epoll_wait
  519759  clock_gettime
   17277  sendto
   17129  recvfrom
     672  epoll_ctl

And the following one after the fix:

   calls  syscall       (2-3% CPU)
  ------  -------
   17201  sendto
   17357  recvfrom
    2304  epoll_wait
    4609  clock_gettime
    1200  epoll_ctl

Thus the behavior is much better.

No backport is needed as these patches were only in 2.2-dev.

Many thanks to William Dauchy for reporting a lot of details around this
difficult issue.
@wtarreau
Copy link
Member

I've just pushed the fix. On a specific test (client-side bandwidth limitations), I was seeing 100% CPU before the fix and 2-3% now. Quite a big difference!

@wtarreau
Copy link
Member

Regarding the sock/connection/mux etc interactions I still have for project to document the new architecture. It's an amazingly difficult work, especially since it changes while you look at it. But having a reference point in time would definitely help us a lot!

@wdauchy
Copy link
Contributor Author

wdauchy commented Jun 17, 2020

Regarding the sock/connection/mux etc interactions I still have for project to document the new architecture. It's an amazingly difficult work, especially since it changes while you look at it. But having a reference point in time would definitely help us a lot!

oh yes, I would be your first reader of such documentation.

Thanks for the patch, I will test it right away!

@wdauchy
Copy link
Contributor Author

wdauchy commented Jun 17, 2020

@wtarreau to give you a quick feedback, it's not very conclusive on my side. CPU consumption is very high (> 10 cores instead of 5 at this hour). I'm giving it some time to see whether it will abort by itself or a looping stream will be detected.

@haproxy-mirror
Copy link

haproxy-mirror commented Jun 17, 2020 via email

@wdauchy
Copy link
Contributor Author

wdauchy commented Jun 17, 2020

no coredump so far, but all cores are now busy:

thread_id: 10 (1..16)
date_now: 1592411771.062205
loops: 307410054 [ 13049249 19647718 20046573 20515893 21215659 21193036 19618610 17370365 13165272 17598872 23578763 18114423 22931492 20314943 16612548 22436638 ]
wake_tasks: 145390194 [ 6340149 9433794 9607467 9535973 10235675 9783964 9328344 8448367 6039860 8197755 11153758 8701350 10618719 9553427 7650763 10760829 ]
wake_signal: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
poll_exp: 140545 [ 4229 9390 9780 9582 10837 10379 9500 7250 4134 6467 13006 7945 11224 9009 5891 11922 ]
poll_drop: 1074427 [ 39991 69727 71484 71126 78059 76105 70363 60416 38268 58125 90353 65467 83348 68004 50463 83128 ]
poll_dead: 132 [ 16 9 12 9 9 11 4 8 18 9 6 3 3 4 7 4 ]
poll_skip: 94151205 [ 5682689 5907945 5886320 5900059 6009002 5888463 5855714 5962857 5601209 5843371 5981219 5876056 5977168 5981895 5809298 5987940 ]
fd_lock: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
conn_dead: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
stream: 428074948 [ 25582751 27193559 26946743 26837482 27467230 26718753 26504284 27529946 25074052 26481089 27334483 26521662 27248349 27263416 26050463 27320686 ]
pool_fail: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
buf_wait: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
empty_rq: 36247254 [ 1341764 1961669 2053570 2625587 2034767 2837928 2165245 2007268 2014812 2485174 2568667 1902387 3130101 2486722 2443708 2187885 ]
long_rq: 131334955 [ 5733295 8522982 8671861 8608108 9238824 8823295 8423443 7665813 5449383 7422025 10057353 7857445 9588008 8634097 6924218 9714805 ]
ctxsw: 1762476906 [ 104622558 111787935 111033819 110740775 113363102 110326527 109421764 112861393 102397210 108770630 113186171 108964598 112877196 112231266 106795794 113096168 ]
tasksw: 535482183 [ 32071626 33945899 33671482 33578224 34345549 33465313 33191784 34321504 31474868 33097933 34214675 33181184 34090158 34042994 32621684 34167306 ]
cpust_ms_tot: 1436 [ 119 108 84 83 103 84 96 81 89 84 86 80 91 81 82 85 ]
cpust_ms_1s: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
cpust_ms_15s: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
avg_loop_us: 1496 [ 2264 1994 543 2031 2669 285 2042 352 2368 2142 241 2135 445 2017 2247 161 ]
accepted: 6821301 [ 25578 197283 226263 469178 204082 623182 254415 224700 555532 804810 519863 172653 1195240 374878 705615 268029 ]
accq_pushed: 6821301 [ 426078 426248 425557 426599 425959 427015 426581 425670 425022 425348 427237 426883 428419 427056 425207 426422 ]
accq_full: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
accq_ring: 22 [ 2 1 0 1 2 0 3 1 3 2 0 4 1 0 2 0 ]
Name: HAProxy
Version: 2.2-dev9
Release_date: 2020/06/11
Nbthread: 16
Nbproc: 1
Process_num: 1
Pid: 18770
Uptime: 0d 0h41m59s
Uptime_sec: 2519
Memmax_MB: 0
PoolAlloc_MB: 633
PoolUsed_MB: 610
PoolFailed: 0
Ulimit-n: 1054075
Maxsock: 1054075
Maxconn: 524288
Hard_maxconn: 524288
CurrConns: 100547
CumConns: 7648586
CumReq: 207046464
MaxSslConns: 0
CurrSslConns: 82768
CumSslConns: 6647167
Maxpipes: 0
PipesUsed: 0
PipesFree: 0
ConnRate: 2792
ConnRateLimit: 0
MaxConnRate: 2876
SessRate: 2804
SessRateLimit: 0
MaxSessRate: 2869
SslRate: 2728
SslRateLimit: 0
MaxSslRate: 2774
SslFrontendKeyRate: 2626
SslFrontendMaxKeyRate: 2704
SslFrontendSessionReuse_pct: 4
SslBackendKeyRate: 0
SslBackendMaxKeyRate: 0
SslCacheLookups: 2489816
SslCacheMisses: 2488877
CompressBpsIn: 0
CompressBpsOut: 0
CompressBpsRateLim: 0
ZlibMemUsage: 0
MaxZlibMemUsage: 0
Tasks: 121016
Run_queue: 3949
Idle_pct: 2
node: lb-consul33-par
Stopping: 0
Jobs: 102052
Unstoppable Jobs: 0
Listeners: 1510
ActivePeers: 0
ConnectedPeers: 0
DroppedLogs: 0
BusyPolling: 0
FailedResolutions: 0
TotalBytesOut: 563310043510
BytesOutRate: 254820704
DebugCommandsIssued: 0

@wtarreau
Copy link
Member

I agree that for ~2 Gbps even in SSL, 100% on 16 threads is pretty high. I find that the run queue is quite large, at roughly 4k entries.

I'm extremely surprized by the poll_skip counter, it accounts for 1/4 of the calls to process_stream and as documented in the code, it should only happen when the FD was migrated, so that leaves me with doubts, as I'm unsure what could really be causing this.

When you say that it's using twice as many cores, is it compared to 2.1 or compared to the latest properly working 2.2 ?

@wtarreau
Copy link
Member

Or said differently, have you seen 2.2 report a reasonable idle_pct value in "show info" ?

@wdauchy
Copy link
Contributor Author

wdauchy commented Jun 17, 2020

yes sorry for the lack of precision, I'm always comparing a node with v2.1 and another with the last v2.2 to make sure I have an idea of the expected average usage at the time of the test (basically because depending on the hour, the usage could drastically change). That being said I can probably put a v2.2dev3 back and compare again.

@wtarreau
Copy link
Member

OK that's quite useful info! So at least we can say that 2.2 still has a problem that 2.1 doesn't have, and that prevents it from going idle.

@wdauchy
Copy link
Contributor Author

wdauchy commented Jun 17, 2020

yes. I also don't see how I could try a bisect, because I guess we are now talking about a different issue (should I open a new entry?), i.e the looping stream issue seems resolved, as I did not saw the message yet.
maybe I could manually apply the revert each time I try a new bisection, but I fear to mix commits.

@wtarreau
Copy link
Member

I'm suspecting we have two bugs in fd_takeover(). In the non-double-cas case, if we fail to grab the fd because we're competing on it with other threads, we prepare ret=-1 but we still perform fd_stop_recv() on this FD that we don't own. In the double-cas case, we do not disable receiving. I'd at least do something like calling fd_stop_recv() only when ret == 0 in both cases. @cognet any idea on this ?

@wtarreau
Copy link
Member

Regarding the bisect, it would be a real pain due to too many other interleaved bugs that could cause similar patterns. I'd rather add more info in the code to improve debugging :-/

@TimWolla
Copy link
Member

maybe I could manually apply the revert each time I try a new bisection, but I fear to mix commits.

Just to chime in here (because I can't say anything about the issue itself 😃): You technically could apply the revert manually using the patch command, resulting in a dirty tree, but not changing anything about commits. Possibly performing the bisection with git stash / git stash pop in between the steps.

@wtarreau
Copy link
Member

Sure but the two problems are 1) that the code around this area has changed quite a bit over the versions, making the patch not apply at all most of the time, or requiring different sets of patches to be reverted, and 2) that other issues can definitely cause the same symptoms, adding quite some noise in the observation.

@cognet
Copy link
Contributor

cognet commented Jun 17, 2020

I'm suspecting we have two bugs in fd_takeover(). In the non-double-cas case, if we fail to grab the fd because we're competing on it with other threads, we prepare ret=-1 but we still perform fd_stop_recv() on this FD that we don't own. In the double-cas case, we do not disable receiving. I'd at least do something like calling fd_stop_recv() only when ret == 0 in both cases. @cognet any idea on this ?

I think you're correct on both counts, I'm going to fix that.

@wtarreau
Copy link
Member

Cool, that might warrant a new test once addressed because given the high number of poll_skip in the output it could indicate that we've left unstoppable FDs behind during HTTP reuse. I've added a "poll_io" counter in the activity report to see how many of the wakeups are caused by poll() activity. If it gets very high it could indicate a stuck FD.

@wdauchy
Copy link
Contributor Author

wdauchy commented Jun 17, 2020

@cognet I don't know whether you were finished with the fd fixes but I took the liberty to apply the two fixes you just sent. I've also applied the new poll_io counter.
the issue is still present though.

thread_id: 4 (1..16)
date_now: 1592421751.972891
ctxsw: 189842997 [ 11925076 11810149 11709261 11888720 12085118 11639617 11447141 12275783 11876325 11835357 12071512 11739921 11883305 12021507 11742852 11891353 ]
tasksw: 53996890 [ 3412998 3350448 3332863 3384485 3442673 3318699 3259267 3484180 3392722 3358439 3425896 3340052 3371637 3411058 3338965 3372508 ]
empty_rq: 10546733 [ 846379 460635 800889 492364 599022 468041 772409 626454 473013 886276 549223 762806 952372 491223 903403 462224 ]
long_rq: 20954609 [ 1249952 1310075 1281458 1320123 1348188 1281514 1258381 1352803 1259166 1303302 1370586 1285270 1355753 1347476 1277735 1352827 ]
loops: 66619946 [ 4074700 4013317 4287936 4051518 4208302 3944445 4238774 4171824 3707208 4341630 4254758 4226841 4595230 4082323 4307330 4113810 ]
wake_tasks: 24536772 [ 1453040 1540660 1506771 1548184 1574196 1505947 1482708 1577025 1470217 1524646 1604562 1506761 1585421 1576690 1495252 1584692 ]
wake_signal: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
poll_io: 56537915 [ 3529330 3353987 3654153 3392064 3555291 3300388 3601122 3548950 3144174 3740786 3571657 3604310 3952824 3431149 3716696 3441034 ]
poll_exp: 18287 [ 912 1151 1179 1169 1166 1214 1208 1167 991 1115 1240 1150 1262 1161 999 1203 ]
poll_drop_fd: 186366 [ 11367 11630 11473 11562 11556 11616 11547 11532 11480 11661 11927 11844 11861 11809 11728 11773 ]
poll_dead_fd: 27 [ 0 2 0 2 0 1 1 2 3 2 3 2 2 1 6 0 ]
poll_skip_fd: 10809806 [ 678056 671939 669266 677469 685053 665926 660316 693349 679419 673916 684165 669607 673107 682399 673359 672460 ]
conn_dead: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
stream_calls: 41058906 [ 2600644 2547211 2529635 2575706 2623223 2521847 2465603 2657556 2585019 2550734 2606875 2537383 2562497 2595592 2531921 2567460 ]
pool_fail: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
buf_wait: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
cpust_ms_tot: 1825 [ 117 132 117 116 109 109 121 122 115 118 113 103 103 106 106 118 ]
cpust_ms_1s: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
cpust_ms_15s: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
avg_loop_us: 70 [ 79 130 49 51 52 54 58 55 106 76 63 58 67 92 78 58 ]
accepted: 1358062 [ 345902 2885 71367 5369 18548 3392 56228 22006 6025 158036 11913 60210 200593 5382 387554 2652 ]
accq_pushed: 1358062 [ 84941 84900 84702 84428 85094 84884 84911 84687 84491 84947 84903 85147 84782 85093 85368 84784 ]
accq_full: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
accq_ring: 1 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ]
Name: HAProxy
Version: 2.2-dev9
Release_date: 2020/06/11
Nbthread: 16
Nbproc: 1
Process_num: 1
Pid: 3681
Uptime: 0d 0h07m29s
Uptime_sec: 449
Memmax_MB: 0
PoolAlloc_MB: 635
PoolUsed_MB: 624
PoolFailed: 0
Ulimit-n: 1054077
Maxsock: 1054077
Maxconn: 524288
Hard_maxconn: 524288
CurrConns: 109327
CumConns: 1558102
CumReq: 20344004
MaxSslConns: 0
CurrSslConns: 99694
CumSslConns: 1381513
Maxpipes: 0
PipesUsed: 0
PipesFree: 0
ConnRate: 3348
ConnRateLimit: 0
MaxConnRate: 3418
SessRate: 3347
SessRateLimit: 0
MaxSessRate: 3419
SslRate: 3251
SslRateLimit: 0
MaxSslRate: 3304
SslFrontendKeyRate: 3105
SslFrontendMaxKeyRate: 3232
SslFrontendSessionReuse_pct: 4
SslBackendKeyRate: 0
SslBackendMaxKeyRate: 0
SslCacheLookups: 535731
SslCacheMisses: 535502
CompressBpsIn: 0
CompressBpsOut: 0
CompressBpsRateLim: 0
ZlibMemUsage: 0
MaxZlibMemUsage: 0
Tasks: 125036
Run_queue: 39
Idle_pct: 28
node: lb-consul33-par
Stopping: 0
Jobs: 110842
Unstoppable Jobs: 0
Listeners: 1510
ActivePeers: 0
ConnectedPeers: 0
DroppedLogs: 0
BusyPolling: 0
FailedResolutions: 0
TotalBytesOut: 74261543252
BytesOutRate: 228546592
DebugCommandsIssued: 0

@wtarreau
Copy link
Member

Nice, so that shows that 85% of the wakeups come from I/O activity. I cannot achieve that high ratios, even with empty objects I'm at 76%. This could indicate a stuck FD. My poll_skip_fd count can reach 50% of poll_io though, due to aggressive reuse on the server side. If I put "http-reuse never" the poll_skip_fd stays to zero. So the ~20% you're seeing might not be that much of an issue after all.

If you're in a situation where it's possible to try with http-reuse never, that could be extremely informative on possible causes.

@wdauchy
Copy link
Contributor Author

wdauchy commented Jun 17, 2020

we are indeed using http-reuse always; I tried with http-reuse never but the issue is the same.

thread_id: 2 (1..16)
date_now: 1592423253.505343
ctxsw: 155879999 [ 9845909 9578257 9659771 9721725 9852534 10109043 9804333 9945204 9739280 9754827 9764633 9457144 9656991 9875887 9539995 9574466 ]
tasksw: 43187400 [ 2730150 2651292 2674817 2692305 2724744 2798857 2711898 2758046 2718824 2705701 2703819 2618870 2668041 2725893 2642557 2661586 ]
empty_rq: 5693280 [ 302553 551607 262764 431547 518093 533671 295912 505557 259884 387799 229196 260759 223315 236273 233459 460891 ]
long_rq: 13125273 [ 816384 810251 817731 830578 847511 870958 845559 849924 775101 822595 816486 802338 820055 850508 793073 756221 ]
loops: 46955045 [ 2795300 3201366 2889934 3093809 3227886 3250179 2956676 3200548 2577357 2962956 2805650 2812090 2783952 2861281 2748111 2787950 ]
wake_tasks: 23822317 [ 1458577 1499969 1500596 1518455 1544942 1569780 1530416 1548838 1385203 1488405 1486541 1465173 1482724 1528241 1454900 1359557 ]
wake_signal: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
poll_io: 35980364 [ 2142992 2495964 2168956 2381179 2508688 2546278 2241369 2490824 1983842 2284153 2106127 2116239 2091996 2161102 2064085 2196570 ]
poll_exp: 7936 [ 426 526 531 553 533 547 542 517 356 551 500 517 434 502 467 434 ]
poll_drop_fd: 117548 [ 7398 7570 7518 7619 7737 7558 7517 7318 6721 7463 7147 7478 7353 7241 7191 6719 ]
poll_dead_fd: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
poll_skip_fd: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
conn_dead: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
stream_calls: 34728687 [ 2200747 2124969 2153966 2167565 2191661 2250859 2184744 2217146 2185064 2176097 2173532 2103524 2146602 2194253 2119224 2138734 ]
pool_fail: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
buf_wait: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
cpust_ms_tot: 2904 [ 220 166 177 168 158 183 179 179 178 166 185 183 191 175 197 199 ]
cpust_ms_1s: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
cpust_ms_15s: 3 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 ]
avg_loop_us: 99 [ 72 73 63 69 68 80 77 95 114 124 77 81 127 190 101 173 ]
accepted: 1003037 [ 16858 255434 8298 42202 94708 165884 13871 87155 11844 34067 3557 8902 3153 4624 4320 248160 ]
accq_pushed: 1003037 [ 62830 62927 62793 62635 63089 62832 62378 62052 62847 62734 62694 62554 62584 62730 62654 62704 ]
accq_full: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
accq_ring: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]

@wtarreau
Copy link
Member

OK and now we clearly see that poll_skip is zero so that cannot be caused by spurious wakeups from migrated FDs. The process_calls ratio to cumreq is clean (2x) so for now I'm ruling out the possibility of a stream-interface issue causing streams to loop.

The polled I/O to loops still seems high but it's well balanced across all threads, so either it's normal or the problem is generalized. The low value for avg_loop_us is cool for latency but could also indicate that the majority of wakeups do nothing (such as a call to conn_fd_handler() and a return).

Among the things I'm thinking about that could possibly help, in order of complexity:

  • running "perf top" to see if it spots hot locations which could direct us at an anomaly
  • running strace on the process to see if epoll_wait() reports the same FD in loops, and if so, in which state
  • rebuild with ARCH_FLAGS="-pg" to enable profiling. It will require to stop the process cleanly (kill -USR1) and that it's not chrooted so that gmon.out can be emitted. This will help figure the number of calls to each function and a call graph.

@rnsanchez
Copy link
Contributor

* rebuild with ARCH_FLAGS="-pg" to enable profiling. It will require to stop the process cleanly (kill -USR1) and that it's not chrooted so that gmon.out can be emitted. This will help figure the number of calls to each function and a call graph.

Why not simply perf record with call-graph, for a few seconds? No need to rebuild/restart, then.

# perf record -g --call-graph dwarf -F 233 -t <thread_ID> -- sleep 10

It seems that symbols are included as well, from gdb output. Dwarf, in my experience perf recording HAProxy, yields the best results.

@wtarreau
Copy link
Member

It is different. perf provides you time-oriented information based on sampling. It will hardly spot short-lived functions. Here it will say that most of the time is spent in the poll() loop, in the scheduler, in process_stream() etc. I'm really interested in seeing if we find ourselves calling conn_fd_handler() too many times or not, and the number of times it's called while it doesn't cause other tasklets to wake up. perf doesn't have the notion of call counts, it's only about sample counts. That doesn't make it bad at all, quite the opposite, it's a fantastic tool for spotting time killers. But here we're trying to spot a wrong "if" condition.

@rnsanchez
Copy link
Contributor

True, indeed. I was under the impression that the culprit might be hot code, where perf call-graphs could be of use (even with the inherent imprecision of sampling-based profiling).

Based on your latest description, it sounds like a good place for eBPF + uprobe. That, though, could rank in the far end of the list.

Thank you for your patience and sorry for the noise. :-)

@wtarreau
Copy link
Member

You don't have to be sorry, your proposal was totally valid, it's just that the context is not suitable. But it could have been a time saver, so don't be shy next time :-) By the way, just like I don't use perf for counting, I don't use gprof for timing, it's totally inaccurate as well! In short, counting=>gprof, timing=>perf.

@wtarreau
Copy link
Member

@wdauchy I'd also be interested in knowing if reverting 5d7dcc2 ("OPTIM: epoll: always poll for recv if neither active nor ready") changes anything (it still reverts fine).

@wtarreau
Copy link
Member

One interesting point to note regarding the performance is cost of the activation of L7 retries. On my machines with just 4 threads over 2 cores and your bufsize, the performance drops from 191k to 125k req/s. This is caused by the requirement to memcpy() the request buffer. I suspect it could be made smarter by using HTX functions which will perform a sparse copy, but at least I wanted to mention this as a warning.

@wtarreau
Copy link
Member

wtarreau commented Jun 18, 2020

@wdauchy I have a present for you :-) I'm attaching a patch that adds experimental support for edge-triggered mode. This way you don't need to revert the patch above and this would once for all rule out any polling-related issues. It only does it for connections, not other FDs since we know that not everything is compatible. But the connection layers should be OK now in 2.2. This will require the explicit addition of tune.fd.edge-triggered on in the global section. For me it cancels all epoll_ctl() calls and increases the I/O performance by up to 14%. Note that it's still experimental and may exhibit some bugs by causing some connections to freeze. I couldn't find any such occurrence but I prefer to warn you.

0001-MEDIUM-fd-add-experimental-support-for-edge-triggere.patch.txt

@wdauchy
Copy link
Contributor Author

wdauchy commented Jun 18, 2020

thanks @wtarreau
for now I have unknown keyword 'tune.fd.dge-triggered' in 'global' section; I will have a look how to fix that if you do not have the fix right now.

@wtarreau
Copy link
Member

Ah I mistyped it, sorry! It's "tune.fd.edge-triggered on" (a missing "e").

@wdauchy
Copy link
Contributor Author

wdauchy commented Jun 18, 2020

wow - I'm tired, I did not even re-read myself

@wdauchy
Copy link
Contributor Author

wdauchy commented Jun 18, 2020

@wtarreau thanks for your patch. for now my results are as follow:

  • reverting 5d7dcc2
  • applying MEDIUM-fd-add-experimental-support-for-edge-triggere

those two tests are not conclusive, progressively using two times cores compared to a node with v2.1 - and after some more time using almost the 16 threads. I will go through the perf experimentation to find out more information.

thread_id: 6 (1..16)
date_now: 1592495423.929595
ctxsw: 1477206655 [ 91308814 93309804 92072137 91973279 92948094 93280018 94427221 91754384 89066672 91551270 92814078 92532031 92987843 94088614 92340608 90751788 ]
tasksw: 439721703 [ 27252783 27746133 27530320 27337525 27551457 27766360 28083625 27320232 26682556 27271991 27514490 27591098 27620933 27949193 27586342 26916665 ]
empty_rq: 38880439 [ 1547327 3243932 2110435 3044578 3312441 2684995 3100671 1840746 1479545 2640367 2145871 2805698 2045114 2464803 2010661 2403255 ]
long_rq: 151053606 [ 8204777 9755068 9275897 9540384 9930551 9706611 9640964 9110647 8015128 9170288 10328137 9223154 10080481 10033813 9022009 10015697 ]
loops: 368173274 [ 18560595 24700780 22496287 24082505 25386157 24110911 24090741 21724053 18051905 22250459 25235785 22704252 24312411 24338541 21071633 25056259 ]
wake_tasks: 168130314 [ 9086238 10852764 10334066 10635300 11060707 10814094 10717142 10158325 8905155 10193985 11521339 10242055 11231220 11162515 10009424 11205985 ]
wake_signal: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
poll_io: 321225900 [ 16236832 21615165 19657622 21072184 22139677 21066338 21119502 18850950 15907764 19551474 21699744 19959493 21012232 21163802 18489220 21683901 ]
poll_exp: 152662 [ 6755 10005 9647 9848 10895 10377 9747 9155 6497 8206 11564 8913 10778 10376 8243 11656 ]
poll_drop_fd: 1572605 [ 90361 99021 98326 99074 99928 99366 98407 97314 93826 97635 100934 98301 100922 100447 97635 101108 ]
poll_dead_fd: 134 [ 12 9 9 11 5 9 11 12 11 4 1 3 7 7 15 8 ]
poll_skip_fd: 81704324 [ 5131514 5110973 5113188 5083563 5096452 5140169 5188025 5092121 5041382 5096065 5066603 5132383 5102578 5157862 5140931 5010515 ]
conn_dead: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
stream_calls: 347866958 [ 21551528 21987984 21782664 21607331 21788557 21976719 22250841 21604966 21075331 21559615 21773914 21834416 21859494 22138074 21840519 21235005 ]
pool_fail: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
buf_wait: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
cpust_ms_tot: 2064 [ 175 128 126 133 115 117 135 120 114 140 130 121 139 125 128 118 ]
cpust_ms_1s: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
cpust_ms_15s: 5 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 ]
avg_loop_us: 158 [ 338 72 89 96 164 247 80 76 230 210 72 220 114 91 332 97 ]
accepted: 5843225 [ 44854 1594411 103874 564790 917398 269956 563743 47284 34245 498436 52539 671871 45305 190686 117550 126283 ]
accq_pushed: 5843225 [ 364740 366168 366177 366291 365423 364638 364667 365144 365132 365358 364378 364682 365147 365176 364567 365537 ]
accq_full: 0 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
accq_ring: 3 [ 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 ]

@wtarreau
Copy link
Member

Thanks. So now it's certain the issue is not related to polling anymore. I'd really love to be able to reproduce it, as I'm out of ideas now :-(

@wtarreau
Copy link
Member

There is something very stange by the way. Even with edge-triggered polling these are still I/O events that make up 85% of the wakeups. But these cannot strike more than once. So that leaves us with a few possibilities:

  • for an unknown reason, I/O processing could be much more expensive than in 2.1, so that the same amount of I/O manages to saturate the system. That doesn't match our observations in the lab nor what you observe at the beginning since you're saying the load progressively grows;
  • another FD event (level triggered) could be stuck and induce this. There aren't that much (DNS, listeners, inter-thread messages). For listeners it wouldn't make much sense, or you'd be rejecting traffic. For DNS I think you don't use it. Inter-thread messages happen with the multi-threaded accept queue (accq_pushed is only 2% of the wakeups), and with cross-thread activity, which is quite limited (wakeups when queues are deep).
  • we could still be facing a stuck event that rings and is not handled (a shutdown?). I'd almost be tempted to put a call counter into the fdtab to detect if this happens.

If, once the issue happens, you can issue a "show sess all" and a "show fd" into a file, that would be awesome, I could try to see if some streams are called many times, for what apparent reason and if they're attached to an FD in a strange state. These will be huge (multi-megabytes) and will contain some partially private data so don't post it here!

@wdauchy
Copy link
Contributor Author

wdauchy commented Jul 2, 2020

  • the looping stream is now fixed.
  • the cpu consumption is now fixed by last HEAD of the repo: we had to switch to private conversations looking at the number of issues we encounter along the path.

Thanks!

@wdauchy wdauchy closed this as completed Jul 2, 2020
@TimWolla TimWolla added status: fixed This issue is a now-fixed bug. and removed status: needs-triage This issue needs to be triaged. labels Jul 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: fixed This issue is a now-fixed bug. type: bug This issue describes a bug.
Projects
None yet
Development

No branches or pull requests

6 participants