Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HAProxy 2.6.14 - former worker SIGABRT after reload #2222

Closed
chipflake opened this issue Jul 19, 2023 · 8 comments
Closed

HAProxy 2.6.14 - former worker SIGABRT after reload #2222

chipflake opened this issue Jul 19, 2023 · 8 comments
Labels
status: fixed This issue is a now-fixed bug. type: bug This issue describes a bug.

Comments

@chipflake
Copy link

Detailed Description of the Problem

Originally asked in the forum.

We run HAProxy in master-worker mode. On reloads, sometimes the former worker crashes with SIGABRT instead of exiting cleanly.

This started after we updated from 2.4.22 to 2.6.14. In the past 20 days we have seen this 3 times, each time on a different host.

Backtrace from the latest crash is here: gdb.txt (attachment because of length)
If needed, I can provide the backtrace and logs from the two crashes prior to this one.

Expected Behavior

The former worker to exit cleanly.

Steps to Reproduce the Behavior

Unfortunately we don't know, other than it happens rarely on reloads.

Do you have any idea what may have caused this?

In the backtrace, some threads with stuck=1 seem to be waiting on the listener and the proxy locks. There have been some changes to listener functions, like:
fed93d3 BUG/MEDIUM: listener: read-lock the listener during accept()
9bfa34f MINOR: listener: add relax_listener() function

It might also be related to the concurrency bug fixes introduced in 2.6.10: https://www.mail-archive.com/haproxy@formilux.org/msg43304.html

I also see these bug fixes for 2.6, after the 2.6.14 release:

however I'm not sure they apply here.

Do you have an idea how to solve the issue?

No response

What is your configuration?

# The configurations are complicated and contain confidential info. 
# Here are the global and default sections if they're of any help.

global
  log /dev/log len 65535 daemon
  log-send-hostname

  stats socket "/haproxy/stats.sock" mode 0600 expose-fd listeners level admin
  maxconn 100000
  hard-stop-after 45s

  h1-accept-payload-with-any-method

  server-state-file /haproxy/server-state

  ssl-default-bind-ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384
  ssl-default-bind-ciphersuites TLS_AES_128_GCM_SHA256:TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256
  ssl-default-bind-options no-sslv3 no-tlsv10 no-tlsv11 no-tls-tickets

  ssl-default-server-ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384
  ssl-default-server-ciphersuites TLS_AES_128_GCM_SHA256:TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256
  ssl-default-server-options no-sslv3 no-tlsv10 no-tlsv11 no-tls-tickets

  ssl-dh-param-file /dhparam.pem

  lua-load /haproxy/reflectconninfo.lua

resolvers default
  parse-resolv-conf
  timeout resolve 1m

defaults
  log global
  mode http

  option httplog
  option forwardfor if-none
  option dontlognull
  option redispatch
  hash-type consistent

  retries 3
  maxconn 100000

  timeout connect  5000
  timeout client  50000
  timeout server  50000
  timeout check   10000

  default-server agent-inter 5s
  balance roundrobin

  errorfile 502 /haproxy/err-502.http
  errorfile 503 /haproxy/err-502.http

Output of haproxy -vv

HAProxy version 2.6.14-5188364 2023/06/09 - https://haproxy.org/
Status: long-term supported branch - will stop receiving fixes around Q2 2027.
Known bugs: http://www.haproxy.org/bugs/bugs-2.6.14.html
Running on: Linux 5.10.0-14-amd64 #1 SMP Debian 5.10.113-1 (2022-04-29) x86_64
Build options :
  TARGET  = linux-glibc
  CPU     = generic
  CC      = cc
  CFLAGS  = -O2 -g -Wall -Wextra -Wundef -Wdeclaration-after-statement -Wfatal-errors -Wtype-limits -Wshift-negative-value -Wshift-overflow=2 -Wduplicated-cond -Wnull-dereference -fwrapv -Wno-address-of-packed-member -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-clobbered -Wno-missing-field-initializers -Wno-cast-function-type -Wno-string-plus-int -Wno-atomic-alignment
  OPTIONS = USE_PCRE=1 USE_OPENSSL=1 USE_LUA=yes USE_ZLIB=1 USE_PROMEX=1
  DEBUG   = -DDEBUG_STRICT -DDEBUG_MEMORY_POOLS

Feature list : -51DEGREES +ACCEPT4 +BACKTRACE -CLOSEFROM +CPU_AFFINITY +CRYPT_H -DEVICEATLAS +DL -ENGINE +EPOLL -EVPORTS +GETADDRINFO -KQUEUE +LIBCRYPT +LINUX_SPLICE +LINUX_TPROXY +LUA -MEMORY_PROFILING +NETFILTER +NS -OBSOLETE_LINKER +OPENSSL -OT +PCRE -PCRE2 -PCRE2_JIT -PCRE_JIT +POLL +PRCTL -PROCCTL +PROMEX -QUIC +RT -SLZ -STATIC_PCRE -STATIC_PCRE2 -SYSTEMD +TFO +THREAD +THREAD_DUMP +TPROXY -WURFL +ZLIB

Default settings :
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_THREADS=64, default=16).
Built with OpenSSL version : OpenSSL 1.1.1n  15 Mar 2022
Running on OpenSSL version : OpenSSL 1.1.1n  15 Mar 2022
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with Lua version : Lua 5.4.2
Built with the Prometheus exporter as a service
Built with network namespace support.
Support for malloc_trim() is enabled.
Built with zlib version : 1.2.11
Running on zlib version : 1.2.11
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with PCRE version : 8.39 2016-06-14
Running on PCRE version : 8.39 2016-06-14
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Encrypted password support via crypt(3): yes
Built with gcc compiler version 10.2.1 20210110

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
         h2 : mode=HTTP  side=FE|BE  mux=H2    flags=HTX|HOL_RISK|NO_UPG
       fcgi : mode=HTTP  side=BE     mux=FCGI  flags=HTX|HOL_RISK|NO_UPG
  <default> : mode=HTTP  side=FE|BE  mux=H1    flags=HTX
         h1 : mode=HTTP  side=FE|BE  mux=H1    flags=HTX|NO_UPG
  <default> : mode=TCP   side=FE|BE  mux=PASS  flags=
       none : mode=TCP   side=FE|BE  mux=PASS  flags=NO_UPG

Available services : prometheus-exporter
Available filters :
	[CACHE] cache
	[COMP] compression
	[FCGI] fcgi-app
	[SPOE] spoe
	[TRACE] trace

Last Outputs and Backtraces

Some lines have been omitted.
See attachments for backtrace.

[NOTICE]   (1833336) : Reloading HAProxy
[NOTICE]   (1833336) : haproxy version is 2.6.14-5188364
[NOTICE]   (1833336) : path to executable is /usr/local/sbin/haproxy
[NOTICE]   (1833336) : New worker (1894172) forked
[NOTICE]   (1833336) : Loading success.
Thread 16 is about to kill the process.
 >Thread 1 : id=0x7f7067fe3f40 act=1 glob=1 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
      1/1    stuck=1 prof=0 harmless=0 wantrdv=0
             cpu_ns: poll=69026828312 now=71203876590 diff=2177048278
             curr_task=0
             call trace(15):
             | 0x555ac8de0f5b [89 44 24 04 85 c0 75 15]: ha_dump_backtrace+0x2b/0x2d0
             | 0x555ac8de19f6 [48 8b 05 d3 ad 1b 00 48]: debug_handler+0x66/0x10b
             | 0x7f7068752140 [48 c7 c0 0f 00 00 00 0f]: libpthread:+0x13140
             | 0x7f706874ef44 [89 c2 f7 da 3d 00 f0 ff]: libpthread:pthread_kill+0x34/0x5d
             | 0x7f7068752140 [48 c7 c0 0f 00 00 00 0f]: libpthread:+0x13140
             | 0x555ac8dd6b57 [eb 9b 0f 1f 80 00 00 00]: stop_listener+0x1a7/0x1fb
             | 0x555ac8e35aee [48 8b 83 c8 01 00 00 48]: protocol_stop_now+0x9e/0xff
             | 0x555ac8da1413 [48 8b 1d 76 7d 17 00 49]: main+0x1284a3
             | 0x555ac8dbca09 [48 89 ef e8 ef 15 08 00]: main+0x143a99
             | 0x555ac8e3dd8b [48 8b 03 48 89 df 4c 39]: __signal_process_queue+0x9b/0x13a
 >Thread 2 : id=0x7f7067fd8700 act=1 glob=1 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
      1/2    stuck=1 prof=0 harmless=0 wantrdv=0
             cpu_ns: poll=59029507968 now=61067577169 diff=2038069201
             curr_task=0
             call trace(10):
             | 0x555ac8de0f5b [89 44 24 04 85 c0 75 15]: ha_dump_backtrace+0x2b/0x2d0
             | 0x555ac8de19f6 [48 8b 05 d3 ad 1b 00 48]: debug_handler+0x66/0x10b
             | 0x7f7068752140 [48 c7 c0 0f 00 00 00 0f]: libpthread:+0x13140
             | 0x555ac8dd7b88 [e9 69 f7 ff ff 0f 1f 00]: listener_accept+0xaf8/0xf2d
             | 0x555ac8e1db3d [48 8b 35 bc 48 0f 00 48]: fd_update_events+0x32d/0x41d
             | 0x555ac8c7d05a [49 39 dd 0f 84 9d 00 00]: main+0x40ea
 >Thread 3 : id=0x7f7065711700 act=1 glob=1 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
      1/3    stuck=1 prof=0 harmless=0 wantrdv=0
             cpu_ns: poll=57578788046 now=59998892665 diff=2420104619
             curr_task=0
             call trace(12):
             | 0x555ac8de0f5b [89 44 24 04 85 c0 75 15]: ha_dump_backtrace+0x2b/0x2d0
             | 0x555ac8de19f6 [48 8b 05 d3 ad 1b 00 48]: debug_handler+0x66/0x10b
             | 0x7f7068752140 [48 c7 c0 0f 00 00 00 0f]: libpthread:+0x13140
             | 0x555ac8dd62da [48 8b 85 30 12 00 00 48]: resume_listener+0xba/0x28d
             | 0x555ac8dd6561 [48 8b 7c 24 08 eb b2 ba]: relax_listener+0xb1/0xc2
             | 0x555ac8dd78f4 [e8 87 ec ff ff 4d 85 e4]: listener_accept+0x864/0xf2d
             | 0x555ac8e1db3d [48 8b 35 bc 48 0f 00 48]: fd_update_events+0x32d/0x41d
             | 0x555ac8c7d05a [49 39 dd 0f 84 9d 00 00]: main+0x40ea
  Thread 4 : id=0x7f7064f10700 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
      1/4    stuck=0 prof=0 harmless=1 wantrdv=0
             cpu_ns: poll=66042720837 now=68382860850 diff=2340140013
             curr_task=0
 >Thread 5 : id=0x7f705ffff700 act=1 glob=1 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
      1/5    stuck=1 prof=0 harmless=0 wantrdv=0
             cpu_ns: poll=58890052321 now=60949154099 diff=2059101778
             curr_task=0
             call trace(10):
             | 0x555ac8de0f5b [89 44 24 04 85 c0 75 15]: ha_dump_backtrace+0x2b/0x2d0
             | 0x555ac8de19f6 [48 8b 05 d3 ad 1b 00 48]: debug_handler+0x66/0x10b
             | 0x7f7068752140 [48 c7 c0 0f 00 00 00 0f]: libpthread:+0x13140
             | 0x555ac8dd7b88 [e9 69 f7 ff ff 0f 1f 00]: listener_accept+0xaf8/0xf2d
             | 0x555ac8e1db3d [48 8b 35 bc 48 0f 00 48]: fd_update_events+0x32d/0x41d
             | 0x555ac8c7d05a [49 39 dd 0f 84 9d 00 00]: main+0x40ea
  Thread 6 : id=0x7f705f7fe700 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
      1/6    stuck=0 prof=0 harmless=1 wantrdv=0
             cpu_ns: poll=57827362990 now=60181718036 diff=2354355046
             curr_task=0
  Thread 7 : id=0x7f705effd700 act=1 glob=1 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
      1/7    stuck=0 prof=0 harmless=1 wantrdv=1
             cpu_ns: poll=57274843457 now=59173907824 diff=1899064367
             curr_task=0x555acb9e6310 (task) calls=1 last=0
               fct=0x555ac8da1970(manage_proxy) ctx=0x555ac9e5fda0
 >Thread 8 : id=0x7f705e7fc700 act=1 glob=1 wq=1 rq=1 tl=0 tlsz=0 rqsz=1
      1/8    stuck=1 prof=0 harmless=0 wantrdv=0
             cpu_ns: poll=63200541761 now=65146690127 diff=1946148366
             curr_task=0
             call trace(11):
             | 0x555ac8de0f5b [89 44 24 04 85 c0 75 15]: ha_dump_backtrace+0x2b/0x2d0
             | 0x555ac8de19f6 [48 8b 05 d3 ad 1b 00 48]: debug_handler+0x66/0x10b
             | 0x7f7068752140 [48 c7 c0 0f 00 00 00 0f]: libpthread:+0x13140
             | 0x555ac8dd5c47 [eb a7 4c 8d 4b 40 48 8d]: main+0x15ccd7
             | 0x555ac8dd7cfa [45 85 ff 0f 84 88 f5 ff]: listener_accept+0xc6a/0xf2d
             | 0x555ac8e1db3d [48 8b 35 bc 48 0f 00 48]: fd_update_events+0x32d/0x41d
             | 0x555ac8c7d05a [49 39 dd 0f 84 9d 00 00]: main+0x40ea
 >Thread 9 : id=0x7f705dffb700 act=1 glob=1 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
      1/9    stuck=1 prof=0 harmless=0 wantrdv=0
             cpu_ns: poll=56731722240 now=58834144668 diff=2102422428
             curr_task=0
             call trace(10):
             | 0x555ac8de0f5b [89 44 24 04 85 c0 75 15]: ha_dump_backtrace+0x2b/0x2d0
             | 0x555ac8de19f6 [48 8b 05 d3 ad 1b 00 48]: debug_handler+0x66/0x10b
             | 0x7f7068752140 [48 c7 c0 0f 00 00 00 0f]: libpthread:+0x13140
             | 0x555ac8dd7b88 [e9 69 f7 ff ff 0f 1f 00]: listener_accept+0xaf8/0xf2d
             | 0x555ac8e1db3d [48 8b 35 bc 48 0f 00 48]: fd_update_events+0x32d/0x41d
             | 0x555ac8c7d05a [49 39 dd 0f 84 9d 00 00]: main+0x40ea
  Thread 10: id=0x7f705d7fa700 act=1 glob=1 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
      1/10   stuck=0 prof=0 harmless=1 wantrdv=0
             cpu_ns: poll=58474957036 now=60454353822 diff=1979396786
             curr_task=0
  Thread 11: id=0x7f705cff9700 act=0 glob=0 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
      1/11   stuck=0 prof=0 harmless=1 wantrdv=0
             cpu_ns: poll=60082936594 now=62116149224 diff=2033212630
             curr_task=0
 >Thread 12: id=0x7f705c7f8700 act=1 glob=1 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
      1/12   stuck=1 prof=0 harmless=0 wantrdv=0
             cpu_ns: poll=64463209874 now=66438612175 diff=1975402301
             curr_task=0
             call trace(10):
             | 0x555ac8de0f5b [89 44 24 04 85 c0 75 15]: ha_dump_backtrace+0x2b/0x2d0
             | 0x555ac8de19f6 [48 8b 05 d3 ad 1b 00 48]: debug_handler+0x66/0x10b
             | 0x7f7068752140 [48 c7 c0 0f 00 00 00 0f]: libpthread:+0x13140
             | 0x555ac8dd7b88 [e9 69 f7 ff ff 0f 1f 00]: listener_accept+0xaf8/0xf2d
             | 0x555ac8e1db3d [48 8b 35 bc 48 0f 00 48]: fd_update_events+0x32d/0x41d
             | 0x555ac8c7d05a [49 39 dd 0f 84 9d 00 00]: main+0x40ea
 >Thread 13: id=0x7f705bff7700 act=1 glob=1 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
      1/13   stuck=1 prof=0 harmless=0 wantrdv=0
             cpu_ns: poll=58565095510 now=60822343604 diff=2257248094
             curr_task=0
             call trace(10):
             | 0x555ac8de0f5b [89 44 24 04 85 c0 75 15]: ha_dump_backtrace+0x2b/0x2d0
             | 0x555ac8de19f6 [48 8b 05 d3 ad 1b 00 48]: debug_handler+0x66/0x10b
             | 0x7f7068752140 [48 c7 c0 0f 00 00 00 0f]: libpthread:+0x13140
             | 0x555ac8dd7b88 [e9 69 f7 ff ff 0f 1f 00]: listener_accept+0xaf8/0xf2d
             | 0x555ac8e1db3d [48 8b 35 bc 48 0f 00 48]: fd_update_events+0x32d/0x41d
             | 0x555ac8c7d05a [49 39 dd 0f 84 9d 00 00]: main+0x40ea
 >Thread 14: id=0x7f705b7f6700 act=1 glob=1 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
      1/14   stuck=1 prof=0 harmless=0 wantrdv=0
             cpu_ns: poll=58438716408 now=60707730435 diff=2269014027
             curr_task=0
             call trace(10):
             | 0x555ac8de0f5b [89 44 24 04 85 c0 75 15]: ha_dump_backtrace+0x2b/0x2d0
             | 0x555ac8de19f6 [48 8b 05 d3 ad 1b 00 48]: debug_handler+0x66/0x10b
             | 0x7f7068752140 [48 c7 c0 0f 00 00 00 0f]: libpthread:+0x13140
             | 0x555ac8dd7b88 [e9 69 f7 ff ff 0f 1f 00]: listener_accept+0xaf8/0xf2d
             | 0x555ac8e1db3d [48 8b 35 bc 48 0f 00 48]: fd_update_events+0x32d/0x41d
             | 0x555ac8c7d05a [49 39 dd 0f 84 9d 00 00]: main+0x40ea
 >Thread 15: id=0x7f705aff5700 act=1 glob=1 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
      1/15   stuck=1 prof=0 harmless=0 wantrdv=0
             cpu_ns: poll=61365942138 now=63756730471 diff=2390788333
             curr_task=0
             call trace(10):
             | 0x555ac8de0f5b [89 44 24 04 85 c0 75 15]: ha_dump_backtrace+0x2b/0x2d0
             | 0x555ac8de19f6 [48 8b 05 d3 ad 1b 00 48]: debug_handler+0x66/0x10b
             | 0x7f7068752140 [48 c7 c0 0f 00 00 00 0f]: libpthread:+0x13140
             | 0x555ac8dd7b88 [e9 69 f7 ff ff 0f 1f 00]: listener_accept+0xaf8/0xf2d
             | 0x555ac8e1db3d [48 8b 35 bc 48 0f 00 48]: fd_update_events+0x32d/0x41d
             | 0x555ac8c7d05a [49 39 dd 0f 84 9d 00 00]: main+0x40ea
*>Thread 16: id=0x7f705a7f4700 act=1 glob=1 wq=1 rq=0 tl=0 tlsz=0 rqsz=0
      1/16   stuck=1 prof=0 harmless=0 wantrdv=0
             cpu_ns: poll=57914739679 now=60227610965 diff=2312871286
             curr_task=0
             call trace(14):
             | 0x555ac8de0f5b [89 44 24 04 85 c0 75 15]: ha_dump_backtrace+0x2b/0x2d0
             | 0x555ac8de19f6 [48 8b 05 d3 ad 1b 00 48]: debug_handler+0x66/0x10b
             | 0x7f7068752140 [48 c7 c0 0f 00 00 00 0f]: libpthread:+0x13140
             | 0x7f7068751fe1 [48 8b 84 24 08 01 00 00]: libpthread:raise+0x141/0x179
             | 0x555ac8de0387 [64 48 8b 53 10 64 48 8b]: main+0x167417
             | 0x555ac8de03d8 [0f 1f 84 00 00 00 00 00]: main+0x167468
             | 0x7f7068752140 [48 c7 c0 0f 00 00 00 0f]: libpthread:+0x13140
             | 0x555ac8dd7b88 [e9 69 f7 ff ff 0f 1f 00]: listener_accept+0xaf8/0xf2d
             | 0x555ac8e1db3d [48 8b 35 bc 48 0f 00 48]: fd_update_events+0x32d/0x41d
             | 0x555ac8c7d05a [49 39 dd 0f 84 9d 00 00]: main+0x40ea
[NOTICE]   (1833336) : haproxy version is 2.6.14-5188364
[NOTICE]   (1833336) : path to executable is /usr/local/sbin/haproxy
[WARNING]  (1833336) : Former worker (1894096) exited with code 134 (Aborted)

Additional Information

No response

@chipflake chipflake added status: needs-triage This issue needs to be triaged. type: bug This issue describes a bug. labels Jul 19, 2023
@lukastribus
Copy link
Member

I also see these bug fixes for 2.6, after the 2.6.14 release:

* http://git.haproxy.org/?p=haproxy-2.6.git;a=commit;h=33ea988c553ecb001e44f8897191ee0d02daab17

* http://git.haproxy.org/?p=haproxy-2.6.git;a=commit;h=98860fad7ebca4f7d42f5d354405841486eb3164

however I'm not sure they apply here.

You can just use the tarball of the 2.6 snapshot I linked to in the forum, so you don't have to apply any patches or do anything in git (just download the tarball and recompile):

http://www.haproxy.org/download/2.6/src/snapshot/haproxy-ss-20230623.tar.gz

@capflam
Copy link
Member

capflam commented Jul 20, 2023

Here it seems there is an ABBA problem. Looking at your GDB trace, the Thread 3 is waiting for the listener's lock while it already have the proxy's lock, in stop_listener() function. Several other threads are waiting for the same listener's lock. And it is pretty common for listener functions get the proxy's lock, then the protocol's lock and finally the listener's lock.

However, the pattern seems to be inverted in the Thread 5. relax_listener() function acquires the listener's lock and then call resume_listener(). This last one follows the first pattern. It tries to acquire the proxy's lock before the listener's lock.

Thus to sum up, the Thread 5 owns the listener's lock and is waiting for the proxy's lock. The Thread 3 owns the proxy's lock and is waiting for the listener's lock. Several threads are blocked too, waiting for the listener's lock.

I must investigate a bit to figure out how to fix the bug. It was introduced with the commit bcad7e6. But it is part of a long series, I must review the changes first.

@capflam capflam added status: reviewed This issue was reviewed. A fix is required. and removed status: needs-triage This issue needs to be triaged. labels Jul 20, 2023
@capflam
Copy link
Member

capflam commented Jul 20, 2023

Here is a patch:
0001-BUG-MEDIUM-listener-Acquire-proxy-s-lock-in-relax_li.patch.txt

@chipflake, have you any way to test it ?

@capflam capflam added status: feedback required The developers are waiting for a reply from the reporter. dev This issue affects the HAProxy development branch. 2.4 This issue affects the HAProxy 2.4 stable branch. 2.6 This issue affects the HAProxy 2.6 stable branch. 2.7 This issue affects the HAProxy 2.7 stable branch. 2.8 This issue affects the HAProxy 2.8 stable branch. and removed status: reviewed This issue was reviewed. A fix is required. labels Jul 20, 2023
@chipflake
Copy link
Author

@capflam thank you. I can test the patch from next week. We will need to wait at least a couple of weeks to confirm that the issue has been solved though. (since so far we've seen 3 crashes in ~20 days)

@wtarreau
Copy link
Member

In any case I think the patch is correct and fixes a real issue, so it would be fine to merge it (which could possibly make your tests even easier), and if for any reason it was still there or another one surfaced, we'd know that this one is addressed ad least and it would simplify the analysis.

@capflam
Copy link
Member

capflam commented Jul 21, 2023

I agree, I will merge it

haproxy-mirror pushed a commit that referenced this issue Jul 21, 2023
…essary

Listener functions must follow a common locking pattern:

  1. Get the proxy's lock if necessary
  2. Get the protocol's lock if necessary
  3. Get the listener's lock if necessary

We must take care to respect this order to avoid any ABBA issue. However, an
issue was introduced in the commit bcad7e6 ("MINOR: listener: add
relax_listener() function"). relax_listener() gets the lisener's lock and if
resume_listener() is called, the proxy's lock is then acquired.

So to fix the issue, the proxy's lock is first acquired in relax_listener(),
if necessary.

This patch should fix the issue #2222. It must be backported as far as 2.4
because the above commit is marked to be backported there.
@capflam capflam added status: fixed This issue is a now-fixed bug. status: feedback required The developers are waiting for a reply from the reporter. and removed status: feedback required The developers are waiting for a reply from the reporter. dev This issue affects the HAProxy development branch. labels Jul 21, 2023
@capflam
Copy link
Member

capflam commented Jul 21, 2023

I marked the issue as fixed for now waiting for your feedback. I'll try to backport to ease your tests. Thanks !

@capflam capflam removed 2.8 This issue affects the HAProxy 2.8 stable branch. 2.7 This issue affects the HAProxy 2.7 stable branch. labels Jul 21, 2023
FireBurn pushed a commit to FireBurn/haproxy that referenced this issue Jul 22, 2023
…essary

Listener functions must follow a common locking pattern:

  1. Get the proxy's lock if necessary
  2. Get the protocol's lock if necessary
  3. Get the listener's lock if necessary

We must take care to respect this order to avoid any ABBA issue. However, an
issue was introduced in the commit bcad7e6 ("MINOR: listener: add
relax_listener() function"). relax_listener() gets the lisener's lock and if
resume_listener() is called, the proxy's lock is then acquired.

So to fix the issue, the proxy's lock is first acquired in relax_listener(),
if necessary.

This patch should fix the issue haproxy#2222. It must be backported as far as 2.4
because the above commit is marked to be backported there.

(cherry picked from commit ff1c803)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>
(cherry picked from commit 6844af6)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>
@capflam capflam removed 2.6 This issue affects the HAProxy 2.6 stable branch. 2.4 This issue affects the HAProxy 2.4 stable branch. labels Jul 24, 2023
@capflam
Copy link
Member

capflam commented Jul 24, 2023

The fix was backported to all versions. I'm closing the issue but of course, reopen it if the issue is still there. Thanks !

@capflam capflam closed this as completed Jul 24, 2023
@capflam capflam removed the status: feedback required The developers are waiting for a reply from the reporter. label Jul 24, 2023
FireBurn pushed a commit to FireBurn/haproxy that referenced this issue Jul 25, 2023
…essary

Listener functions must follow a common locking pattern:

  1. Get the proxy's lock if necessary
  2. Get the protocol's lock if necessary
  3. Get the listener's lock if necessary

We must take care to respect this order to avoid any ABBA issue. However, an
issue was introduced in the commit bcad7e6 ("MINOR: listener: add
relax_listener() function"). relax_listener() gets the lisener's lock and if
resume_listener() is called, the proxy's lock is then acquired.

So to fix the issue, the proxy's lock is first acquired in relax_listener(),
if necessary.

This patch should fix the issue haproxy#2222. It must be backported as far as 2.4
because the above commit is marked to be backported there.

(cherry picked from commit ff1c803)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>
(cherry picked from commit 6844af6)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>
(cherry picked from commit 6859790)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>
FireBurn pushed a commit to FireBurn/haproxy that referenced this issue Jul 25, 2023
…essary

Listener functions must follow a common locking pattern:

  1. Get the proxy's lock if necessary
  2. Get the protocol's lock if necessary
  3. Get the listener's lock if necessary

We must take care to respect this order to avoid any ABBA issue. However, an
issue was introduced in the commit bcad7e6 ("MINOR: listener: add
relax_listener() function"). relax_listener() gets the lisener's lock and if
resume_listener() is called, the proxy's lock is then acquired.

So to fix the issue, the proxy's lock is first acquired in relax_listener(),
if necessary.

This patch should fix the issue haproxy#2222. It must be backported as far as 2.4
because the above commit is marked to be backported there.

(cherry picked from commit ff1c803)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>
(cherry picked from commit 6844af6)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>
(cherry picked from commit 6859790)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>
(cherry picked from commit fbc1119)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>
FireBurn pushed a commit to FireBurn/haproxy that referenced this issue Aug 8, 2023
…essary

Listener functions must follow a common locking pattern:

  1. Get the proxy's lock if necessary
  2. Get the protocol's lock if necessary
  3. Get the listener's lock if necessary

We must take care to respect this order to avoid any ABBA issue. However, an
issue was introduced in the commit bcad7e6 ("MINOR: listener: add
relax_listener() function"). relax_listener() gets the lisener's lock and if
resume_listener() is called, the proxy's lock is then acquired.

So to fix the issue, the proxy's lock is first acquired in relax_listener(),
if necessary.

This patch should fix the issue haproxy#2222. It must be backported as far as 2.4
because the above commit is marked to be backported there.

(cherry picked from commit ff1c803)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: fixed This issue is a now-fixed bug. type: bug This issue describes a bug.
Projects
None yet
Development

No branches or pull requests

4 participants