QUIC debug message gets logged #1737

Tristan971 · 2022-06-08T14:03:23Z

Detailed Description of the Problem

After moving to a 2.6.0 build with QUIC enabled, the following gets sprinkled in the logs occasionally:

2022-06-08T12:20:10+01:00       REMAINING QEL RX PKTS:
2022-06-08T12:20:10+01:00       pkt@0x7f3c6c202310 type=4 pn=33
2022-06-08T12:20:10+01:00       pkt@0x7f3c6c1f6e20 type=4 pn=34
2022-06-08T12:20:10+01:00       pkt@0x7f3c6c0e40c0 type=4 pn=35
2022-06-08T12:20:10+01:00       pkt@0x7f3c6c1db2c0 type=4 pn=36
2022-06-08T12:20:10+01:00       pkt@0x7f3c6c2261c0 type=4 pn=37
2022-06-08T12:20:10+01:00       pkt@0x7f3c6c2413d0 type=4 pn=38
2022-06-08T12:20:10+01:00       pkt@0x7f3c6c237ea0 type=4 pn=39
2022-06-08T12:20:10+01:00       pkt@0x7f3c6c0c62f0 type=4 pn=40
2022-06-08T12:20:10+01:00       pkt@0x7f3c6c0bd250 type=4 pn=41
2022-06-08T12:20:10+01:00       pkt@0x7f3c6c2311e0 type=4 pn=42
2022-06-08T12:20:10+01:00       pkt@0x7f3c6c1e94e0 type=4 pn=43
2022-06-08T12:20:10+01:00       pkt@0x7f3c6c0663e0 type=4 pn=44
2022-06-08T12:20:10+01:00       pkt@0x7f3c6c1264a0 type=4 pn=45
2022-06-08T12:20:10+01:00       pkt@0x7f3c6c1d0650 type=4 pn=46
2022-06-08T12:20:10+01:00       pkt@0x7f3c6c0c42f0 type=4 pn=47
2022-06-08T12:20:10+01:00       pkt@0x7f3c6c26d450 type=4 pn=48
2022-06-08T12:20:10+01:00       pkt@0x7f3c6c278250 type=4 pn=49
2022-06-08T12:20:10+01:00       pkt@0x7f3c6c25c090 type=4 pn=50
2022-06-08T12:20:10+01:00       pkt@0x7f3c6c347970 type=4 pn=51
2022-06-08T12:20:10+01:00       pkt@0x7f3c6c370e70 type=4 pn=52
2022-06-08T12:20:10+01:00       pkt@0x7f3c6c10e100 type=4 pn=53
2022-06-08T12:20:10+01:00       pkt@0x7f3c6c2f4fe0 type=4 pn=54
2022-06-08T12:20:10+01:00       pkt@0x7f3c6c0efd60 type=4 pn=55
2022-06-08T12:20:10+01:00       pkt@0x7f3c6c3252f0 type=4 pn=56
2022-06-08T12:20:10+01:00       pkt@0x7f3c6c1e8510 type=4 pn=57

Pretty rare and not sustained, but while mentioning it to @wtarreau he suggested it looked like a debug message and wasn't expected

Expected Behavior

Assuming it is indeed a debug message, I expect that to not be logged.

Steps to Reproduce the Behavior

Unclear. At least you need to have traffic going over a quic4 frontend, since after removing that (and keeping the same haproxy+quictls builds) it stopped.

Do you have any idea what may have caused this?

No response

Do you have an idea how to solve the issue?

No response

What is your configuration?

global
    daemon
    user haproxy
    group haproxy
    log /dev/log local0
    log /dev/log local1 notice
    ...

...

defaults defaults-base
    log global
    option log-separate-errors
    ...

...

Output of `haproxy -vv`

HAProxy version 2.6.0-a1efc04+mangadex-2ad5d4c 2022-06-07T11:55+00:00 - https://haproxy.org/
Status: long-term supported branch - will stop receiving fixes around Q2 2027.
Known bugs: http://www.haproxy.org/bugs/bugs-2.6.0.html
Running on: Linux 5.4.143-1-pve #1 SMP PVE 5.4.143-1 (Tue, 28 Sep 2021 09:10:37 +0200) x86_64
Build options :
  TARGET  = linux-glibc
  CPU     = generic
  CC      = cc
  CFLAGS  = -O2 -g -Wall -Wextra -Wundef -Wdeclaration-after-statement -Wfatal-errors -Wtype-limits -Wshift-negative-value -Wshift-overflow=2 -Wduplicated-cond -Wnull-dereference -fwrapv -Wno-address-of-packed-member -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-clobbered -Wno-missing-field-initializers -Wno-cast-function-type -Wno-string-plus-int -Wno-atomic-alignment -DMAX_SESS_STKCTR=5
  OPTIONS = USE_PCRE2=1 USE_PCRE2_JIT=1 USE_STATIC_PCRE2=1 USE_LINUX_TPROXY=1 USE_GETADDRINFO=1 USE_OPENSSL=1 USE_LUA=1 USE_SLZ=1 USE_TFO=1 USE_DL=1 USE_SYSTEMD=1 USE_QUIC=1 USE_PROMEX=1
  DEBUG   = -DDEBUG_STRICT -DDEBUG_MEMORY_POOLS

Feature list : +EPOLL -KQUEUE +NETFILTER -PCRE -PCRE_JIT +PCRE2 +PCRE2_JIT +POLL +THREAD +BACKTRACE -STATIC_PCRE +STATIC_PCRE2 +TPROXY +LINUX_TPROXY +LINUX_SPLICE +LIBCRYPT +CRYPT_H -ENGINE +GETADDRINFO +OPENSSL +LUA +ACCEPT4 -CLOSEFROM -ZLIB +SLZ +CPU_AFFINITY +TFO +NS +DL +RT -DEVICEATLAS -51DEGREES -WURFL +SYSTEMD -OBSOLETE_LINKER +PRCTL -PROCCTL +THREAD_DUMP -EVPORTS -OT +QUIC +PROMEX -MEMORY_PROFILING

Default settings :
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_THREADS=64, default=8).
Built with OpenSSL version : OpenSSL 1.1.1o+quic-mangadex-  7 Jun 2022
Running on OpenSSL version : OpenSSL 1.1.1o+quic-mangadex-  7 Jun 2022
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with Lua version : Lua 5.3.6
Built with the Prometheus exporter as a service
Built with network namespace support.
Support for malloc_trim() is enabled.
Built with libslz for stateless compression.
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with PCRE2 version : 10.40 2022-04-14
PCRE2 library supports JIT : yes
Encrypted password support via crypt(3): yes
Built with gcc compiler version 8.3.0

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
       quic : mode=HTTP  side=FE     mux=QUIC  flags=HTX|NO_UPG|FRAMED
         h2 : mode=HTTP  side=FE|BE  mux=H2    flags=HTX|HOL_RISK|NO_UPG
       fcgi : mode=HTTP  side=BE     mux=FCGI  flags=HTX|HOL_RISK|NO_UPG
  <default> : mode=HTTP  side=FE|BE  mux=H1    flags=HTX
         h1 : mode=HTTP  side=FE|BE  mux=H1    flags=HTX|NO_UPG
  <default> : mode=TCP   side=FE|BE  mux=PASS  flags=
       none : mode=TCP   side=FE|BE  mux=PASS  flags=NO_UPG

Available services : prometheus-exporter
Available filters :
        [CACHE] cache
        [COMP] compression
        [FCGI] fcgi-app
        [SPOE] spoe
        [TRACE] trace

Last Outputs and Backtraces

No response

Additional Information

No particular patch applied to the source. Running on Ubuntu 20.04, typical L7->L7 workload.

The text was updated successfully, but these errors were encountered:

haproxyFred · 2022-06-09T09:25:01Z

hello, when we get such traces, the risk is that the connection may block. Is it the case?

Tristan971 · 2022-06-09T11:27:09Z

the risk is that the connection may block. Is it the case?

Ah so... I was going to raise a third issue to keep things well sorted but if it's perhaps related, I've noticed that some routes occasionally caused stalled requests.

It seems quite inconsistent but frequent enough to be within the multi-% chance of happening. Within the browser, those "stalled" requests are either not even visible in the network tab (as if they're not even initiated) or they show up without any response body/headers and never finish.

So... Yes perhaps?

Is there some set of flags that'd help debug that more on haproxy's side? or something else I could provide to help testing?

haproxyFred · 2022-06-09T12:49:07Z

Well. This is difficult to identify the reason why a connection or requests seem stalled from the client point of view. I prefer rely only on events from haproxy point of view.

Such traces are printed on stderr when the RX buffer is full. In such a situation the RX QUIC packets are dropped. This does not mean the connection cannot progress. If it does not progress, there is a bug on our side. To know if the connection can progress we must have a look to the packet number which are logged. As printed, this is the remaining RX packets which could not be dequeued as were are trying to enqueue a new RX packet. So, the question is: do these packet numbers progress (or increase) or are they endlessly printed with the same values (ie the situation cannot be unblocked)?

Tristan971 · 2022-06-09T12:58:39Z

Doesn't seem to repeat to me, but it's also somewhat rare in the first place

However it definitely ends up causing multi-second stalls on the user-side until the browser falls back to h2 (well, Chrome does, Edge seems to never give up...) but that could be a different issue altogether

haproxyFred · 2022-06-09T16:36:18Z

Ok. So, I would say that this issue is more a performance issue than a bug. This is something we need to improve: RX performance.

Tristan971 · 2022-06-09T16:40:21Z

If it's any useful, this is the CPU usage % on the 2 small LBs I used for testing, and the spikes aren't HAProxy but Varnish bans

haproxyFred · 2022-06-09T17:13:49Z

Due to the last architectural modifications, I guess there is a bottleneck issue which appeared at the connection level when the RX buffer attached to a connection is full. The lowest RX part which recvfrom() datagrams is multithreaded. But only one thread may handle the QUIC packets attached to a connection. This is where there is a performance issue I think. This does not mean there is not enough CPU resource.

Do you have information about the ratio between the POST and GET methods used during such a traffic? A QUIC haproxy listener should not receive very much QUIC packets, except when handling POST requests.

Tristan971 · 2022-06-09T17:30:04Z

Do you have information about the ratio between the POST and GET methods used during such a traffic?

Exclusively GETs or nearly as I have ended up not advertising H3 for our API altogether since it was much more affected by the stalling (which would make sense with your remark about POSTs)

Example test run I did for #1738

Though yesterday when I originally tested it I didn't manually exclude the API, and it didn't change the spread that much (still overwhelmingly GETs)

(sorry for the color mismatch between the two screenshots, Grafana isn't being very helpful on that...)

Tristan971 · 2022-06-09T17:32:25Z

maybe this view is mildly more readable

wtarreau · 2022-06-09T17:33:24Z

Just a naive question Fred, do you think the debug message itself can cause a pause that will induce a subsequent burst and fill the next buffer ? I mean, printfs are quite serialized and can slow everything down, so if it happens only once in a while to see a buffer full, it could be possible that the rest of the condition is caused by printf itself slowing down the whole thing.

haproxyFred · 2022-06-09T18:53:24Z

Well, of course the calls to printf() do not accelerate the process. We should replace them by something else which would not hide this issue. I also think it would be interesting to know how much packets could not be treated fast enough to trigger the first dump of packets. Indeed this first dump could not be impacted by any printf slowing down issue, right? I meant that if there is a first list of packets to print this is because the RX buffer is already full. This is the same thread which enqueues the packet and prints its list of non already dequeued packets.

Tristan971 · 2022-06-09T18:55:15Z

Here's trying to correlate queued RX packets messages to quic requests per method

haproxyFred · 2022-06-09T19:59:10Z

@Tristan971 I do not know if you are familiar with C syntax and file patching. I am able to get similar traces but only when I use POST requests. The traces disappear (100000 x 1024 bytes POST requests) when I quadruple the connection RX buffer size (so up to 64KB) thanks to the attached patch file.

Tristan971 · 2022-06-09T20:01:11Z

I can at least read it and apply patches yeah 😅

I don't see an attached file though?

haproxyFred · 2022-06-09T20:03:08Z

here is the patch file: conn_rx_buf.txt

Tristan971 · 2022-06-09T20:04:53Z

I will have a try in a short bit and let you know 👍

Tristan971 · 2022-06-09T21:26:33Z

Alright so it does make the RX messages go away indeed (build here), or at least way much rarer (tried with API on QUIC, so we'd get enough posts to be significant statistically without waiting hours).

I have no idea whether that is an appropriate fix for closing this issue or if it was just an experiment, so I'll let you be the judge of that :-)

--

On a slightly different note, it does not however solve the never-ending/stalled requests (unsure if you expected it to). Shall I open a separate issue for those? (was thinking waiting on #1737 and #1738 was a better way, in case they were root/contributing sources of such a generic problem as "neverending requests").

For reference that's what the browser shows (ie it's pretty useless info).

If you want, I have kept the quic4 binding up so I can share the requests that failed (tho you mentioned that client-side data is mostly useless, and it seems half-random)

haproxyFred · 2022-06-10T06:51:57Z

I have no idea whether that is an appropriate fix for closing this issue or if it was just an experiment, so I'll let you be the judge of that :-)

No, this is not an appropriate fix. It is useful to separate the root of the problems. Now, we know there are remaining stalled requests without fulfilling the RX buffers. But this is not a scoop. We already know that our QUIC implementation does not work well with big HTTP applications.

I think you should open new issues for these stalled requests issues with more information about the HTTP statuses. If not already done, you should also enable this option "stats show-modules" to have an access to QUIC transport and h3 statistical counters from the stats socket.

Tristan971 · 2022-06-10T07:24:07Z

No, this is not an appropriate fix. It is useful to separate the root of the problems. Now, we know there are remaining stalled requests without fulfilling the RX buffers. But this is not a scoop. We already know that our QUIC implementation does not work well with big HTTP applications.

I see, thanks and good luck with that then 😅

If not already done, you should also enable this option "stats show-modules" to have an access to QUIC transport and h3 statistical counters from the stats socket.

I was in fact not aware of this option at all. I'll do that, gather some statistics, and open a separate issue then.

haproxyFred · 2022-06-17T10:10:32Z

Just to let you know we are working on this issue. But it is not an easy one to fix.

Tristan971 · 2022-06-17T10:11:50Z

No worries, nothing pressing :-)

First we add a loop around recfrom() into the most low level I/O handler quic_sock_fd_iocb() to collect as most as possible datagrams before during its tasklet wakeup with a limit: we recvfrom() at most "maxpollevents" datagrams. Furthermore we add a local task list into the datagram handler quic_lstnr_dghdlr() which is passed to the first datagrams parser qc_lstnr_pkt_rcv(). This latter parser only identifies the connection associated to the datagrams then wakeup the highest level packet parser I/O handlers (quic_conn.*io_cb()) after it is done, thanks to the call to tasklet_wakeup_after() which replaces from now on the call to tasklet_wakeup(). This should reduce drastically the latency and the chances to fulfil the RX buffers at the QUIC connections level as reported in GH #1737 by Tritan. These modifications depend on this commit: "MINOR: task: Add tasklet_wakeup_after()" Must be backported to 2.6 with the previous commit.

With ~1500 bytes QUIC datagrams, we can handle less than 200 datagrams which is less than the default maxpollevents value. This should reduce the chances of fulfilling the connections RX buffers as reported by Tristan in GH #1737. Must be backported to 2.6.

haproxyFred · 2022-07-01T06:33:52Z

@Tristan971 FYI with the current development version of haproxy, in case of RX buffer overrun, no more messages will be printed on stderr anymore. But we added a new stats counter for the total number of packets dropped because of RX buffer overrun.

Tristan971 · 2022-07-01T06:35:16Z

I was about to try it out 🙂 mangadex-pub/haproxy@fbec8df

First we add a loop around recfrom() into the most low level I/O handler quic_sock_fd_iocb() to collect as most as possible datagrams before during its tasklet wakeup with a limit: we recvfrom() at most "maxpollevents" datagrams. Furthermore we add a local task list into the datagram handler quic_lstnr_dghdlr() which is passed to the first datagrams parser qc_lstnr_pkt_rcv(). This latter parser only identifies the connection associated to the datagrams then wakeup the highest level packet parser I/O handlers (quic_conn.*io_cb()) after it is done, thanks to the call to tasklet_wakeup_after() which replaces from now on the call to tasklet_wakeup(). This should reduce drastically the latency and the chances to fulfil the RX buffers at the QUIC connections level as reported in GH haproxy#1737 by Tritan. These modifications depend on this commit: "MINOR: task: Add tasklet_wakeup_after()" Must be backported to 2.6 with the previous commit. (cherry picked from commit 1b0707f) Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

With ~1500 bytes QUIC datagrams, we can handle less than 200 datagrams which is less than the default maxpollevents value. This should reduce the chances of fulfilling the connections RX buffers as reported by Tristan in GH haproxy#1737. Must be backported to 2.6. (cherry picked from commit 649b3fd) Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

a-denoyelle · 2023-03-07T15:27:29Z

Closed with backport done.

Tristan971 added status: needs-triage This issue needs to be triaged. type: bug This issue describes a bug. labels Jun 8, 2022

This comment was marked as resolved.

Sign in to view

Tristan971 mentioned this issue Jun 22, 2022

QUIC occasional stalled requests #1759

Closed

a-denoyelle closed this as completed Mar 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QUIC debug message gets logged #1737

QUIC debug message gets logged #1737

Tristan971 commented Jun 8, 2022 •

edited

Loading

haproxyFred commented Jun 9, 2022

Tristan971 commented Jun 9, 2022 •

edited

Loading

haproxyFred commented Jun 9, 2022

Tristan971 commented Jun 9, 2022 •

edited

Loading

haproxyFred commented Jun 9, 2022

Tristan971 commented Jun 9, 2022 •

edited

Loading

haproxyFred commented Jun 9, 2022

Tristan971 commented Jun 9, 2022 •

edited

Loading

Tristan971 commented Jun 9, 2022

wtarreau commented Jun 9, 2022

haproxyFred commented Jun 9, 2022 •

edited

Loading

Tristan971 commented Jun 9, 2022

haproxyFred commented Jun 9, 2022 •

edited

Loading

Tristan971 commented Jun 9, 2022

This comment was marked as resolved.

haproxyFred commented Jun 9, 2022 •

edited

Loading

Tristan971 commented Jun 9, 2022

Tristan971 commented Jun 9, 2022 •

edited

Loading

haproxyFred commented Jun 10, 2022

Tristan971 commented Jun 10, 2022

haproxyFred commented Jun 17, 2022

Tristan971 commented Jun 17, 2022

haproxyFred commented Jul 1, 2022

Tristan971 commented Jul 1, 2022

a-denoyelle commented Mar 7, 2023

QUIC debug message gets logged #1737

QUIC debug message gets logged #1737

Comments

Tristan971 commented Jun 8, 2022 • edited Loading

Detailed Description of the Problem

Expected Behavior

Steps to Reproduce the Behavior

Do you have any idea what may have caused this?

Do you have an idea how to solve the issue?

What is your configuration?

Output of haproxy -vv

Last Outputs and Backtraces

Additional Information

haproxyFred commented Jun 9, 2022

Tristan971 commented Jun 9, 2022 • edited Loading

haproxyFred commented Jun 9, 2022

Tristan971 commented Jun 9, 2022 • edited Loading

haproxyFred commented Jun 9, 2022

Tristan971 commented Jun 9, 2022 • edited Loading

haproxyFred commented Jun 9, 2022

Tristan971 commented Jun 9, 2022 • edited Loading

Tristan971 commented Jun 9, 2022

wtarreau commented Jun 9, 2022

haproxyFred commented Jun 9, 2022 • edited Loading

Tristan971 commented Jun 9, 2022

haproxyFred commented Jun 9, 2022 • edited Loading

Tristan971 commented Jun 9, 2022

This comment was marked as resolved.

haproxyFred commented Jun 9, 2022 • edited Loading

Tristan971 commented Jun 9, 2022

Tristan971 commented Jun 9, 2022 • edited Loading

haproxyFred commented Jun 10, 2022

Tristan971 commented Jun 10, 2022

haproxyFred commented Jun 17, 2022

Tristan971 commented Jun 17, 2022

haproxyFred commented Jul 1, 2022

Tristan971 commented Jul 1, 2022

a-denoyelle commented Mar 7, 2023

Tristan971 commented Jun 8, 2022 •

edited

Loading

Output of `haproxy -vv`

Tristan971 commented Jun 9, 2022 •

edited

Loading

Tristan971 commented Jun 9, 2022 •

edited

Loading

Tristan971 commented Jun 9, 2022 •

edited

Loading

Tristan971 commented Jun 9, 2022 •

edited

Loading

haproxyFred commented Jun 9, 2022 •

edited

Loading

haproxyFred commented Jun 9, 2022 •

edited

Loading

haproxyFred commented Jun 9, 2022 •

edited

Loading

Tristan971 commented Jun 9, 2022 •

edited

Loading