-
Notifications
You must be signed in to change notification settings - Fork 783
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QUIC debug message gets logged #1737
Comments
hello, when we get such traces, the risk is that the connection may block. Is it the case? |
Ah so... I was going to raise a third issue to keep things well sorted but if it's perhaps related, I've noticed that some routes occasionally caused stalled requests. It seems quite inconsistent but frequent enough to be within the multi-% chance of happening. Within the browser, those "stalled" requests are either not even visible in the network tab (as if they're not even initiated) or they show up without any response body/headers and never finish. So... Yes perhaps? Is there some set of flags that'd help debug that more on haproxy's side? or something else I could provide to help testing? |
Well. This is difficult to identify the reason why a connection or requests seem stalled from the client point of view. I prefer rely only on events from haproxy point of view. Such traces are printed on stderr when the RX buffer is full. In such a situation the RX QUIC packets are dropped. This does not mean the connection cannot progress. If it does not progress, there is a bug on our side. To know if the connection can progress we must have a look to the packet number which are logged. As printed, this is the remaining RX packets which could not be dequeued as were are trying to enqueue a new RX packet. So, the question is: do these packet numbers progress (or increase) or are they endlessly printed with the same values (ie the situation cannot be unblocked)? |
Ok. So, I would say that this issue is more a performance issue than a bug. This is something we need to improve: RX performance. |
Due to the last architectural modifications, I guess there is a bottleneck issue which appeared at the connection level when the RX buffer attached to a connection is full. The lowest RX part which recvfrom() datagrams is multithreaded. But only one thread may handle the QUIC packets attached to a connection. This is where there is a performance issue I think. This does not mean there is not enough CPU resource. Do you have information about the ratio between the POST and GET methods used during such a traffic? A QUIC haproxy listener should not receive very much QUIC packets, except when handling POST requests. |
Exclusively GETs or nearly as I have ended up not advertising H3 for our API altogether since it was much more affected by the stalling (which would make sense with your remark about POSTs) Example test run I did for #1738 Though yesterday when I originally tested it I didn't manually exclude the API, and it didn't change the spread that much (still overwhelmingly GETs) (sorry for the color mismatch between the two screenshots, Grafana isn't being very helpful on that...) |
Just a naive question Fred, do you think the debug message itself can cause a pause that will induce a subsequent burst and fill the next buffer ? I mean, printfs are quite serialized and can slow everything down, so if it happens only once in a while to see a buffer full, it could be possible that the rest of the condition is caused by printf itself slowing down the whole thing. |
Well, of course the calls to printf() do not accelerate the process. We should replace them by something else which would not hide this issue. I also think it would be interesting to know how much packets could not be treated fast enough to trigger the first dump of packets. Indeed this first dump could not be impacted by any printf slowing down issue, right? I meant that if there is a first list of packets to print this is because the RX buffer is already full. This is the same thread which enqueues the packet and prints its list of non already dequeued packets. |
@Tristan971 I do not know if you are familiar with C syntax and file patching. I am able to get similar traces but only when I use POST requests. The traces disappear (100000 x 1024 bytes POST requests) when I quadruple the connection RX buffer size (so up to 64KB) thanks to the attached patch file. |
I can at least read it and apply patches yeah 😅 I don't see an attached file though? |
This comment was marked as resolved.
This comment was marked as resolved.
here is the patch file: conn_rx_buf.txt |
I will have a try in a short bit and let you know 👍 |
Alright so it does make the RX messages go away indeed (build here), or at least way much rarer (tried with API on QUIC, so we'd get enough posts to be significant statistically without waiting hours). I have no idea whether that is an appropriate fix for closing this issue or if it was just an experiment, so I'll let you be the judge of that :-) -- On a slightly different note, it does not however solve the never-ending/stalled requests (unsure if you expected it to). Shall I open a separate issue for those? (was thinking waiting on #1737 and #1738 was a better way, in case they were root/contributing sources of such a generic problem as "neverending requests"). For reference that's what the browser shows (ie it's pretty useless info). If you want, I have kept the quic4 binding up so I can share the requests that failed (tho you mentioned that client-side data is mostly useless, and it seems half-random) |
No, this is not an appropriate fix. It is useful to separate the root of the problems. Now, we know there are remaining stalled requests without fulfilling the RX buffers. But this is not a scoop. We already know that our QUIC implementation does not work well with big HTTP applications. I think you should open new issues for these stalled requests issues with more information about the HTTP statuses. If not already done, you should also enable this option "stats show-modules" to have an access to QUIC transport and h3 statistical counters from the stats socket. |
I see, thanks and good luck with that then 😅
I was in fact not aware of this option at all. I'll do that, gather some statistics, and open a separate issue then. |
Just to let you know we are working on this issue. But it is not an easy one to fix. |
No worries, nothing pressing :-) |
First we add a loop around recfrom() into the most low level I/O handler quic_sock_fd_iocb() to collect as most as possible datagrams before during its tasklet wakeup with a limit: we recvfrom() at most "maxpollevents" datagrams. Furthermore we add a local task list into the datagram handler quic_lstnr_dghdlr() which is passed to the first datagrams parser qc_lstnr_pkt_rcv(). This latter parser only identifies the connection associated to the datagrams then wakeup the highest level packet parser I/O handlers (quic_conn.*io_cb()) after it is done, thanks to the call to tasklet_wakeup_after() which replaces from now on the call to tasklet_wakeup(). This should reduce drastically the latency and the chances to fulfil the RX buffers at the QUIC connections level as reported in GH #1737 by Tritan. These modifications depend on this commit: "MINOR: task: Add tasklet_wakeup_after()" Must be backported to 2.6 with the previous commit.
With ~1500 bytes QUIC datagrams, we can handle less than 200 datagrams which is less than the default maxpollevents value. This should reduce the chances of fulfilling the connections RX buffers as reported by Tristan in GH #1737. Must be backported to 2.6.
@Tristan971 FYI with the current development version of haproxy, in case of RX buffer overrun, no more messages will be printed on stderr anymore. But we added a new stats counter for the total number of packets dropped because of RX buffer overrun. |
I was about to try it out 🙂 mangadex-pub/haproxy@fbec8df |
First we add a loop around recfrom() into the most low level I/O handler quic_sock_fd_iocb() to collect as most as possible datagrams before during its tasklet wakeup with a limit: we recvfrom() at most "maxpollevents" datagrams. Furthermore we add a local task list into the datagram handler quic_lstnr_dghdlr() which is passed to the first datagrams parser qc_lstnr_pkt_rcv(). This latter parser only identifies the connection associated to the datagrams then wakeup the highest level packet parser I/O handlers (quic_conn.*io_cb()) after it is done, thanks to the call to tasklet_wakeup_after() which replaces from now on the call to tasklet_wakeup(). This should reduce drastically the latency and the chances to fulfil the RX buffers at the QUIC connections level as reported in GH haproxy#1737 by Tritan. These modifications depend on this commit: "MINOR: task: Add tasklet_wakeup_after()" Must be backported to 2.6 with the previous commit. (cherry picked from commit 1b0707f) Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>
With ~1500 bytes QUIC datagrams, we can handle less than 200 datagrams which is less than the default maxpollevents value. This should reduce the chances of fulfilling the connections RX buffers as reported by Tristan in GH haproxy#1737. Must be backported to 2.6. (cherry picked from commit 649b3fd) Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>
Closed with backport done. |
Detailed Description of the Problem
After moving to a 2.6.0 build with QUIC enabled, the following gets sprinkled in the logs occasionally:
Pretty rare and not sustained, but while mentioning it to @wtarreau he suggested it looked like a debug message and wasn't expected
Expected Behavior
Assuming it is indeed a debug message, I expect that to not be logged.
Steps to Reproduce the Behavior
Unclear. At least you need to have traffic going over a quic4 frontend, since after removing that (and keeping the same haproxy+quictls builds) it stopped.
Do you have any idea what may have caused this?
No response
Do you have an idea how to solve the issue?
No response
What is your configuration?
Output of
haproxy -vv
Last Outputs and Backtraces
No response
Additional Information
No particular patch applied to the source. Running on Ubuntu 20.04, typical L7->L7 workload.
The text was updated successfully, but these errors were encountered: