New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v2.2dev8] looping stream #662
Comments
Thank you William. Could you please issue "p *(struct stream*)0x7fcbea72aa00" so that we get all the elements from that stream ? It's interesting to see that the other side has closed, I suspect that there's an unclean shutdown situation that's not correctly handled by process_stream(). |
here it is
|
I didn't notice the l7 retries, but here if we have this buffer and a close on the other side, I'm suspecting we might be in the process of trying to retry, otherwise the buffer would have been released. If you can try again commenting the "retry-on" line to confirm or infirm the hypothesis, it would be nice. If it happens to be the trigger, we'll ping Olivier. |
I started a new test without |
(confirmed, the new core is the same) |
I tried to do a bisection, but don't put too much value on it, I'm not 100% sure of the results: commit 065a025 is the first bad commit
|
I'm still failing to understand how that could wake the stream up given that it's at the opposite of the chain. Given that the input is closed (both sides have CF_SHUTR), what might happen would be that we do not disable reading despite being closed and the event would then be reported multiple times, but that would be a wrong reason a just a symptom. Here both sides have CF_SHUTR so I don't understand why the stream doesn't stop and abort completely. |
In fact I'm still wondering what the impact of retries can be here (and not necessarily L7). Maybe we're attempting a retry just after the client has aborted and we ignore this condition. I'll need to try to model this case to understand if (and how) it can happen. |
So after analysis it turns out that your bisect was correct. There is a fundamental flaw in the asynchronous connection wake up sequence which we thought we worked around a few commits later but didn't completely. The problem is that we don't know if there's still someone listening on a connection to decide whether or not to disable an event. Initially someone is subscribed, then as soon as it's woken up it gets unsubscribed. It's comparable to EPOLL_ONESHOT. So the read event strikes, the subscriber (the mux) is woken up, it receives some data, fills its buffer doesn't subscribe again. We purposely don't disable polling, hoping that these data will be quickly forwarded and we can get back to reading again. The problem is that forwarding requires that someone is actually consuming the received data, and that if the channel's buffer is full because the client stalled, this doesn't happen. In this case nobody disables polling at all, the mux hasn't resubscribed due to the buffer full condition and will not be woken up again. In your case it's more visible because it happens on the very last block and the server has closed. As such it contains a shutdown even that will be reported forever. This one has a special effect since it makes the mux wake up the upper layer stream which is seen as spinning. But disabling polling only on POLL_HUP would be pointless as I manage to make it loop as well by retrieving a 1 GB file and pressing Ctrl-S in curl immediately at the beginning of the transfer, which causes the transfer to stall with system buffers full. In this case the stream is not woken up, neither the mux. Only the conn_fd_handler() which hopes "someone" will stop the fd but there's nobody there anymore. For now the only solution is to revert this change. This tells us two things that we've been suspecting over time but are now proven by this issue:
I'm going to work on fixing this now and will push patches after thorough testing. I still think there's a minor glitch at the stream-interface level based on the flag combinations I've seen but I don't think it has a real impact. CCing @cognet who could be interested in this side effect since we got this initial idea together. Let's now see what we can do for 2.3 :-) |
thank you very much for the full explanation, I learned a lot and have still to process the following commits you mentioned. As I'm spending some time on sockets for io_uring, it is useful information for me to understand the interaction between raw_sock/connection/listener/mux. |
This effectively reverts the two following commits: 6f95f6e ("OPTIM: connection: disable receiving on disabled events when the run queue is too high") 065a025 ("MEDIUM: connection: don't stop receiving events in the FD handler") The problem as reported in issue #662 is that when the events signals the readiness of input data that has to be forwarded over a congested stream, the mux will read data and wake the stream up to forward them, but the buffer full condition makes this impossible immediately, then nobody in the chain will be able to disable the event after it was first reported. And given we don't know at the connection level whether an event was already reported or not, we can't decide anymore to forcefully stop it if for any reason its processing gets delayed. The problem is magnified in issue #662 by the fact that a shutdown is reported with pending data occupying the buffer. The shutdown will strike in loops and cause the upper layer stream to be notified until it's handled, but with a buffer full it's not possible to call cs_recv() hence to purge the event. All this can only be handled optimally by implementing a lower layer, direct mux-to-mux forwarding that will not require any scheduling. This was no wake up will be needed and the event will be instantly handled or paused for a long time. For now let's simply revert these optimizations. Running a 1 MB transfer test over H2 using 8 connections having each 32 streams with a limited link of 320 Mbps shows the following profile before this fix: calls syscall (100% CPU) ------ ------- 259878 epoll_wait 519759 clock_gettime 17277 sendto 17129 recvfrom 672 epoll_ctl And the following one after the fix: calls syscall (2-3% CPU) ------ ------- 17201 sendto 17357 recvfrom 2304 epoll_wait 4609 clock_gettime 1200 epoll_ctl Thus the behavior is much better. No backport is needed as these patches were only in 2.2-dev. Many thanks to William Dauchy for reporting a lot of details around this difficult issue.
I've just pushed the fix. On a specific test (client-side bandwidth limitations), I was seeing 100% CPU before the fix and 2-3% now. Quite a big difference! |
Regarding the sock/connection/mux etc interactions I still have for project to document the new architecture. It's an amazingly difficult work, especially since it changes while you look at it. But having a reference point in time would definitely help us a lot! |
oh yes, I would be your first reader of such documentation. Thanks for the patch, I will test it right away! |
@wtarreau to give you a quick feedback, it's not very conclusive on my side. CPU consumption is very high (> 10 cores instead of 5 at this hour). I'm giving it some time to see whether it will abort by itself or a looping stream will be detected. |
Hmm that's bad, it's possible that there's something else :-(
Could you please emit "show activity" on the CLI as well as
"show info" ?
Thanks,
Willy
|
no coredump so far, but all cores are now busy:
|
I agree that for ~2 Gbps even in SSL, 100% on 16 threads is pretty high. I find that the run queue is quite large, at roughly 4k entries. I'm extremely surprized by the poll_skip counter, it accounts for 1/4 of the calls to process_stream and as documented in the code, it should only happen when the FD was migrated, so that leaves me with doubts, as I'm unsure what could really be causing this. When you say that it's using twice as many cores, is it compared to 2.1 or compared to the latest properly working 2.2 ? |
Or said differently, have you seen 2.2 report a reasonable idle_pct value in "show info" ? |
yes sorry for the lack of precision, I'm always comparing a node with v2.1 and another with the last v2.2 to make sure I have an idea of the expected average usage at the time of the test (basically because depending on the hour, the usage could drastically change). That being said I can probably put a v2.2dev3 back and compare again. |
OK that's quite useful info! So at least we can say that 2.2 still has a problem that 2.1 doesn't have, and that prevents it from going idle. |
yes. I also don't see how I could try a bisect, because I guess we are now talking about a different issue (should I open a new entry?), i.e the looping stream issue seems resolved, as I did not saw the message yet. |
I'm suspecting we have two bugs in fd_takeover(). In the non-double-cas case, if we fail to grab the fd because we're competing on it with other threads, we prepare ret=-1 but we still perform fd_stop_recv() on this FD that we don't own. In the double-cas case, we do not disable receiving. I'd at least do something like calling fd_stop_recv() only when ret == 0 in both cases. @cognet any idea on this ? |
Regarding the bisect, it would be a real pain due to too many other interleaved bugs that could cause similar patterns. I'd rather add more info in the code to improve debugging :-/ |
Just to chime in here (because I can't say anything about the issue itself 😃): You technically could apply the revert manually using the |
Sure but the two problems are 1) that the code around this area has changed quite a bit over the versions, making the patch not apply at all most of the time, or requiring different sets of patches to be reverted, and 2) that other issues can definitely cause the same symptoms, adding quite some noise in the observation. |
I think you're correct on both counts, I'm going to fix that. |
Cool, that might warrant a new test once addressed because given the high number of poll_skip in the output it could indicate that we've left unstoppable FDs behind during HTTP reuse. I've added a "poll_io" counter in the activity report to see how many of the wakeups are caused by poll() activity. If it gets very high it could indicate a stuck FD. |
@cognet I don't know whether you were finished with the fd fixes but I took the liberty to apply the two fixes you just sent. I've also applied the new poll_io counter.
|
Nice, so that shows that 85% of the wakeups come from I/O activity. I cannot achieve that high ratios, even with empty objects I'm at 76%. This could indicate a stuck FD. My poll_skip_fd count can reach 50% of poll_io though, due to aggressive reuse on the server side. If I put "http-reuse never" the poll_skip_fd stays to zero. So the ~20% you're seeing might not be that much of an issue after all. If you're in a situation where it's possible to try with http-reuse never, that could be extremely informative on possible causes. |
we are indeed using
|
OK and now we clearly see that poll_skip is zero so that cannot be caused by spurious wakeups from migrated FDs. The process_calls ratio to cumreq is clean (2x) so for now I'm ruling out the possibility of a stream-interface issue causing streams to loop. The polled I/O to loops still seems high but it's well balanced across all threads, so either it's normal or the problem is generalized. The low value for avg_loop_us is cool for latency but could also indicate that the majority of wakeups do nothing (such as a call to conn_fd_handler() and a return). Among the things I'm thinking about that could possibly help, in order of complexity:
|
Why not simply
It seems that symbols are included as well, from gdb output. Dwarf, in my experience |
It is different. |
True, indeed. I was under the impression that the culprit might be hot code, where Based on your latest description, it sounds like a good place for eBPF + uprobe. That, though, could rank in the far end of the list. Thank you for your patience and sorry for the noise. :-) |
You don't have to be sorry, your proposal was totally valid, it's just that the context is not suitable. But it could have been a time saver, so don't be shy next time :-) By the way, just like I don't use perf for counting, I don't use gprof for timing, it's totally inaccurate as well! In short, counting=>gprof, timing=>perf. |
One interesting point to note regarding the performance is cost of the activation of L7 retries. On my machines with just 4 threads over 2 cores and your bufsize, the performance drops from 191k to 125k req/s. This is caused by the requirement to memcpy() the request buffer. I suspect it could be made smarter by using HTX functions which will perform a sparse copy, but at least I wanted to mention this as a warning. |
@wdauchy I have a present for you :-) I'm attaching a patch that adds experimental support for edge-triggered mode. This way you don't need to revert the patch above and this would once for all rule out any polling-related issues. It only does it for connections, not other FDs since we know that not everything is compatible. But the connection layers should be OK now in 2.2. This will require the explicit addition of 0001-MEDIUM-fd-add-experimental-support-for-edge-triggere.patch.txt |
thanks @wtarreau |
Ah I mistyped it, sorry! It's "tune.fd.edge-triggered on" (a missing "e"). |
wow - I'm tired, I did not even re-read myself |
@wtarreau thanks for your patch. for now my results are as follow:
those two tests are not conclusive, progressively using two times cores compared to a node with v2.1 - and after some more time using almost the 16 threads. I will go through the perf experimentation to find out more information.
|
Thanks. So now it's certain the issue is not related to polling anymore. I'd really love to be able to reproduce it, as I'm out of ideas now :-( |
There is something very stange by the way. Even with edge-triggered polling these are still I/O events that make up 85% of the wakeups. But these cannot strike more than once. So that leaves us with a few possibilities:
If, once the issue happens, you can issue a "show sess all" and a "show fd" into a file, that would be awesome, I could try to see if some streams are called many times, for what apparent reason and if they're attached to an FD in a strange state. These will be huge (multi-megabytes) and will contain some partially private data so don't post it here! |
Thanks! |
Output of
haproxy -vv
anduname -a
on top of v2.2dev8 we have the following patches:
63a8738 MEDIUM: pools: directly free objects when pools are too much crowded
a1e4f8c MINOR: pools: compute an estimate of each pool's average needed objects
56192cc BUG/MEDIUM: checks: Don't add a tcpcheck ruleset twice in the shared tree
What's the configuration?
I might provide more info if needed
Actual behavior
We are getting quite regular infinite loop in v2.2dev8:
I can provide coredump privately if there are not enough information.
The text was updated successfully, but these errors were encountered: