-
Notifications
You must be signed in to change notification settings - Fork 3.7k
linux: exploit eventfd
in EPOLLET
mode to avoid syscall per wakeup
#4400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
My initial reaction is that this is not how edge triggering would usually work for epoll, so I am skeptical how it would work here (it would work if this was kevent, as there is a different PR open now for that, but it is not). The man page for eventfd does not seem to mention being permitted you to skip the read syscall. The eventfd_read and eventfd_write are documented to be thin wrappers around the read/write, so I would prefer to stick without those added wrapper as well. Is this an undocumented linux kernel bug or feature here that edge trigger on eventfd does not follow normal edge trigger semantics (which normally requires 2 read calls every time to trigger EAGAIN and re-arm the edge trigger)? |
I'm not sure what this "normal edge trigger semantics" looks like in your mind. But I'm sure that your "which normally requires 2 read calls every time to trigger EAGAIN and re-arm the edge trigger" is wrong, I don't think we need to rearm the events of Back to the rationale behind this PR, the main difference between I'll quote this paragraph from the man pages1:
Every time we write to the else if (!(epi->event.events & EPOLLET)) {
/*
* If this file has been added with Level
* Trigger mode, we need to insert back inside
* the ready list, so that the next call to
* epoll_wait() will check again the events
* availability. At this point, no one can insert
* into ep->rdllist besides us. The epoll_ctl()
* callers are locked out by
* ep_send_events() holding "mtx" and the
* poll callback will queue them in ep->ovflist.
*/
list_add_tail(&epi->rdllink, &ep->rdllist);
ep_pm_stay_awake(epi);
} Footnotes |
Just in case, one more thing about Therefore, reading until |
I think that's the common understanding when using a fd that point to a socket for example, right? Is this different in case of eventfd? |
Yes, we often do that when working with sockets, and that's legit.
To clarify, sockets and eventfd share the same implementation of |
Hum, I still don't get it. Let's say we have the loop thread plus other 4.
Where does EPOLLET play here? Is this an optimization for the case in which we read the value before all threads have incremented it? So say we read 2, then another thread peforms the write and we'd be woken up again to read a 1 and so on? |
Sorry, I don't get your example because I don't see this example has anything to do with |
Perhaps some sample code and the Granted, if one of the threads performs the write after the loop has read the value a new wakeup will happen, and that is ok, because we don't know how much later that was. What am I missing? |
If you were talking about N threads calling uv_async_send per thread to wake another one specific thread t1 and the wakeup times is less than N because t1 read eventfd and reset it to zero, then it's true regardless of the old implementation with LT or the new one with ET. If you want to trigger extract N times of wakeup, you need to specify
Again, this will happen with either LT or ET. |
Correct, that was the case I was making.
Right, but uv_async is defined so it wakes up at least once, regardless of the number of times
Ok. So then, can you please elaborate on what case this PR helps with? |
To sum up, before this PR, we use |
So, what I am pointing out is that this contradicts the documentation for
What is strange is that it gives an exception for stream-oriented files, but then makes a claim that directly contradicts the documentation for To be clear, I am content to assume that this is a kernel bug that we can (ab)use in our favor for eventfd, but it would be good to make sure we can actually rely on the behavior of epoll_wait not exactly following this part of the documentation for it. According to the documentation for epoll, it seems a new call to |
In what case would that manifest? When someonee calls uv_async_send we'll get a wakeup and read from the fd, thus reseting it to zero, we cannot avoid that. So in what circumstance would be leave the counter to not zero intil called again? |
This would be true under one specific circumstance: the interval between two writes is extremely small. So if you issued a new write right after the previous without a gap, eventfd in ET mode would be waked up only once. By contrast, if you issued a new write after some times from the previous write, it would be waked up twice. This is because linux will converge multiple ready events on a same file descriptor that are in the short time windows and only trigger the wakeup callback once. As for the confusing description on man pages, it's just a common pattern of working with And I don't think this is a kernel bug, based on the source code of Linux, it's solid and it look more like a functionality to me. |
We can avoid that by using |
It looks like the kernel may not care if the read actually ever happened, as it doesn't have a cheap way to filter this for whether the item in the queue is currently LT or ET, so it just sets it after every write? I am not entirely certain of how the kernel decides when to clear the bit (for LT) |
Sorry if it feels like I am ignoring that you made that comment before, but the documentation for this function does not seem to line up with your description, and instead repeatedly states that reading the file is required afterwards, while reiterating that this "common pattern of working with EPOLLET" that you are using should not to be relied upon, as it may work in most cases, but is not guaranteed to work correctly all of the time or in all cases. Do you know if there is any kernel documentation for eventfd that might clarify this? |
What's your main concern here? I'm still not quite sure. |
The man page for epoll_wait specifically says this PR is not correct, but the kernel implementation may allow it. Do we trust that the linux kernel developers will never update the implementation to more closely follow the man page, given that the kernel documentation for this appears to be non-existent? |
There is actually also a second more complicated concern for it, which is that this syscall must have sequentially-consistent ordering. The read/write pair used to guarantee that, but does epoll_wait also guarantee that? (edit: we can probably fix this by adding explicit fences, if necessary) |
I actually don't think this PR contradicts the man page.
When we receive a readable event from the |
Maybe this could make things more interesting: https://lwn.net/Articles/865400/ |
Alright, yeah, that email from Linus stating this this PR relies on a kernel bug that probably won't get fixed (https://lwn.net/ml/linux-kernel/CAHk-=witY33b-vqqp=ApqyoFDpx9p+n4PwG9N-TvF8bq7-tsHw@mail.gmail.com/) is probably compelling |
Note that this stated usage is still wrong (per the documentation and implementation), but that this case is fundamentally different because it never reads from the fd and only ever cares about the occurrence of the write syscall itself (which Linus's email indicates is a kernel bug that probably won't get fixed), but never about the quantity of data written (which is not an existing kernel bug, but may be an existing reliability issue in some applications). |
Well, it is shocking to me... Normally you don't expect something like epoll whose design and source code are that sophisticated to be fundamentally broken when it was implemented. But thanks for the link, it's frustrating but useful. |
Sorry, why would it be wrong when reading a socket fd in ET mode until EAGAIN is returned? Could you clarify that with more contexts? |
If you do read until EAGAIN, it is correct, but wastes a syscall every time you go to call |
Oh, I think we were just not on the same page about that. I was specifically talking about EAGAIN with ET mode, it's clear that we don't have to read until EAGAIN in LT mode. Now we are in sync. |
So, what are we going to do with this PR? Does it still seem worth going through for |
I'd say some benchmarks would help decide. |
Ping @libuv/collaborators |
I see @saghul gave his approval on this a long time ago but I guess that will not suffice to merge this PR. Since @bnoordhuis @vtjnash seemed to have got a lot of other work on their plates recently and have no time to make the review here, I'm wondering if there are other available collaborators @libuv/collaborators who could chime in and review this PR? |
Another two weeks have passed and yet I still haven't heard back from @libuv/collaborators. I wonder what's going on, could anyone help make progress on this PR? It's been dangling around for a long time. @bnoordhuis @vtjnash @saghul |
Another two weeks have passed, still zero response... Helplessly and despairingly ping @bnoordhuis @vtjnash @saghul @libuv/collaborators |
Sorry for the delay. I was AFK for a long time because of Real World Reasons. Cost/benefit trade-off: I'm somewhat worried this change may cause hard-to-debug regressions in existing programs, whereas the benefits won't be very visible to most users. Reasons why:
I'll happily step aside if other maintainers think it's a good idea, but on the whole, I'd be in favor of making no changes. Aside: I played with switching libuv to edge-triggered mode wholesale about 10 years ago but I rapidly came to the conclusion that ET is not always faster and frequently more trouble than it's worth in programs where not all components are equally well-behaved. It works great when you're nginx; it's got a lot of sharp edges when you're node.js. |
Thank you for sharing the insightful thoughts on this! I think that something like nginx, netty, and tokio employing this methodology for years should give us more confidence here, I wouldn't worry so much, although I may understand your concern. Nonetheless, what I'm having in mind is: would it make more sense to enable this optimization via CFLAGS (something like |
Well, I don't know. It's one of those "diminishing returns" things. It probably doesn't pay to sink too much effort in it, simply because for most users it's a non-issue, and any bugs that pop up are likely very subtle. The last time I tried to improve async handle performance, it caused fairness issues with a particular scheduling policy on single-core Linux machines - and only there. It took more time to track down (and then a lot of back and forth to mitigate) than I'd like to admit, and I'm in no rush to do that again. :-) |
What is blocking you then? The process to use I am in favor of merging this (and #4330). Like Ben, I have had Real World Reasons that have kept me away for a long time, but I am slowly making my way back around to the things I have been putting off for a while. I would like to see a documentation update, so that kernel developers might think twice before breaking this in the future, but given the number of other significant projects that would also be broken by this, I see no reason to avoid this now.
AFAIK, this was because you were coding it correctly (waiting to read/write EAGAIN), as the documentation specifies is required, rather than being exploiting the implementation details that the other big projects do. It might be interesting now to try again with ET, given that we know the kernel is now probably committed to supporting it indefinitely (torvalds/linux@3b84482) |
|
This is an insufficiently strong claim for this PR though, which would require the stronger claim that "an event WILL be generated for each chunk of data received" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CI failure seems to be unrelated, though not one that we have seen happen before
not ok 336 - tcp_reuseport
# exit code 134
# Output from process `tcp_reuseport`:
# Assertion failed in ../../test/test-tcp-reuseport.c on line 232: `thread_loop1_accepted > 0` (0 > 0)
This was supposed to be fixed by #4417, it's uncanny. Maybe it should be blamed on the test code. I'll try to find out what's really going on. |
Something like "an event WILL be generated for each chunk of data received" in the kernel docs would be the best for our use case. However, from my point of view, I do deem that the current version of "multiple events can be generated upon receipt of multiple chunks of data" (along with those commits and statements from Linus and the successful applications in other renowned projects) should be sufficient to endorse this PR. But I will try to send a patch (which I hope won't get in the way of this PR) to the kernel docs over the weekend. |
Register the eventfd with EPOLLET to enable edge-triggered notification where we're able to eliminate the overhead of reading the eventfd via system call on each wakeup event. When the eventfd counter reaches the maximum value of the unsigned 64-bit, which may not happen for the entire lifetime of the process, we rewind the counter and retry. This optimization saves one system call on each event-loop wakeup, eliminating the overhead of read(2) as well as the extra latency for each epoll wakeup. --------- Signed-off-by: Andy Pan <i@andypan.me>
Thank you for merging this PR! I sent a patch to the Linux man-pages a few days ago and I've got my first round of review from the current maintainer, but it seemed to need more from another kernel email group <linux-api@vger.kernel.org>, so I've CCed it and still waiting for them. I'll update the info here when there is any progress. |
Just for notice, my patch to the man-pages was approved and placed in the merging queue by the current maintainer ten days ago, it was supposed to be after another patch in line. I'm still waiting for the maintainer to get it landed on the mainline, this is the mailing list: https://lore.kernel.org/all/20240801-epoll-et-desc-v5-1-7fcb9260a3b2@andypan.me/ |
Nice! |
At last, my patch gets merged into the mainline: https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=71988df59d2585cac1147a6f785df65693b7d77f |
Congrats! That is awesome 🤩 |
For the moment, the edge-triggered epoll generates an event for each receipt of a chunk of data, that is to say, epoll_wait() will return and tell us a monitored file descriptor is ready whenever there is a new activity on that FD since we were last informed about that FD. This is not a real _edge_ implementation for epoll, but it's been working this way for years and plenty of projects are relying on it to eliminate the overhead of one system call of read(2) per wakeup event. There are several renowned open-source projects relying on this feature for notification function (with eventfd): register eventfd with EPOLLET and avoid calling read(2) on the eventfd when there is wakeup event (eventfd being written). Examples: nginx [1], netty [2], tokio [3], libevent [4], ect. [5] These projects are widely used in today's Internet infrastructures. Thus, changing this behavior of epoll ET will fundamentally break them and cause a significant negative impact. Linux has changed it for pipe before [6], breaking some Android libraries, which had got "reverted" somehow. [7] [8] Nevertheless, the paragraph in the manual pages describing this characteristic of epoll ET seems ambiguous, I think a more explict sentence should be used to clarify it. We're improving the notification mechanism for libuv recently by exploiting this feature with eventfd, which brings us a significant performance boost. [9] Therefore, we (as well as the maintainers of nginx, netty, tokio, etc.) would have a sense of security to build an enhanced notification function based on this feature if there is a guarantee of retaining this implementation of epoll ET for the backward compatibility in the man pages. [1]: https://github.com/nginx/nginx/blob/efc6a217b92985a1ee211b6bb7337cd2f62deb90/src/event/modules/ngx_epoll_module.c#L386-L457 [2]: netty/netty#9192 [3]: https://github.com/tokio-rs/mio/blob/309daae21ecb1d46203a7dbc0cf4c80310240cba/src/sys/unix/waker.rs#L111-L143 [4]: https://github.com/libevent/libevent/blob/525f5d0a14c9c103be750f2ca175328c25505ea4/event.c#L2597-L2614 [5]: libuv/libuv#4400 (comment) [6]: https://lkml.iu.edu/hypermail/linux/kernel/2010.1/04363.html [7]: torvalds/linux@3a34b13 [8]: torvalds/linux@3b84482 [9]: libuv/libuv#4400 (comment) Signed-off-by: Andy Pan <i@andypan.me> Cc: <linux-api@vger.kernel.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Message-ID: <20240801-epoll-et-desc-v5-1-7fcb9260a3b2@andypan.me> Signed-off-by: Alejandro Colomar <alx@kernel.org>
This reverts commit e5cb1d3. Reason: bisecting says it breaks dnstap. Also revert commit 2713454 ("kqueue: use EVFILT_USER for async if available") because otherwise the first commit doesn't revert cleanly, with enough conflicts in src/unix/async.c that I'm not comfortable fixing those up manually. Fixes: libuv#4584
This reverts commit e5cb1d3. Reason: bisecting says it breaks dnstap. Also revert commit 2713454 ("kqueue: use EVFILT_USER for async if available") because otherwise the first commit doesn't revert cleanly, with enough conflicts in src/unix/async.c that I'm not comfortable fixing those up manually. Fixes: #4584
Register the eventfd with EPOLLET to enable edge-triggered notification where we're able to eliminate the overhead of reading the eventfd via system call on each wakeup event.
When the eventfd counter reaches the maximum value of the unsigned 64-bit, which may not happen for the entire lifetime of the process, we rewind the counter and retry.
This optimization saves one system call on each event-loop wakeup, eliminating the overhead of read(2) as well as the extra latency for each epoll wakeup.
Signed-off-by: Andy Pan i@andypan.me