linux: exploit `eventfd` in `EPOLLET` mode to avoid syscall per wakeup #4400

panjf2000 · 2024-05-09T02:37:12Z

Register the eventfd with EPOLLET to enable edge-triggered notification where we're able to eliminate the overhead of reading the eventfd via system call on each wakeup event.

When the eventfd counter reaches the maximum value of the unsigned 64-bit, which may not happen for the entire lifetime of the process, we rewind the counter and retry.

This optimization saves one system call on each event-loop wakeup, eliminating the overhead of read(2) as well as the extra latency for each epoll wakeup.

Signed-off-by: Andy Pan i@andypan.me

vtjnash · 2024-05-09T03:38:38Z

My initial reaction is that this is not how edge triggering would usually work for epoll, so I am skeptical how it would work here (it would work if this was kevent, as there is a different PR open now for that, but it is not). The man page for eventfd does not seem to mention being permitted you to skip the read syscall. The eventfd_read and eventfd_write are documented to be thin wrappers around the read/write, so I would prefer to stick without those added wrapper as well.

Is this an undocumented linux kernel bug or feature here that edge trigger on eventfd does not follow normal edge trigger semantics (which normally requires 2 read calls every time to trigger EAGAIN and re-arm the edge trigger)?

panjf2000 · 2024-05-09T05:10:35Z

Is this an undocumented linux kernel bug or feature here that edge trigger on eventfd does not follow normal edge trigger semantics (which normally requires 2 read calls every time to trigger EAGAIN and re-arm the edge trigger)?

I'm not sure what this "normal edge trigger semantics" looks like in your mind. But I'm sure that your "which normally requires 2 read calls every time to trigger EAGAIN and re-arm the edge trigger" is wrong, I don't think we need to rearm the events of EPOLLET when we didn't specify the EPOLLONESHOT.

Back to the rationale behind this PR, the main difference between EPOLLLT and EPOLLET is that ET completely relies on the wakeup callback mechanism to add events to the ready list and remove them after the poll while LT doesn't absolutely rely on the wakeup callback mechanism but will always add events back to the ready list after the poll, which means that programs with EPOLLET mode get notified only when there is new ready event occurring where a wakeup callback associated with each epoll entry will be triggered and it ends up calling ep_poll_callback() to add the event to the ready list, if that event is ignored (in our case, we don't read the eventfd), they won't be notified again until the next event arrives (in our case, we write new data to the eventfd) because unlike LT this event won't be added back in ready list under ET mode.

I'll quote this paragraph from the man pages¹:

Since even with edge-triggered epoll, multiple events can be
generated upon receipt of multiple chunks of data, the caller has
the option to specify the EPOLLONESHOT flag, to tell epoll to
disable the associated file descriptor after the receipt of an
event with epoll_wait(2). When the EPOLLONESHOT flag is
specified, it is the caller's responsibility to rearm the file
descriptor using epoll_ctl(2) with EPOLL_CTL_MOD.

Every time we write to the eventfd with EPOLLET, we get a wakeup event and are notified by epoll_wait(), but we're not obligated to read it because that event won't retain in the ready list after this wakeup²:

		else if (!(epi->event.events & EPOLLET)) {
			/*
			 * If this file has been added with Level
			 * Trigger mode, we need to insert back inside
			 * the ready list, so that the next call to
			 * epoll_wait() will check again the events
			 * availability. At this point, no one can insert
			 * into ep->rdllist besides us. The epoll_ctl()
			 * callers are locked out by
			 * ep_send_events() holding "mtx" and the
			 * poll callback will queue them in ep->ovflist.
			 */
			list_add_tail(&epi->rdllink, &ep->rdllist);
			ep_pm_stay_awake(epi);
		}

panjf2000 · 2024-05-09T05:38:45Z

Just in case, one more thing about EPOLLET and EAGAIN, we all know that one of the most common pattern of working with EPOLLET is to read until EAGAIN is returned, this often misleads people into thinking that they have to drain out the underlying buffer to get notified when there is new data arriving, but this is a classic misunderstanding because even if we don't drain out the buffer, we will still get notified whenever a new event occurs (it will trigger the wakeup callback). The man pages only suggested this way to prevent programs from hanging and never getting a chance to consume the leftover data in the underlying buffer when the remote peer stops sending data for some reason.

Therefore, reading until EAGAIN is returned is not mandatory for all the use cases of EPOLLET.

saghul · 2024-05-09T09:38:44Z

Just in case, one more thing about EPOLLET and EAGAIN, we all know that one of the most common pattern of working with EPOLLET is to read until EAGAIN is returned, this often misleads people into thinking that they have to drain out the underlying buffer to get notified when there is new data arriving, but this is a classic misunderstanding because even if we don't drain out the buffer, we will still get notified whenever a new event occurs (it will trigger the wakeup callback). The man pages only suggested this way to prevent programs from hanging and never getting a chance to consume the leftover data in the underlying buffer when the remote peer stops sending data for some reason.

Therefore, reading until EAGAIN is returned is not mandatory for all the use cases of EPOLLET.

I think that's the common understanding when using a fd that point to a socket for example, right? Is this different in case of eventfd?

panjf2000 · 2024-05-09T09:52:25Z

I think that's the common understanding when using a fd that point to a socket for example, right?

Yes, we often do that when working with sockets, and that's legit.

Is this different in case of eventfd?

To clarify, sockets and eventfd share the same implementation of EPOLLET under the hood and we should use the same pattern I described above on both of them when we care about the data on the underlying kernel buffer and need to read and use that data. But when it comes to the use case of merely using eventfd as a notification mechanism (also the exact use case in this PR), we neither care about the data on the underlying buffer nor need to read and use that data, that's why we can use the implementation in this PR to eliminate read(2) system call on each wakeup event and libuv will continue to work correctly.

saghul · 2024-05-09T09:59:59Z

Hum, I still don't get it. Let's say we have the loop thread plus other 4.

Each of the 4 non-loop threads call uv_async_send
Assuming the value they pass is 1, chances are we'd read a 4
The loop threads gets the wakeup, performs one read, gets the 4 and the counter would be reset to 0

Where does EPOLLET play here? Is this an optimization for the case in which we read the value before all threads have incremented it? So say we read 2, then another thread peforms the write and we'd be woken up again to read a 1 and so on?

panjf2000 · 2024-05-09T10:16:20Z

Hum, I still don't get it. Let's say we have the loop thread plus other 4.

Each of the 4 non-loop threads call uv_async_send

Assuming the value they pass is 1, chances are we'd read a 4

The loop threads gets the wakeup, performs one read, gets the 4 and the counter would be reset to 0

Where does EPOLLET play here? Is this an optimization for the case in which we read the value before all threads have incremented it? So say we read 2, then another thread peforms the write and we'd be woken up again to read a 1 and so on?

Sorry, I don't get your example because I don't see this example has anything to do with EPOLLET. The improvement from this PR is that we don't need to perform a read(2) for each wakeup event. Other than that, I think the new impl will behave the same way the old impl used to do.

saghul · 2024-05-09T10:24:38Z

Perhaps some sample code and the perf output would help here then. If N threads call uv_async_send there won't necessarily be N wakeups, the loop will be awakeen so long as the counter is > 0, but then it will be reset to 0, so it will be < N.

Granted, if one of the threads performs the write after the loop has read the value a new wakeup will happen, and that is ok, because we don't know how much later that was.

What am I missing?

panjf2000 · 2024-05-09T10:34:12Z

Perhaps some sample code and the perf output would help here then. If N threads call uv_async_send there won't necessarily be N wakeups, the loop will be awakeen so long as the counter is > 0, but then it will be reset to 0, so it will be < N.

If you were talking about N threads calling uv_async_send per thread to wake another one specific thread t1 and the wakeup times is less than N because t1 read eventfd and reset it to zero, then it's true regardless of the old implementation with LT or the new one with ET. If you want to trigger extract N times of wakeup, you need to specify EFD_SEMAPHORE for the eventfd.

Granted, if one of the threads performs the write after the loop has read the value a new wakeup will happen, and that is ok, because we don't know how much later that was.

Again, this will happen with either LT or ET.

saghul · 2024-05-09T11:00:14Z

If you were talking about N threads calling uv_async_send per thread to wake another one specific thread t1 and the wakeup times is less than N because t1 read eventfd and reset it to zero, then it's true regardless of the old implementation with LT or the new one with ET.

Correct, that was the case I was making.

f you want to trigger extract N times of wakeup, you need to specify EFD_SEMAPHORE for the eventfd.

Right, but uv_async is defined so it wakes up at least once, regardless of the number of times uv_async_send was called.

Again, this will happen with either LT or ET.

Ok.

So then, can you please elaborate on what case this PR helps with?

panjf2000 · 2024-05-09T11:05:32Z

To sum up, before this PR, we use eventfd in LT mode and we need to perform a read(2) for every wakeup event and this extra system call read(2) is mandatory and unavoidable, if we don't read the eventfd to reset it to zero, the kernel will keep waking up epoll_wait(). After this PR, we use eventfd in ET mode and that allows us to avoid calling read(2) to reset eventfd for every wakeup event, this helps us eliminate this system call overhead because under ET mode the kernel won't wake up the epoll_wait() again even if we don't read the eventfd to reset it to zero until next time we call uv_async_send from any threads.

vtjnash · 2024-05-09T14:18:31Z

Every time we write to the eventfd with EPOLLET, we get a wakeup event and are notified by epoll_wait

So, what I am pointing out is that this contradicts the documentation for epoll_wait, which states that if you do this, it will work most of the time, but not be reliable or robust:

• Do I need to continuously read/write a file descriptor until
EAGAIN when using the EPOLLET flag (edge-triggered behavior)?

      Receiving an event from [epoll_wait(2)](https://man7.org/linux/man-pages/man2/epoll_wait.2.html) should suggest to you
      that such file descriptor is ready for the requested I/O
      operation.  You must consider it ready until the next
      (nonblocking) read/write yields EAGAIN.  When and how you will
      use the file descriptor is entirely up to you.

      For packet/token-oriented files (e.g., datagram socket,
      terminal in canonical mode), the only way to detect the end of
      the read/write I/O space is to continue to read/write until
      EAGAIN.

      For stream-oriented files (e.g., pipe, FIFO, stream socket),
      the condition that the read/write I/O space is exhausted can
      also be detected by checking the amount of data read from /
      written to the target file descriptor.  For example, if you
      call [read(2)](https://man7.org/linux/man-pages/man2/read.2.html) by asking to read a certain amount of data and
      [read(2)](https://man7.org/linux/man-pages/man2/read.2.html) returns a lower number of bytes, you can be sure of
      having exhausted the read I/O space for the file descriptor.
      The same is true when writing using [write(2)](https://man7.org/linux/man-pages/man2/write.2.html).  (Avoid this
      latter technique if you cannot guarantee that the monitored
      file descriptor always refers to a stream-oriented file.)

What is strange is that it gives an exception for stream-oriented files, but then makes a claim that directly contradicts the documentation for read on a stream (it also may return a short read if the requested size is larger than the atomic size).

To be clear, I am content to assume that this is a kernel bug that we can (ab)use in our favor for eventfd, but it would be good to make sure we can actually rely on the behavior of epoll_wait not exactly following this part of the documentation for it. According to the documentation for epoll, it seems a new call to write while a read is already possible should not trigger a new event until after the old event has been read by the event loop and discarded.

saghul · 2024-05-09T15:01:23Z

To sum up, before this PR, we use eventfd in LT mode and we need to perform a read(2) for every wakeup event and this extra system call read(2) is mandatory and unavoidable, if we don't read the eventfd to reset it to zero, the kernel will keep waking up epoll_wait(). After this PR, we use eventfd in ET mode and that allows us to avoid calling read(2) to reset eventfd for every wakeup event, this helps us eliminate this system call overhead because under ET mode the kernel won't wake up the epoll_wait() again even if we don't read the eventfd to reset it to zero until next time we call uv_async_send from any threads.

In what case would that manifest? When someonee calls uv_async_send we'll get a wakeup and read from the fd, thus reseting it to zero, we cannot avoid that. So in what circumstance would be leave the counter to not zero intil called again?

panjf2000 · 2024-05-09T15:03:48Z

it seems a new call to write while a read is already possible should not trigger a new event until after the old event has been read by the event loop and discarded.

This would be true under one specific circumstance: the interval between two writes is extremely small. So if you issued a new write right after the previous without a gap, eventfd in ET mode would be waked up only once. By contrast, if you issued a new write after some times from the previous write, it would be waked up twice. This is because linux will converge multiple ready events on a same file descriptor that are in the short time windows and only trigger the wakeup callback once.

As for the confusing description on man pages, it's just a common pattern of working with EPOLLET, we have to analyze the situation in the real world on a case-by-case basis, like I said before: #4400 (comment) and #4400 (comment). In our case of using eventfd, we only care about the wakeup event, reading the eventfd or not on each wakeup event doesn't matter as long as we received the event as expected.

And I don't think this is a kernel bug, based on the source code of Linux, it's solid and it look more like a functionality to me.
@vtjnash

panjf2000 · 2024-05-09T15:05:12Z

To sum up, before this PR, we use eventfd in LT mode and we need to perform a read(2) for every wakeup event and this extra system call read(2) is mandatory and unavoidable, if we don't read the eventfd to reset it to zero, the kernel will keep waking up epoll_wait(). After this PR, we use eventfd in ET mode and that allows us to avoid calling read(2) to reset eventfd for every wakeup event, this helps us eliminate this system call overhead because under ET mode the kernel won't wake up the epoll_wait() again even if we don't read the eventfd to reset it to zero until next time we call uv_async_send from any threads.

In what case would that manifest? When someonee calls uv_async_send we'll get a wakeup and read from the fd, thus reseting it to zero, we cannot avoid that. So in what circumstance would be leave the counter to not zero intil called again?

We can avoid that by using EPOLLET on eventfd, this is what this PR does.

vtjnash · 2024-05-09T15:06:17Z

It looks like the kernel may not care if the read actually ever happened, as it doesn't have a cheap way to filter this for whether the item in the queue is currently LT or ET, so it just sets it after every write? I am not entirely certain of how the kernel decides when to clear the bit (for LT)
https://github.com/torvalds/linux/blob/45db3ab70092637967967bfd8e6144017638563c/fs/eventfd.c#L273-L274

vtjnash · 2024-05-09T15:11:35Z

like I said before

Sorry if it feels like I am ignoring that you made that comment before, but the documentation for this function does not seem to line up with your description, and instead repeatedly states that reading the file is required afterwards, while reiterating that this "common pattern of working with EPOLLET" that you are using should not to be relied upon, as it may work in most cases, but is not guaranteed to work correctly all of the time or in all cases.

Do you know if there is any kernel documentation for eventfd that might clarify this?

panjf2000 · 2024-05-09T15:15:50Z

What's your main concern here? I'm still not quite sure.

vtjnash · 2024-05-09T15:25:04Z

The man page for epoll_wait specifically says this PR is not correct, but the kernel implementation may allow it. Do we trust that the linux kernel developers will never update the implementation to more closely follow the man page, given that the kernel documentation for this appears to be non-existent?

vtjnash · 2024-05-09T15:27:09Z

There is actually also a second more complicated concern for it, which is that this syscall must have sequentially-consistent ordering. The read/write pair used to guarantee that, but does epoll_wait also guarantee that? (edit: we can probably fix this by adding explicit fences, if necessary)

panjf2000 · 2024-05-09T15:45:24Z

The man page for epoll_wait specifically says this PR is not correct, but the kernel implementation may allow it. Do we trust that the linux kernel developers will never update the implementation to more closely follow the man page, given that the kernel documentation for this appears to be non-existent?

I actually don't think this PR contradicts the man page.

Receiving an event from epoll_wait(2) should suggest to you
that such file descriptor is ready for the requested I/O
operation. You must consider it ready until the next
(nonblocking) read/write yields EAGAIN. When and how you will
use the file descriptor is entirely up to you.

When we receive a readable event from the eventfd, the man pages say that we must consider it ready until the next EAGAIN, it never says we can't get a new ready event again (and epoll in ET mode will always send a new ready event, and I don't think it's done by accident after all the kernel releases). If we want to read and use the data on the kernel buffer, we must read until EAGAIN is returned, but if we don't care about that data, then we don't read it and just wait for the next ready event.

panjf2000 · 2024-05-09T15:57:39Z

Maybe this could make things more interesting: https://lwn.net/Articles/865400/

vtjnash · 2024-05-09T16:33:58Z

Alright, yeah, that email from Linus stating this this PR relies on a kernel bug that probably won't get fixed (https://lwn.net/ml/linux-kernel/CAHk-=witY33b-vqqp=ApqyoFDpx9p+n4PwG9N-TvF8bq7-tsHw@mail.gmail.com/) is probably compelling

vtjnash · 2024-05-09T16:51:28Z

Yes, we often do that when working with sockets, and that's legit.

Note that this stated usage is still wrong (per the documentation and implementation), but that this case is fundamentally different because it never reads from the fd and only ever cares about the occurrence of the write syscall itself (which Linus's email indicates is a kernel bug that probably won't get fixed), but never about the quantity of data written (which is not an existing kernel bug, but may be an existing reliability issue in some applications).

panjf2000 · 2024-05-09T17:00:11Z

Alright, yeah, that email from Linus stating this this PR relies on a kernel bug that probably won't get fixed (lwn.net/ml/linux-kernel/CAHk-=witY33b-vqqp=ApqyoFDpx9p+n4PwG9N-TvF8bq7-tsHw@mail.gmail.com) is probably compelling

Well, it is shocking to me... Normally you don't expect something like epoll whose design and source code are that sophisticated to be fundamentally broken when it was implemented. But thanks for the link, it's frustrating but useful.

panjf2000 · 2024-05-09T17:02:43Z

Yes, we often do that when working with sockets, and that's legit.

Note that this stated usage is still wrong (per the documentation and implementation), but that this case is fundamentally different because it never reads from the fd and only ever cares about the occurrence of the write syscall itself (which Linus's email indicates is a kernel bug that probably won't get fixed), but never about the quantity of data written (which is not an existing kernel bug, but may be an existing reliability issue in some applications).

Sorry, why would it be wrong when reading a socket fd in ET mode until EAGAIN is returned? Could you clarify that with more contexts?

vtjnash · 2024-05-09T17:10:57Z

If you do read until EAGAIN, it is correct, but wastes a syscall every time you go to call read (as you could have used LT instead, and called epoll_wait, and gotten the same info simultaneously for every socket instead of just one)

panjf2000 · 2024-05-09T17:16:46Z

If you do read until EAGAIN, it is correct, but wastes a syscall every time you go to call read (as you could have used LT instead, and called epoll_wait, and gotten the same info simultaneously for every socket instead of just one)

Oh, I think we were just not on the same page about that. I was specifically talking about EAGAIN with ET mode, it's clear that we don't have to read until EAGAIN in LT mode. Now we are in sync.

panjf2000 · 2024-05-09T17:19:27Z

So, what are we going to do with this PR? Does it still seem worth going through for libuv?

saghul · 2024-05-09T17:42:57Z

I'd say some benchmarks would help decide.

panjf2000 · 2024-06-11T16:50:45Z

Ping @libuv/collaborators

panjf2000 · 2024-06-19T03:49:50Z

I see @saghul gave his approval on this a long time ago but I guess that will not suffice to merge this PR. Since @bnoordhuis @vtjnash seemed to have got a lot of other work on their plates recently and have no time to make the review here, I'm wondering if there are other available collaborators @libuv/collaborators who could chime in and review this PR?

panjf2000 · 2024-07-04T02:57:17Z

I see @saghul gave his approval on this a long time ago but I guess that will not suffice to merge this PR. Since @bnoordhuis @vtjnash seemed to have got a lot of other work on their plates recently and have no time to make the review here, I'm wondering if there are other available collaborators @libuv/collaborators who could chime in and review this PR?

Another two weeks have passed and yet I still haven't heard back from @libuv/collaborators. I wonder what's going on, could anyone help make progress on this PR? It's been dangling around for a long time. @bnoordhuis @vtjnash @saghul

panjf2000 · 2024-07-17T06:45:41Z

Another two weeks have passed, still zero response...

Helplessly and despairingly ping @bnoordhuis @vtjnash @saghul @libuv/collaborators

bnoordhuis · 2024-07-18T07:41:50Z

Sorry for the delay. I was AFK for a long time because of Real World Reasons.

Cost/benefit trade-off: I'm somewhat worried this change may cause hard-to-debug regressions in existing programs, whereas the benefits won't be very visible to most users. Reasons why:

async handle performance reached a level long ago where it's Good Enough for most programs that an extra 10% doesn't really matter
million_async is an unrepresentative micro-benchmark because most programs don't create a million async handles, they only create a handful, oftentimes just one. Even if you make it 10x faster, it's still a wash.

I'll happily step aside if other maintainers think it's a good idea, but on the whole, I'd be in favor of making no changes.

Aside: I played with switching libuv to edge-triggered mode wholesale about 10 years ago but I rapidly came to the conclusion that ET is not always faster and frequently more trouble than it's worth in programs where not all components are equally well-behaved.

It works great when you're nginx; it's got a lot of sharp edges when you're node.js.

panjf2000 · 2024-07-18T09:00:50Z

Thank you for sharing the insightful thoughts on this!

I think that something like nginx, netty, and tokio employing this methodology for years should give us more confidence here, I wouldn't worry so much, although I may understand your concern.

Nonetheless, what I'm having in mind is: would it make more sense to enable this optimization via CFLAGS (something like UV_ASYNC_POLL_ET)? Think of it as some experimental feature, the same rules apply to #4330

bnoordhuis · 2024-07-19T20:13:48Z

Well, I don't know. It's one of those "diminishing returns" things. It probably doesn't pay to sink too much effort in it, simply because for most users it's a non-issue, and any bugs that pop up are likely very subtle.

The last time I tried to improve async handle performance, it caused fairness issues with a particular scheduling policy on single-core Linux machines - and only there.

It took more time to track down (and then a lot of back and forth to mitigate) than I'd like to admit, and I'm in no rush to do that again. :-)

vtjnash · 2024-07-22T20:28:52Z

I would if I could.

What is blocking you then? The process to use git send-email or b4 to https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/tree/CONTRIBUTING.d/patches per https://people.kernel.org/monsieuricon/sending-a-kernel-patch-with-b4-part-1 takes extra steps compared to Github's model, but is no less open.

I am in favor of merging this (and #4330). Like Ben, I have had Real World Reasons that have kept me away for a long time, but I am slowly making my way back around to the things I have been putting off for a while. I would like to see a documentation update, so that kernel developers might think twice before breaking this in the future, but given the number of other significant projects that would also be broken by this, I see no reason to avoid this now.

Aside: I played with switching libuv to edge-triggered mode wholesale about 10 years ago but I rapidly came to the conclusion that ET is not always faster and frequently more trouble than it's worth in programs where not all components are equally well-behaved

AFAIK, this was because you were coding it correctly (waiting to read/write EAGAIN), as the documentation specifies is required, rather than being exploiting the implementation details that the other big projects do. It might be interesting now to try again with ET, given that we know the kernel is now probably committed to supporting it indefinitely (torvalds/linux@3b84482)

panjf2000 · 2024-07-22T22:20:35Z

What is blocking you then?

I didn't know the first thing about submitting a patch to the kernel docs, I assumed it would take a long process to review before it's done. Back then I was hoping for a quick review of this PR followed by an approval or disapproval so I can move on from this. To be frank, I didn't expect such a long pause on this PR.
linux: exploit eventfd in EPOLLET mode to avoid syscall per wakeup #4400 (comment)

vtjnash · 2024-07-25T16:31:20Z

"multiple events can"

This is an insufficiently strong claim for this PR though, which would require the stronger claim that "an event WILL be generated for each chunk of data received"

vtjnash

The CI failure seems to be unrelated, though not one that we have seen happen before

not ok 336 - tcp_reuseport
# exit code 134
# Output from process `tcp_reuseport`:
# Assertion failed in ../../test/test-tcp-reuseport.c on line 232: `thread_loop1_accepted > 0` (0 > 0)

panjf2000 · 2024-07-25T22:36:14Z

The CI failure seems to be unrelated, though not one that we have seen happen before
not ok 336 - tcp_reuseport
# exit code 134
# Output from process `tcp_reuseport`:
# Assertion failed in ../../test/test-tcp-reuseport.c on line 232: `thread_loop1_accepted > 0` (0 > 0)

This was supposed to be fixed by #4417, it's uncanny. Maybe it should be blamed on the test code. I'll try to find out what's really going on.

panjf2000 · 2024-07-25T22:50:29Z

"multiple events can"

This is an insufficiently strong claim for this PR though, which would require the stronger claim that "an event WILL be generated for each chunk of data received"

Something like "an event WILL be generated for each chunk of data received" in the kernel docs would be the best for our use case. However, from my point of view, I do deem that the current version of "multiple events can be generated upon receipt of multiple chunks of data" (along with those commits and statements from Linus and the successful applications in other renowned projects) should be sufficient to endorse this PR.

But I will try to send a patch (which I hope won't get in the way of this PR) to the kernel docs over the weekend.

Register the eventfd with EPOLLET to enable edge-triggered notification where we're able to eliminate the overhead of reading the eventfd via system call on each wakeup event. When the eventfd counter reaches the maximum value of the unsigned 64-bit, which may not happen for the entire lifetime of the process, we rewind the counter and retry. This optimization saves one system call on each event-loop wakeup, eliminating the overhead of read(2) as well as the extra latency for each epoll wakeup. --------- Signed-off-by: Andy Pan <i@andypan.me>

panjf2000 · 2024-07-30T00:24:27Z

Thank you for merging this PR! I sent a patch to the Linux man-pages a few days ago and I've got my first round of review from the current maintainer, but it seemed to need more from another kernel email group <linux-api@vger.kernel.org>, so I've CCed it and still waiting for them. I'll update the info here when there is any progress.

panjf2000 · 2024-08-12T10:33:33Z

Thank you for merging this PR! I sent a patch to the Linux man-pages a few days ago and I've got my first round of review from the current maintainer, but it seemed to need more from another kernel email group <linux-api@vger.kernel.org>, so I've CCed it and still waiting for them. I'll update the info here when there is any progress.

Just for notice, my patch to the man-pages was approved and placed in the merging queue by the current maintainer ten days ago, it was supposed to be after another patch in line.

I'm still waiting for the maintainer to get it landed on the mainline, this is the mailing list: https://lore.kernel.org/all/20240801-epoll-et-desc-v5-1-7fcb9260a3b2@andypan.me/

saghul · 2024-08-12T11:00:28Z

Nice!

panjf2000 · 2024-08-22T01:13:04Z

Thank you for merging this PR! I sent a patch to the Linux man-pages a few days ago and I've got my first round of review from the current maintainer, but it seemed to need more from another kernel email group <linux-api@vger.kernel.org>, so I've CCed it and still waiting for them. I'll update the info here when there is any progress.

Just for notice, my patch to the man-pages was approved and placed in the merging queue by the current maintainer ten days ago, it was supposed to be after another patch in line.

I'm still waiting for the maintainer to get it landed on the mainline, this is the mailing list: lore.kernel.org/all/20240801-epoll-et-desc-v5-1-7fcb9260a3b2@andypan.me

At last, my patch gets merged into the mainline: https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=71988df59d2585cac1147a6f785df65693b7d77f

vtjnash · 2024-08-22T01:35:46Z

Congrats! That is awesome 🤩

For the moment, the edge-triggered epoll generates an event for each receipt of a chunk of data, that is to say, epoll_wait() will return and tell us a monitored file descriptor is ready whenever there is a new activity on that FD since we were last informed about that FD. This is not a real _edge_ implementation for epoll, but it's been working this way for years and plenty of projects are relying on it to eliminate the overhead of one system call of read(2) per wakeup event. There are several renowned open-source projects relying on this feature for notification function (with eventfd): register eventfd with EPOLLET and avoid calling read(2) on the eventfd when there is wakeup event (eventfd being written). Examples: nginx [1], netty [2], tokio [3], libevent [4], ect. [5] These projects are widely used in today's Internet infrastructures. Thus, changing this behavior of epoll ET will fundamentally break them and cause a significant negative impact. Linux has changed it for pipe before [6], breaking some Android libraries, which had got "reverted" somehow. [7] [8] Nevertheless, the paragraph in the manual pages describing this characteristic of epoll ET seems ambiguous, I think a more explict sentence should be used to clarify it. We're improving the notification mechanism for libuv recently by exploiting this feature with eventfd, which brings us a significant performance boost. [9] Therefore, we (as well as the maintainers of nginx, netty, tokio, etc.) would have a sense of security to build an enhanced notification function based on this feature if there is a guarantee of retaining this implementation of epoll ET for the backward compatibility in the man pages. [1]: https://github.com/nginx/nginx/blob/efc6a217b92985a1ee211b6bb7337cd2f62deb90/src/event/modules/ngx_epoll_module.c#L386-L457 [2]: netty/netty#9192 [3]: https://github.com/tokio-rs/mio/blob/309daae21ecb1d46203a7dbc0cf4c80310240cba/src/sys/unix/waker.rs#L111-L143 [4]: https://github.com/libevent/libevent/blob/525f5d0a14c9c103be750f2ca175328c25505ea4/event.c#L2597-L2614 [5]: libuv/libuv#4400 (comment) [6]: https://lkml.iu.edu/hypermail/linux/kernel/2010.1/04363.html [7]: torvalds/linux@3a34b13 [8]: torvalds/linux@3b84482 [9]: libuv/libuv#4400 (comment) Signed-off-by: Andy Pan <i@andypan.me> Cc: <linux-api@vger.kernel.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Message-ID: <20240801-epoll-et-desc-v5-1-7fcb9260a3b2@andypan.me> Signed-off-by: Alejandro Colomar <alx@kernel.org>

This reverts commit e5cb1d3. Reason: bisecting says it breaks dnstap. Also revert commit 2713454 ("kqueue: use EVFILT_USER for async if available") because otherwise the first commit doesn't revert cleanly, with enough conflicts in src/unix/async.c that I'm not comfortable fixing those up manually. Fixes: libuv#4584

This reverts commit e5cb1d3. Reason: bisecting says it breaks dnstap. Also revert commit 2713454 ("kqueue: use EVFILT_USER for async if available") because otherwise the first commit doesn't revert cleanly, with enough conflicts in src/unix/async.c that I'm not comfortable fixing those up manually. Fixes: #4584

#4585)" This reverts commit 18d48bc.

panjf2000 force-pushed the eventfd-et branch from 07811e8 to 67890e9 Compare July 4, 2024 02:46

panjf2000 force-pushed the eventfd-et branch from 67890e9 to 621c3fd Compare July 12, 2024 09:59

panjf2000 force-pushed the eventfd-et branch from 621c3fd to 7de675b Compare July 22, 2024 22:29

vtjnash approved these changes Jul 25, 2024

View reviewed changes

panjf2000 force-pushed the eventfd-et branch from 7de675b to 9295a88 Compare July 28, 2024 14:29

vtjnash merged commit e5cb1d3 into libuv:v1.x Jul 29, 2024
26 checks passed

panjf2000 deleted the eventfd-et branch July 30, 2024 00:44

Mno-hime mentioned this pull request Oct 16, 2024

libuv 1.49.0 breaks dnstap test in BIND 9 #4584

Closed

vtjnash added a commit that referenced this pull request Oct 18, 2024

Revert "Revert "linux: eliminate a read on eventfd per wakeup (#4400)" (

ced7797

#4585)" This reverts commit 18d48bc.

linux: exploit eventfd in EPOLLET mode to avoid syscall per wakeup #4400

linux: exploit eventfd in EPOLLET mode to avoid syscall per wakeup #4400

Uh oh!

Conversation

panjf2000 commented May 9, 2024

Uh oh!

vtjnash commented May 9, 2024

Uh oh!

panjf2000 commented May 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Footnotes

Uh oh!

panjf2000 commented May 9, 2024

Uh oh!

saghul commented May 9, 2024

Uh oh!

panjf2000 commented May 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

saghul commented May 9, 2024

Uh oh!

panjf2000 commented May 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

saghul commented May 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

panjf2000 commented May 9, 2024

Uh oh!

saghul commented May 9, 2024

Uh oh!

panjf2000 commented May 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vtjnash commented May 9, 2024

Uh oh!

saghul commented May 9, 2024

Uh oh!

panjf2000 commented May 9, 2024

Uh oh!

panjf2000 commented May 9, 2024

Uh oh!

vtjnash commented May 9, 2024

Uh oh!

vtjnash commented May 9, 2024

Uh oh!

panjf2000 commented May 9, 2024

Uh oh!

vtjnash commented May 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vtjnash commented May 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

panjf2000 commented May 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

panjf2000 commented May 9, 2024

Uh oh!

vtjnash commented May 9, 2024

Uh oh!

vtjnash commented May 9, 2024

Uh oh!

panjf2000 commented May 9, 2024

Uh oh!

panjf2000 commented May 9, 2024

Uh oh!

vtjnash commented May 9, 2024

Uh oh!

panjf2000 commented May 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

panjf2000 commented May 9, 2024

Uh oh!

saghul commented May 9, 2024

Uh oh!

panjf2000 commented Jun 11, 2024

Uh oh!

panjf2000 commented Jun 19, 2024

Uh oh!

linux: exploit `eventfd` in `EPOLLET` mode to avoid syscall per wakeup #4400

linux: exploit `eventfd` in `EPOLLET` mode to avoid syscall per wakeup #4400

panjf2000 commented May 9, 2024 •

edited

Loading

panjf2000 commented May 9, 2024 •

edited

Loading

panjf2000 commented May 9, 2024 •

edited

Loading

saghul commented May 9, 2024 •

edited

Loading

panjf2000 commented May 9, 2024 •

edited

Loading

vtjnash commented May 9, 2024 •

edited

Loading

vtjnash commented May 9, 2024 •

edited

Loading

panjf2000 commented May 9, 2024 •

edited

Loading

panjf2000 commented May 9, 2024 •

edited

Loading

panjf2000 commented Jul 18, 2024 •

edited

Loading

vtjnash commented Jul 22, 2024 •

edited

Loading

panjf2000 commented Aug 12, 2024 •

edited

Loading