Dedicated event for EPOLLHUP, EPOLLERR #1038

bschlinker · 2020-06-18T18:08:38Z

I'd like to add another event type that enables notification of EPOLLHUP and EPOLLERR without requiring a subscription to EV_READ or EV_WRITE.

I've done a rough cut of this and would like to upstream.

Why this is useful:

EPOLLHUP and EPOLLERR are currently delivered as (EV_READ | EV_WRITE)
In some cases, the application may not be subscribed for EV_READ or EV_WRITE, but still want to be notified of EPOLLHUP / EPOLLERR:
- For instance, a server in the process of closing a connection may unsubscribe from EV_READ messages to put back pressure on the transport and may not be subscribed to EV_WRITE as there are no writes pending.
- However, the server may want to be notified of connection failures, or of events such as socket timestamps being delivered via the socket error queue.

Questions:

Two new events or one: At the moment I've added a single event, EV_HUPERR, that captures EPOLLHUP and EPOLLERR. I could split into two separate events, but I think most use cases will want both. Looking for feedback on whether flexibility should be prioritized
Name of the event: Can be EV_HUPERR, or EV_ERRHUP, or anything else. There was some relevant discussion in libevent hides POLLERR #495

Seems related to #345; using a new issue since that one is 4 years old.

The text was updated successfully, but these errors were encountered:

azat · 2020-06-18T22:22:46Z

Hi, good question, but let's look at the existing bits.

(all above about epoll)

For instance, a server in the process of closing a connection may unsubscribe from EV_READ messages to put back pressure on the transport and may not be subscribed to EV_WRITE as there are no writes pending.

EV_CLOSE (EPOLLRDHUP) not the same?

However, the server may want to be notified of connection failures, or of events such as socket timestamps being delivered via the socket error queue.

And EPOLLERR is always monitored (you don't need to specify it, so it will be monitored explicitly with EPOLLRDHUP)

bschlinker · 2020-10-04T21:00:12Z

Thanks for the reply -- sorry for the late follow up from my side.

EV_CLOSE (EPOLLRDHUP) not the same?

My understanding is that a socket can close without a EPOLLRDHUP event occurring. There's some useful discussion in on this in a StackOverflow answer ([1]), which matches my review of the kernel.
Messages in the socket error queue result in a EPOLLERR message ([2]).
Socket errors (such as a RST being received), also result in EPOLLERR ([3], [4], [5]).

Based on this, I don't think EPOLLRDHUP is the right signal.

And EPOLLERR is always monitored (you don't need to specify it, so it will be monitored explicitly with EPOLLRDHUP)

I understand that EPOLLERR and EPOLLHUP are always added to the list of monitored events in eventpoll.c ([6]) and understand the idea of subscribing to EV_CLOSED events to get these signals.

However, I don't think this approach will work given the current implementation:

If only EV_CLOSED is subscribed to, epoll_dispatch will set EV_READ and EV_WRITE when EPOLLERR and/or EPOLLHUP signals are raised ([7]).
However, no event will get to the application because EV_READ and EV_WRITE are not subscribed to, and thus the check in evmap_io_active_ on ev->ev_events will fail ([8], [9]).

Thus, it appears if only the EV_CLOSED event is subscribed to the application does not get notified of events in the socket error queue, or socket errors (such as resets).

Based on the above, there doesn't seem to be a way to listen for EPOLLERR or EPOLLHUP messages unless read and write events are also listened to -- and the application may not want to this if it intentionally wants to put back pressure on the transport and has nothing to write.

This can be resolved by either setting the EV_CLOSED event when there is a EPOLLERR or EPOLLHUP pending (as is done for EV_READ and EV_WRITE ([10])), or by adding a new event. Let me know if I've misunderstood something during my trace.

bschlinker · 2020-10-11T15:45:16Z

@azat Assuming I haven't misunderstood something during my trace, I think there's two options:

Modify the EV_CLOSED event to also be set when there is a EPOLLERR or EPOLLHUP event.
Create one or two new events that are explicitly for EPOLLERR and EPOLLHUP.

Would be great to get your feedback as I'm adding socket timestamp support to a socket library that is also open source, and want to ensure that the change is upstream compatible.

One concern about option (1) is possible side effects on existing use cases. In addition, dependencies, such as the open source socket library I'm modifying, will need to be able to determine whether the libevent version includes the modification or they won't be able to operate safely.

azat · 2020-10-11T21:41:37Z

@bschlinker hi!
Sorry for the delay and many thanks for such verbose descriptions!

I've done a rough cut of this and would like to upstream.

Yes, please! (actually I missed this paragraph, sorry)

Does your changes modifies all backends or only epoll?

Based on this, I don't think EPOLLRDHUP is the right signal.

Indeed reasonable

Thus, it appears if only the EV_CLOSED event is subscribed to the application does not get notified of events in the socket error queue, or socket errors (such as resets).

Yes this is tricky (actually comment for it says that it requires EV_FEATURE_EARLY_CLOSE, but who should care about comments, everything should just make sense)

This can be resolved by either setting the EV_CLOSED event when there is a EPOLLERR or EPOLLHUP pending (as is done for EV_READ and EV_WRITE ([10])), or by adding a new event. Let me know if I've misunderstood something during my trace.

Totally agree, let's go with a new event

One concern about option (1) is possible side effects on existing use cases. In addition, dependencies, such as the open source socket library I'm modifying, will need to be able to determine whether the libevent version includes the modification or they won't be able to operate safely.

Yes, any change (especially in the core part) should be backward compatible, so modifying EV_CLOSED is not an option

Create one or two new events that are explicitly for EPOLLERR and EPOLLHUP.

AFAICS it maybe useful, not for subscribing but for distinguishing in the event callback EPOLERR from EPOLLHUP (although he need to always subscribe to both to check hang up on error), for example polling of sysfs will return POLLERR|POLLPRI 1

Also we need to check how this works in other backends, and if the behavior cannot be common for all we can add some feature like EV_FEATURE_EARLY_CLOSE to give the user information about this.

So I would try add two separate events and see how it will looks...

bschlinker · 2020-10-11T22:01:31Z

Thanks for the detailed response! Sounds like we're on the same page.

Does your changes modifies all backends or only epoll?

Unfortunately I have only done this for the epoll backend -- have never looked through the code for the others. I also did my initial patch on an older version of libevent that has a fork to add event EV_PRI for EPOLLPRI [1]; let me look into whether I need to upstream that as well.

I will start creating EV_ERR and EV_HUP; let me know if you have other preferred names.

[1]: Relevant: https://github.com/facebook/folly/blob/master/folly/io/async/EventHandler.h#L47

azat · 2020-10-11T22:05:37Z

I will start creating EV_ERR and EV_HUP; let me know if you have other preferred names.

EV_ERR defined in win32 AFAICS, EV_ERROR is not, let use it then

azat · 2020-11-01T18:45:46Z

@bschlinker by the way do you have any ETA? I'm planning to make a 2.2 alpha release (#1094) and want to include this change too

Summary: Adding support for write and socket timestamps by introducing `ByteEvent` that can be delivered to observers. `AsyncTransport::WriteFlags` has long had timestamping related flags, such as `TIMESTAMP_TX`, but the code required to act on these flags only existed in proxygen. This diff generalizes the approach so that it works for other use cases of `AsyncSocket`. This diff is long, but much of it is unit tests designed to prevent regressions given the trickiness of socket timestamps and `ByteEvent`. **Each `ByteEvent` contains:** - Type (WRITE, SCHED, TX, ACK) - Byte stream offset that the timestamp is for (relative to the raw byte stream, which means after SSL in the case of AsyncSSLSocket) - `steady_clock` timestamp recorded by AsyncSocket when generating the `ByteEvent` - For SCHED, TX, and ACK events, if available, hardware and software (kernel) timestamps **How `ByteEvent` are used:** - Support is enabled when an observer is attached with the `byteEvents` config flag set. If the socket does not support timestamps, the observer is notified through the `byteEventsUnavailable` callback. Otherwise, `byteEventsEnabled` is called - When the application writes to a socket with `ByteEvent` support enabled and a relevant `WriteFlag`, SCHED/TX/ACK `ByteEvent` are requested from the kernel, and WRITE `ByteEvent` are generated by the socket for the *last byte* in the write. - If the entire write buffer cannot be written at once, then additional `ByteEvent` will also be generated for the last byte in each write. - This means that if the application wants to timestamp a specific byte, it must break up the write buffer before handing it to `AsyncSocket` such that the byte to timestamp is the last byte in the write buffer. - When socket timestamps arrive from the kernel via the socket error queue, they are transformed into `ByteEvent` and passed to observers **Caveats:** 1. Socket timestamps received from the kernel contain the byte's offset in the write stream. This counter is a `uint32_t`, and thus rolls over every ~4GB. When transforming raw timestamp into `ByteEvent`, we correct for this and transform the raw offset into an offset relative to the raw byte offset stored by `AsyncSocket` (returned via `getRawBytesWritten()`). 2. At the moment, a read callback must be installed to receive socket timestamps due to epoll's behavior. I will correct this with a patch to epoll, see libevent/libevent#1038 (comment) for details 3. If a msghdr's ancillary data contains a timestamping flag (such as `SOF_TIMESTAMPING_TX_SOFTWARE`), the kernel will enqueue a socket error message containing the byte offset of the write ( `SO_EE_ORIGIN_TIMESTAMPING`) even if timestamping has not been enabled by an associated call to `setsockopt`. This creates a problem: 1. If an application was to use a timestamp `WriteFlags` such as `TIMESTAMP_TX` without enabling timestamping, and if `AsyncSocket` transformed such `WriteFlags` to ancillary data by default, it could create a situation where epoll continues to return `EV_READ` (due to items in the socket error queue), but `AsyncSocket` would not fetch anything from the socket error queue. 2. To prevent this scenario, `WriteFlags` related to timestamping are not translated into msghdr ancillary data unless timestamping is enabled. This required adding a boolean to `getAncillaryData` and `getAncillaryDataSize`. Differential Revision: D24094832 fbshipit-source-id: e3bec730ddd1fc1696023d8c982ae02ab9b5fb7d

nspring · 2022-03-14T18:54:45Z

@azat I started working on this issue and think I have a reasonable cut at it.

This raised two questions:

EV_ERROR is defined on mac, but it only warns -- do you have a suggestion for another name? or let it warn?:

  CC       kqueue.lo
In file included from kqueue.c:65:
In file included from ./event-internal.h:41:
In file included from ./minheap-internal.h:33:
./include/event2/event.h:986:9: warning: 'EV_ERROR' macro redefined [-Wmacro-redefined]
#define EV_ERROR        0x200
        ^
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/sys/event.h:170:9: note: previous definition is here
#define EV_ERROR            0x4000      /* error, data contains errno */
        ^

epolltable-internal.h is 2.8MB with 5 events, up from 40K. This feels extra large to me, and likely easily compressed by omitting comments or using the preprocessor; do you have a suggestion? Leave it alone? Cut until it reaches a target file size? Just drop the comments and see what happens?

Any other feedback is welcome of course.

azat · 2022-07-09T19:45:08Z

Sorry for such a huge delay.

EV_ERROR is defined on mac, but it only warns -- do you have a suggestion for another name? or let it warn?:

EV_EVENT maybe ?

epolltable-internal.h is 2.8MB with 5 events, up from 40K. This feels extra large to me, and likely easily compressed by omitting comments or using the preprocessor; do you have a suggestion? Leave it alone? Cut until it reaches a target file size? Just drop the comments and see what happens?

Where do you see 2.8MB?
I think 40K for this generated file is OK, comments could be useful.

P.S. it is generated via make_epoll_table.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dedicated event for EPOLLHUP, EPOLLERR #1038

Dedicated event for EPOLLHUP, EPOLLERR #1038

bschlinker commented Jun 18, 2020

azat commented Jun 18, 2020

bschlinker commented Oct 4, 2020 •

edited

Loading

bschlinker commented Oct 11, 2020

azat commented Oct 11, 2020

bschlinker commented Oct 11, 2020

azat commented Oct 11, 2020

azat commented Nov 1, 2020 •

edited

Loading

nspring commented Mar 14, 2022

azat commented Jul 9, 2022

Dedicated event for EPOLLHUP, EPOLLERR #1038

Dedicated event for EPOLLHUP, EPOLLERR #1038

Comments

bschlinker commented Jun 18, 2020

azat commented Jun 18, 2020

bschlinker commented Oct 4, 2020 • edited Loading

bschlinker commented Oct 11, 2020

azat commented Oct 11, 2020

bschlinker commented Oct 11, 2020

azat commented Oct 11, 2020

azat commented Nov 1, 2020 • edited Loading

nspring commented Mar 14, 2022

azat commented Jul 9, 2022

bschlinker commented Oct 4, 2020 •

edited

Loading

azat commented Nov 1, 2020 •

edited

Loading