Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dedicated event for EPOLLHUP, EPOLLERR #1038

Open
Tracked by #1094
bschlinker opened this issue Jun 18, 2020 · 9 comments
Open
Tracked by #1094

Dedicated event for EPOLLHUP, EPOLLERR #1038

bschlinker opened this issue Jun 18, 2020 · 9 comments

Comments

@bschlinker
Copy link

I'd like to add another event type that enables notification of EPOLLHUP and EPOLLERR without requiring a subscription to EV_READ or EV_WRITE.

I've done a rough cut of this and would like to upstream.

Why this is useful:

  • EPOLLHUP and EPOLLERR are currently delivered as (EV_READ | EV_WRITE)
  • In some cases, the application may not be subscribed for EV_READ or EV_WRITE, but still want to be notified of EPOLLHUP / EPOLLERR:
    • For instance, a server in the process of closing a connection may unsubscribe from EV_READ messages to put back pressure on the transport and may not be subscribed to EV_WRITE as there are no writes pending.
    • However, the server may want to be notified of connection failures, or of events such as socket timestamps being delivered via the socket error queue.

Questions:

  • Two new events or one: At the moment I've added a single event, EV_HUPERR, that captures EPOLLHUP and EPOLLERR. I could split into two separate events, but I think most use cases will want both. Looking for feedback on whether flexibility should be prioritized
  • Name of the event: Can be EV_HUPERR, or EV_ERRHUP, or anything else. There was some relevant discussion in libevent hides POLLERR #495

Seems related to #345; using a new issue since that one is 4 years old.

@azat
Copy link
Member

azat commented Jun 18, 2020

Hi, good question, but let's look at the existing bits.

(all above about epoll)

For instance, a server in the process of closing a connection may unsubscribe from EV_READ messages to put back pressure on the transport and may not be subscribed to EV_WRITE as there are no writes pending.

EV_CLOSE (EPOLLRDHUP) not the same?

However, the server may want to be notified of connection failures, or of events such as socket timestamps being delivered via the socket error queue.

And EPOLLERR is always monitored (you don't need to specify it, so it will be monitored explicitly with EPOLLRDHUP)

@bschlinker
Copy link
Author

bschlinker commented Oct 4, 2020

Thanks for the reply -- sorry for the late follow up from my side.

EV_CLOSE (EPOLLRDHUP) not the same?

  • My understanding is that a socket can close without a EPOLLRDHUP event occurring. There's some useful discussion in on this in a StackOverflow answer ([1]), which matches my review of the kernel.
  • Messages in the socket error queue result in a EPOLLERR message ([2]).
  • Socket errors (such as a RST being received), also result in EPOLLERR ([3], [4], [5]).

Based on this, I don't think EPOLLRDHUP is the right signal.

And EPOLLERR is always monitored (you don't need to specify it, so it will be monitored explicitly with EPOLLRDHUP)

I understand that EPOLLERR and EPOLLHUP are always added to the list of monitored events in eventpoll.c ([6]) and understand the idea of subscribing to EV_CLOSED events to get these signals.

However, I don't think this approach will work given the current implementation:

  • If only EV_CLOSED is subscribed to, epoll_dispatch will set EV_READ and EV_WRITE when EPOLLERR and/or EPOLLHUP signals are raised ([7]).
  • However, no event will get to the application because EV_READ and EV_WRITE are not subscribed to, and thus the check in evmap_io_active_ on ev->ev_events will fail ([8], [9]).

Thus, it appears if only the EV_CLOSED event is subscribed to the application does not get notified of events in the socket error queue, or socket errors (such as resets).


Based on the above, there doesn't seem to be a way to listen for EPOLLERR or EPOLLHUP messages unless read and write events are also listened to -- and the application may not want to this if it intentionally wants to put back pressure on the transport and has nothing to write.

This can be resolved by either setting the EV_CLOSED event when there is a EPOLLERR or EPOLLHUP pending (as is done for EV_READ and EV_WRITE ([10])), or by adding a new event. Let me know if I've misunderstood something during my trace.

@bschlinker
Copy link
Author

@azat Assuming I haven't misunderstood something during my trace, I think there's two options:

  1. Modify the EV_CLOSED event to also be set when there is a EPOLLERR or EPOLLHUP event.
  2. Create one or two new events that are explicitly for EPOLLERR and EPOLLHUP.

Would be great to get your feedback as I'm adding socket timestamp support to a socket library that is also open source, and want to ensure that the change is upstream compatible.

One concern about option (1) is possible side effects on existing use cases. In addition, dependencies, such as the open source socket library I'm modifying, will need to be able to determine whether the libevent version includes the modification or they won't be able to operate safely.

@azat
Copy link
Member

azat commented Oct 11, 2020

@bschlinker hi!
Sorry for the delay and many thanks for such verbose descriptions!

I've done a rough cut of this and would like to upstream.

Yes, please! (actually I missed this paragraph, sorry)

Does your changes modifies all backends or only epoll?

Based on this, I don't think EPOLLRDHUP is the right signal.

Indeed reasonable

Thus, it appears if only the EV_CLOSED event is subscribed to the application does not get notified of events in the socket error queue, or socket errors (such as resets).

Yes this is tricky (actually comment for it says that it requires EV_FEATURE_EARLY_CLOSE, but who should care about comments, everything should just make sense)

This can be resolved by either setting the EV_CLOSED event when there is a EPOLLERR or EPOLLHUP pending (as is done for EV_READ and EV_WRITE ([10])), or by adding a new event. Let me know if I've misunderstood something during my trace.

Totally agree, let's go with a new event

One concern about option (1) is possible side effects on existing use cases. In addition, dependencies, such as the open source socket library I'm modifying, will need to be able to determine whether the libevent version includes the modification or they won't be able to operate safely.

Yes, any change (especially in the core part) should be backward compatible, so modifying EV_CLOSED is not an option

Create one or two new events that are explicitly for EPOLLERR and EPOLLHUP.

AFAICS it maybe useful, not for subscribing but for distinguishing in the event callback EPOLERR from EPOLLHUP (although he need to always subscribe to both to check hang up on error), for example polling of sysfs will return POLLERR|POLLPRI 1

Also we need to check how this works in other backends, and if the behavior cannot be common for all we can add some feature like EV_FEATURE_EARLY_CLOSE to give the user information about this.

So I would try add two separate events and see how it will looks...

@bschlinker
Copy link
Author

Thanks for the detailed response! Sounds like we're on the same page.

Does your changes modifies all backends or only epoll?

Unfortunately I have only done this for the epoll backend -- have never looked through the code for the others. I also did my initial patch on an older version of libevent that has a fork to add event EV_PRI for EPOLLPRI [1]; let me look into whether I need to upstream that as well.

I will start creating EV_ERR and EV_HUP; let me know if you have other preferred names.

[1]: Relevant: https://github.com/facebook/folly/blob/master/folly/io/async/EventHandler.h#L47

@azat
Copy link
Member

azat commented Oct 11, 2020

I will start creating EV_ERR and EV_HUP; let me know if you have other preferred names.

EV_ERR defined in win32 AFAICS, EV_ERROR is not, let use it then

@azat
Copy link
Member

azat commented Nov 1, 2020

@bschlinker by the way do you have any ETA? I'm planning to make a 2.2 alpha release (#1094) and want to include this change too

facebook-github-bot pushed a commit to facebook/folly that referenced this issue Mar 1, 2021
Summary:
Adding support for write and socket timestamps by introducing `ByteEvent` that can be delivered to observers.
`AsyncTransport::WriteFlags` has long had timestamping related flags, such as `TIMESTAMP_TX`, but the code required to act on these flags only existed in proxygen. This diff generalizes the approach so that it works for other use cases of `AsyncSocket`.

This diff is long, but much of it is unit tests designed to prevent regressions given the trickiness of socket timestamps and `ByteEvent`.

**Each `ByteEvent` contains:**
- Type (WRITE, SCHED, TX, ACK)
- Byte stream offset that the timestamp is for (relative to the raw byte stream, which means after SSL in the case of AsyncSSLSocket)
- `steady_clock` timestamp recorded by AsyncSocket when generating the `ByteEvent`
- For SCHED, TX, and ACK events, if available, hardware and software (kernel) timestamps

**How `ByteEvent` are used:**
- Support is enabled when an observer is attached with the `byteEvents` config flag set. If the socket does not support timestamps, the observer is notified through the `byteEventsUnavailable` callback. Otherwise, `byteEventsEnabled` is called
- When the application writes to a socket with `ByteEvent` support enabled and a relevant `WriteFlag`, SCHED/TX/ACK `ByteEvent` are requested from the kernel, and WRITE `ByteEvent` are generated by the socket for the *last byte* in the write.
  - If the entire write buffer cannot be written at once, then additional `ByteEvent` will also be generated for the last byte in each write.
  - This means that if the application wants to timestamp a specific byte, it must break up the write buffer before handing it to `AsyncSocket` such that the byte to timestamp is the last byte in the write buffer.
- When socket timestamps arrive from the kernel via the socket error queue, they are transformed into `ByteEvent` and passed to observers

**Caveats:**
1. Socket timestamps received from the kernel contain the byte's offset in the write stream. This counter is a `uint32_t`, and thus rolls over every ~4GB. When transforming raw timestamp into `ByteEvent`, we correct for this and transform the raw offset into an offset relative to the raw byte offset stored by `AsyncSocket` (returned via `getRawBytesWritten()`).
2. At the moment, a read callback must be installed to receive socket timestamps due to epoll's behavior. I will correct this with a patch to epoll, see libevent/libevent#1038 (comment) for details
3. If a msghdr's ancillary data contains a timestamping flag (such as `SOF_TIMESTAMPING_TX_SOFTWARE`), the kernel will enqueue a socket error message containing the byte offset of the write ( `SO_EE_ORIGIN_TIMESTAMPING`) even if timestamping has not been enabled by an associated call to `setsockopt`. This creates a problem:
    1. If an application was to use a timestamp `WriteFlags` such as `TIMESTAMP_TX` without enabling timestamping, and if `AsyncSocket` transformed such `WriteFlags` to ancillary data by default, it could create a situation where epoll continues to return `EV_READ` (due to items in the socket error queue), but `AsyncSocket` would not fetch anything from the socket error queue.
    2. To prevent this scenario, `WriteFlags` related to timestamping are not translated into msghdr ancillary data unless timestamping is enabled. This required adding a boolean to `getAncillaryData` and `getAncillaryDataSize`.

Differential Revision: D24094832

fbshipit-source-id: e3bec730ddd1fc1696023d8c982ae02ab9b5fb7d
dotconnor pushed a commit to 5448C8/folly that referenced this issue Mar 19, 2021
Summary:
Adding support for write and socket timestamps by introducing `ByteEvent` that can be delivered to observers.
`AsyncTransport::WriteFlags` has long had timestamping related flags, such as `TIMESTAMP_TX`, but the code required to act on these flags only existed in proxygen. This diff generalizes the approach so that it works for other use cases of `AsyncSocket`.

This diff is long, but much of it is unit tests designed to prevent regressions given the trickiness of socket timestamps and `ByteEvent`.

**Each `ByteEvent` contains:**
- Type (WRITE, SCHED, TX, ACK)
- Byte stream offset that the timestamp is for (relative to the raw byte stream, which means after SSL in the case of AsyncSSLSocket)
- `steady_clock` timestamp recorded by AsyncSocket when generating the `ByteEvent`
- For SCHED, TX, and ACK events, if available, hardware and software (kernel) timestamps

**How `ByteEvent` are used:**
- Support is enabled when an observer is attached with the `byteEvents` config flag set. If the socket does not support timestamps, the observer is notified through the `byteEventsUnavailable` callback. Otherwise, `byteEventsEnabled` is called
- When the application writes to a socket with `ByteEvent` support enabled and a relevant `WriteFlag`, SCHED/TX/ACK `ByteEvent` are requested from the kernel, and WRITE `ByteEvent` are generated by the socket for the *last byte* in the write.
  - If the entire write buffer cannot be written at once, then additional `ByteEvent` will also be generated for the last byte in each write.
  - This means that if the application wants to timestamp a specific byte, it must break up the write buffer before handing it to `AsyncSocket` such that the byte to timestamp is the last byte in the write buffer.
- When socket timestamps arrive from the kernel via the socket error queue, they are transformed into `ByteEvent` and passed to observers

**Caveats:**
1. Socket timestamps received from the kernel contain the byte's offset in the write stream. This counter is a `uint32_t`, and thus rolls over every ~4GB. When transforming raw timestamp into `ByteEvent`, we correct for this and transform the raw offset into an offset relative to the raw byte offset stored by `AsyncSocket` (returned via `getRawBytesWritten()`).
2. At the moment, a read callback must be installed to receive socket timestamps due to epoll's behavior. I will correct this with a patch to epoll, see libevent/libevent#1038 (comment) for details
3. If a msghdr's ancillary data contains a timestamping flag (such as `SOF_TIMESTAMPING_TX_SOFTWARE`), the kernel will enqueue a socket error message containing the byte offset of the write ( `SO_EE_ORIGIN_TIMESTAMPING`) even if timestamping has not been enabled by an associated call to `setsockopt`. This creates a problem:
    1. If an application was to use a timestamp `WriteFlags` such as `TIMESTAMP_TX` without enabling timestamping, and if `AsyncSocket` transformed such `WriteFlags` to ancillary data by default, it could create a situation where epoll continues to return `EV_READ` (due to items in the socket error queue), but `AsyncSocket` would not fetch anything from the socket error queue.
    2. To prevent this scenario, `WriteFlags` related to timestamping are not translated into msghdr ancillary data unless timestamping is enabled. This required adding a boolean to `getAncillaryData` and `getAncillaryDataSize`.

Differential Revision: D24094832

fbshipit-source-id: e3bec730ddd1fc1696023d8c982ae02ab9b5fb7d
nspring added a commit to nspring/libevent that referenced this issue Mar 14, 2022
nspring added a commit to nspring/libevent that referenced this issue Mar 14, 2022
@nspring
Copy link

nspring commented Mar 14, 2022

@azat I started working on this issue and think I have a reasonable cut at it.

This raised two questions:

  1. EV_ERROR is defined on mac, but it only warns -- do you have a suggestion for another name? or let it warn?:
  CC       kqueue.lo
In file included from kqueue.c:65:
In file included from ./event-internal.h:41:
In file included from ./minheap-internal.h:33:
./include/event2/event.h:986:9: warning: 'EV_ERROR' macro redefined [-Wmacro-redefined]
#define EV_ERROR        0x200
        ^
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/sys/event.h:170:9: note: previous definition is here
#define EV_ERROR            0x4000      /* error, data contains errno */
        ^
  1. epolltable-internal.h is 2.8MB with 5 events, up from 40K. This feels extra large to me, and likely easily compressed by omitting comments or using the preprocessor; do you have a suggestion? Leave it alone? Cut until it reaches a target file size? Just drop the comments and see what happens?

Any other feedback is welcome of course.

@azat
Copy link
Member

azat commented Jul 9, 2022

Sorry for such a huge delay.

EV_ERROR is defined on mac, but it only warns -- do you have a suggestion for another name? or let it warn?:

EV_EVENT maybe ?

epolltable-internal.h is 2.8MB with 5 events, up from 40K. This feels extra large to me, and likely easily compressed by omitting comments or using the preprocessor; do you have a suggestion? Leave it alone? Cut until it reaches a target file size? Just drop the comments and see what happens?

Where do you see 2.8MB?
I think 40K for this generated file is OK, comments could be useful.

P.S. it is generated via make_epoll_table.py

wengxt added a commit to fcitx/fcitx5 that referenced this issue Dec 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants