fix(fast_io): prevent io_uring SEND deadlock under TCP backpressure (#1872)#3551
Merged
Conversation
…1872) Without a readiness gate, an IORING_OP_SEND on a back-pressured TCP socket can sit in the kernel until the send buffer drains. While that SQE is pending the writer ring's submit_and_wait() does not return, starving any concurrent RECV completion side and producing the apparent deadlock reported in #1872. Mirror upstream rsync's bidirectional select() strategy (io.c:perform_io) inside submit_send_batch by gating each batch with a PollAdd(POLLOUT) SQE plus a linked Timeout. SEND SQEs are only submitted after the socket reports writable, so submit_and_wait is bounded by readiness rather than peer drain. Transient EAGAIN/ETIME results re-arm the readiness wait instead of failing. Adds a Linux-only regression test that prefills a small TCP socket buffer until EAGAIN, then drives a concurrent SEND and RECV across their own io_uring rings under a 20s wall-clock guard. The test skips gracefully when io_uring is not available.
…1872) Same unreachable-pattern fix as the prior commit for batching.rs - applied to the prefill loop in tests.rs. EWOULDBLOCK == EAGAIN on Linux, so the second arm is unreachable under -D warnings.
The previous payload formula `(i % 250) + 1` produced byte 0xAB at i=170, which collides with PREFILL_MARKER (0xAB). The drain side relies on this byte never appearing in the payload to recognize the boundary between the saturating prefill and the io_uring writer output. Replace the mapping with a range that walks 1..=255 while skipping the marker byte, so the debug_assert on line 1040 holds for any index.
4e3eed2 to
58e4329
Compare
oferchen
added a commit
that referenced
this pull request
May 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
IORING_OP_SENDon a back-pressured TCP socket would sit in the kernel until the send buffer drained, holdingsubmit_and_waitand starving any concurrent RECV completion side, producing an apparent deadlock during daemon-mode bidirectional transfers.PollAdd(POLLOUT)readiness gate (with a linked timeout) in front of everysubmit_send_batchflush. SEND SQEs only enter the kernel once the socket reports writable, sosubmit_and_waitis bounded by readiness rather than peer drain. This mirrors upstream rsync'sselect()-based bidirectional I/O loop inio.c:perform_io.EAGAIN/EWOULDBLOCKSEND CQEs andETIME/ECANCELEDpoll CQEs re-arm the readiness wait without surfacing fatal errors.Why poll-add gating instead of an interleaved peek loop
Each
IoUringSocketWriter/IoUringSocketReaderalready owns a private ring (so the SEND ring never carries RECV SQEs). The cleanest containment for the deadlock is therefore inside the writer's ownsubmit_send_batch: defer the SEND submission until the kernel signals room. This adds zero coupling between the writer and reader rings and keeps the existing API surface intact, whereas an interleaved peek/timeout loop would require sharing CQE handling between two rings that today have no shared state.Test plan
cargo nextest run -p fast_io --all-features -E 'test(io_uring) or test(socket) or test(send)'(Linux must exercisetest_socket_send_no_deadlock_under_backpressure_1872; macOS/Windows skip via#[cfg(target_os = "linux")]).The new regression test:
SO_SNDBUF/SO_RCVBUFto 4 KiB.EAGAIN, then restores blocking mode.IoUringPolicy::Enabled).