Skip to content

async: always defer task wakes via ngx_post_event#295

Open
CVanF5 wants to merge 2 commits into
nginx:mainfrom
CVanF5:fix/schedule-defer-wakes
Open

async: always defer task wakes via ngx_post_event#295
CVanF5 wants to merge 2 commits into
nginx:mainfrom
CVanF5:fix/schedule-defer-wakes

Conversation

@CVanF5
Copy link
Copy Markdown

@CVanF5 CVanF5 commented May 29, 2026

Fixes #294.

schedule() ran runnable.run() synchronously when a task was woken from
outside its own poll. But Waker::wake() may be called from any context,
including a Drop that is holding a lock the woken task also needs — e.g. h2's
Streams::drop — and re-polling inline then re-enters that task on the caller's
stack and deadlocks on the held lock. Since the executor can't tell whether a
wake's caller is holding such a lock, it shouldn't re-poll inside wake() at
all. This always defers the wake via ngx_post_event (one event-loop tick).

  • fix: drop the synchronous runnable.run(); always SCHEDULER.schedule().
  • test: freestanding reproducer (only async_task) — synchronous re-poll
    reproduces the held-lock deadlock signature, deferred re-poll avoids it; no
    NGINX event loop needed.

Verified on Linux + macOS aarch64: cargo test / clippy --all-targets -Dwarnings
/ fmt --check all clean.

CVanF5 and others added 2 commits May 29, 2026 10:34
`schedule()` ran `runnable.run()` synchronously when a task was woken
from outside its own poll (`woken_while_running == false`). That violates
the `Waker::wake()` contract (wakes must be non-blocking and
non-re-entrant): when a wake fires from a `Drop` that holds a lock the
woken task also needs — e.g. h2's `Streams::drop` waking its `Connection`
task while holding `Arc<Mutex<Inner>>` — the synchronous re-poll re-enters
and deadlocks on that lock.

Always defer the wake via `ngx_post_event` instead; the runnable is
re-polled on the next event-loop tick by `ngx_event_process_posted`. On
the single-threaded event loop that is one worker-local list insert — one
tick of latency.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a freestanding test in `async_::spawn` (no deps beyond `async_task`)
reproducing the deadlock fixed by the previous commit: a `Drop` impl
wakes a parked task while holding a lock. With synchronous re-poll the
re-poll finds the lock still held — the deadlock signature, surfaced via
`Mutex::try_lock` returning `WouldBlock` so the test cannot hang; with
deferred wakes the re-poll acquires the lock cleanly. The test supplies
its own `schedule` functions, so no NGINX event loop is required.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@avahahn
Copy link
Copy Markdown

avahahn commented May 29, 2026

I was asked to review this to identify if it could cause any headaches within NGINX WASM.
Without commenting on anything else I can say that all tests pass in NGINX WASM with this fix and that I dont think this design will cause any issues for the WASM module.

I appreciate that this change will also make the interleaving of WASM guest execution (or any rust async code) with other components in the NGINX worker process easier to reason about.

@pschyska
Copy link
Copy Markdown

What are the chances 😲?
I had the same issue, but in ngx-tickle, and with hyper, not h2. We used to inherit this inline-run design as well, and came to the same conclusion (that we must always enqueue). I wanted to report it here because I suspected it would apply, and was trying to make a reproducer, but now that you have reported it independently I guess it's clear.

I can't do an official review here but: lgtm, and thanks!

@bavshin-f5
Copy link
Copy Markdown
Member

violating the Waker contract (wakes must be non-blocking and non-re-entrant)

Do you have a source for this? Neither the current documentation nor the design doc at https://github.com/rust-lang/rfcs/blob/master/text/2592-futures.md includes such a contract.
The only relevant resource I found is this old discussion, where one of the tokio developers confirms that such behavior is possible and should be expected. This thread also implies that the behavior in h2 is not expected.


There's a very good reason why the scheduler was implemented like this: all IO events and callbacks from the nginx event loop must be handled synchronously. Otherwise, we'll start observing some rare but nonetheless amusing consequences: operation results can be freed at the callback exit, file descriptors can be incorrectly registered for subsequent operations on the event loop, etc.
Common definitions of AsyncRead cannot even be efficiently implemented within the fully deferred scheduling model, because it would require eagerly reading the socket data into a temporary buffer before waking the task. And then potentially unregistering the file descriptor from nginx event methods, if the deferred poll() decides that further reads are unnecessary.

It would make sense to extend the info.woken_while_running condition to other tasks, because we only care about synchronous polling of nginx IO event wakeups. I'm not sure what's the best way to implement that other than an atomic lock on the scheduler.

@CVanF5
Copy link
Copy Markdown
Author

CVanF5 commented Jun 2, 2026

Thanks @bavshin-f5, you're right about the wording. "Contract" was the wrong word; it was more the general idea that wake() should be non-blocking, that the executor shouldn't run the task inline inside wake(). I'll fix the wording in the PR.

I don't think inline re-poll can be made safe in general. The failure is deterministic: when wake() is called from code that is holding a non-reentrant lock, and re-polling the woken task tries to take that same lock, the re-poll runs synchronously on the caller's stack while the lock is still held. It then can't acquire the lock, and the caller can't release it until the re-poll returns...deadlock.

We've seen it with h2, and now it looks like hyper too, so I don't think it's specific to one library. I'd expect it to surface in any library that wakes a task while holding one of its own private locks that the woken task then tries to acquire, which I think is a legal thing for a library to do: the Waker API permits waking from any context, and the executor has no way to detect that the caller is holding such a lock. Fixing it upstream library-by-library would be whack-a-mole across an open-ended set; the executor is the one place where deferring is robust against all of them at once.

On extending woken_while_running (or guarding the scheduler with a lock): I don't think it can be made safe without collapsing back into always-defer. The signal the guard can test: "are we nested inside a poll?" is not the question we're trying to answer if we're trying to prevent a deadlock. The question we need to answer is: "is the waker's caller holding a lock the task will re-acquire?" And that lock is private to the library and invisible to the executor. A lock-holding wake can arrive un-nested or nested. The guard needs to defer every wake it can't prove lock-free, and at the decision point the scheduler has nothing to prove that with, because it can't see the lock. Therefore I think the safest thing to do is to defer everything.

On the synchronous-IO concern...where you know the code better than I do, so correct me if I've read it wrong. As I understand it, your concern assumes a design where the nginx event callback itself does the IO: reads the socket, owns the result, touches fd registration, so a task must run before the callback returns. If that's the model, I agree deferral breaks it. However, when I run grep -rn "wake_by_ref\|\.wake()" src/async_/ . I find only two production sites that wake a task: Sleep::timer_handler and Resolver::handler. Neither seems to need the task to run before the callback returns: Sleep's handler only wakes, and Resolver's stages its result into the future's own state (this.complete = …) and consumes the ctx before waking...and your comment there ("wake last … because wake may poll Resolution future on current stack") orders the writes ahead of the wake precisely so it looks to me like it's correct whether the re-poll is inline or deferred. Have I read it wrong, or am I missing a planned future integration where a task must be run before the callback returns?

For what it's worth, enqueuing the woken task instead of re-polling it inline is what the mainstream executors already do. I checked three: Tokio's LocalSet pushes the woken task onto its queue (Shared::schedule → task_push_back/queue.push_back); async-executor's schedule closure is state.queue.push(runnable); state.notify();; and futures' LocalPool waker just sets a flag and unparks the run loop (ThreadNotify::wake_by_ref), which then polls the pool. None poll a future inline from inside wake(). So deferring isn't unusual.

I appreciate your time on this, either way.

@pchickey
Copy link
Copy Markdown
Contributor

pchickey commented Jun 3, 2026

violating the Waker contract (wakes must be non-blocking and non-re-entrant)

Do you have a source for this? Neither the current documentation nor the design doc at https://github.com/rust-lang/rfcs/blob/master/text/2592-futures.md includes such a contract.
The only relevant resource I found is this old discussion, where one of the tokio developers confirms that such behavior is possible and should be expected. This thread also implies that the behavior in h2 is not expected.

I talked to @alexcrichton, who created Waker, about this. As far as he knows, there is nothing in the Rust project docs that say you can't poll a future on the call stack of Waker::wake(), and in the early days of building executors, many implementations did do so. Practically, all executors that did so no longer do so, and the Rust ecosystem contains lots of Future impls that, whether by use of locks or unsafe, call wake with the assumption of no major side effects (such as polling a Future) will happen synchronously. Given that this issue bites us on real code today with a deadlock and could potentially bite with less-detectable bad behavior in the future, I think the best way to treat this is that Waker has an invariant which is not reflected in the Rust docs, and thats a deficiency of those docs.

@bavshin-f5
Copy link
Copy Markdown
Member

bavshin-f5 commented Jun 3, 2026

The main difference between all the mainstream async executors and our one is the entire nginx doing its work behind the scenes. It's not a simple event loop like libevent, libuv or mio, it's a complete, complex server application that accepts connections, handles errors, and can decide to drop our async task context without a notification, before we even have a chance to handle this gracefully.
Deferred wakeup can be processed when it's no longer safe to access the event context or when the underlying IO object state has already changed, and it is not a theoretical issue. It is my experience from building async connection wrapper for nginx-acme, which includes some interesting, rare and really hard to debug crashes.

If we have to respect the undocumented Waker invariant, as an nginx developer, I no longer believe it is possible to build a safe async abstraction over the nginx event loop. We'll need to drop everything we currently have, and likely return to the approach of running tokio as a sidecar. I'd prefer to find an approach that does not require such drastic measures.

There are some less important issues, such as always queuing events to the ngx_posted_next_events and thus waking up the event loop immediately and eliminating idle wait, but that's just a non-critical performance stuff.


Interestingly enough, a similar deadlock was reported to h2 in 2021. It was stated directly that there's a problem in h2, but the change in tokio that caused the task to be shutdown too early got reverted, and nobody bothered to fix h2.

Now, if you close the client connection with some very fortunate timing, the socket error will be processed on the next event loop iteration before the scheduled async task wakeups. The connection will be destroyed, and all the tasks will be dropped without polling. This sounds oddly similar to the scenario Alice described, and depending on the drop order it is possible to get the same deadlock.

Edit: I realized that the scenario as described should not possible, because we already left the call stack with the lock.

@pschyska
Copy link
Copy Markdown

pschyska commented Jun 4, 2026

The main difference between all the mainstream async executors and our one is the entire nginx doing its work behind the scenes. It's not a simple event loop like libevent, libuv or mio, it's a complete, complex server application that accepts connections, handles errors, and can decide to drop our async task context without a notification, before we even have a chance to handle this gracefully. Deferred wakeup can be processed when it's no longer safe to access the event context or when the underlying IO object state has already changed, and it is not a theoretical issue. It is my experience from building async connection wrapper for nginx-acme, which includes some interesting, rare and really hard to debug crashes.

If we have to respect the undocumented Waker invariant, as an nginx developer, I no longer believe it is possible to build a safe async abstraction over the nginx event loop. We'll need to drop everything we currently have, and likely return to the approach of running tokio as a sidecar. I'd prefer to find an approach that does not require such drastic measures.

Don't throw out the baby with the bathwater.
I think two concerns are getting conflated here.

1. nginx as an async executor.

This is valuable on its own, so handlers can be async and we can do non-blocking i/o, which is crucial.

We could use hickory-resolver instead of the Resolution future and hyper on TokioIo instead of PeerConnection, although my recent benchmarks show that nginx futures might be preferable for efficiency reasons.

Worst case, with just an nginx executor and a bridge into its epoll which wakes it in response to external events (e.g. eventfd), you have another epoll loop in the tokio runtime thread. The nginx epoll would handle the incoming connection fds, etc. and tokio's epoll the fds for hyper clients, file i/o, etc.1

2. futures wrapping nginx internals like Resolution, PeerConnection

If nginx frees resources after the completion callback returns, the future itself should move the read into the callback, so the data is available at poll time.

I don't know of any concrete examples of this, but you clearly have something in mind? Maybe you can share one, which would make it easier for me to wrap my head around.

I believe read/recv handlers don't generally require this, e.g. PeerConnection can recv() at poll time, nothing in nginx recv()'s our data away from that socket.

In my benchmarks, I'm using your PeerConnection with both ngx::async_::spawn() and ngx_tickle::spawn(), and the latter already defers wakeups exactly like this PR proposes. So futures in ngx_tickle's spawn are always polled after the completion handler has returned (necessarily, because there is only one thread in this scenario).2 I haven't seen a segfault or similar issues with that approach across probably hundreds of millions of wrk{,2} requests (that would have uncovered rare issues), FWIW.

In summary, I would say that 1. is a necessity, and 2. is a nice-to-have, and that the issues you are mentioning are related to 2.

So how to get 1., i.e. the ability to have handlers async?
just a sidecar-runtime like the compat runtime or tokio won't work well: you could not pin futures to the main thread, and could not use any of the nginx objects like Request.
You really want to be able to have some futures running in the nginx executor.

use std::sync::OnceLock;

use async_compat::CompatExt;
use ngx::core::Status;
use ngx::ffi::{
    NGX_HTTP_MODULE, ngx_array_push, ngx_command_t, ngx_conf_t, ngx_http_handler_pt,
    ngx_http_module_t, ngx_http_phases_NGX_HTTP_PRECONTENT_PHASE, ngx_int_t, ngx_module_t,
};
use ngx::http::{self, HTTPStatus, HttpModule, Request};
use ngx::http::{HttpModuleMainConf, NgxHttpCoreModule};
use ngx::{http_request_handler, ngx_modules};
use tokio::runtime::Runtime;

use ngx_tickle::prelude::*;

static UPSTREAM: &str = "example.com";

async fn async_handler_compat(request: &mut Request) {
    // future 1, poll #1 -> nginx thread, safe to use Request, and other !Send
    let response = reqwest::get(format!("{UPSTREAM}/{}", request.path()))
        // reqwest starts a hyper driver task internally using tokio::spawn().
        // The compat runtime is a tokio new_current_thread in a secondary thread, and it sets up a
        // context that makes global tokio::spawn() work, using that runtime.
        // Let's call that future 2 -> compat thread
        .compat()
        .await
        // future 1, poll #2 -> nginx thread again
        .unwrap();
    request.add_header_out("X-example-status", &format!("{}", response.status()));
    finalize_request(request, HTTPStatus::NO_CONTENT.into());
}

http_request_handler!(entry_handler_compat, |request: &mut http::Request| {
    request.spawn(async_handler_compat).unwrap(); // RequestSpawn, see below
    Status::NGX_AGAIN
});

async fn async_handler_tokio(request: &mut Request) {
    // future 1, poll #1 -> nginx thread (safe to use Request, and other !Send)
    let path = request.path().to_str().unwrap().to_string();
    let response = tokio_runtime()
        .spawn(async move {
            // future 2 -> tokio thread, *not* safe to use !Send, but the compiler will stop you:
            // reqwest::get(format!("{UPSTREAM}/{}", request.path()))
            //
            // error: future cannot be sent between threads safely
            //    --> examples/compat.rs:45:36
            //     |
            //  45 |     let response = tokio_runtime().spawn(async move {
            //     |                                    ^^^^^ future created by async block is not `Send`
            //     |
            //     = help: within `{async block@examples/compat.rs:45:42: 45:52}`, the trait `Send` is not implemented for `*mut u8`
            // note: captured value is not `Send` because `&mut` references cannot be sent unless their referent is `Send`
            //    --> examples/compat.rs:46:47
            //     |
            //  46 |         reqwest::get(format!("{UPSTREAM}/{}", request.path()))
            //     |                                               ^^^^^^^ has type `&mut ngx::http::Request` which is not `Send`, because `ngx::http::Request` is not `Send`
            // note: required by a bound in `Runtime::spawn`
            //    --> /home/p/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.52.3/src/runtime/runtime.rs:241:21
            //     |
            // 239 |     pub fn spawn<F>(&self, future: F) -> JoinHandle<F::Output>
            //     |            ----- required by a bound in this associated function
            // 240 |     where
            // 241 |         F: Future + Send + 'static,
            //     |                     ^^^^ required by this bound in `Runtime::spawn`

            // again, reqwest spawns an internal hyper driver task, let's call that future 2.1 -> tokio thread
            reqwest::get(format!("{UPSTREAM}/{}", path)).await.unwrap()
        })
        .await
        .unwrap();

    // future 1, poll #2 -> nginx thread again
    request.add_header_out("X-example-status", &format!("{}", response.status()));
    finalize_request(request, HTTPStatus::NO_CONTENT.into());
}

http_request_handler!(entry_handler_tokio, |request: &mut http::Request| {
    request.spawn(async_handler_tokio).unwrap(); // RequestSpawn, see below
    Status::NGX_AGAIN
});

static RUNTIME: OnceLock<Runtime> = OnceLock::new();
fn tokio_runtime() -> &'static Runtime {
    RUNTIME.get_or_init(|| {
        // or new_current_thread, but doesn't make a difference at this point
        tokio::runtime::Builder::new_multi_thread()
            .enable_all()
            .build()
            .expect("runtime")
    })
}

In this example, future 1 runs in the nginx executor. It can use &mut Request and other !Send, and must be polled on the main thread. future 2 (and 2.1) run in tokio, and are polled on that thread. This almost works today, but when future 2/2.1 wakes future 1, it will do so from the other runtime's thread, which runs future 1 in that thread in-line, and not the main thread. This is why #218 (and ngx-tickle) exists.3

!Send prevents you from accidentally borrowing Request in future 2, the compiler will stop you. That is why you want something like RequestSpawn to get rid of the 'static requirement: If you do the AtomicPtr dance, you are also making Request Send in addition to making it 'static.
Then, the compiler will not stop you using it in another thread, and you know how quick things mess up if you tried to use nginx state from the wrong thread.

For some context: we are using the ngx-tickle approach for a customer project now.

We use some tokio futures, mainly reqwest and file i/o, and I wrote a RequestBody future to read the client body with the helpful pointers you've given me in #222. In it, I'm just collecting buf pointers in the completion handler, but only call ngx_http_finalize_request (for the ngx_http_read_client_request_body "subrequest") after I have read them fully.

We have an extensive black-box test suite (external to that repo), are now in production, and do load-testing regularly. There are no issues in the "twilight zone" (segfaults, deadlocks, ...) that would arise from unsafe behaviour.
I'm not saying there weren't any, there were plenty, but they turned out to be misuse of the nginx APIs in futures: calling finalize/free functions at the wrong time, or not at all, for example.
Or hitting a rare UAF during request cancellation initiated from the nginx side due to timeouts during load tests.

So I do know what you are talking about, but the crashes you mean were, in our experience, category 2 — futures misusing the nginx APIs — not the scheduler itself.

Now, after derailing this discussion with my thread-safe schedule/Waker stuff again 🙂 , let me bring it back:
The one genuine scheduler bug we hit is exactly the issue this PR addresses:

A very rare deadlock that happens when hyper connections are Dropped, related to a lock guarding its connection pool that is held while notifying a task waiting for a connection. ngx-tickle had the exact same recursive-poll behavior, so the waking of a connection waiter ran in-line, and the woken task tries to acquire the same lock re-entrantly, which it is not prepared for.
I fixed it by switching to always-enqueue — which is what this PR does.

Footnotes

  1. A minor efficiency issue here would be that timers aren't synchronized: a tokio timer would not deadline the nginx epoll, so when it fires it would have to do an additional eventfd write (or similar) to ping nginx. There might be a way to integrate the timers, though, but I didn't find it yet. I tried to come up with a way to single-step a tokio runtime, i.e. to run one step every time the nginx event loop turns. Then, when forwarding tokio epoll events and synchronizing its timers to nginx, it would be possible to get rid of the secondary thread driving the tokio runtime. I think it used to be possible (tokio-core), but it isn't anymore today, but I didn't give up yet 🙂

  2. I actually have unpublished numbers for ngx::async_::spawn() with this PR merged-in, because I was curious how it might impact performance. I'm happy to share them if you are interested, but IIRC it didn't move the needle much.

  3. Note that just deferring wakeup as in this PR doesn't resolve the thread-safety issue on its own: The code would still run unsafe { ngx_post_event(&raw mut self.event, &raw mut ngx_posted_next_events) }, from the tokio thread, as this is the one calling the Waker and this isn't safe. ngx-tickle's notify has a dual role: It interrupts nginx' epoll and moves the task back to the main thread.

@pchickey
Copy link
Copy Markdown
Contributor

pchickey commented Jun 4, 2026

If we have to respect the undocumented Waker invariant, as an nginx developer, I no longer believe it is possible to build a safe async abstraction over the nginx event loop. We'll need to drop everything we currently have, and likely return to the approach of running tokio as a sidecar. I'd prefer to find an approach that does not require such drastic measures.

We do have to respect the Waker invariant. Dropping everything we currently have isn't a realistic approach, we have multiple projects in progress that depend on async rust on top of the nginx event loop, and switching to a sidecar would amount to a rewrite on those projects and would require escalation way up the business chain for the disruption to deliverables that would cause.

There are already numerous problems with using Rust safely on top of nginx's C abstractions, and any additional problems this creates will need to be tackled systematically.

Common definitions of AsyncRead cannot even be efficiently implemented within the fully deferred scheduling model, because it would require eagerly reading the socket data into a temporary buffer before waking the task. And then potentially unregistering the file descriptor from nginx event methods, if the deferred poll() decides that further reads are unnecessary.

I don't love AsyncRead's design but its an immovable part of the world this project is living in, so the next step is to start building the functionality to make this possible, and make it reasonable for the users of these async Rust specific abstractions to be able to understand and manage the costs of doing things the way they must.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

schedule() re-polls synchronously on wake, violating the Waker contract

5 participants