Proposal: make the Windows asio backend work like epoll and kqueue #5465

SeanTAllen · 2026-06-11T15:15:34Z

SeanTAllen
Jun 11, 2026
Maintainer

This is a proposal to replace how the runtime does socket I/O on Windows: drop completion-based overlapped I/O (IOCP) and use readiness notifications instead — the same model the epoll and kqueue backends use everywhere else. Reads and writes, TCP and UDP, one model on every platform. The difference between the two models: under completions you hand the kernel a buffer and get told later when the operation has finished with it; under readiness the kernel tells you an operation would succeed right now, and you perform it synchronously with your own buffer.

The rationale, in order of weight. Memory safety: under the current model the kernel writes into Pony buffers after calls return; no reference capability can describe that (the capability analysis is in #5461). Today, not corrupting an in-flight buffer is the user's burden, enforced by nothing but care. That burden should be the runtime's. One model for users: anyone writing event-driven code against the runtime's ASIO interface today gets readiness semantics on POSIX and completion semantics on Windows; with this change, OS-agnostic ASIO code becomes the easy path. Real backpressure: Windows TCP writes currently throttle on an in-flight-count heuristic the source itself calls "rather arbitrary"; readiness gives writability as the kernel reports it, like every other platform. Maintenance: packages/net carries a divergent Windows twin of nearly every read/write path, plus a Windows-only concurrency-survival subsystem in the runtime, all of which gets deleted by this change.

There's precedent: Rust's mio did this (the wepoll technique), libuv's sockets work this way, and since Windows Server 2022 there's a documented API for it. What follows is the investigation: what we actually have today, an alternative I considered and why I don't think it's enough, what a readiness backend requires, and the choices I'd make.

What we actually have today

On Linux, the runtime's asio thread sits in epoll_wait, the kernel tells it which sockets are ready, and it sends events to the interested actors. All socket handling runs through that one Pony-owned thread.

Windows has no equivalent of that today. The asio thread exists there too (asio/iocp.c), but it never touches a socket: it waits on exactly two things, its own wakeup signal and console stdin, and handles timers and signals on the side. Socket I/O takes a completely different path. When the stdlib reads or writes, the runtime starts an overlapped operation: it hands the kernel a buffer, and the call returns immediately, before the work is done. When the kernel finishes (however much later that is), Windows runs a callback of ours on one of its own thread-pool threads, and that callback, on a thread the Pony runtime doesn't own or manage, is what sends the asio event to the actor (BindIoCompletionCallback, socket.c:516-520, 541-544). There is no Pony-owned waiting point for sockets at all: no completion port of ours, no loop pulling socket events (there is no CreateIoCompletionPort or GetQueuedCompletionStatus call anywhere in the runtime). Every socket event the runtime delivers on Windows is sent from a thread the runtime doesn't manage.

That architecture has consequences independent of any switch:

A subsystem exists to survive completions firing on foreign threads. Every event carries a Windows-only refcounted token (iocp_token_t, event.h:11-31) so that an iocp_callback arriving after pony_asio_event_destroy doesn't use freed memory. The protocol spans event.c, socket.c:214-302, and a release hook in actor.c:477-498. Foreign-thread sends also require pony_register_thread() on thread-pool threads (event.c:165-179).
UDP send completions are discarded. WSASendTo is issued fire-and-forget with a NULL event (socket.c:456-481, 376-379). If a send fails after queuing, nothing is reported. No error, no backpressure.
TCP write backpressure is a guess. Applied only if the overlapped submit itself fails (which a full socket send buffer doesn't cause) and released when fewer than 16 writes are in flight. The comment in tcp_connection.pony:767-770: "The choice of 16 is rather arbitrary and probably needs to be tuned." POSIX gets exact backpressure from the kernel; Windows gets a magic number.
The listener has a latent liveness gap. Exactly one AcceptEx is outstanding at a time, and the C failure path closes the pre-created socket and sends no event (socket.c:336-340); a failed accept silently stops the listener accepting. (The stdlib's ns == -1 handling branch is unreachable from the current C.)
The kernel writes into GC-managed memory while actors run. An in-flight TCP read targets the connection's iso read buffer; an in-flight UDP read additionally writes a NetAddress object. The teardown sequencing in tcp_connection and udp_socket exists to order close around these in-flight kernel writes. This is the retention window from Round 2 plan: honest FFI reference capabilities for stdlib + tools (#4925) #5461: the window no capability can describe.
The cross-platform asio contract is only half-implemented on Windows. ASIO_ONESHOT (a subscription mode where an event fires once, then stays disarmed until explicitly re-armed) is ignored; pony_asio_event_resubscribe_read/_write don't exist in iocp.c (the stdlib's calls are inside ifdef not windows); the readable/writeable event fields are never touched by the backend. The stdlib compensates by requesting different event flags per platform.

An alternative: keep IOCP and transfer buffer ownership explicitly

There is a way to fix the memory-unsafety while staying on completions, and it belongs in this discussion because it's today's model with the ownership transfer made explicit. Completion I/O's contract is ownership transfer: the buffer belongs to the kernel from submission until completion. Expressed properly, a read would consume the buffer at submit, the runtime would own and root it while in flight, and the completion would hand it back. That's where Rust's io_uring runtimes landed (tokio-uring's reads take a Vec<u8> by value and the completion returns it), and it works. For us it means runtime machinery: completions that can carry an object back (today's _event_notify, the behavior the runtime invokes on an actor to deliver an asio event, carries only an event id and two U32s) and runtime rooting of in-flight buffers. But it's buildable.

I don't think it's sufficient, because it fixes only half the problem. The buffer stops being a safety hazard, but IOCP stays semantically different from epoll and kqueue, and every user writing OS-agnostic ASIO code still has to handle that difference: completion model on Windows, readiness model everywhere else, two shapes of event handling in every cross-platform library that touches the runtime's event system. The point of taking complexity into the runtime is an easier model for users, and the round-trip design adds the runtime complexity but still leaves users with two models. Readiness everywhere fixes the same unsafety and removes the split: one ASIO contract on every platform, the easiest path for writing OS-agnostic event-driven Pony. The trade-off is stated below: the floor would become Windows 11.

What readiness means for us

The cross-platform contract a readiness backend must implement already exists and is already exercised: it's exactly what epoll.c and kqueue.c do. One-shot events, resubscribe_read/resubscribe_write to re-arm, set_readable/set_writeable state, reads and writes as synchronous non-blocking calls that run until WOULDBLOCK. The POSIX read loop in tcp_connection (_pending_reads, lines 978-1059) is the spec; the Windows backend has to implement the semantics that loop depends on.

The work splits into three layers:

Runtime, socket layer (socket.c): sockets become non-blocking. Today no Windows socket is; they're blocking-but-overlapped, and not one FIONBIO exists in the tree. pony_os_recv/recvfrom/send/writev lose their overlapped Windows variants and become what they are on POSIX: synchronous calls returning OK/RETRY/ERROR. Accept drops the AcceptEx machinery (pre-created duplicated sockets, SO_UPDATE_ACCEPT_CONTEXT, the extension-pointer fetch) for a plain accept-on-readiness loop. Connect drops ConnectEx and its mandatory pre-bind for connect-plus-poll-for-writable, which also removes the path where a connect completion is reported to the stdlib as a writeable event, and the SO_CONNECT_TIME special case the stdlib uses to tell a successful connect from a failed one (both arrive as that same event today).

Runtime, event delivery: something has to turn "socket became ready" into the ASIO_READ/ASIO_WRITE events the stdlib consumes, with one-shot semantics and re-arm. This is where the three implementation candidates below differ.

Stdlib (packages/net): the Windows halves get deleted, not rewritten. The inventory says roughly: in tcp_connection, the dual pending-write buffers and _pending_sent, _complete_writes, _complete_reads, _queue_read, the dual writev FFI declarations (WSABUF vs iovec tuple order), the per-platform event-flag selection, and the connect special case all go; the POSIX loops become universal. In udp_socket, _read_from, _complete_reads, _start_next_read, and the deferred-unsubscribe close sequence go. In tcp_listener, the completion-driven accept collapses into the POSIX loop. The foreign-thread token subsystem in the runtime becomes deletable too, if event sends move onto the asio thread (see "Who waits on the port" below).

What doesn't change: everything on the Windows asio thread that isn't socket I/O, because none of it goes through IOCP today. Timers are waitable-timer callbacks delivered straight to the asio thread; signals are CRT signal handlers; stdin is a console handle the asio thread waits on directly; and process pipes never touched sockets or asio events at all (they poll named pipes on a timer). A change to how sockets are handled doesn't reach any of it. The one shared obligation is noisy-event accounting (the count of live I/O subscriptions that keeps the runtime from shutting down while I/O is outstanding), which must be preserved identically.

Three ways to get readiness on Windows

1. ProcessSocketNotifications — the documented one. Since Windows Server 2022 / Windows 11 (build 20348), Winsock has a documented socket-readiness API: register sockets with a completion port you own, receive SOCK_NOTIFY_EVENT_IN/OUT/HANGUP/ERR as ordinary completion packets (the event mask is carried in the bytes-transferred field), dequeue with GetQueuedCompletionStatusEx. It supports level-triggered one-shot registration (level-triggered: notified whenever the condition holds, not only on the transition into it), which is Microsoft's recommended pattern for multithreaded servers and exactly the shape of our ASIO_ONESHOT contract. Lifecycle is documented and manageable: one port per socket, deregister-and-wait-for-REMOVE before freeing context, closesocket auto-deregisters without a REMOVE so event lifetimes need care. Our README's supported-Windows is Windows 11 only and CI runs windows-11/windows-2025, so every supported target has this API. The catch: Windows 10 (currently best-effort, x86) doesn't have it; the consequence is covered below.

2. AFD poll, wepoll-style — undocumented, widely shipped. IOCTL_AFD_POLL (0x12024) issued via NtDeviceIoControlFile against helper handles opened on \Device\Afd, completions delivered to a port, NtCancelIoFileEx to cancel. The event vocabulary is richer than candidate 1 (POLL_RECEIVE, POLL_SEND, POLL_DISCONNECT, POLL_ABORT, POLL_ACCEPT, POLL_CONNECT_FAIL, POLL_LOCAL_CLOSE), it works back to Vista, and it's the core mechanism under wepoll, libuv, and mio: years of production use of an API Microsoft never documented. Implementation details are well mapped: base-socket unwrapping via SIO_BASE_HANDLE (with fallbacks, because some layered service providers, third-party software inserted into the Winsock stack, break it), helper-handle pooling (wepoll runs 32 sockets per AFD handle), one-shot-by-nature with explicit re-arm. And calling ntdll costs us nothing new at link time: every compiled Pony program already links ntdll.lib (it's in the compiler's default link line). One limitation matters below: an AFD poll reports "the condition holds now," which is level-triggered behavior. Persistent edge-triggered behavior has no clean expression on it. That's not my inference; it's wepoll's own documented limitation ("Edge-triggered (EPOLLET) mode isn't supported"), unchanged across years of production use.

3. AFD poll per-socket — the simplified undocumented one. Len Holgate demonstrated in 2024 that the same IOCTL_AFD_POLL can be issued on the socket's own handle, skipping the shared helper handles and their bookkeeping entirely. Same undocumented dependency, less machinery. Less production history than wepoll's approach.

Which mechanism: ProcessSocketNotifications

My choice is ProcessSocketNotifications, for one reason that outweighs everything else: it's the only option that gives users the same affordances on Windows that epoll and kqueue give them everywhere else, and that is incredibly important. The asio subsystem isn't just plumbing for packages/net — it's a runtime interface user code builds on. On Linux and the BSDs, a subscription without ASIO_ONESHOT gets persistent edge-triggered semantics (EPOLLET / EV_CLEAR: notified once each time the socket becomes ready, with no re-arm needed); with it, one-shot plus re-arm (notified once, then nothing until explicitly re-armed). Offering that same surface on Windows is the same principle as the memory-safety argument: the runtime takes the complexity on so users get the model. ProcessSocketNotifications supports one-shot, persistent, level, and edge triggering natively: everything the posix backends offer. The AFD route cannot do edge. None of the implementations built on it support EPOLLET (wepoll, the reference for the technique, documents the limitation outright; the Rust libraries that reimplemented the technique share it), and it appears that making it happen would be painful and bug-prone. The only thing that pain would buy is Windows 10 support.

For the stdlib's own usage the candidates are equivalent (the stdlib subscribes one-shot and re-arms after each event, and that pattern maps onto either mechanism with the same number of syscalls), so user affordances are the deciding criterion.

The trade-off: same functionality for user code to interface with the runtime on all supported platforms — but Windows 11 and Windows Server 2022 become the floor (the API appears in build 20348). Our supported-platforms list is already Windows 11 only and CI runs Windows Server 2025, so nothing changes for supported targets. With this change, Windows 10 support is dropped entirely. It's a best-effort tier today (x86 only); every Windows release before the required API existed falls below the floor: no fallback path, no degraded mode, and the platform documentation changes accordingly in the same PR. Wepoll-style AFD is the fallback if the documented API turns out to have problems in practice, at the cost of the edge mode.

Who waits on the port

With ProcessSocketNotifications, readiness notifications arrive as packets on a completion port. Some thread has to wait on that port, pull packets off, and send the corresponding ASIO_READ/ASIO_WRITE events to actors. Windows offers two ways to arrange that:

The asio thread waits on the port. We create the port, and the asio thread blocks on it waiting for packets (GetQueuedCompletionStatusEx) instead of blocking on its current two handles: the same job it does today, and the same job the asio thread does on Linux sitting in epoll_wait. When a packet arrives, it wakes, reads which socket became ready for what, and sends the asio event, all from the one thread the runtime already dedicates to this. Two details need engineering, both manageable. Timers: today they're delivered to the asio thread as APCs, which Windows only delivers while a thread is in an "alertable" wait; the port wait has an alertable variant, so timers keep working unchanged. Stdin: a console handle can't be attached to a completion port, so stdin needs a side path; Windows can watch the handle and post a packet to the port when input arrives (RegisterWaitForSingleObject).
Windows' thread pool delivers the packets. Windows can instead run a callback of ours on one of its own threads whenever a packet arrives — this is exactly how today's socket completions are delivered (BindIoCompletionCallback). Less change, but our code keeps running on threads the runtime doesn't manage.

I think the Pony-owned port is the right call, and the ground is correctness. When our code runs on Windows' threads, an event can be delivered at the same moment the runtime is tearing that event down. We know those races are real, because the refcount/dead-flag machinery described earlier exists specifically to manage them. With the asio thread waiting on its own port, exactly one thread touches events (the same arrangement every POSIX backend has), so those races can't happen, and the machinery that managed them gets deleted instead of maintained. ProcessSocketNotifications requires a caller-created port anyway, so the architecture I'd want and the one the API requires are the same.

User-visible changes

Backpressure becomes real. throttled/unthrottled fire on actual socket writability on every platform, not a 16-in-flight heuristic.
UDP send failures stop being silently discarded.
Muted-connection close detection changes. (Muting is receive-side backpressure: a TCPConnection can be muted to stop reading.) Today Windows notices a peer close while muted (the queued read surfaces it) and POSIX doesn't (documented in tcp_connection.pony:197-202). Readiness makes Windows match POSIX. If we'd rather keep close-detection-while-muted, it looks doable on both platforms under readiness — but that's a behavior decision, not an implementation detail.
Several \exhaustive\ match arms in the net stdlib that are documented "unreachable on Windows" become reachable; the _SocketResultDecoder OK/Retry/Error contract means the same thing everywhere.

What this does for memory safety

Under readiness, recv and recvfrom write only during the call (the kernel never retains a pointer past return), so reference capabilities can describe everything that happens to socket memory, on every platform. The in-flight-kernel-write teardown sequencing goes away, no GC-managed memory gets written outside any capability's description, and the safety-by-discipline requirement on Windows disappears: there is no in-flight buffer left to be careful with.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal: make the Windows asio backend work like epoll and kqueue #5465

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Proposal: make the Windows asio backend work like epoll and kqueue #5465

Uh oh!

SeanTAllen Jun 11, 2026 Maintainer

What we actually have today

An alternative: keep IOCP and transfer buffer ownership explicitly

What readiness means for us

Three ways to get readiness on Windows

Which mechanism: ProcessSocketNotifications

Who waits on the port

User-visible changes

What this does for memory safety

Replies: 0 comments

SeanTAllen
Jun 11, 2026
Maintainer