Proposal: make the Windows asio backend work like epoll and kqueue #5465
SeanTAllen
started this conversation in
Runtime
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
This is a proposal to replace how the runtime does socket I/O on Windows: drop completion-based overlapped I/O (IOCP) and use readiness notifications instead — the same model the epoll and kqueue backends use everywhere else. Reads and writes, TCP and UDP, one model on every platform. The difference between the two models: under completions you hand the kernel a buffer and get told later when the operation has finished with it; under readiness the kernel tells you an operation would succeed right now, and you perform it synchronously with your own buffer.
The rationale, in order of weight. Memory safety: under the current model the kernel writes into Pony buffers after calls return; no reference capability can describe that (the capability analysis is in #5461). Today, not corrupting an in-flight buffer is the user's burden, enforced by nothing but care. That burden should be the runtime's. One model for users: anyone writing event-driven code against the runtime's ASIO interface today gets readiness semantics on POSIX and completion semantics on Windows; with this change, OS-agnostic ASIO code becomes the easy path. Real backpressure: Windows TCP writes currently throttle on an in-flight-count heuristic the source itself calls "rather arbitrary"; readiness gives writability as the kernel reports it, like every other platform. Maintenance: packages/net carries a divergent Windows twin of nearly every read/write path, plus a Windows-only concurrency-survival subsystem in the runtime, all of which gets deleted by this change.
There's precedent: Rust's mio did this (the wepoll technique), libuv's sockets work this way, and since Windows Server 2022 there's a documented API for it. What follows is the investigation: what we actually have today, an alternative I considered and why I don't think it's enough, what a readiness backend requires, and the choices I'd make.
What we actually have today
On Linux, the runtime's asio thread sits in
epoll_wait, the kernel tells it which sockets are ready, and it sends events to the interested actors. All socket handling runs through that one Pony-owned thread.Windows has no equivalent of that today. The asio thread exists there too (asio/iocp.c), but it never touches a socket: it waits on exactly two things, its own wakeup signal and console stdin, and handles timers and signals on the side. Socket I/O takes a completely different path. When the stdlib reads or writes, the runtime starts an overlapped operation: it hands the kernel a buffer, and the call returns immediately, before the work is done. When the kernel finishes (however much later that is), Windows runs a callback of ours on one of its own thread-pool threads, and that callback, on a thread the Pony runtime doesn't own or manage, is what sends the asio event to the actor (
BindIoCompletionCallback, socket.c:516-520, 541-544). There is no Pony-owned waiting point for sockets at all: no completion port of ours, no loop pulling socket events (there is noCreateIoCompletionPortorGetQueuedCompletionStatuscall anywhere in the runtime). Every socket event the runtime delivers on Windows is sent from a thread the runtime doesn't manage.That architecture has consequences independent of any switch:
iocp_token_t, event.h:11-31) so that aniocp_callbackarriving afterpony_asio_event_destroydoesn't use freed memory. The protocol spans event.c, socket.c:214-302, and a release hook in actor.c:477-498. Foreign-thread sends also requirepony_register_thread()on thread-pool threads (event.c:165-179).WSASendTois issued fire-and-forget with a NULL event (socket.c:456-481, 376-379). If a send fails after queuing, nothing is reported. No error, no backpressure.AcceptExis outstanding at a time, and the C failure path closes the pre-created socket and sends no event (socket.c:336-340); a failed accept silently stops the listener accepting. (The stdlib'sns == -1handling branch is unreachable from the current C.)isoread buffer; an in-flight UDP read additionally writes aNetAddressobject. The teardown sequencing in tcp_connection and udp_socket exists to order close around these in-flight kernel writes. This is the retention window from Round 2 plan: honest FFI reference capabilities for stdlib + tools (#4925) #5461: the window no capability can describe.ASIO_ONESHOT(a subscription mode where an event fires once, then stays disarmed until explicitly re-armed) is ignored;pony_asio_event_resubscribe_read/_writedon't exist in iocp.c (the stdlib's calls are insideifdef not windows); thereadable/writeableevent fields are never touched by the backend. The stdlib compensates by requesting different event flags per platform.An alternative: keep IOCP and transfer buffer ownership explicitly
There is a way to fix the memory-unsafety while staying on completions, and it belongs in this discussion because it's today's model with the ownership transfer made explicit. Completion I/O's contract is ownership transfer: the buffer belongs to the kernel from submission until completion. Expressed properly, a read would consume the buffer at submit, the runtime would own and root it while in flight, and the completion would hand it back. That's where Rust's io_uring runtimes landed (tokio-uring's reads take a
Vec<u8>by value and the completion returns it), and it works. For us it means runtime machinery: completions that can carry an object back (today's_event_notify, the behavior the runtime invokes on an actor to deliver an asio event, carries only an event id and twoU32s) and runtime rooting of in-flight buffers. But it's buildable.I don't think it's sufficient, because it fixes only half the problem. The buffer stops being a safety hazard, but IOCP stays semantically different from epoll and kqueue, and every user writing OS-agnostic ASIO code still has to handle that difference: completion model on Windows, readiness model everywhere else, two shapes of event handling in every cross-platform library that touches the runtime's event system. The point of taking complexity into the runtime is an easier model for users, and the round-trip design adds the runtime complexity but still leaves users with two models. Readiness everywhere fixes the same unsafety and removes the split: one ASIO contract on every platform, the easiest path for writing OS-agnostic event-driven Pony. The trade-off is stated below: the floor would become Windows 11.
What readiness means for us
The cross-platform contract a readiness backend must implement already exists and is already exercised: it's exactly what epoll.c and kqueue.c do. One-shot events,
resubscribe_read/resubscribe_writeto re-arm,set_readable/set_writeablestate, reads and writes as synchronous non-blocking calls that run untilWOULDBLOCK. The POSIX read loop in tcp_connection (_pending_reads, lines 978-1059) is the spec; the Windows backend has to implement the semantics that loop depends on.The work splits into three layers:
Runtime, socket layer (socket.c): sockets become non-blocking. Today no Windows socket is; they're blocking-but-overlapped, and not one
FIONBIOexists in the tree.pony_os_recv/recvfrom/send/writevlose their overlapped Windows variants and become what they are on POSIX: synchronous calls returning OK/RETRY/ERROR. Accept drops the AcceptEx machinery (pre-created duplicated sockets,SO_UPDATE_ACCEPT_CONTEXT, the extension-pointer fetch) for a plain accept-on-readiness loop. Connect drops ConnectEx and its mandatory pre-bind for connect-plus-poll-for-writable, which also removes the path where a connect completion is reported to the stdlib as a writeable event, and theSO_CONNECT_TIMEspecial case the stdlib uses to tell a successful connect from a failed one (both arrive as that same event today).Runtime, event delivery: something has to turn "socket became ready" into the
ASIO_READ/ASIO_WRITEevents the stdlib consumes, with one-shot semantics and re-arm. This is where the three implementation candidates below differ.Stdlib (packages/net): the Windows halves get deleted, not rewritten. The inventory says roughly: in tcp_connection, the dual pending-write buffers and
_pending_sent,_complete_writes,_complete_reads,_queue_read, the dual writev FFI declarations (WSABUF vs iovec tuple order), the per-platform event-flag selection, and the connect special case all go; the POSIX loops become universal. In udp_socket,_read_from,_complete_reads,_start_next_read, and the deferred-unsubscribe close sequence go. In tcp_listener, the completion-driven accept collapses into the POSIX loop. The foreign-thread token subsystem in the runtime becomes deletable too, if event sends move onto the asio thread (see "Who waits on the port" below).What doesn't change: everything on the Windows asio thread that isn't socket I/O, because none of it goes through IOCP today. Timers are waitable-timer callbacks delivered straight to the asio thread; signals are CRT signal handlers; stdin is a console handle the asio thread waits on directly; and process pipes never touched sockets or asio events at all (they poll named pipes on a timer). A change to how sockets are handled doesn't reach any of it. The one shared obligation is noisy-event accounting (the count of live I/O subscriptions that keeps the runtime from shutting down while I/O is outstanding), which must be preserved identically.
Three ways to get readiness on Windows
1.
ProcessSocketNotifications— the documented one. Since Windows Server 2022 / Windows 11 (build 20348), Winsock has a documented socket-readiness API: register sockets with a completion port you own, receiveSOCK_NOTIFY_EVENT_IN/OUT/HANGUP/ERRas ordinary completion packets (the event mask is carried in the bytes-transferred field), dequeue withGetQueuedCompletionStatusEx. It supports level-triggered one-shot registration (level-triggered: notified whenever the condition holds, not only on the transition into it), which is Microsoft's recommended pattern for multithreaded servers and exactly the shape of ourASIO_ONESHOTcontract. Lifecycle is documented and manageable: one port per socket, deregister-and-wait-for-REMOVE before freeing context,closesocketauto-deregisters without a REMOVE so event lifetimes need care. Our README's supported-Windows is Windows 11 only and CI runs windows-11/windows-2025, so every supported target has this API. The catch: Windows 10 (currently best-effort, x86) doesn't have it; the consequence is covered below.2. AFD poll, wepoll-style — undocumented, widely shipped.
IOCTL_AFD_POLL(0x12024) issued viaNtDeviceIoControlFileagainst helper handles opened on\Device\Afd, completions delivered to a port,NtCancelIoFileExto cancel. The event vocabulary is richer than candidate 1 (POLL_RECEIVE,POLL_SEND,POLL_DISCONNECT,POLL_ABORT,POLL_ACCEPT,POLL_CONNECT_FAIL,POLL_LOCAL_CLOSE), it works back to Vista, and it's the core mechanism under wepoll, libuv, and mio: years of production use of an API Microsoft never documented. Implementation details are well mapped: base-socket unwrapping viaSIO_BASE_HANDLE(with fallbacks, because some layered service providers, third-party software inserted into the Winsock stack, break it), helper-handle pooling (wepoll runs 32 sockets per AFD handle), one-shot-by-nature with explicit re-arm. And calling ntdll costs us nothing new at link time: every compiled Pony program already links ntdll.lib (it's in the compiler's default link line). One limitation matters below: an AFD poll reports "the condition holds now," which is level-triggered behavior. Persistent edge-triggered behavior has no clean expression on it. That's not my inference; it's wepoll's own documented limitation ("Edge-triggered (EPOLLET) mode isn't supported"), unchanged across years of production use.3. AFD poll per-socket — the simplified undocumented one. Len Holgate demonstrated in 2024 that the same
IOCTL_AFD_POLLcan be issued on the socket's own handle, skipping the shared helper handles and their bookkeeping entirely. Same undocumented dependency, less machinery. Less production history than wepoll's approach.Which mechanism: ProcessSocketNotifications
My choice is
ProcessSocketNotifications, for one reason that outweighs everything else: it's the only option that gives users the same affordances on Windows that epoll and kqueue give them everywhere else, and that is incredibly important. The asio subsystem isn't just plumbing for packages/net — it's a runtime interface user code builds on. On Linux and the BSDs, a subscription withoutASIO_ONESHOTgets persistent edge-triggered semantics (EPOLLET/EV_CLEAR: notified once each time the socket becomes ready, with no re-arm needed); with it, one-shot plus re-arm (notified once, then nothing until explicitly re-armed). Offering that same surface on Windows is the same principle as the memory-safety argument: the runtime takes the complexity on so users get the model.ProcessSocketNotificationssupports one-shot, persistent, level, and edge triggering natively: everything the posix backends offer. The AFD route cannot do edge. None of the implementations built on it supportEPOLLET(wepoll, the reference for the technique, documents the limitation outright; the Rust libraries that reimplemented the technique share it), and it appears that making it happen would be painful and bug-prone. The only thing that pain would buy is Windows 10 support.For the stdlib's own usage the candidates are equivalent (the stdlib subscribes one-shot and re-arms after each event, and that pattern maps onto either mechanism with the same number of syscalls), so user affordances are the deciding criterion.
The trade-off: same functionality for user code to interface with the runtime on all supported platforms — but Windows 11 and Windows Server 2022 become the floor (the API appears in build 20348). Our supported-platforms list is already Windows 11 only and CI runs Windows Server 2025, so nothing changes for supported targets. With this change, Windows 10 support is dropped entirely. It's a best-effort tier today (x86 only); every Windows release before the required API existed falls below the floor: no fallback path, no degraded mode, and the platform documentation changes accordingly in the same PR. Wepoll-style AFD is the fallback if the documented API turns out to have problems in practice, at the cost of the edge mode.
Who waits on the port
With
ProcessSocketNotifications, readiness notifications arrive as packets on a completion port. Some thread has to wait on that port, pull packets off, and send the correspondingASIO_READ/ASIO_WRITEevents to actors. Windows offers two ways to arrange that:GetQueuedCompletionStatusEx) instead of blocking on its current two handles: the same job it does today, and the same job the asio thread does on Linux sitting inepoll_wait. When a packet arrives, it wakes, reads which socket became ready for what, and sends the asio event, all from the one thread the runtime already dedicates to this. Two details need engineering, both manageable. Timers: today they're delivered to the asio thread as APCs, which Windows only delivers while a thread is in an "alertable" wait; the port wait has an alertable variant, so timers keep working unchanged. Stdin: a console handle can't be attached to a completion port, so stdin needs a side path; Windows can watch the handle and post a packet to the port when input arrives (RegisterWaitForSingleObject).BindIoCompletionCallback). Less change, but our code keeps running on threads the runtime doesn't manage.I think the Pony-owned port is the right call, and the ground is correctness. When our code runs on Windows' threads, an event can be delivered at the same moment the runtime is tearing that event down. We know those races are real, because the refcount/dead-flag machinery described earlier exists specifically to manage them. With the asio thread waiting on its own port, exactly one thread touches events (the same arrangement every POSIX backend has), so those races can't happen, and the machinery that managed them gets deleted instead of maintained.
ProcessSocketNotificationsrequires a caller-created port anyway, so the architecture I'd want and the one the API requires are the same.User-visible changes
throttled/unthrottledfire on actual socket writability on every platform, not a 16-in-flight heuristic.TCPConnectioncan be muted to stop reading.) Today Windows notices a peer close while muted (the queued read surfaces it) and POSIX doesn't (documented in tcp_connection.pony:197-202). Readiness makes Windows match POSIX. If we'd rather keep close-detection-while-muted, it looks doable on both platforms under readiness — but that's a behavior decision, not an implementation detail.\exhaustive\match arms in the net stdlib that are documented "unreachable on Windows" become reachable; the_SocketResultDecoderOK/Retry/Error contract means the same thing everywhere.What this does for memory safety
Under readiness, recv and recvfrom write only during the call (the kernel never retains a pointer past return), so reference capabilities can describe everything that happens to socket memory, on every platform. The in-flight-kernel-write teardown sequencing goes away, no GC-managed memory gets written outside any capability's description, and the safety-by-discipline requirement on Windows disappears: there is no in-flight buffer left to be careful with.
Beta Was this translation helpful? Give feedback.
All reactions