Fix potential broken state after transaction timeout#14455
Fix potential broken state after transaction timeout#14455chemwolf6922 wants to merge 4 commits intomicrosoft:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes a broken channel state that occurs after a transaction timeout in WSL's socket-based IPC protocol. The issue (#14193, #14055) manifests after laptop sleep/hibernate, where a channel's expected sequence number gets desynchronized, causing all subsequent communication to fail until wsl --shutdown.
Changes:
- Replace independent sender/receiver sequence counters with an echo-back mechanism: the responder echoes back the request's sequence number in its reply, preventing desync after timeouts.
- Add a magic number field to
MESSAGE_HEADERfor early framing corruption detection, and skip stale (timed-out) replies in the receive loop. - Zero-initialize a
Replyunion inbinfmt.cppto ensure the newMessageMagicdefault initializer doesn't cause issues with rawread()calls.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/shared/inc/lxinitshared.h |
Added Magic constant and MessageMagic field to MESSAGE_HEADER; updated static_assert for new struct size. |
src/shared/inc/SocketChannel.h |
Rewrote send/receive sequence logic to echo-back model; added stale message skipping loop; replaced m_received_messages with m_expected_reply_sequence / m_pending_reply_sequence. |
src/shared/inc/socketshared.h |
Added magic number validation in RecvMessage before processing header. |
src/linux/init/binfmt.cpp |
Zero-initialized Reply union to handle new MessageMagic default member initializer. |
You can also share your feedback on Copilot code review. Take the survey.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR fixes a broken channel state issue in WSL's SocketChannel that occurs after a transaction timeout (e.g., when resuming from sleep). Previously, a timeout would increment the expected message ID on the receiver side, but the sender wouldn't use that incremented ID, causing a permanent ID desync and locking the channel. The fix replaces independent sequence tracking with an echo-back mechanism where the responder echoes back the request's sequence number in its reply, and the requester skips stale replies from previously timed-out transactions.
Changes:
- Added a magic number field to
MESSAGE_HEADERand validated it inRecvMessageto detect framing corruption early. - Replaced independent sequence counters with an echo-back sequence mechanism in
SocketChannelusingm_expected_reply_sequenceandm_pending_reply_sequence, with a loop to skip stale replies. - Zero-initialized a union in
binfmt.cppto ensure the newMessageMagicfield is properly initialized when reading responses.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
src/shared/inc/lxinitshared.h |
Added Magic constant and MessageMagic field to MESSAGE_HEADER; updated static_assert for LX_GNS_SET_PORT_LISTENER size. |
src/shared/inc/socketshared.h |
Added magic number validation in RecvMessage after reading the header. |
src/shared/inc/SocketChannel.h |
Replaced send/receive sequence tracking with echo-back mechanism; added stale-reply skip loop; removed sequence parameter from ValidateMessageHeader. |
src/linux/init/binfmt.cpp |
Zero-initialized Reply union to ensure MessageMagic defaults correctly. |
You can also share your feedback on Copilot code review. Take the survey.
…om:chemwolf6922/WSL into fix-broken-state-after-transaction-timeout
Summary of the Pull Request
In this issue #14193, the user's log shows a channel with a broken state. Where after a previous transaction timeout. The expected message id got incremented. But the sender side did not use that message id at all. So it kept sending messages with N-1 id. Causing the channel to be locked completely.

This is because the original expected id got incremented before any message is received. So the naive fix to this would be moving the ++ after actually received the message. But that causes another problem where the sender might eventually reply after the timeout. And the channel will still end up in a broken state.
So I started guessing the purpose of this exact id matching. And seems me that (please correct me if I'm wrong):
To ensure all these can be achieved when solving this broken state issue, I made the following changes:
For the echo back part. The normal practice would be tie a message id to each request when it's being handled. And use that to send the reply. But since:
I decided to use the latest received request id for the current reply. Please see the comment in the code for more details on this logic. This avoids passing the id to each handler.
This should solve the later part of the original issue, where wsl is stuck in a broken state and require restart to be fixed. But will not solve the initial problem. Where a sleep may cause this message timeout.
PR Checklist
Closes: WSL2 crashes on waking up from sleep #14193 WSL 2.6.3.0: Terminal crash after hibernation/sleep with [process exited with code 1] #14055 Error code: Wsl/Service/E_UNEXPECTED #14014
Communication: I've discussed this with core contributors already. If work hasn't been agreed, this work might be rejected
Tests: Added/updated if needed and all pass
All but 8 unit test fails. Where 6 of them failed because of GPO settings or powershell issues. 2 of them (CGroupv1 and CaseSensitivity) failed but seems unrelated to the changes. I have appended logs for those failed tests in the end.
I'm also dog fooding this build right now.
Localization: All end user facing strings can be localized
Dev docs: Added/updated if needed
Documentation updated: If checked, please file a pull request on our docs repo and link it here: #xxx
Detailed Description of the Pull Request / Additional comments
Validation Steps Performed
Failed unit tests