Skip to content

Fix potential broken state after transaction timeout#14455

Open
chemwolf6922 wants to merge 4 commits intomicrosoft:masterfrom
chemwolf6922:fix-broken-state-after-transaction-timeout
Open

Fix potential broken state after transaction timeout#14455
chemwolf6922 wants to merge 4 commits intomicrosoft:masterfrom
chemwolf6922:fix-broken-state-after-transaction-timeout

Conversation

@chemwolf6922
Copy link

@chemwolf6922 chemwolf6922 commented Mar 17, 2026

Summary of the Pull Request

In this issue #14193, the user's log shows a channel with a broken state. Where after a previous transaction timeout. The expected message id got incremented. But the sender side did not use that message id at all. So it kept sending messages with N-1 id. Causing the channel to be locked completely.
image

This is because the original expected id got incremented before any message is received. So the naive fix to this would be moving the ++ after actually received the message. But that causes another problem where the sender might eventually reply after the timeout. And the channel will still end up in a broken state.

So I started guessing the purpose of this exact id matching. And seems me that (please correct me if I'm wrong):

  1. Ensure that the replied message matches the request in a transaction.
  2. The exact match, as a side effect, checks if the message header is malformed. (This might just be me thinking too much...)

To ensure all these can be achieved when solving this broken state issue, I made the following changes:

  1. Echo back the request id in the reply, instead of tracking the ids separately, to avoid id desync.
  2. When receiving the message, skip any that's older than the anticipated reply. So the previous timeout but actually delivered messages do not break the state.
  3. (This is probably redundant). Add a random magic to the message header so the implicit message header format check from the id is kept. (Since the echo back logic cannot serve this role anymore).

For the echo back part. The normal practice would be tie a message id to each request when it's being handled. And use that to send the reply. But since:

  1. That will require too much change in the code base and I'm afraid I may break things. And
  2. The channels in WSL are unidirectional (one end requests, the other replies / silently receives). And not concurrent.

I decided to use the latest received request id for the current reply. Please see the comment in the code for more details on this logic. This avoids passing the id to each handler.

This should solve the later part of the original issue, where wsl is stuck in a broken state and require restart to be fixed. But will not solve the initial problem. Where a sleep may cause this message timeout.

PR Checklist

Detailed Description of the Pull Request / Additional comments

Validation Steps Performed

Failed unit tests


UnitTests::UnitTests::Warnings [Failed]
Error: Verify: AreEqual(expectedWarnings, warnings) - Values (, wsl: Due to a current compatibility issue with Global Secure Access Client, DNS Tunneling is disabled.
) [File: D:\workspace\WSL\test\windows\UnitTests.cpp, Function: UnitTests::UnitTests::Warnings::<lambda_1>::operator (), Line: 2022]

UnitTests::UnitTests::KernelModules [Failed]
Error: Caught std::exception: D:\workspace\WSL\test\windows\Common.cpp(261)\wsltests.dll!00007FFD0D986FF8: (caller: 00007FFD0D9868B9) Exception(57) tid(8178) 8000FFFF Catastrophic failure
    Msg:[Command "C:\WINDOWS\system32\wsl.exe echo ok"returned unexpected exit code (0 != -1). Stdout: 'ok
'Stderr: 'wsl: The .wslconfig setting 'wsl2.kernel' is disabled by the computer policy.
'] [LxsstuLaunchCommandAndCaptureOutput(E_UNEXPECTED)]

UnitTests::UnitTests::VersionFlavorParsing [Failed]
WSL1 is disabled by the computer policy.
Please run 'wsl.exe --set-version tmpdistro 2' to upgrade to WSL2.
Error code: Wsl/Service/CreateInstance/WSL_E_WSL1_DISABLED
Error: Verify: AreEqual(LxsstuLaunchWsl(std::format(L"-d {} cat /etc/os-release || true", Distro).c_str()), 0L) - Values (4294967295, 0) [File: D:\workspace\WSL\test\windows\UnitTests.cpp, Function: UnitTests::UnitTests::VersionFlavorParsing::<lambda_1>::operator (), Line: 3924]

UnitTests::UnitTests::CaseSensitivity [Failed]
Verify: IsFalse(getCaseSensitivity(std::format(L"{}/l1/l2/l3-other", testDir)))
Error: Caught std::exception: D:\workspace\WSL\src\windows\common\filesystem.cpp(339)\wsltests.dll!00007FFD0D2E0169: (caller: 00007FFD0D07A634) Exception(11) tid(1a18) C0000101     [`anonymous-namespace'::EnsureCaseSensitiveDirectoryRecursive::<lambda_1>::operator ()(NtSetInformationFile(Directory, &IoStatus, &CaseInfo, sizeof(CaseInfo), FileCaseSensitiveInformation))]

UnitTests::UnitTests::CustomModulesVhd [Failed]
Error: Caught std::exception: D:\workspace\WSL\test\windows\Common.cpp(261)\wsltests.dll!00007FFD0D986FF8: (caller: 00007FFD0D987DAF) Exception(1) tid(3f90) 8000FFFF Catastrophic failure
    Msg:[Command "Powershell -NoProfile -Command "$acl = Get-Acl 'D:\workspace\WSL\test-modules.vhd' ; $acl.RemoveAccessRuleAll((New-Object System.Security.AccessControl.FileSystemAccessRule(\"Everyone\", \"Read\", \"None\", \"None\", \"Allow\"))); Set-Acl -Path 'D:\workspace\WSL\test-modules.vhd' -AclObject $acl""returned unexpected exit code (1 != 0). Stdout: ''Stderr: 'Get-Acl : The 'Get-Acl' command was found in the module 'Microsoft.PowerShell.Security', but the module could not be loaded. For more informa
tion, run 'Import-Module Microsoft.PowerShell.Security'.
At line:1 char:8
+ $acl = Get-Acl 'D:\workspace\WSL\test-modules.vhd' ; $acl.RemoveAcces ...
+        ~~~~~~~
    + CategoryInfo          : ObjectNotFound: (Get-Acl:String) [], CommandNotFoundException
    + FullyQualifiedErrorId : CouldNotAutoloadMatchingModule

You cannot call a method on a null-valued expression.
At line:1 char:54
+ ... ules.vhd' ; $acl.RemoveAccessRuleAll((New-Object System.Security.Acce ...
+                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (:) [], RuntimeException
    + FullyQualifiedErrorId : InvokeMethodOnNull

Set-Acl : The 'Set-Acl' command was found in the module 'Microsoft.PowerShell.Security', but the module could not be loaded. For more informa
tion, run 'Import-Module Microsoft.PowerShell.Security'.
At line:1 char:190
+ ... sRule("Everyone", "Read", "None", "None", "Allow"))); Set-Acl -Path ' ...
+                                                           ~~~~~~~
    + CategoryInfo          : ObjectNotFound: (Set-Acl:String) [], CommandNotFoundException
    + FullyQualifiedErrorId : CouldNotAutoloadMatchingModule

'] [LxsstuLaunchCommandAndCaptureOutput(E_UNEXPECTED)]

UnitTests::UnitTests::BrokenDistroImport [Failed]
Error: Caught std::exception: D:\workspace\WSL\test\windows\Common.cpp(261)\wsltests.dll!00007FFD0D986FF8: (caller: 00007FFD0D987DAF) Exception(1) tid(13dc8) 8000FFFF Catastrophic failure
    Msg:[Command "Powershell -NoProfile -Command "New-Vhd EmptyVhd.vhdx  -SizeBytes 20MB""returned unexpected exit code (1 != 0). Stdout: ''Stderr: 'New-Vhd : Failed to create the virtual hard disk.
The system failed to create 'D:\workspace\WSL\EmptyVhd.vhdx'.
Failed to create the virtual hard disk.
The system failed to create 'D:\workspace\WSL\EmptyVhd.vhdx': The file exists. (0x80070050).
At line:1 char:1
+ New-Vhd EmptyVhd.vhdx  -SizeBytes 20MB
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (:) [New-VHD], VirtualizationException
    + FullyQualifiedErrorId : OperationFailed,Microsoft.Vhd.PowerShell.Cmdlets.NewVhd

'] [LxsstuLaunchCommandAndCaptureOutput(E_UNEXPECTED)]

UnitTests::UnitTests::WslDebug [Failed]
wsl: The .wslconfig setting 'wsl2.kernelCommandLine' is disabled by the computer policy.
Error: Verify: AreEqual(LxsstuLaunchWsl(L"dmesg | grep -iF 'vmbus_send_tl_connect_request'"), 0L) - Values (1, 0) [File: D:\workspace\WSL\test\windows\UnitTests.cpp, Function: UnitTests::UnitTests::WslDebug, Line: 6425]

UnitTests::UnitTests::CGroupv1 [Failed]
Error: Verify: AreEqual(out, expected) - Values (/sys/fs/cgroup/unified cgroup2 cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate
, ) [File: D:\workspace\WSL\test\windows\UnitTests.cpp, Function: UnitTests::UnitTests::CGroupv1::<lambda_1>::operator (), Line: 6435]

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a broken channel state that occurs after a transaction timeout in WSL's socket-based IPC protocol. The issue (#14193, #14055) manifests after laptop sleep/hibernate, where a channel's expected sequence number gets desynchronized, causing all subsequent communication to fail until wsl --shutdown.

Changes:

  • Replace independent sender/receiver sequence counters with an echo-back mechanism: the responder echoes back the request's sequence number in its reply, preventing desync after timeouts.
  • Add a magic number field to MESSAGE_HEADER for early framing corruption detection, and skip stale (timed-out) replies in the receive loop.
  • Zero-initialize a Reply union in binfmt.cpp to ensure the new MessageMagic default initializer doesn't cause issues with raw read() calls.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
src/shared/inc/lxinitshared.h Added Magic constant and MessageMagic field to MESSAGE_HEADER; updated static_assert for new struct size.
src/shared/inc/SocketChannel.h Rewrote send/receive sequence logic to echo-back model; added stale message skipping loop; replaced m_received_messages with m_expected_reply_sequence / m_pending_reply_sequence.
src/shared/inc/socketshared.h Added magic number validation in RecvMessage before processing header.
src/linux/init/binfmt.cpp Zero-initialized Reply union to handle new MessageMagic default member initializer.

You can also share your feedback on Copilot code review. Take the survey.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 17, 2026 09:23
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a broken channel state issue in WSL's SocketChannel that occurs after a transaction timeout (e.g., when resuming from sleep). Previously, a timeout would increment the expected message ID on the receiver side, but the sender wouldn't use that incremented ID, causing a permanent ID desync and locking the channel. The fix replaces independent sequence tracking with an echo-back mechanism where the responder echoes back the request's sequence number in its reply, and the requester skips stale replies from previously timed-out transactions.

Changes:

  • Added a magic number field to MESSAGE_HEADER and validated it in RecvMessage to detect framing corruption early.
  • Replaced independent sequence counters with an echo-back sequence mechanism in SocketChannel using m_expected_reply_sequence and m_pending_reply_sequence, with a loop to skip stale replies.
  • Zero-initialized a union in binfmt.cpp to ensure the new MessageMagic field is properly initialized when reading responses.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
src/shared/inc/lxinitshared.h Added Magic constant and MessageMagic field to MESSAGE_HEADER; updated static_assert for LX_GNS_SET_PORT_LISTENER size.
src/shared/inc/socketshared.h Added magic number validation in RecvMessage after reading the header.
src/shared/inc/SocketChannel.h Replaced send/receive sequence tracking with echo-back mechanism; added stale-reply skip loop; removed sequence parameter from ValidateMessageHeader.
src/linux/init/binfmt.cpp Zero-initialized Reply union to ensure MessageMagic defaults correctly.

You can also share your feedback on Copilot code review. Take the survey.

Feng Wang added 2 commits March 17, 2026 17:31
…om:chemwolf6922/WSL into fix-broken-state-after-transaction-timeout
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants