Fix WSLC exec hang on fast runc failure (e.g. invalid user/group)#40550
Merged
benhillis merged 1 commit intoMay 15, 2026
Conversation
WSLCContainerImpl::Exec polls Docker's exec inspect endpoint after
StartExec to learn whether the user process is running or has already
failed. The Running branch was guarded by `state.Pid.has_value()`,
which is meaningless because Docker's wire schema declares Pid as a
non-nullable Go int that is 0 until runc forks the user process - so
the JSON always contains `"Pid": 0` and nlohmann always deserializes
that as `optional<int>(0)` with `has_value() == true`.
When runc fails before forking (e.g. `-u root:badgid`), Docker
briefly reports `{Running: true, Pid: 0, ExitCode: null}` in the
window between logging the error and running its deferred cleanup that
sets `Running=false, ExitCode=126`. The polling loop accepted Pid=0
as a valid PID, called SetPid(0), broke out, and returned the process
to wslc. wslc then waited on the exit event forever, because Docker
never emits an `exec_die` event when the user process never spawned.
Change InspectExec.Pid from `std::optional<int>` to `int` to match
the wire format, and check `state.Pid > 0` at the call site. With
this change the loop continues polling on Pid=0; on the next iteration
Docker has settled state and the existing ExitCode branch fires with
the correct exit code (126).
Verified against the failing test
WSLCE2EContainerExecTests::WSLCE2E_Container_Exec_UserOption_InvalidGroup_Fails,
which is the regression test for this bug.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Member
Author
|
Hit this test failure in the release pipeline - I suspect it's just a somewhat tight race window. |
dkbennett
approved these changes
May 15, 2026
OneBlue
approved these changes
May 15, 2026
Collaborator
OneBlue
left a comment
There was a problem hiding this comment.
Great catch ! We might as also want to add assert that Pid > 0 in SetPid() to avoid getting hung if we ever hit something similar in the future
benhillis
added a commit
that referenced
this pull request
May 16, 2026
Defense-in-depth follow-up to PR #40550. The exec polling loop in WSLCContainerImpl::Exec now correctly filters Pid > 0 before calling SetPid, but a future caller that bypassed that check would silently hang the process wait (because Docker never emits exec_die for a process that never spawned). Assert at the lowest level so any such regression fires loudly in Debug builds. Suggested by @OneBlue in the PR #40550 review. Co-authored-by: Ben Hillis <benhill@ntdev.microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a hang in
wslc container execwhen the exec'd process fails before runc forks it (e.g.wslc container exec -u root:badgid id). The wslc client hangs forever instead of returning exit code 126 and the "unable to find group badgid" error.Root cause
WSLCContainerImpl::Execpolls Docker's exec inspect endpoint afterStartExecto learn whether the user process is running or has already failed. The Running branch was guarded bystate.Pid.has_value():This guard is meaningless because Docker's wire schema (
backend.ExecInspectin moby) declaresPidas a non-nullable Gointthat is 0 until runc forks the user process — so the JSON response always contains"Pid": 0and nlohmann always deserializes that asoptional<int>(0)withhas_value() == true.When runc fails before forking (invalid user/group, missing binary, etc.), Docker briefly reports
{"Running": true, "Pid": 0, "ExitCode": null}in the small window between logging the error and running its deferred cleanup that setsRunning=false, ExitCode=126. The polling loop:Pid=0as a valid PIDSetPid(0)exec_dieevent when the user process never spawned (containerd's process-exit event stream never fires)Forensics
Confirmed via process dumps and ETL trace from a failing CloudTest run:
DockerExecProcessControlin the wslcsession dump hasm_pid = 0(withhas_value() == true) andm_exitedCodeunset; its exit event was never signaled.exec_create,exec_start, dockerd ERROR"unable to find group badgid", and oneGET /exec/{id}/jsonreturning 200. Noexec_dieever.NameGroupRoot) where the user process actually runs:exec_dieis emitted normally.Fix
Change
InspectExec.Pidfromstd::optional<int>to plainintto match the wire format (Gointis non-nullable), and checkstate.Pid > 0at the call site. With this change the loop continues polling onPid=0; on the next 100ms iteration Docker has settled state and the existingExitCodebranch fires with the correct exit code 126.ExitCoderemainsstd::optional<int>because moby'sbackend.ExecInspect.ExitCodeis*int(genuinely nullable).Validation
The existing E2E test
WSLCE2EContainerExecTests::WSLCE2E_Container_Exec_UserOption_InvalidGroup_Failsis the regression test for this bug. With the fix it should pass reliably; without it, it hangs until the test host is killed.