Fix system.vhd loss during failed MSI upgrade#40524
Conversation
There was a problem hiding this comment.
Pull request overview
This PR hardens WSL against unrecoverable failures when an MSI major upgrade fails mid-install by (1) moving old-product removal into the MSI transaction for rollback safety and (2) adding a dedicated, localized error path when required packaged VHD files (e.g., system.vhd, modules.vhd) are missing so users get an actionable message instead of a generic HCS failure.
Changes:
- Adjust WiX
MajorUpgradescheduling soRemoveExistingProductsruns inside the MSI transaction (rollback restores the previous install on failure). - Introduce
WSL_E_SYSTEM_DISTRO_MISSINGand a localized message for missing packaged files. - Add runtime existence checks in VM startup paths and wire the new HRESULT into common error-string handling.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| msipackage/package.wix.in | Schedules MajorUpgrade removal to occur inside the MSI transaction for rollback protection. |
| src/windows/service/inc/wslservice.idl | Adds new HRESULT WSL_E_SYSTEM_DISTRO_MISSING (0x33). |
| src/windows/service/exe/WslCoreVm.cpp | Replaces debug-only asserts with production checks that throw a user-facing localized error when packaged VHDs are missing. |
| src/windows/service/exe/HcsVirtualMachine.cpp | Adds packaged-file existence validation before attaching boot VHDs (currently without setting a user-facing message). |
| src/windows/common/wslutil.cpp | Adds the new HRESULT to common error mappings and returns a localized fallback string (currently hard-coded to system.vhd). |
| localization/strings/en-US/Resources.resw | Adds MessageSystemDistroMissing localized string resource. |
| THROW_HR_IF(WSL_E_SYSTEM_DISTRO_MISSING, !wsl::windows::common::filesystem::FileExists(rootVhdPath.c_str())); | ||
| } | ||
|
|
||
| THROW_HR_IF(WSL_E_SYSTEM_DISTRO_MISSING, !wsl::windows::common::filesystem::FileExists(kernelModulesPath.c_str())); |
There was a problem hiding this comment.
By design — HcsVirtualMachine.cpp doesn't have ExecutionContext. The fallback in GetErrorString now uses GetMsiPackagePath() so users get a meaningful message.
| return Localization::MessageInvalidDistributionTar(); | ||
|
|
||
| case WSL_E_SYSTEM_DISTRO_MISSING: | ||
| return Localization::MessageSystemDistroMissing(L"system.vhd"); |
There was a problem hiding this comment.
Fixed — now uses GetMsiPackagePath().value_or(L\) instead of hard-coding 'system.vhd'. WslCoreVm.cpp paths set the specific missing file path via THROW_HR_WITH_USER_ERROR_IF before this fallback is reached.
| THROW_HR_WITH_USER_ERROR_IF( | ||
| WSL_E_SYSTEM_DISTRO_MISSING, | ||
| Localization::MessageSystemDistroMissing(m_vmConfig.SystemDistroPath), | ||
| !wsl::windows::common::filesystem::FileExists(m_vmConfig.SystemDistroPath.c_str())); |
There was a problem hiding this comment.
Manual runtime rollback test documented in PR comments validates the core scenario (locked file → install failure → rollback → VHDs intact). Automated test would require deleting system.vhd from a live install — happy to add in a follow-up PR.
|
Thanks for investigating, would it be possible to try to root cause the issue instead of a band-aid? A slightly better error message isn’t going to help users that get into this state. |
✅ Runtime Rollback Test PASSEDRan the test on an admin terminal with the built MSI: Method: Locked Results:
Post-test: Log evidence: MSI rollback policy confirmed enabled ( This confirms |
|
@benhillis Good call — re-reading the PR description, it buries the root cause fix. Let me clarify: This PR does fix the root cause, not just error messages. There are two changes:
I'll update the PR description to make this clearer. The one-line |
6c159b2 to
2ad3626
Compare
Problem: MSI MajorUpgrade with default scheduling removes old system.vhd before the new install completes. If the new install fails, the file is permanently lost and WSL2 breaks with a cryptic error. The default system.vhd path only had a debug-only WI_ASSERT, so production builds gave no actionable error message. Fix 1 - Atomic MSI rollback: Change MajorUpgrade Schedule to 'afterInstallInitialize' so old product removal occurs inside the MSI transaction. If the new install fails, Windows Installer rolls back and restores the old files. Fix 2 - Production existence checks: - Add WSL_E_SYSTEM_DISTRO_MISSING error code (0x80040333) - Replace WI_ASSERT in WslCoreVm.cpp with THROW_HR_WITH_USER_ERROR_IF that provides a clear, localized message with the missing file path - Add existence checks for both system.vhd and modules.vhd in HcsVirtualMachine.cpp before SCSI disk attach - Add existence check for default modules.vhd in WslCoreVm.cpp - Only validate default/package paths; user-provided overrides (RootVhdOverride, WSL_SYSTEM_DISTRO_PATH) use existing error paths Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2ad3626 to
05c4925
Compare
Testing Update — Before/After AnalysisReproducing the bug (BEFORE)Attempted multiple approaches to reproduce the file-loss scenario with the stock MSI:
The real-world failure conditions from #40488 (AV kernel-mode filter drivers, disk full, power loss, system crashes) cannot be safely simulated in a controlled test. MSI Sequence Analysis (definitive evidence)Queried the \InstallExecuteSequence\ table from both MSIs:
Stock: \RemoveExistingProducts\ at seq 1401 runs before the transaction (1500). Old product is removed with no rollback safety net. Fixed: \RemoveExistingProducts\ at seq 1501 runs inside the transaction (1500–6600). On failure, the MSI engine rolls back the entire transaction, restoring the old product. Rollback test (AFTER — our fix)Already posted above — locked \wslservice.exe\ → msiexec failed (1603) → all VHDs restored with matching SHA256 hashes → WSL remained functional. Error message updateUpdated the user-facing error to give an actionable recovery command: |
Updated analysis: This fix also prevents the reboot-pending VHD deletion scenarioAfter further investigation and consulting the WiX MajorUpgrade documentation and RemoveExistingProducts action reference, I've confirmed that Failure Mode 1: Hard failure (exit 1603)Already documented — rollback restores old product. Failure Mode 2: File-in-use → reboot-pending delete nukes new fileThis is a real customer scenario where:
Why With With This is also confirmed by the Microsoft docs which state that |
Correction to my earlier comment (refined analysis with MSI internals research)After deeper research into MSI's file handling internals, I need to refine my earlier analysis: MSI's
|
|
Hello! Could you please provide more logs to help us better diagnose your issue? To collect WSL logs, download and execute collect-wsl-logs.ps1 in an administrative powershell prompt: The script will output the path of the log file once done. Once completed please upload the output files to this GitHub issue. See Collect WSL logs (recommended method). If you choose to email these logs instead of attaching them to the bug, please send them to wsl-gh-logs@microsoft.com with the GitHub issue number in the subject, and include a link to your GitHub issue comment in the message body. Thank you! |
Fix system.vhd loss during failed MSI upgrade
Fixes #40488
Problem
The default WiX
MajorUpgradescheduling (afterInstallValidate) placesRemoveExistingProductsoutside the MSI transaction. If the new installation fails after the old product was removed, the machine has neither version installed — including critical files likesystem.vhd.Per WiX documentation:
Fix
1. MSI Rollback Protection —
msipackage/package.wix.inChanged
MajorUpgrade ScheduletoafterInstallInitialize, movingRemoveExistingProductsinside the MSI transaction.Per WiX documentation:
This is a one-line change with no impact on successful upgrades. On failure, the old product is restored instead of leaving nothing installed.
2. Defense-in-depth: Runtime existence checks —
WslCoreVm.cpp,HcsVirtualMachine.cppAdded explicit file existence checks before VM boot. If packaged VHDs are missing for any reason, users get a clear error with recovery instructions instead of a cryptic HCS failure:
3. New error code —
WSL_E_SYSTEM_DISTRO_MISSING(0x80040333)Testing
Rollback test: Built MSI with fix → locked
wslservice.exeto force failure →msiexec /i wsl.msi /qn→ exit 1603 → all VHDs restored with identical SHA256 hashes → WSL functional post-rollback.Files Changed
msipackage/package.wix.inSchedule="afterInstallInitialize"src/windows/service/inc/wslservice.idlWSL_E_SYSTEM_DISTRO_MISSINGsrc/windows/service/exe/WslCoreVm.cppsrc/windows/service/exe/HcsVirtualMachine.cppsrc/windows/common/wslutil.cpplocalization/strings/en-US/Resources.reswMessageSystemDistroMissing