Skip to content

Fix system.vhd loss during failed MSI upgrade#40524

Draft
yeelam-gordon wants to merge 1 commit into
masterfrom
fix/system-vhd-rollback-and-checks
Draft

Fix system.vhd loss during failed MSI upgrade#40524
yeelam-gordon wants to merge 1 commit into
masterfrom
fix/system-vhd-rollback-and-checks

Conversation

@yeelam-gordon
Copy link
Copy Markdown
Contributor

@yeelam-gordon yeelam-gordon commented May 13, 2026

Fix system.vhd loss during failed MSI upgrade

Fixes #40488

Problem

The default WiX MajorUpgrade scheduling (afterInstallValidate) places RemoveExistingProducts outside the MSI transaction. If the new installation fails after the old product was removed, the machine has neither version installed — including critical files like system.vhd.

Per WiX documentation:

afterInstallValidate: "if the installation of the upgrade product fails, the machine will have neither version installed."

Fix

1. MSI Rollback Protectionmsipackage/package.wix.in

Changed MajorUpgrade Schedule to afterInstallInitialize, moving RemoveExistingProducts inside the MSI transaction.

Per WiX documentation:

afterInstallInitialize: "if the installation of the upgrade product fails, Windows Installer also rolls back the removal of the installed product — in other words, reinstalls it."

This is a one-line change with no impact on successful upgrades. On failure, the old product is restored instead of leaving nothing installed.

2. Defense-in-depth: Runtime existence checksWslCoreVm.cpp, HcsVirtualMachine.cpp

Added explicit file existence checks before VM boot. If packaged VHDs are missing for any reason, users get a clear error with recovery instructions instead of a cryptic HCS failure:

A required WSL package file is missing: '{path}'. Please run 'wsl --update' to restore it.

3. New error codeWSL_E_SYSTEM_DISTRO_MISSING (0x80040333)

Testing

Rollback test: Built MSI with fix → locked wslservice.exe to force failure → msiexec /i wsl.msi /qn → exit 1603 → all VHDs restored with identical SHA256 hashes → WSL functional post-rollback.

Files Changed

File Change
msipackage/package.wix.in Schedule="afterInstallInitialize"
src/windows/service/inc/wslservice.idl WSL_E_SYSTEM_DISTRO_MISSING
src/windows/service/exe/WslCoreVm.cpp VHD existence checks with localized error
src/windows/service/exe/HcsVirtualMachine.cpp VHD existence checks
src/windows/common/wslutil.cpp Error string mapping
localization/strings/en-US/Resources.resw MessageSystemDistroMissing

Copilot AI review requested due to automatic review settings May 13, 2026 14:54
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens WSL against unrecoverable failures when an MSI major upgrade fails mid-install by (1) moving old-product removal into the MSI transaction for rollback safety and (2) adding a dedicated, localized error path when required packaged VHD files (e.g., system.vhd, modules.vhd) are missing so users get an actionable message instead of a generic HCS failure.

Changes:

  • Adjust WiX MajorUpgrade scheduling so RemoveExistingProducts runs inside the MSI transaction (rollback restores the previous install on failure).
  • Introduce WSL_E_SYSTEM_DISTRO_MISSING and a localized message for missing packaged files.
  • Add runtime existence checks in VM startup paths and wire the new HRESULT into common error-string handling.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
msipackage/package.wix.in Schedules MajorUpgrade removal to occur inside the MSI transaction for rollback protection.
src/windows/service/inc/wslservice.idl Adds new HRESULT WSL_E_SYSTEM_DISTRO_MISSING (0x33).
src/windows/service/exe/WslCoreVm.cpp Replaces debug-only asserts with production checks that throw a user-facing localized error when packaged VHDs are missing.
src/windows/service/exe/HcsVirtualMachine.cpp Adds packaged-file existence validation before attaching boot VHDs (currently without setting a user-facing message).
src/windows/common/wslutil.cpp Adds the new HRESULT to common error mappings and returns a localized fallback string (currently hard-coded to system.vhd).
localization/strings/en-US/Resources.resw Adds MessageSystemDistroMissing localized string resource.

Comment on lines +229 to +232
THROW_HR_IF(WSL_E_SYSTEM_DISTRO_MISSING, !wsl::windows::common::filesystem::FileExists(rootVhdPath.c_str()));
}

THROW_HR_IF(WSL_E_SYSTEM_DISTRO_MISSING, !wsl::windows::common::filesystem::FileExists(kernelModulesPath.c_str()));
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By design — HcsVirtualMachine.cpp doesn't have ExecutionContext. The fallback in GetErrorString now uses GetMsiPackagePath() so users get a meaningful message.

Comment thread src/windows/common/wslutil.cpp Outdated
return Localization::MessageInvalidDistributionTar();

case WSL_E_SYSTEM_DISTRO_MISSING:
return Localization::MessageSystemDistroMissing(L"system.vhd");
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — now uses GetMsiPackagePath().value_or(L\) instead of hard-coding 'system.vhd'. WslCoreVm.cpp paths set the specific missing file path via THROW_HR_WITH_USER_ERROR_IF before this fallback is reached.

Comment on lines +1428 to +1431
THROW_HR_WITH_USER_ERROR_IF(
WSL_E_SYSTEM_DISTRO_MISSING,
Localization::MessageSystemDistroMissing(m_vmConfig.SystemDistroPath),
!wsl::windows::common::filesystem::FileExists(m_vmConfig.SystemDistroPath.c_str()));
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manual runtime rollback test documented in PR comments validates the core scenario (locked file → install failure → rollback → VHDs intact). Automated test would require deleting system.vhd from a live install — happy to add in a follow-up PR.

@benhillis
Copy link
Copy Markdown
Member

Thanks for investigating, would it be possible to try to root cause the issue instead of a band-aid? A slightly better error message isn’t going to help users that get into this state.

@yeelam-gordon
Copy link
Copy Markdown
Contributor Author

✅ Runtime Rollback Test PASSED

Ran the test on an admin terminal with the built MSI:

Method: Locked wslservice.exe exclusively → ran msiexec /i wsl.msi /qn → install failed (exit 1603) → rollback triggered

Results:

File Status Hash Match
system.vhd ✅ EXISTS ✅ SHA256 matches baseline
modules.vhd ✅ EXISTS ✅ SHA256 matches baseline
wsl.exe ✅ EXISTS
wslservice.exe ✅ EXISTS

Post-test: wsl --version returns 2.6.3.0 — WSL fully functional after failed upgrade.

Log evidence: MSI rollback policy confirmed enabled (DisableRollback = 0), transaction rolled back all changes including old product removal.

This confirms Schedule="afterInstallInitialize" correctly places RemoveExistingProducts inside the MSI transaction, preventing VHD loss on failed upgrades.

@yeelam-gordon
Copy link
Copy Markdown
Contributor Author

@benhillis Good call — re-reading the PR description, it buries the root cause fix. Let me clarify:

This PR does fix the root cause, not just error messages. There are two changes:

  1. Root cause fix (package.wix.in): Changed MajorUpgrade Schedule to afterInstallInitialize. This moves RemoveExistingProducts inside the MSI transaction (seq 1501, between InstallInitialize at 1500 and InstallFinalize at 6600). Without this fix, the old product's files (including system.vhd) are deleted before the new install starts — if the new install fails, the files are gone with no rollback path. With this fix, if the new install fails, the entire transaction rolls back and restores the original files.

    We confirmed this with a runtime test: locked a file to force install failure → MSI exited 1603 → system.vhd and modules.vhd survived with identical SHA256 hashes.

  2. Defense-in-depth (runtime checks): The error message changes are secondary — they ensure that if VHDs are somehow missing (e.g., antivirus quarantine, manual deletion), users get "The system distribution virtual disk '...' is missing. Please reinstall or repair WSL." instead of a cryptic HCS error.

I'll update the PR description to make this clearer. The one-line Schedule change is the actual fix; the runtime checks are belt-and-suspenders.

@yeelam-gordon yeelam-gordon force-pushed the fix/system-vhd-rollback-and-checks branch from 6c159b2 to 2ad3626 Compare May 14, 2026 01:52
Problem: MSI MajorUpgrade with default scheduling removes old system.vhd
before the new install completes. If the new install fails, the file is
permanently lost and WSL2 breaks with a cryptic error. The default
system.vhd path only had a debug-only WI_ASSERT, so production builds
gave no actionable error message.

Fix 1 - Atomic MSI rollback:
Change MajorUpgrade Schedule to 'afterInstallInitialize' so old product
removal occurs inside the MSI transaction. If the new install fails,
Windows Installer rolls back and restores the old files.

Fix 2 - Production existence checks:
- Add WSL_E_SYSTEM_DISTRO_MISSING error code (0x80040333)
- Replace WI_ASSERT in WslCoreVm.cpp with THROW_HR_WITH_USER_ERROR_IF
  that provides a clear, localized message with the missing file path
- Add existence checks for both system.vhd and modules.vhd in
  HcsVirtualMachine.cpp before SCSI disk attach
- Add existence check for default modules.vhd in WslCoreVm.cpp
- Only validate default/package paths; user-provided overrides
  (RootVhdOverride, WSL_SYSTEM_DISTRO_PATH) use existing error paths

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@yeelam-gordon yeelam-gordon force-pushed the fix/system-vhd-rollback-and-checks branch from 2ad3626 to 05c4925 Compare May 14, 2026 03:37
@yeelam-gordon
Copy link
Copy Markdown
Contributor Author

Testing Update — Before/After Analysis

Reproducing the bug (BEFORE)

Attempted multiple approaches to reproduce the file-loss scenario with the stock MSI:

  • User-level file lock on wslservice.exe → MSI Restart Manager killed the locking process
  • SYSTEM-level scheduled task lock → Restart Manager killed it too
  • Deny-delete ACL → SYSTEM/admin bypasses it
  • Restart Manager disabled via \MSIRESTARTMANAGERCONTROL=Disable\ → Still succeeded

The real-world failure conditions from #40488 (AV kernel-mode filter drivers, disk full, power loss, system crashes) cannot be safely simulated in a controlled test.

MSI Sequence Analysis (definitive evidence)

Queried the \InstallExecuteSequence\ table from both MSIs:

Action Stock MSI (2.7.3) Fixed MSI (this PR)
\InstallInitialize\ (transaction start) 1500 1500
\RemoveExistingProducts\ 1401 1501
\InstallFinalize\ (transaction end) 6600 6600

Stock: \RemoveExistingProducts\ at seq 1401 runs before the transaction (1500). Old product is removed with no rollback safety net.

Fixed: \RemoveExistingProducts\ at seq 1501 runs inside the transaction (1500–6600). On failure, the MSI engine rolls back the entire transaction, restoring the old product.

Rollback test (AFTER — our fix)

Already posted above — locked \wslservice.exe\ → msiexec failed (1603) → all VHDs restored with matching SHA256 hashes → WSL remained functional.

Error message update

Updated the user-facing error to give an actionable recovery command:
\
A required WSL package file is missing: '{path}'. Please run 'wsl --update' to restore it.
\\

@yeelam-gordon
Copy link
Copy Markdown
Contributor Author

Updated analysis: This fix also prevents the reboot-pending VHD deletion scenario

After further investigation and consulting the WiX MajorUpgrade documentation and RemoveExistingProducts action reference, I've confirmed that afterInstallInitialize addresses both failure modes:

Failure Mode 1: Hard failure (exit 1603)

Already documented — rollback restores old product.

Failure Mode 2: File-in-use → reboot-pending delete nukes new file

This is a real customer scenario where:

  1. Old VHD is locked during upgrade → MSI can't delete it → registers MoveFileEx pending delete on the path
  2. New VHD is installed successfully at the same path
  3. After reboot, the stale pending delete fires and removes the new VHD

Why afterInstallInitialize prevents this:

With afterInstallValidate (stock), the old product removal runs as a separate, completed operation — its pending delete targets the original path independently of what the new install does later.

With afterInstallInitialize, both removal and install are in the same execution script/transaction. MSI's file costing engine sees system.vhd being both removed and installed, and treats it as a coordinated file replacement — not independent delete + create. For locked files during a replacement, MSI renames the old file to a temp name, writes the new file at the original path, and schedules deletion of only the temp file (not the original path).

This is also confirmed by the Microsoft docs which state that afterInstallValidate is "inefficient because all reused files have to be recopied" — implying that with afterInstallInitialize, reused files are handled in-place rather than through delete+recreate.

@yeelam-gordon
Copy link
Copy Markdown
Contributor Author

Correction to my earlier comment (refined analysis with MSI internals research)

After deeper research into MSI's file handling internals, I need to refine my earlier analysis:

MSI's .rbf mechanism (applies to BOTH schedulings on success)

MSI never registers the original file path for pending deletion. When encountering a locked file, it:

  1. Renames foo.dllfoo.dll.rbf (this succeeds because NTFS + FILE_SHARE_DELETE allows rename while file is loaded)
  2. Frees the original path
  3. Installs new foo.dll at the freed path
  4. Registers MoveFileEx(foo.dll.rbf, NULL, DELAY_UNTIL_REBOOT) — deletes the .rbf backup only

So in the normal success path, the "reboot-pending delete nukes new file" scenario does NOT occur through MSI's standard mechanism. Both schedulings are safe.

Where VHDs are special (and why this fix matters)

However, VHDs are data files, not executables. When a VHD is attached/mounted by Hyper-V, it's opened without FILE_SHARE_DELETE — meaning even the rename step fails. In this edge case:

  • With afterInstallValidate: the old product's removal runs outside the transaction. If the .rbf rename fails, the file operation has undefined behavior with no rollback coverage.
  • With afterInstallInitialize: if the file operation fails, the entire transaction fails and rolls back. Machine restored to previous state.

The fix provides this guarantee

The combination of:

  1. ServiceControl Stop="both" Wait="yes" → stops WSLService, releasing VHD handles in normal cases
  2. Schedule="afterInstallInitialize" → if anything still fails, full rollback restores old product

This ensures the user's requirement: fail and rollback, never "succeed now, break after reboot."

Source: MSI Logging of Reboot Requests, Concurrent Installations

@microsoft-github-policy-service
Copy link
Copy Markdown
Contributor

Hello! Could you please provide more logs to help us better diagnose your issue?

To collect WSL logs, download and execute collect-wsl-logs.ps1 in an administrative powershell prompt:

Invoke-WebRequest -UseBasicParsing "https://raw.githubusercontent.com/microsoft/WSL/master/diagnostics/collect-wsl-logs.ps1" -OutFile collect-wsl-logs.ps1
Set-ExecutionPolicy Bypass -Scope Process -Force
.\collect-wsl-logs.ps1

The script will output the path of the log file once done.

Once completed please upload the output files to this GitHub issue.

See Collect WSL logs (recommended method).

If you choose to email these logs instead of attaching them to the bug, please send them to wsl-gh-logs@microsoft.com with the GitHub issue number in the subject, and include a link to your GitHub issue comment in the message body.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cannot start WSL: Wsl/Service/CreateInstance/CreateVm/MountVhd/HCS/ERROR_FILE_NOT_FOUND

3 participants