Skip to content

feat: unified install integrity — .mcpp_ok marker + auto-cleanup#73

Merged
Sunrisepeak merged 11 commits into
mainfrom
feat/install-recovery
May 22, 2026
Merged

feat: unified install integrity — .mcpp_ok marker + auto-cleanup#73
Sunrisepeak merged 11 commits into
mainfrom
feat/install-recovery

Conversation

@Sunrisepeak
Copy link
Copy Markdown
Member

@Sunrisepeak Sunrisepeak commented May 22, 2026

Summary

Unified mechanism for detecting and recovering from interrupted installs. Fixes the "Ctrl+C then permanently stuck" problem for all package types.

How it works

  • .mcpp_ok marker file written after every verified successful install
  • is_install_complete() is strict marker-only — no legacy heuristic fallback on the resolve/install path
  • clean_incomplete_install() is also strict marker-only — any directory without .mcpp_ok is treated as residue from an interrupted install
  • After a successful copy from ~/.xlings/, the content is validated via looks_complete_legacy() (structural check) before writing the marker
  • clean_all_incomplete() (used by mcpp self init) preserves legacy packages (no marker but valid layout) — user-visible asset protection only, separate from the install path

Trade-offs

  • Half-extracted residue: cleaned automatically on next attempt (the fix)
  • Pre-upgrade legacy installs: trigger one-time reinstall when first used; typically a fast copy_xpkg_from_global from ~/.xlings/ (no real download)
  • mcpp self init: preserves legacy packages so users don't lose them

Changes

  • src/fallback/install_integrity.cppm — unified integrity API
  • resolve_xpkg_path() — strict marker check → clean residue → install → copy fallback (with content validation)
  • mcpp self init — scans and cleans incomplete xpkgs (legacy-aware)
  • Bootstrap (ensure_patchelf/ensure_ninja) — verify actual binary before marking complete
  • toolchain install — explicit bootstrap integrity check at entry

User experience

# Before: Ctrl+C → permanently stuck
$ mcpp toolchain install llvm   # Ctrl+C during download
$ mcpp toolchain install llvm   # error: exit 1 (forever)

# After: Ctrl+C → auto-recovers
$ mcpp toolchain install llvm   # Ctrl+C during download
$ mcpp toolchain install llvm   # "Cleaning incomplete install..." → re-downloads → success

Test plan

  • self init cleans incomplete, keeps complete legacy
  • toolchain install writes .mcpp_ok marker after success
  • build + run works normally
  • toolchain install fails early if bootstrap is incomplete
  • xlings install errors preserved (not masked by generic "payload missing")
  • copy fallback validates content before marking complete
  • CI: Linux, macOS, Windows all pass

Unified mechanism for detecting and recovering from interrupted installs
(Ctrl+C, network failure, kill -9). Applies to all package types:
toolchains, bootstrap tools, and modular libraries.

src/fallback/install_integrity.cppm:
  - is_install_complete(): check .mcpp_ok marker or backward-compat heuristic
  - mark_install_complete(): write .mcpp_ok after verified install
  - clean_incomplete_install(): remove directory if incomplete
  - clean_all_incomplete(): scan xpkgs/ and clean all residue

resolve_xpkg_path() now:
  1. Check complete (marker or heuristic) → use
  2. Clean incomplete residue → install → mark complete
  3. Install failed → clean residue → copy fallback → mark complete
  4. All failed → clear error with hint

mcpp self init now scans and cleans incomplete xpkgs.
Bootstrap tools (patchelf/ninja) get .mcpp_ok after ensure.
1. is_install_complete: handle mcpplibs nested layout (single subdir
   with src/ or mcpp.toml), prevents false deletion of old packages

2. copy_from_global: check return value and verify completeness
   before marking .mcpp_ok; clean partial copies on failure

3. Restore !inst error propagation in resolve_xpkg_path — don't
   mask xlings launch/protocol errors behind generic "payload missing"

4. Per-command bootstrap gating: check_base_init is deferred to
   get_cfg(requireBootstrap=true) in build/run/toolchain-install,
   not run globally in load_or_init. Light commands (self env,
   toolchain list) skip the check.
1. toolchain install now calls check_base_init() before proceeding,
   failing early if patchelf/ninja bootstrap is incomplete

2. Preserve original xlings install error (exit code + message) in
   final error instead of masking with generic "payload missing"

3. One-time legacy migration: migrate_legacy_installs() scans xpkgs
   and writes .mcpp_ok markers for old complete packages on first run.
   is_install_complete() still has legacy fallback for un-migrated
   packages but migration ensures it's rarely needed.

4. Clean trailing whitespace in .agents/docs/*.md
Round-3 review fixes:

1. Bootstrap marker: mark_install_complete() only called after
   verifying the actual binary (bin/patchelf, bin/ninja) exists,
   not just after ensure_*() returns (which may have failed).

2. Remove automatic migrate_legacy_installs() from load_or_init().
   Heuristic-based marker writing could stamp half-extracted packages
   as complete. Legacy heuristic remains in is_install_complete() as
   read-only fallback (won't delete old packages), but .mcpp_ok is
   only written on explicit success paths or via `mcpp self init`.
Round-4 review fixes:

1. clean_incomplete_install() now uses marker-only check:
   - Has .mcpp_ok → keep (verified install)
   - No marker but looks_complete_legacy → keep (pre-upgrade package)
   - No marker, no legacy content → clean (genuinely incomplete)
   This prevents half-extracted packages that happen to have bin/lib
   from escaping cleanup.

2. Remove migrate_legacy_installs() — it was dead code (declared but
   never called). The legacy fallback in is_install_complete() handles
   old packages read-only without writing markers.
Round-4 review fix:

clean_incomplete_install() now uses STRICT marker-only semantics.
Used on the resolve/install path for the CURRENT target — absence of
.mcpp_ok unambiguously means the install attempt was incomplete.
A half-extracted dir with bin/ would otherwise escape cleanup and
corrupt subsequent installs.

clean_all_incomplete() (global scan via `mcpp self init`) keeps the
legacy-aware behavior: packages without marker but with legacy content
dirs are preserved for backward compatibility with pre-upgrade installs.

is_install_complete() retains the legacy fallback for read-only compat
in resolve_xpkg_path() — old packages are recognized as usable, but
this doesn't shield them from explicit cleanup on the install path.
Round-5 review fix:

is_install_complete() is now strict marker-only. No more legacy
heuristic fallback on the resolve/install path.

Rationale: from directory layout alone we cannot distinguish a
legacy-complete install (bin/ exists, full) from a half-extracted
residue (bin/ exists, partial). Adopting the latter silently
corrupts the user's toolchain. Strict semantics close this gap.

Cost: upgrade users do a one-time reinstall per toolchain. In
practice this hits the fast copy_xpkg_from_global() fallback that
reuses ~/.xlings/, so it's rarely a real download.

clean_all_incomplete() (mcpp self init) still preserves legacy
packages (no marker + legacy layout) as user-visible assets —
that's a separate concern from the resolve path's strictness.

looks_complete_legacy() is now exported for explicit legacy-aware
call sites (currently only clean_all_incomplete uses it).
Round-6 review fix:

After Round-5 made is_install_complete() strict marker-only, the copy
fallback path broke:

    bool copyOk = copy_xpkg_from_global(verdir);
    if (copyOk && is_install_complete(verdir)) {   // ← always false
        mark_install_complete(...);                //   never reached
        return make_payload();
    }
    clean_incomplete_install(verdir);              // ← wipes the copy

copy_xpkg_from_global() doesn't (and can't) write .mcpp_ok in the
copied directory, so the marker-only check would always fail, and the
just-copied package would be immediately wiped, returning "xpkg payload
missing".

Fix: validate the copied content via looks_complete_legacy() (the
structural heuristic) before writing the marker. This is safe in this
context because:

  1. Step 2 of the resolve chain already cleaned any pre-existing
     residue using strict marker-only semantics — so anything at
     verdir now MUST be the result of our just-completed copy.

  2. copy_xpkg_from_global() only returns true on a clean recursive
     copy (no partial copies reach this branch).

  3. The heuristic validates that the source actually had content
     (rules out copying from an empty/broken global xlings dir).

This restores the documented "copy_xpkg_from_global is the typical fast
fallback" behavior that Round-5 unintentionally broke.
Round-7 review fix:

Previously the copy fallback ran after ANY xlings install failure
(exitCode != 0), copying whatever was in ~/.xlings/ and validating
with looks_complete_legacy(). That heuristic only checks for top-level
bin/lib/include/share — a half-extracted residue in the GLOBAL xlings
directory (which we cannot clean) would pass this check and get
marked as complete, permanently masking the broken install.

Fix: split into three branches based on what xlings reported.

  - exitCode == 0 && verdir exists       → normal success, mark complete
  - exitCode == 0 && verdir missing      → XLINGS_HOME propagation bug;
                                           this is the ONLY scenario where
                                           we trust the global location
                                           enough to fall through to copy
  - exitCode != 0                        → genuine install failure;
                                           propagate the original error
                                           without trying global copy
                                           (global may also be residue
                                           from the same failure, and
                                           looks_complete_legacy can't
                                           tell them apart)

Also clarifies the autoInstall=false branch: still allow copy from
global if the user previously installed via system xlings (no install
attempt to confuse the state).
The autoInstall=false branch was performing copy_xpkg_from_global
recovery without a "this session's xlings install reported success"
witness, falling outside the safety boundary established in round-7.

Currently no caller passes autoInstall=false, so this is a no-op
cleanup that removes a future foot-gun: anyone adding such a caller
would inadvertently re-introduce the "half-extracted residue marked
as complete" window.

Semantic: when the caller explicitly disables auto-install, do not
perform any implicit recovery — return "payload missing" so the
caller (and the user) sees the truth instead of a silently-recovered
possibly-broken package.
@Sunrisepeak Sunrisepeak merged commit fd6f133 into main May 22, 2026
3 checks passed
@Sunrisepeak Sunrisepeak deleted the feat/install-recovery branch May 22, 2026 23:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant