Skip to content

Conversation

@cburroughs
Copy link
Contributor

@cburroughs cburroughs commented Jan 20, 2026

This whole thing is a disclaimer because I don't know Rust and this is all LLM. My hope is that either upon looking at the code that it either looks obviously correct, or if not we abort before a human spends too much time on it.

I asked Claude to look at the last hundred commits on main and pull the logs for MissingDigest errors (there were 2). It then came up with a "plan" for fixing them (in a comment below). The short summary is: "The backtracking mechanism (designed to retry with caches disabled) fails to trigger because the MissingDigest error type is lost during error propagation."

I re-ran the CI workflow over a dozen times which I take as a hopeful sign, but did not burn hundreds of re-runs to statistically "prove" a fix.

ref #21497

@cburroughs cburroughs self-assigned this Jan 20, 2026
@cburroughs cburroughs added the category:internal CI, fixes for not-yet-released features, etc. label Jan 20, 2026
@cburroughs cburroughs changed the title WIP yolo MissingDigest speculative fix for sporadic MissingDigest errors Jan 21, 2026
@cburroughs
Copy link
Contributor Author

Plan

Fix: Missing Digest Errors - Enable Backtracking Recovery

 Problem Summary

 Sporadic "Missing digest: Was not present in the local store" errors occur during sandbox materialization. The backtracking
  mechanism (designed to retry with caches disabled) fails to trigger because the MissingDigest error type is lost during 
 error propagation.

 CI Failures Found:
 - Run #20851300165 (2026-01-09) - macOS14-ARM64
 - Run #19943400854 (2025-12-04) - macOS14-ARM64

 Why Does This Appear macOS-Specific?

 Short answer: It may not be. With only 2 failures in 200 workflow runs, this could be coincidental.

 Contributing factors that might make macOS more susceptible:

 1. Copy-on-Write vs Hard Links: macOS uses fclonefileat (CoW copy) instead of hard links for file materialization (see 
 store/src/lib.rs:1566-1576). While both should be reliable, CoW has different failure modes.
 2. Different CI Environment: macOS runners may have different:
   - Store size limits (causing more eviction)
   - Remote cache settings (more cache hits with stale digests)
   - I/O scheduling (exposing race conditions)
 3. Filesystem Semantics: APFS has different sync/caching behavior than ext4.

 However, the underlying bug affects ALL platforms:
 - The error type conversion bug (StoreError::MissingDigest → String → lost type) exists in platform-agnostic code
 - Any platform can hit this if a cached result references a missing digest
 - The fix enables proper recovery regardless of the root cause

 Root Cause

 The error type information is lost at local.rs:753:
 store.materialize_directory(...).await
     .map_err(|se| se.enrich("...").to_string())?;  // .to_string() loses type!

 Current (broken) error flow:
 1. StoreError::MissingDigest(msg, digest)
 2. → .to_string() → String
 3. → CapturedWorkdirError::Fatal(String)
 4. → ProcessError::Unclassified(String)
 5. → Failure::Throw (not Failure::MissingDigest!)
 6. → maybe_backtrack doesn't catch it, no retry happens

 Solution

 Preserve MissingDigest type through the entire error chain.

 Files to Modify

 1. src/rust/process_execution/src/local.rs

 Add MissingDigest variant to CapturedWorkdirError (~line 354):
 pub enum CapturedWorkdirError {
     Timeout { ... },
     Retryable(String),
     Fatal(String),
     MissingDigest(String, hashing::Digest),  // NEW
 }

 Update Display impl (~line 369):
 Self::MissingDigest(s, d) => write!(f, "Missing digest: {s}: {d:?}"),

 Add From<StoreError> impl:
 impl From<StoreError> for CapturedWorkdirError {
     fn from(err: StoreError) -> Self {
         match err {
             StoreError::MissingDigest(s, d) => Self::MissingDigest(s, d),
             StoreError::Unclassified(s) => Self::Fatal(s),
         }
     }
 }

 Update prepare_workdir (~line 753):
 // FROM:
 .map_err(|se| se.enrich("...").to_string())?;

 // TO:
 .map_err(|se| CapturedWorkdirError::from(se.enrich("...")))?;

 Update prepare_workdir_digest (~line 590, 655) similarly.

 2. src/rust/process_execution/src/lib.rs

 Update From<CapturedWorkdirError> for ProcessError (~line 128):
 impl From<CapturedWorkdirError> for ProcessError {
     fn from(err: CapturedWorkdirError) -> Self {
         match err {
             CapturedWorkdirError::MissingDigest(s, d) => ProcessError::MissingDigest(s, d),
             _ => ProcessError::Unclassified(err.to_string()),
         }
     }
 }

 No Changes Required

 These files already have correct handling:
 - src/rust/engine/src/context.rs - maybe_backtrack already handles Failure::MissingDigest
 - src/rust/engine/src/python.rs - From<ProcessError> for Failure already maps MissingDigest
 - src/rust/fs/store/src/lib.rs - StoreError already has MissingDigest variant

 Expected Result After Fix

 New error flow:
 1. StoreError::MissingDigest(msg, digest)
 2. → CapturedWorkdirError::MissingDigest(msg, digest) (type preserved!)
 3. → ProcessError::MissingDigest(msg, digest)
 4. → Failure::MissingDigest(msg, digest)
 5. → maybe_backtrack catches it, invalidates source process, retries with caches disabled

 Testing

 1. Build the Rust code: ./cargo build -p process_execution
 2. Run existing tests: ./cargo test -p process_execution
 3. The fix should be validated by CI runs - the sporadic failures should trigger backtracking instead of failing

@cburroughs
Copy link
Contributor Author

cburroughs commented Jan 21, 2026

 1. src/rust/process_execution/src/local.rs
  - Added use hashing::Digest; import
  - Added StoreError to imports from store
  - Added MissingDigest(String, Digest) variant to CapturedWorkdirError enum
  - Updated Display impl to handle the new variant
  - Added From<StoreError> for CapturedWorkdirError impl that preserves MissingDigest
  - Updated prepare_workdir to use CapturedWorkdirError::from() instead of .to_string()
  - Updated prepare_workdir_digest (2 locations) to use CapturedWorkdirError::from()

  2. src/rust/process_execution/src/lib.rs
  - Updated From<CapturedWorkdirError> for ProcessError to match on MissingDigest and preserve it

  How the Fix Works

  Before (broken):
  StoreError::MissingDigest → .to_string() → String → Fatal(String) → Unclassified → Throw
                                                      (type lost!)

  After (fixed):
  StoreError::MissingDigest → CapturedWorkdirError::MissingDigest → ProcessError::MissingDigest → Failure::MissingDigest
                                                    (type preserved!)

  Now when a digest is missing during sandbox materialization, the maybe_backtrack function in context.rs will catch the Failure::MissingDigest error and trigger the backtracking mechanism to retry with caches disabled.

@cburroughs cburroughs marked this pull request as ready for review January 21, 2026 21:52
@cburroughs
Copy link
Contributor Author

And before GitHub GCs them, an example log of what is allegedly fixed: https://github.com/pantsbuild/pants/actions/runs/20851300165/job/59906772145

  File "/Users/runner/work/pants/pants/src/python/pants/engine/internals/scheduler.py", line 598, in product_request
    return self.execute(request)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/runner/work/pants/pants/src/python/pants/engine/internals/scheduler.py", line 539, in execute
    self._raise_on_error(throws)
  File "/Users/runner/work/pants/pants/src/python/pants/engine/internals/scheduler.py", line 523, in _raise_on_error
    raise ExecutionError(
pants.engine.internals.scheduler.ExecutionError: 1 Exception encountered:

Engine traceback:
  in root
    ..
  in pants.init.plugin_resolver.resolve_plugins
    ..

Traceback (most recent call last):
  File "/Users/runner/work/pants/pants/src/python/pants/init/plugin_resolver.py", line 80, in resolve_plugins
    plugins_pex = await create_venv_pex(
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/runner/work/pants/pants/src/python/pants/engine/rules.py", line 69, in wrapper
    return await call
           ^^^^^^^^^^
  File "/Users/runner/work/pants/pants/src/python/pants/engine/internals/selectors.py", line 78, in __await__
    result = yield self
             ^^^^^^^^^^
  File "/Users/runner/work/pants/pants/src/python/pants/backend/python/util_rules/pex.py", line 1142, in create_venv_pex
    venv_pex_result = await build_pex(seeded_venv_request, **implicitly())
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/runner/work/pants/pants/src/python/pants/engine/rules.py", line 69, in wrapper
    return await call
           ^^^^^^^^^^
  File "/Users/runner/work/pants/pants/src/python/pants/engine/internals/selectors.py", line 78, in __await__
    result = yield self
             ^^^^^^^^^^
  File "/Users/runner/work/pants/pants/src/python/pants/backend/python/util_rules/pex.py", line 823, in build_pex
    result = await fallible_to_exec_result_or_raise(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/runner/work/pants/pants/src/python/pants/engine/rules.py", line 69, in wrapper
    return await call
           ^^^^^^^^^^
  File "/Users/runner/work/pants/pants/src/python/pants/engine/internals/selectors.py", line 78, in __await__
    result = yield self
             ^^^^^^^^^^
  File "/Users/runner/work/pants/pants/src/python/pants/engine/intrinsics.py", line 110, in execute_process
    return await native_engine.execute_process(process, process_execution_environment)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
native_engine.IntrinsicError: An error occurred when attempting to materialize a working directory at "/private/var/folders/pr/_3h4tm913_l56k6hrhpm89z00000gn/T/pants-sandbox-ewggr3": Was not present in the local store: Digest { hash: Fingerprint<49424048c0d1f8c1edf1bac6375c7984a6d7d16c83997eeafcb9de3bfd7edf40>, size_bytes: 1750 }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category:internal CI, fixes for not-yet-released features, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant