Skip to content

Fix PDS disk boot failures (MIR-788)#666

Merged
evanphx merged 2 commits intomainfrom
mir-788-investigate-pds-disk-boot-failures
Mar 12, 2026
Merged

Fix PDS disk boot failures (MIR-788)#666
evanphx merged 2 commits intomainfrom
mir-788-investigate-pds-disk-boot-failures

Conversation

@evanphx
Copy link
Copy Markdown
Contributor

@evanphx evanphx commented Mar 12, 2026

Summary

A user deployed with size = 20 instead of size_gb = 20 in their miren.toml disk config. TOML silently ignored the unknown field, leaving SizeGB=0. This caused a cascade of failures:

  1. Sandbox stuck PENDING forever: configureVolumes() failed every boot attempt ("disk does not exist and no size specified"), but the sandbox was never marked DEAD. The pool controller counts PENDING sandboxes as "actual" capacity, so it didn't create replacements — but the stale-PENDING monitor would mark them STOPPED after 5 minutes, then the pool would create a new one. Since STOPPED sandboxes don't trigger the crash counter, the exponential backoff never kicked in.

  2. Lease carousel after fix: After the user fixed the config, old stuck sandboxes' leases conflicted with new ones from the redeployment, causing ~6 minutes of lease churn before stabilizing.

  3. Orphaned disk on app delete: ensureDisk() never set CreatedBy on disk entities, so DeleteAppTransitive couldn't find the disk when the app was deleted. The user had to manually clean up the disk and its lease.

Changes

  • Reject unknown TOML fields (appconfig/appconfig.go): Enable DisallowUnknownFields() on all decoder paths so typos like size instead of size_gb produce a clear parse error at deploy time.

  • Mark sandbox DEAD on any boot failure (controllers/sandbox/sandbox.go): Restructured createSandbox() to use a named return with a top-level defer — every error path now transitions the sandbox to DEAD. This is a catch-all that covers configureVolumes, buildSpec, NewContainer, port health checks, and any future failure modes. The pool controller's existing exponential crash-backoff (10s → 20s → 40s → ... → 15min) then kicks in properly.

  • Set CreatedBy on disk entities (controllers/sandbox/volume.go): Pass the app ID through to ensureDisk() so DeleteAppTransitive can find and clean up disks when an app is deleted.

  • Validate disk existence at deploy time (servers/build/build.go): When size_gb is 0 (meaning "use existing disk"), query the entity store to verify the disk exists before creating any sandboxes. Fails with: disk "X" does not exist; set size_gb to auto-create it.

Test plan

  • go test ./appconfig/... — unknown field rejection tests pass
  • go test ./servers/build/... — disk existence validation tests pass (missing disk, existing disk, auto-create)
  • go vet and make lint clean
  • make test full suite in iso

… DEAD, track disk ownership

A user deployed with `size = 20` instead of `size_gb = 20` in their disk
config. TOML silently ignored the unknown field, leaving SizeGB=0. This
caused a cascade:

1. configureVolumes() failed every boot ("disk does not exist and no size
   specified"), but the sandbox stayed PENDING — never marked DEAD — so
   the pool controller kept retrying it forever (no crash-backoff).

2. After the user fixed the config, a new deploy created the disk
   successfully, but the old stuck sandboxes' leases conflicted with new
   ones, causing a "lease carousel" for ~6 minutes.

3. After deleting the app, the disk and its lease survived because
   ensureDisk() never set CreatedBy, so DeleteAppTransitive couldn't
   find it. The user had to manually clean up.

Four fixes:

- Reject unknown TOML fields: enable DisallowUnknownFields() on all
  decoder paths so typos like `size` are caught at parse time.

- Mark sandbox DEAD on any createSandbox error: use a named return with
  a top-level defer so every failure path — configureVolumes, buildSpec,
  NewContainer, port health checks — transitions the sandbox to DEAD.
  This lets the pool controller's exponential crash-backoff kick in
  instead of retrying broken sandboxes forever.

- Set CreatedBy on disk entities: pass the app ID through to ensureDisk()
  so DeleteAppTransitive can find and clean up disks when an app is
  deleted.

- Validate disk existence at deploy time: when size_gb is 0, query the
  entity store to verify the disk exists before creating any sandboxes.
  Returns a clear error: "disk X does not exist; set size_gb to
  auto-create it".
@evanphx evanphx requested a review from a team as a code owner March 12, 2026 02:56
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 12, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: dd1cdc17-a9e8-4815-85f1-d49cceb15881

📥 Commits

Reviewing files that changed from the base of the PR and between e5e8665 and a1d7e23.

📒 Files selected for processing (1)
  • appconfig/appconfig.go

📝 Walkthrough

Walkthrough

Adds strict TOML decoding (DisallowUnknownFields) and post-decode validation (ac.Validate) for app config parsing, plus tests verifying unknown-field rejection. Changes SandboxController.createSandbox to a named-error return and adds a defer that marks sandboxes DEAD and persists status on errors. Threads appID into disk provisioning: ensureDisk now accepts appID and newly created storage.Disk records CreatedBy. Adds validateDiskConfigs to BuildFromTar to fail when disks referenced with size_gb=0 are missing, with unit tests covering disk validation and creation paths.

📝 Coding Plan for PR comments
  • Generate coding plan

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
appconfig/appconfig_test.go (1)

1140-1200: Add a file-backed strictness test.

These cases only exercise Parse(), but the deploy-time path changed in this PR is also LoadAppConfigUnder(). A regression there would keep .miren/app.toml loads permissive while this suite still passes. A temp-dir test that writes AppConfigPath would close that gap.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@appconfig/appconfig_test.go` around lines 1140 - 1200, Add a new
temp-dir-backed test that writes the same TOML strings to a file and calls
LoadAppConfigUnder() (not just Parse()) to exercise the deploy-time path: create
a temp dir with t.TempDir(), write the test TOML into a file named using
AppConfigPath basename, set the package-level AppConfigPath to that filename
(saving and deferring restore of the original), call LoadAppConfigUnder(tempDir)
and assert it errors for the unknown-field cases and succeeds for the valid case
(checking ac.Services["database"].Disks[0].SizeGB == 20); ensure file writes use
os.WriteFile and cleanup/restoration is done with defer.
controllers/sandbox/sandbox.go (1)

846-852: Consider potential race with existing status updates.

The defer unconditionally sets co.Status = compute.DEAD and calls meta.Update(). If the sandbox was already marked DEAD or STOPPED by another goroutine (e.g., task exit monitoring), this will still attempt an update.

This is likely fine since:

  1. The error condition means the sandbox never reached RUNNING
  2. The meta.Update uses the in-memory meta which tracks revision

However, if meta.Update fails (e.g., OCC conflict), the error is silently ignored. Consider logging update failures:

🛠️ Optional: Log meta.Update failures
 defer func() {
     if err != nil {
         c.Log.Error("sandbox boot failed, marking DEAD", "id", co.ID, "err", err)
         co.Status = compute.DEAD
-        meta.Update(co.Encode())
+        if updateErr := meta.Update(co.Encode()); updateErr != nil {
+            c.Log.Warn("failed to mark sandbox DEAD", "id", co.ID, "updateErr", updateErr)
+        }
     }
 }()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@controllers/sandbox/sandbox.go` around lines 846 - 852, The defer currently
force-sets co.Status = compute.DEAD and calls meta.Update() which can race with
other goroutines; modify the deferred closure in the sandbox boot path (the
anonymous defer where co and meta are captured) to first check the current
stored status (e.g., skip if already compute.DEAD or compute.STOPPED) before
mutating co.Status, and when calling meta.Update() capture and log any returned
error (use c.Log.Error with context "sandbox meta update failed", "id", co.ID,
"err", err) so OCC/update failures are not silently ignored; keep the existing
behavior of setting DEAD when appropriate but add the conditional guard and an
error log for meta.Update() failures.
controllers/sandbox/volume.go (1)

136-148: Consider failing sandbox creation when version lookup fails for disk provisioning.

When the version lookup fails, appID remains empty and the disk is created without ownership. While this is logged as a warning, it means:

  1. Disks won't be cleaned up by DeleteAppTransitive
  2. The original orphaned disk problem persists for these cases

Since sb.Spec.Version is set, a lookup failure likely indicates a transient issue. Consider failing the sandbox creation to ensure retries with proper ownership:

🛠️ Optional: Fail on version lookup error when disks require ownership
 if sb.Spec.Version != "" {
     versionResp, err := c.EAC.Get(ctx, sb.Spec.Version.String())
     if err != nil {
         c.Log.Warn("failed to get app version for disk lease",
             "version", sb.Spec.Version,
             "error", err)
+        // If this sandbox has disks that need ownership tracking, fail early
+        if len(sb.Spec.Volume) > 0 {
+            return nil, fmt.Errorf("cannot provision disks without app context: %w", err)
+        }
     } else {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@controllers/sandbox/volume.go` around lines 136 - 148, When sb.Spec.Version
is set and c.EAC.Get(ctx, sb.Spec.Version.String()) returns an error, stop
sandbox creation instead of proceeding with an empty appID; change the current
warning-only path in controllers/sandbox/volume.go so that the handler returns
or propagates an error (using the same error from c.EAC.Get or a wrapped error)
rather than falling through to disk creation. Specifically, in the block around
variable appID and the call to c.EAC.Get, replace the Warn log-only branch with
a return err (or wrap with context) so any disk provisioning that depends on
appID is not executed when version lookup fails.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@appconfig/appconfig.go`:
- Around line 127-129: LoadAppConfig() currently uses a permissive toml decoder
whereas LoadAppConfigUnder() and Parse() enable strict decoding; update
LoadAppConfig() to call DisallowUnknownFields() on its toml.Decoder (or refactor
all three loaders to use a shared decode helper) so unknown keys are rejected
consistently. Locate the toml.NewDecoder(...) usage in LoadAppConfig (and the
other occurrence noted near the second decoder) and either chain
.DisallowUnknownFields() on the decoder before Decode(&ac) or extract a helper
like decodeAppConfig(decoder, &ac) that applies DisallowUnknownFields() and
performs dec.Decode(&ac). Ensure the change is applied to both decoder
occurrences so all loaders behave identically.

---

Nitpick comments:
In `@appconfig/appconfig_test.go`:
- Around line 1140-1200: Add a new temp-dir-backed test that writes the same
TOML strings to a file and calls LoadAppConfigUnder() (not just Parse()) to
exercise the deploy-time path: create a temp dir with t.TempDir(), write the
test TOML into a file named using AppConfigPath basename, set the package-level
AppConfigPath to that filename (saving and deferring restore of the original),
call LoadAppConfigUnder(tempDir) and assert it errors for the unknown-field
cases and succeeds for the valid case (checking
ac.Services["database"].Disks[0].SizeGB == 20); ensure file writes use
os.WriteFile and cleanup/restoration is done with defer.

In `@controllers/sandbox/sandbox.go`:
- Around line 846-852: The defer currently force-sets co.Status = compute.DEAD
and calls meta.Update() which can race with other goroutines; modify the
deferred closure in the sandbox boot path (the anonymous defer where co and meta
are captured) to first check the current stored status (e.g., skip if already
compute.DEAD or compute.STOPPED) before mutating co.Status, and when calling
meta.Update() capture and log any returned error (use c.Log.Error with context
"sandbox meta update failed", "id", co.ID, "err", err) so OCC/update failures
are not silently ignored; keep the existing behavior of setting DEAD when
appropriate but add the conditional guard and an error log for meta.Update()
failures.

In `@controllers/sandbox/volume.go`:
- Around line 136-148: When sb.Spec.Version is set and c.EAC.Get(ctx,
sb.Spec.Version.String()) returns an error, stop sandbox creation instead of
proceeding with an empty appID; change the current warning-only path in
controllers/sandbox/volume.go so that the handler returns or propagates an error
(using the same error from c.EAC.Get or a wrapped error) rather than falling
through to disk creation. Specifically, in the block around variable appID and
the call to c.EAC.Get, replace the Warn log-only branch with a return err (or
wrap with context) so any disk provisioning that depends on appID is not
executed when version lookup fails.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 037eac88-d949-452a-a92d-77b1322005d1

📥 Commits

Reviewing files that changed from the base of the PR and between 407cb88 and e5e8665.

📒 Files selected for processing (6)
  • appconfig/appconfig.go
  • appconfig/appconfig_test.go
  • controllers/sandbox/sandbox.go
  • controllers/sandbox/volume.go
  • servers/build/build.go
  • servers/build/build_test.go

LoadAppConfigUnder() and Parse() already had strict decoding enabled,
but LoadAppConfig() was missed. All three paths are now consistent.
Copy link
Copy Markdown
Contributor

@phinze phinze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude and I went through this together — traced the full failure cascade from the TOML typo through the PENDING loop and the orphaned disk. Really nice work addressing this at every layer.

One thing: we agree with CodeRabbit that LoadAppConfig() at appconfig.go:100 should also get DisallowUnknownFields() to keep all three loader paths consistent. Otherwise this is good to go!

@evanphx evanphx merged commit 66b6371 into main Mar 12, 2026
11 checks passed
@evanphx evanphx deleted the mir-788-investigate-pds-disk-boot-failures branch March 12, 2026 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants