Conversation
… DEAD, track disk ownership
A user deployed with `size = 20` instead of `size_gb = 20` in their disk
config. TOML silently ignored the unknown field, leaving SizeGB=0. This
caused a cascade:
1. configureVolumes() failed every boot ("disk does not exist and no size
specified"), but the sandbox stayed PENDING — never marked DEAD — so
the pool controller kept retrying it forever (no crash-backoff).
2. After the user fixed the config, a new deploy created the disk
successfully, but the old stuck sandboxes' leases conflicted with new
ones, causing a "lease carousel" for ~6 minutes.
3. After deleting the app, the disk and its lease survived because
ensureDisk() never set CreatedBy, so DeleteAppTransitive couldn't
find it. The user had to manually clean up.
Four fixes:
- Reject unknown TOML fields: enable DisallowUnknownFields() on all
decoder paths so typos like `size` are caught at parse time.
- Mark sandbox DEAD on any createSandbox error: use a named return with
a top-level defer so every failure path — configureVolumes, buildSpec,
NewContainer, port health checks — transitions the sandbox to DEAD.
This lets the pool controller's exponential crash-backoff kick in
instead of retrying broken sandboxes forever.
- Set CreatedBy on disk entities: pass the app ID through to ensureDisk()
so DeleteAppTransitive can find and clean up disks when an app is
deleted.
- Validate disk existence at deploy time: when size_gb is 0, query the
entity store to verify the disk exists before creating any sandboxes.
Returns a clear error: "disk X does not exist; set size_gb to
auto-create it".
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds strict TOML decoding (DisallowUnknownFields) and post-decode validation (ac.Validate) for app config parsing, plus tests verifying unknown-field rejection. Changes SandboxController.createSandbox to a named-error return and adds a defer that marks sandboxes DEAD and persists status on errors. Threads appID into disk provisioning: ensureDisk now accepts appID and newly created storage.Disk records CreatedBy. Adds validateDiskConfigs to BuildFromTar to fail when disks referenced with size_gb=0 are missing, with unit tests covering disk validation and creation paths. 📝 Coding Plan for PR comments
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
appconfig/appconfig_test.go (1)
1140-1200: Add a file-backed strictness test.These cases only exercise
Parse(), but the deploy-time path changed in this PR is alsoLoadAppConfigUnder(). A regression there would keep.miren/app.tomlloads permissive while this suite still passes. A temp-dir test that writesAppConfigPathwould close that gap.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@appconfig/appconfig_test.go` around lines 1140 - 1200, Add a new temp-dir-backed test that writes the same TOML strings to a file and calls LoadAppConfigUnder() (not just Parse()) to exercise the deploy-time path: create a temp dir with t.TempDir(), write the test TOML into a file named using AppConfigPath basename, set the package-level AppConfigPath to that filename (saving and deferring restore of the original), call LoadAppConfigUnder(tempDir) and assert it errors for the unknown-field cases and succeeds for the valid case (checking ac.Services["database"].Disks[0].SizeGB == 20); ensure file writes use os.WriteFile and cleanup/restoration is done with defer.controllers/sandbox/sandbox.go (1)
846-852: Consider potential race with existing status updates.The defer unconditionally sets
co.Status = compute.DEADand callsmeta.Update(). If the sandbox was already marked DEAD or STOPPED by another goroutine (e.g., task exit monitoring), this will still attempt an update.This is likely fine since:
- The error condition means the sandbox never reached RUNNING
- The
meta.Updateuses the in-memory meta which tracks revisionHowever, if
meta.Updatefails (e.g., OCC conflict), the error is silently ignored. Consider logging update failures:🛠️ Optional: Log meta.Update failures
defer func() { if err != nil { c.Log.Error("sandbox boot failed, marking DEAD", "id", co.ID, "err", err) co.Status = compute.DEAD - meta.Update(co.Encode()) + if updateErr := meta.Update(co.Encode()); updateErr != nil { + c.Log.Warn("failed to mark sandbox DEAD", "id", co.ID, "updateErr", updateErr) + } } }()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@controllers/sandbox/sandbox.go` around lines 846 - 852, The defer currently force-sets co.Status = compute.DEAD and calls meta.Update() which can race with other goroutines; modify the deferred closure in the sandbox boot path (the anonymous defer where co and meta are captured) to first check the current stored status (e.g., skip if already compute.DEAD or compute.STOPPED) before mutating co.Status, and when calling meta.Update() capture and log any returned error (use c.Log.Error with context "sandbox meta update failed", "id", co.ID, "err", err) so OCC/update failures are not silently ignored; keep the existing behavior of setting DEAD when appropriate but add the conditional guard and an error log for meta.Update() failures.controllers/sandbox/volume.go (1)
136-148: Consider failing sandbox creation when version lookup fails for disk provisioning.When the version lookup fails,
appIDremains empty and the disk is created without ownership. While this is logged as a warning, it means:
- Disks won't be cleaned up by
DeleteAppTransitive- The original orphaned disk problem persists for these cases
Since
sb.Spec.Versionis set, a lookup failure likely indicates a transient issue. Consider failing the sandbox creation to ensure retries with proper ownership:🛠️ Optional: Fail on version lookup error when disks require ownership
if sb.Spec.Version != "" { versionResp, err := c.EAC.Get(ctx, sb.Spec.Version.String()) if err != nil { c.Log.Warn("failed to get app version for disk lease", "version", sb.Spec.Version, "error", err) + // If this sandbox has disks that need ownership tracking, fail early + if len(sb.Spec.Volume) > 0 { + return nil, fmt.Errorf("cannot provision disks without app context: %w", err) + } } else {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@controllers/sandbox/volume.go` around lines 136 - 148, When sb.Spec.Version is set and c.EAC.Get(ctx, sb.Spec.Version.String()) returns an error, stop sandbox creation instead of proceeding with an empty appID; change the current warning-only path in controllers/sandbox/volume.go so that the handler returns or propagates an error (using the same error from c.EAC.Get or a wrapped error) rather than falling through to disk creation. Specifically, in the block around variable appID and the call to c.EAC.Get, replace the Warn log-only branch with a return err (or wrap with context) so any disk provisioning that depends on appID is not executed when version lookup fails.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@appconfig/appconfig.go`:
- Around line 127-129: LoadAppConfig() currently uses a permissive toml decoder
whereas LoadAppConfigUnder() and Parse() enable strict decoding; update
LoadAppConfig() to call DisallowUnknownFields() on its toml.Decoder (or refactor
all three loaders to use a shared decode helper) so unknown keys are rejected
consistently. Locate the toml.NewDecoder(...) usage in LoadAppConfig (and the
other occurrence noted near the second decoder) and either chain
.DisallowUnknownFields() on the decoder before Decode(&ac) or extract a helper
like decodeAppConfig(decoder, &ac) that applies DisallowUnknownFields() and
performs dec.Decode(&ac). Ensure the change is applied to both decoder
occurrences so all loaders behave identically.
---
Nitpick comments:
In `@appconfig/appconfig_test.go`:
- Around line 1140-1200: Add a new temp-dir-backed test that writes the same
TOML strings to a file and calls LoadAppConfigUnder() (not just Parse()) to
exercise the deploy-time path: create a temp dir with t.TempDir(), write the
test TOML into a file named using AppConfigPath basename, set the package-level
AppConfigPath to that filename (saving and deferring restore of the original),
call LoadAppConfigUnder(tempDir) and assert it errors for the unknown-field
cases and succeeds for the valid case (checking
ac.Services["database"].Disks[0].SizeGB == 20); ensure file writes use
os.WriteFile and cleanup/restoration is done with defer.
In `@controllers/sandbox/sandbox.go`:
- Around line 846-852: The defer currently force-sets co.Status = compute.DEAD
and calls meta.Update() which can race with other goroutines; modify the
deferred closure in the sandbox boot path (the anonymous defer where co and meta
are captured) to first check the current stored status (e.g., skip if already
compute.DEAD or compute.STOPPED) before mutating co.Status, and when calling
meta.Update() capture and log any returned error (use c.Log.Error with context
"sandbox meta update failed", "id", co.ID, "err", err) so OCC/update failures
are not silently ignored; keep the existing behavior of setting DEAD when
appropriate but add the conditional guard and an error log for meta.Update()
failures.
In `@controllers/sandbox/volume.go`:
- Around line 136-148: When sb.Spec.Version is set and c.EAC.Get(ctx,
sb.Spec.Version.String()) returns an error, stop sandbox creation instead of
proceeding with an empty appID; change the current warning-only path in
controllers/sandbox/volume.go so that the handler returns or propagates an error
(using the same error from c.EAC.Get or a wrapped error) rather than falling
through to disk creation. Specifically, in the block around variable appID and
the call to c.EAC.Get, replace the Warn log-only branch with a return err (or
wrap with context) so any disk provisioning that depends on appID is not
executed when version lookup fails.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 037eac88-d949-452a-a92d-77b1322005d1
📒 Files selected for processing (6)
appconfig/appconfig.goappconfig/appconfig_test.gocontrollers/sandbox/sandbox.gocontrollers/sandbox/volume.goservers/build/build.goservers/build/build_test.go
LoadAppConfigUnder() and Parse() already had strict decoding enabled, but LoadAppConfig() was missed. All three paths are now consistent.
phinze
left a comment
There was a problem hiding this comment.
Claude and I went through this together — traced the full failure cascade from the TOML typo through the PENDING loop and the orphaned disk. Really nice work addressing this at every layer.
One thing: we agree with CodeRabbit that LoadAppConfig() at appconfig.go:100 should also get DisallowUnknownFields() to keep all three loader paths consistent. Otherwise this is good to go!
Summary
A user deployed with
size = 20instead ofsize_gb = 20in their miren.toml disk config. TOML silently ignored the unknown field, leavingSizeGB=0. This caused a cascade of failures:Sandbox stuck PENDING forever:
configureVolumes()failed every boot attempt ("disk does not exist and no size specified"), but the sandbox was never marked DEAD. The pool controller counts PENDING sandboxes as "actual" capacity, so it didn't create replacements — but the stale-PENDING monitor would mark them STOPPED after 5 minutes, then the pool would create a new one. Since STOPPED sandboxes don't trigger the crash counter, the exponential backoff never kicked in.Lease carousel after fix: After the user fixed the config, old stuck sandboxes' leases conflicted with new ones from the redeployment, causing ~6 minutes of lease churn before stabilizing.
Orphaned disk on app delete:
ensureDisk()never setCreatedByon disk entities, soDeleteAppTransitivecouldn't find the disk when the app was deleted. The user had to manually clean up the disk and its lease.Changes
Reject unknown TOML fields (
appconfig/appconfig.go): EnableDisallowUnknownFields()on all decoder paths so typos likesizeinstead ofsize_gbproduce a clear parse error at deploy time.Mark sandbox DEAD on any boot failure (
controllers/sandbox/sandbox.go): RestructuredcreateSandbox()to use a named return with a top-level defer — every error path now transitions the sandbox to DEAD. This is a catch-all that coversconfigureVolumes,buildSpec,NewContainer, port health checks, and any future failure modes. The pool controller's existing exponential crash-backoff (10s → 20s → 40s → ... → 15min) then kicks in properly.Set
CreatedByon disk entities (controllers/sandbox/volume.go): Pass the app ID through toensureDisk()soDeleteAppTransitivecan find and clean up disks when an app is deleted.Validate disk existence at deploy time (
servers/build/build.go): Whensize_gbis 0 (meaning "use existing disk"), query the entity store to verify the disk exists before creating any sandboxes. Fails with:disk "X" does not exist; set size_gb to auto-create it.Test plan
go test ./appconfig/...— unknown field rejection tests passgo test ./servers/build/...— disk existence validation tests pass (missing disk, existing disk, auto-create)go vetandmake lintcleanmake testfull suite in iso