Conversation
…e routing
Three bugs manifested during rapid successive deploys:
1. Pool sandbox visibility: listSandboxes queried by SandboxSpec.Version,
which changes during pool reuse, causing the pool to lose visibility of
existing sandboxes. Now queries by entity kind and filters by pool label,
which never changes.
2. False crash detection: Short-lived sandboxes in decommissioned pools
(desired=0, no references) triggered crash detection. Now skips crash
detection for pools being intentionally torn down.
3. Stale lease routing: httpingress cached leases pointing to dead sandboxes
for up to 30s (renewal cycle), causing empty replies and timeouts.
- Activator now pushes SandboxInvalidation events when sandboxes leave
RUNNING state
- httpingress watches the invalidation channel and immediately drops
affected cached leases
- RenewLease rejects STOPPED/DEAD sandboxes as a backup
4. Slow traffic shift: waitForPoolReady polled pool.ReadyInstances (updated
on 10s resync) instead of checking sandbox status directly. Now also
checks sandbox entities which are updated immediately when sandboxes boot,
with a 500ms poll interval.
📝 WalkthroughWalkthroughAdds a SandboxInvalidation event and exposes invalidations via AppActivator.Invalidations(). The local activator publishes invalidations when sandboxes transition away from RUNNING and RenewLease is rejected for non-RUNNING sandboxes. The HTTP ingress server listens for invalidations to evict and release affected leases. Pool readiness now also detects RUNNING sandboxes directly; sandbox listing is based on pool labels and crash-cooldown is skipped for decommissioned pools. Tests and harnesses add helpers (NewTestLease, mock Invalidations, Miren.RunCmd, harness.HTTPGet) and a zero-downtime blackbox deploy test. CLAUDE.md build instructions clarified. 📝 Coding Plan
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
components/activator/activator.go (1)
1014-1020:⚠️ Potential issue | 🟠 MajorEmit invalidation on sandbox delete events too
Invalidations are only emitted on status transitions. A direct delete event currently removes tracking without invalidating cached leases, so stale routes can survive until renewal/error fallback.
🔧 Suggested fix
if op.IsDelete() { // Entity was deleted - clean up from tracking // The ID should still be available in the operation even without the entity if op.HasEntityId() { - a.removeSandboxFromTracking(entity.Id(op.EntityId())) + sandboxID := entity.Id(op.EntityId()) + a.emitInvalidation(sandboxID) + a.removeSandboxFromTracking(sandboxID) } return nil }+func (a *localActivator) emitInvalidation(sandboxID entity.Id) { + select { + case a.invalidationCh <- SandboxInvalidation{SandboxID: sandboxID}: + default: + a.log.Warn("invalidation channel full, dropping notification", "sandbox", sandboxID) + } +}- if oldStatus == compute_v1alpha.RUNNING && oldStatus != sb.Status { - select { - case a.invalidationCh <- SandboxInvalidation{SandboxID: sb.ID}: - default: - a.log.Warn("invalidation channel full, dropping notification", "sandbox", sb.ID) - } - } + if oldStatus == compute_v1alpha.RUNNING && oldStatus != sb.Status { + a.emitInvalidation(sb.ID) + }Also applies to: 1100-1108
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/activator/activator.go` around lines 1014 - 1020, When handling op.IsDelete() (and the similar block at lines ~1100-1108) you remove tracking with a.removeSandboxFromTracking(entity.Id(op.EntityId())) but never emit a lease invalidation; change the delete branch to emit the same invalidation used for status transitions for that sandbox ID before calling removeSandboxFromTracking (i.e., invoke the existing invalidation helper used elsewhere in this file/module for sandbox status changes, passing entity.Id(op.EntityId())), so cached leases/routes are invalidated on delete as well.
🧹 Nitpick comments (2)
servers/httpingress/httpingress.go (1)
186-207: Avoid callingReleaseLeasewhile holdingh.mu
invalidateSandboxLeasesholds the server mutex across externalReleaseLeasecalls. If release is slow, cache operations block behind this path.🔧 Suggested refactor
func (h *Server) invalidateSandboxLeases(ctx context.Context, sandboxID entity.Id) { - h.mu.Lock() - defer h.mu.Unlock() + var toRelease []*activator.Lease + h.mu.Lock() for app, ar := range h.apps { var kept []*lease for _, l := range ar.leases { - if l.Lease.Sandbox().ID == sandboxID { + sb := l.Lease.Sandbox() + if sb != nil && sb.ID == sandboxID { h.Log.Info("invalidating cached lease for stopped sandbox", "app", app, "sandbox", sandboxID, "url", l.Lease.URL) - h.aa.ReleaseLease(ctx, l.Lease) + toRelease = append(toRelease, l.Lease) } else { kept = append(kept, l) } } @@ ar.leases = kept } } + h.mu.Unlock() + + for _, lease := range toRelease { + _ = h.aa.ReleaseLease(ctx, lease) + } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@servers/httpingress/httpingress.go` around lines 186 - 207, invalidateSandboxLeases currently holds h.mu while calling h.aa.ReleaseLease which can block other ops; change it to first collect the leases to release without calling external code (iterate h.apps/ar.leases under h.mu and build a slice of toRelease entries and the updated kept slices or app deletions), update h.apps/ar.leases state while still holding h.mu, then unlock h.mu and iterate the collected toRelease slice to call h.aa.ReleaseLease(ctx, lease). Use the same identifiers (invalidateSandboxLeases, h.mu, h.apps, ar.leases, h.aa.ReleaseLease) so the critical section only touches in-memory maps and the potentially slow ReleaseLease runs outside the mutex.servers/httpingress/lease_test.go (1)
48-50: Use a stable mock invalidation channelReturning a new channel on each call makes it difficult to inject invalidation events in tests that call
Invalidations()and send from a different point.🔧 Suggested refactor
type mockActivator struct { mu sync.Mutex renewCount int releaseCount int renewErr error releasedURLs []string renewedURLs []string + invalidationCh chan activator.SandboxInvalidation } @@ func (m *mockActivator) Invalidations() <-chan activator.SandboxInvalidation { - return make(chan activator.SandboxInvalidation) + m.mu.Lock() + defer m.mu.Unlock() + if m.invalidationCh == nil { + m.invalidationCh = make(chan activator.SandboxInvalidation, 8) + } + return m.invalidationCh }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@servers/httpingress/lease_test.go` around lines 48 - 50, The mockActivator.Invalidations implementation returns a fresh channel each call which prevents tests from sending events into the same channel; change mockActivator to hold a persistent field (e.g. invalidations chan activator.SandboxInvalidation) initialized once (in the mockActivator constructor or when the struct is created) and have Invalidations() return that field; also expose or document the field so tests can send invalidation events into mockActivator.invalidations and close it when done.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@controllers/deployment/launcher.go`:
- Around line 953-958: hasRunningSandboxForPool currently masks listing errors
by returning false, causing waitForPoolReady to treat errors as “not ready”;
change hasRunningSandboxForPool to return (bool, error) (or otherwise surface
the underlying error) and update its callers (including the checks in
waitForPoolReady around the l.hasRunningSandboxForPool calls at the shown
locations) to handle and propagate errors instead of treating them as a negative
result—i.e., if the list call returns an error, return that error up to
waitForPoolReady so the caller can abort/handle the failure rather than timing
out and cleaning up.
In `@controllers/sandboxpool/manager_test.go`:
- Around line 1306-1311: The test mutates pool.SandboxSpec.Version in-memory but
never persists it to the test store, so reconciliation doesn't exercise the
version-update path; update the stored entity before calling reconcilePool by
saving the modified pool into the test datastore (i.e., persist the pool after
setting pool.SandboxSpec.Version = entity.Id("ver-2")) so that
NewManager(...)/reconcilePool(...) operates on the updated pool state.
In `@controllers/sandboxpool/manager.go`:
- Around line 324-330: The current listSandboxes implementation uses a global
indexAttr (entity.Ref(entity.EntityKind, compute_v1alpha.KindSandbox)) and
m.eac.List to fetch all sandboxes then filters in-memory, which is
O(pools×sandboxes); instead introduce and use a stable pool-membership indexed
attribute (e.g. "poolRef" or "sandboxPoolRef") on Sandbox entities and query by
that attribute so the backend returns only sandboxes for the given pool. Update
the Sandbox creation/ownership code to set the poolRef attribute (stable pool ID
or entity.Ref for the pool), then change listSandboxes to build an indexAttr for
sandbox-by-pool (referencing compute_v1alpha.KindSandbox and the poolRef value)
and call m.eac.List with that attribute/value so you avoid global scans (adjust
places using indexAttr, m.eac.List, and entity.Ref accordingly).
In `@servers/httpingress/httpingress.go`:
- Around line 176-181: The select loop reading from the invalidation channel can
spin if the channel from Invalidations() is closed because receiving yields zero
values; update the receive case to use the comma-ok form (inv, ok := <-ch) and
if ok is false treat the closure as a termination condition (e.g., return or
break the goroutine) before calling invalidateSandboxLeases(ctx, inv.SandboxID),
so the handler cleanly exits when the channel is closed; make this change in the
goroutine/select that currently reads from ch and calls invalidateSandboxLeases.
---
Outside diff comments:
In `@components/activator/activator.go`:
- Around line 1014-1020: When handling op.IsDelete() (and the similar block at
lines ~1100-1108) you remove tracking with
a.removeSandboxFromTracking(entity.Id(op.EntityId())) but never emit a lease
invalidation; change the delete branch to emit the same invalidation used for
status transitions for that sandbox ID before calling removeSandboxFromTracking
(i.e., invoke the existing invalidation helper used elsewhere in this
file/module for sandbox status changes, passing entity.Id(op.EntityId())), so
cached leases/routes are invalidated on delete as well.
---
Nitpick comments:
In `@servers/httpingress/httpingress.go`:
- Around line 186-207: invalidateSandboxLeases currently holds h.mu while
calling h.aa.ReleaseLease which can block other ops; change it to first collect
the leases to release without calling external code (iterate h.apps/ar.leases
under h.mu and build a slice of toRelease entries and the updated kept slices or
app deletions), update h.apps/ar.leases state while still holding h.mu, then
unlock h.mu and iterate the collected toRelease slice to call
h.aa.ReleaseLease(ctx, lease). Use the same identifiers
(invalidateSandboxLeases, h.mu, h.apps, ar.leases, h.aa.ReleaseLease) so the
critical section only touches in-memory maps and the potentially slow
ReleaseLease runs outside the mutex.
In `@servers/httpingress/lease_test.go`:
- Around line 48-50: The mockActivator.Invalidations implementation returns a
fresh channel each call which prevents tests from sending events into the same
channel; change mockActivator to hold a persistent field (e.g. invalidations
chan activator.SandboxInvalidation) initialized once (in the mockActivator
constructor or when the struct is created) and have Invalidations() return that
field; also expose or document the field so tests can send invalidation events
into mockActivator.invalidations and close it when done.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: bae1d254-49aa-4f87-b309-012d6f47382d
📒 Files selected for processing (6)
components/activator/activator.gocontrollers/deployment/launcher.gocontrollers/sandboxpool/manager.gocontrollers/sandboxpool/manager_test.goservers/httpingress/httpingress.goservers/httpingress/lease_test.go
| // Always query all sandboxes by kind and filter by pool label. | ||
| // We used to query by pool.SandboxSpec.Version, but that field changes | ||
| // during pool reuse (when a new deploy has identical specs), causing the | ||
| // pool to lose visibility of its existing sandboxes. | ||
| indexAttr := entity.Ref(entity.EntityKind, compute_v1alpha.KindSandbox) | ||
|
|
||
| resp, err := m.eac.List(ctx, indexAttr) |
There was a problem hiding this comment.
Global sandbox scans in listSandboxes will become a hot-path bottleneck
This now lists all sandboxes for every pool reconciliation/monitor pass and filters in memory. At scale, this creates O(pools × sandboxes) load and can significantly increase reconcile latency.
Consider introducing an indexed pool-membership attribute (stable pool ID ref) so lookups stay version-agnostic but query-selective.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@controllers/sandboxpool/manager.go` around lines 324 - 330, The current
listSandboxes implementation uses a global indexAttr
(entity.Ref(entity.EntityKind, compute_v1alpha.KindSandbox)) and m.eac.List to
fetch all sandboxes then filters in-memory, which is O(pools×sandboxes); instead
introduce and use a stable pool-membership indexed attribute (e.g. "poolRef" or
"sandboxPoolRef") on Sandbox entities and query by that attribute so the backend
returns only sandboxes for the given pool. Update the Sandbox creation/ownership
code to set the poolRef attribute (stable pool ID or entity.Ref for the pool),
then change listSandboxes to build an indexAttr for sandbox-by-pool (referencing
compute_v1alpha.KindSandbox and the poolRef value) and call m.eac.List with that
attribute/value so you avoid global scans (adjust places using indexAttr,
m.eac.List, and entity.Ref accordingly).
There was a problem hiding this comment.
Acknowledged, but we're going to leave this as-is for now. A few reasons:
- The old code already did global scans for addon pools (the
elsebranch that fell back to listing by kind) — this isn't a new pattern, we're just making it consistent. - At current scale, the entity store handles these queries fine. Pool reconciliation runs on a 10s resync, not on every request.
- Adding an indexed pool-ref attribute is the right optimization if this becomes a bottleneck, but it touches the entity schema and sandbox creation paths — that's a larger change we'd rather do intentionally rather than as part of a bug fix PR.
hasRunningSandboxForPoolin the launcher does the same global scan — if we index, both should be updated together.
If we see reconcile latency climb as sandbox counts grow, we'll revisit with a pool-ref index.
There was a problem hiding this comment.
@phinze, understood — those are all solid reasons. The consistency argument is particularly compelling, and coordinating a pool-ref index across both listSandboxes and hasRunningSandboxForPool in a dedicated follow-up is the right call.
✏️ Learnings added
Learnt from: phinze
Repo: mirendev/runtime PR: 672
File: controllers/sandboxpool/manager.go:324-330
Timestamp: 2026-03-13T18:36:08.403Z
Learning: In mirendev/runtime, the global sandbox scan pattern in `listSandboxes` (controllers/sandboxpool/manager.go) and `hasRunningSandboxForPool` (controllers/deployment/launcher.go) is intentional and acceptable at current scale. Pool reconciliation runs on a 10s resync, not on every request. The old code already used global scans for addon pools. The right optimization—adding a stable pool-ref indexed attribute on Sandbox entities—intentionally deferred because it touches the entity schema and sandbox creation paths, and both functions should be updated together when that work is done. Do not flag global-scan / O(pools×sandboxes) patterns in these two functions until a dedicated indexing effort is undertaken.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: phinze
Repo: mirendev/runtime PR: 256
File: controllers/sandboxpool/manager.go:428-563
Timestamp: 2025-10-23T18:56:26.936Z
Learning: Optimistic concurrency control issues in the SandboxPoolManager (controllers/sandboxpool/manager.go) should be deferred to issue MIR-219, which tracks a holistic review of the optimistic concurrency locking system. Do not suggest adding Revision field checks or optimistic locking fixes for SandboxPoolManager Put operations until that effort is complete.
Learnt from: phinze
Repo: mirendev/runtime PR: 294
File: components/activator/activator.go:707-722
Timestamp: 2025-11-05T19:00:27.973Z
Learning: In the Activator (components/activator/activator.go), the ~300ms retry delay during App.ActiveVersion switches is acceptable. This occurs when findPoolInStore temporarily fails to find a reused pool because ReferencedByVersions hasn't been updated yet by DeploymentLauncher. The delay is transient and self-heals once the launcher updates the pool references. This is a known optimization opportunity but not blocking for persistent sandbox pool functionality.
Learnt from: phinze
Repo: mirendev/runtime PR: 651
File: controllers/sandbox/create_saga_test.go:63-70
Timestamp: 2026-03-09T23:00:17.977Z
Learning: In controllers/sandbox/create_saga_test.go, phinze prefers keeping the mockEntityStore.PatchSandbox mock simple (no revision validation or meta.Revision mutation) until tests are added that exercise multi-patch revision ordering. Do not suggest tightening the mock's revision enforcement unless such tests are present in the PR.
Learnt from: phinze
Repo: mirendev/runtime PR: 637
File: controllers/deployment/launcher.go:214-220
Timestamp: 2026-03-02T01:19:30.972Z
Learning: In controllers/deployment/launcher.go, the `cleanupOldVersionPools` function operates at the version level, not the service level. It scales down pools belonging to old versions regardless of which service they're for. If one service fails to create a pool during reconciliation, the other services' old pools still need cleanup once their new pools are ready—cleanup should not be skipped per-service.
Learnt from: phinze
Repo: mirendev/runtime PR: 637
File: controllers/deployment/launcher.go:848-853
Timestamp: 2026-03-02T01:19:45.274Z
Learning: In Go polling loops that call EAC.Get (EntityAccessClient.Get) in mirendev/runtime, avoid wrapping every call with a per-call context.WithTimeout. Instead rely on the parent context for cancellation and perform manual deadline checks between iterations (e.g., time.Now().After(deadline)) while also handling cancellation in the select on the parent context. This provides sufficient timeout enforcement without the overhead and complexity of creating a new child context per iteration. Apply this guideline to polling patterns in controllers/deployment/launcher.go and similar polling scenarios across the repository.
Test that invalidateSandboxLeases correctly evicts cached leases when sandboxes die, only affects matching sandboxes, and works across multiple apps pointing at the same sandbox. Also adds activator.NewTestLease helper for constructing leases with sandbox references in tests outside the activator package.
TestZeroDowntimeDeploy fires continuous HTTP requests through httpingress while triggering a redeploy, verifying no requests fail during the transition. This exercises the proactive lease invalidation end-to-end. Also adds RunCmd and HTTPGet harness helpers for making HTTP requests from inside the dev container via curl.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
blackbox/harness/miren.go (1)
105-135: Consider extracting shared command-execution plumbing fromRun/RunCmd.Env setup, output capture, exit-code handling, and logging are duplicated, which increases drift risk when one path changes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@blackbox/harness/miren.go` around lines 105 - 135, The Run and RunCmd implementations duplicate command-execution plumbing (env setup, stdout/stderr capture, exit-code handling, and logging); extract that shared logic into a single helper (e.g., runCommand or executeCmd) and have Run and RunCmd call it. The helper should accept the exec.Cmd or args, perform cmd.Env = append(cmd.Environ(), "TERM=dumb"), wire bytes.Buffers to cmd.Stdout/cmd.Stderr, run cmd, determine exitCode from exec.ExitError, build and return the existing Result struct, and perform the same m.t.Logf logging so callers (Run/RunCmd) only prepare the cmd/args and forward to the helper.blackbox/zero_downtime_test.go (2)
69-81: Empty conditional block creates confusion about 502/503 tolerance.Lines 69-72 contain an empty
ifblock with a comment stating that 502/503 responses are "noted" and "may be acceptable," but the block does nothing. The subsequent code (lines 73-80) records all errors and all non-200 responses as failures, causing the test to fail at line 106.This contradiction makes the test's intent unclear. If the goal is strict zero-downtime (no 502/503 allowed), remove the empty block and misleading comment. If some transient errors during transition are acceptable, implement actual tolerance logic.
Suggested fix (assuming strict zero-downtime is intended)
code, _, err := harness.HTTPGet(m, host, "/") totalRequests.Add(1) - if err != nil || (code != 200 && code != 502 && code != 503) { - // 502/503 during brief transition are noted but may be acceptable - // in some edge cases; we track them all - } if err != nil { failMu.Lock() failedResults = append(failedResults, fmt.Sprintf("request #%d: error: %v", totalRequests.Load(), err))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@blackbox/zero_downtime_test.go` around lines 69 - 81, The empty if block around checking (code != 200 && code != 502 && code != 503) is misleading and contradicts the subsequent failure recording logic; either remove that empty conditional and its comment so failures are recorded strictly (zero-downtime), or implement explicit tolerance by skipping appending to failedResults for 502/503. Update the section that appends to failedResults (references: failedResults, totalRequests.Load(), failMu) so it only records non-200 responses when you intend to treat them as failures, or explicitly ignore 502/503 before acquiring failMu if transient errors are acceptable.
89-89: Usefilepath.Joinfor path construction.String concatenation for paths is less idiomatic than
filepath.Join, which handles path separators correctly.Suggested fix
- m.MustRun("deploy", "-a", name, "-d", m.ContainerPath(c.TestdataDir+"/go-server"), "-f", "-e", "DEPLOY_VERSION=v2") + m.MustRun("deploy", "-a", name, "-d", m.ContainerPath(filepath.Join(c.TestdataDir, "go-server")), "-f", "-e", "DEPLOY_VERSION=v2")Add
"path/filepath"to imports.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@blackbox/zero_downtime_test.go` at line 89, Replace the string-concatenated path in the m.MustRun call with filepath.Join: call m.ContainerPath(filepath.Join(c.TestdataDir, "go-server")) inside the existing m.MustRun invocation (the function call referencing m.MustRun and m.ContainerPath with c.TestdataDir and "go-server"); also add the "path/filepath" import to the file's imports. Ensure only the path construction is changed and the rest of the m.MustRun arguments remain identical.servers/httpingress/lease_test.go (1)
50-52: Return a stable invalidations channel from the mock.On Line 51, creating a new channel on every call makes subscriptions non-deterministic and hard to exercise in tests that need to push invalidation events.
💡 Proposed refactor
type mockActivator struct { mu sync.Mutex renewCount int releaseCount int renewErr error releasedURLs []string renewedURLs []string + invalidCh chan activator.SandboxInvalidation } @@ func (m *mockActivator) Invalidations() <-chan activator.SandboxInvalidation { - return make(chan activator.SandboxInvalidation) + m.mu.Lock() + defer m.mu.Unlock() + if m.invalidCh == nil { + m.invalidCh = make(chan activator.SandboxInvalidation, 1) + } + return m.invalidCh }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@servers/httpingress/lease_test.go` around lines 50 - 52, The mockActivator.Invalidations method currently creates a new channel on each call which makes subscriptions flaky; change mockActivator to hold a persistent channel field (e.g., invalidations chan activator.SandboxInvalidation) initialized once (in the mockActivator constructor or struct literal) and have Invalidations() return that stored channel so all callers share a stable invalidation stream.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@blackbox/harness/miren.go`:
- Around line 88-100: RunCmd can panic in ModeLocal because it indexes
args[0]/args[1:] without checking length; add a guard at the start of the
ModeLocal branch in Miren.RunCmd to check len(args) > 0 and, if not, call
m.t.Fatal or m.t.Fatalf with a clear message (e.g., "RunCmd: no args provided
for ModeLocal") so the test fails cleanly instead of panicking; update the
switch case handling for ModeLocal in the RunCmd method to perform this
validation before constructing exec.Command.
---
Nitpick comments:
In `@blackbox/harness/miren.go`:
- Around line 105-135: The Run and RunCmd implementations duplicate
command-execution plumbing (env setup, stdout/stderr capture, exit-code
handling, and logging); extract that shared logic into a single helper (e.g.,
runCommand or executeCmd) and have Run and RunCmd call it. The helper should
accept the exec.Cmd or args, perform cmd.Env = append(cmd.Environ(),
"TERM=dumb"), wire bytes.Buffers to cmd.Stdout/cmd.Stderr, run cmd, determine
exitCode from exec.ExitError, build and return the existing Result struct, and
perform the same m.t.Logf logging so callers (Run/RunCmd) only prepare the
cmd/args and forward to the helper.
In `@blackbox/zero_downtime_test.go`:
- Around line 69-81: The empty if block around checking (code != 200 && code !=
502 && code != 503) is misleading and contradicts the subsequent failure
recording logic; either remove that empty conditional and its comment so
failures are recorded strictly (zero-downtime), or implement explicit tolerance
by skipping appending to failedResults for 502/503. Update the section that
appends to failedResults (references: failedResults, totalRequests.Load(),
failMu) so it only records non-200 responses when you intend to treat them as
failures, or explicitly ignore 502/503 before acquiring failMu if transient
errors are acceptable.
- Line 89: Replace the string-concatenated path in the m.MustRun call with
filepath.Join: call m.ContainerPath(filepath.Join(c.TestdataDir, "go-server"))
inside the existing m.MustRun invocation (the function call referencing
m.MustRun and m.ContainerPath with c.TestdataDir and "go-server"); also add the
"path/filepath" import to the file's imports. Ensure only the path construction
is changed and the rest of the m.MustRun arguments remain identical.
In `@servers/httpingress/lease_test.go`:
- Around line 50-52: The mockActivator.Invalidations method currently creates a
new channel on each call which makes subscriptions flaky; change mockActivator
to hold a persistent channel field (e.g., invalidations chan
activator.SandboxInvalidation) initialized once (in the mockActivator
constructor or struct literal) and have Invalidations() return that stored
channel so all callers share a stable invalidation stream.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 21d6246e-2c6d-46c0-9a7b-94a95f4b59bc
📒 Files selected for processing (5)
blackbox/harness/http.goblackbox/harness/miren.goblackbox/zero_downtime_test.gocomponents/activator/testing.goservers/httpingress/lease_test.go
- hasRunningSandboxForPool returns (bool, error) instead of swallowing list failures silently - watchInvalidations guards against channel closure to prevent busy loop - Clarify invalidation condition (compare sb.Status not oldStatus) - Fix blackbox HTTPGet: use HTTPS on 443 (port 80 redirects), drop -o /dev/stdout that caused curl exit 23 - CLAUDE.md: document that binary must be built inside dev container
- Persist pool version update in store before reconciliation in TestManagerPoolReuse_NoOrphanedSandboxes so the test exercises the actual production scenario (store has ver-2, sandbox has ver-1) - Guard RunCmd against empty args to prevent panic in local mode
There was a problem hiding this comment.
🧹 Nitpick comments (2)
blackbox/harness/miren.go (1)
95-139: Consider consolidating duplicated exec/capture/log logic shared withRun.
RunandRunCmdnow have near-identical command execution flow (env setup, run, exit-code handling, output capture, logging). Extracting a shared helper would reduce drift risk.♻️ Suggested refactor sketch
+func (m *Miren) execAndCapture(logPrefix string, cmd *exec.Cmd, args []string) *Result { + m.t.Helper() + cmd.Env = append(cmd.Environ(), "TERM=dumb") + + var stdout, stderr bytes.Buffer + cmd.Stdout = &stdout + cmd.Stderr = &stderr + + err := cmd.Run() + exitCode := 0 + if err != nil { + if exitErr, ok := err.(*exec.ExitError); ok { + exitCode = exitErr.ExitCode() + } else { + m.t.Fatalf("failed to execute command: %v", err) + } + } + + r := &Result{ExitCode: exitCode, Stdout: stdout.String(), Stderr: stderr.String()} + m.t.Logf("%s %s → exit %d", logPrefix, strings.Join(args, " "), exitCode) + if r.Stdout != "" { m.t.Logf("stdout: %s", r.Stdout) } + if r.Stderr != "" { m.t.Logf("stderr: %s", r.Stderr) } + return r +}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@blackbox/harness/miren.go` around lines 95 - 139, Extract the duplicated command-execution logic from RunCmd and Run into a shared helper (e.g., runCommand or executeCmd) that accepts either an *exec.Cmd or the args+mode and returns the Result struct (ExitCode, Stdout, Stderr) and any non-ExitError error; move the common pieces—setting TERM in cmd.Env, wiring cmd.Dir, capturing stdout/stderr via bytes.Buffer, running cmd, extracting exit code from exec.ExitError, and logging via m.t.Logf—into that helper, then have Run and RunCmd build the appropriate cmd (using devExec for ModeDev and args for ModeLocal) and call the helper to get the Result; keep Result, Run, RunCmd, ModeDev, ModeLocal, devExec, and m.cluster.Mode names so callers remain unchanged.controllers/sandboxpool/manager_test.go (1)
1323-1325: Userequire.Lenbefore indexingsandboxes[0](Line 1323).If the length assertion fails, continuing to
sandboxes[0]can panic and hide the real failure signal.Suggested patch
- assert.Equal(t, 1, len(sandboxes), "existing sandbox should still be visible after version update") + require.Len(t, sandboxes, 1, "existing sandbox should still be visible after version update") assert.Equal(t, entity.Id("ver-1"), sandboxes[0].Spec.Version, "original sandbox should be the one found")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@controllers/sandboxpool/manager_test.go` around lines 1323 - 1325, Replace the unsafe length assertion with a required length check so the test fails immediately instead of panicking on indexing: call require.Len(t, sandboxes, 1) (or insert require.Len before the existing index access) to ensure sandboxes has one element before accessing sandboxes[0], then keep the existing assertion comparing sandboxes[0].Spec.Version to entity.Id("ver-1").
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@blackbox/harness/miren.go`:
- Around line 95-139: Extract the duplicated command-execution logic from RunCmd
and Run into a shared helper (e.g., runCommand or executeCmd) that accepts
either an *exec.Cmd or the args+mode and returns the Result struct (ExitCode,
Stdout, Stderr) and any non-ExitError error; move the common pieces—setting TERM
in cmd.Env, wiring cmd.Dir, capturing stdout/stderr via bytes.Buffer, running
cmd, extracting exit code from exec.ExitError, and logging via m.t.Logf—into
that helper, then have Run and RunCmd build the appropriate cmd (using devExec
for ModeDev and args for ModeLocal) and call the helper to get the Result; keep
Result, Run, RunCmd, ModeDev, ModeLocal, devExec, and m.cluster.Mode names so
callers remain unchanged.
In `@controllers/sandboxpool/manager_test.go`:
- Around line 1323-1325: Replace the unsafe length assertion with a required
length check so the test fails immediately instead of panicking on indexing:
call require.Len(t, sandboxes, 1) (or insert require.Len before the existing
index access) to ensure sandboxes has one element before accessing sandboxes[0],
then keep the existing assertion comparing sandboxes[0].Spec.Version to
entity.Id("ver-1").
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 2924d27b-f8ba-4fe3-bc22-d3ce1d9c65ed
📒 Files selected for processing (2)
blackbox/harness/miren.gocontrollers/sandboxpool/manager_test.go
phinze
left a comment
There was a problem hiding this comment.
Pushed a few tests and fixes, and approving to merge this sucker. Lots of good improvements!
Summary
Fixes multiple bugs that manifest during rapid successive deploys. The scenario: deploy an app 3 times in quick succession — deploy 1 (initial), deploy 2 (identical code, reuses pool), deploy 3 (changed code, new pool).
Bug 1: Pool loses visibility of its own sandboxes during reuse
Root cause:
listSandboxes()in the sandbox pool controller queried sandboxes bypool.SandboxSpec.Version(an indexed ref). During pool reuse, the deployment launcher updatesSandboxSpec.Versionto the new deploy's version. This made the pool lose visibility of its existing sandboxes — they still had the old version in their spec but the query was now looking for the new version.Consequence: The pool thinks
current=0, creates a duplicate sandbox. The original sandbox becomes orphaned — still running, still in DNS, but invisible to its pool and never cleaned up.Fix:
listSandboxes()now always queries sandboxes by entity kind and filters by pool label. The pool label is set at sandbox creation and never changes, making it a reliable membership indicator. The version-based query was an optimization that broke correctness during pool reuse.Bug 2: False crash detection on decommissioned pools
Root cause: When the old pool is decommissioned (desired=0, no version references), its short-lived duplicate sandbox (created by bug 1, lived ~20s) is retired. The crash detection logic counts any sandbox that died within 60s of creation as a "quick crash" — it doesn't distinguish between intentional retirement and actual crashes.
Consequence: The pool enters crash cooldown with exponential backoff, even though no actual crash occurred. This is cosmetic in isolation but could delay future pool reuse.
Fix: Skip crash detection entirely when a pool is being intentionally decommissioned (
DesiredInstances == 0ANDlen(ReferencedByVersions) == 0). All sandbox deaths in this state are expected, not crashes.Bug 3: httpingress routes to dead sandboxes for up to 30+ seconds
Root cause: httpingress caches leases (sandbox routing info) and only invalidates them in two ways: (a) when a proxy attempt fails with a connection error, or (b) on a 30-second periodic renewal cycle. Additionally,
RenewLease()in the activator didn't check sandbox status — it only verified the sandbox was still tracked. Since STOPPED/DEAD sandboxes are deliberately kept in tracking for fail-fast detection, renewals always succeeded even for dead sandboxes.Consequence: After old sandboxes are retired, httpingress continues routing requests to the dead IP for up to 30+ seconds. Users see "Empty reply from server" (curl error 52), timeouts, and "no route to host" errors until the lease finally expires or a connection error triggers invalidation.
Fix (three layers):
SandboxInvalidationevent type andInvalidations()channel to theAppActivatorinterface. WhenwatchSandboxes()detects a sandbox transitioning from RUNNING to any other state, it pushes a notification. httpingress runs awatchInvalidations()goroutine that listens on this channel and immediately drops all cached leases pointing to the invalidated sandbox.RenewLease()now rejects leases for non-RUNNING sandboxes, so the 30s periodic renewal acts as a backup invalidation mechanism.Bug 4: ~10 second delay before traffic shifts to new version
Root cause:
waitForPoolReady()in the deployment launcher pollspool.ReadyInstancesevery 2 seconds. ButReadyInstancesis only updated when the sandbox pool controller reconciles, which happens on a 10-second resync interval. So even though the new sandbox becomes RUNNING almost immediately (~200ms after container start), the deployment launcher doesn't see it for up to 10 seconds.Consequence: Old pool cleanup (and thus old sandbox retirement) is delayed by ~10 seconds after the new sandbox is ready. During this window, the old sandbox is still serving traffic with stale code.
Fix:
waitForPoolReady()now also callshasRunningSandboxForPool(), which queries sandbox entities directly. Sandbox status is updated immediately when the sandbox boots, bypassing the pool controller's 10s resync cycle. Poll interval reduced from 2s to 500ms since the entity queries are cheap.Bug sequence (from server logs)
Files changed
controllers/sandboxpool/manager.golistSandboxesto query by kind+pool label instead of version; skip crash detection for decommissioned poolscontrollers/sandboxpool/manager_test.goTestManagerVersionFilteringfor new pool-label-based membership; updatelistSandboxesForPoolhelper; addTestManagerPoolReuse_NoOrphanedSandboxesandTestManagerDecommissionedPool_NoCrashDetectioncomponents/activator/activator.goSandboxInvalidationtype andInvalidations()toAppActivatorinterface; send invalidation when sandbox leaves RUNNING; rejectRenewLeasefor non-RUNNING sandboxescontrollers/deployment/launcher.gowaitForPoolReadychecks sandbox entities directly viahasRunningSandboxForPool; reduce poll interval to 500msservers/httpingress/httpingress.gowatchInvalidationsgoroutine andinvalidateSandboxLeasesto immediately drop cached leases for dead sandboxesservers/httpingress/lease_test.goInvalidations()to mock activatorTest plan
TestManagerPoolReuse_NoOrphanedSandboxes— verifies old-version sandboxes remain visible after pool version updateTestManagerDecommissionedPool_NoCrashDetection— verifies DEAD sandboxes in decommissioned pools don't trigger crash detectionmake lintclean (0 issues)