Skip to content

Fix rapid deploys causing orphaned sandboxes and stale routing#672

Merged
phinze merged 5 commits intomainfrom
mir-797-investigate-rapid-deploys-showing-up-as-sandbox-po
Mar 13, 2026
Merged

Fix rapid deploys causing orphaned sandboxes and stale routing#672
phinze merged 5 commits intomainfrom
mir-797-investigate-rapid-deploys-showing-up-as-sandbox-po

Conversation

@evanphx
Copy link
Copy Markdown
Contributor

@evanphx evanphx commented Mar 13, 2026

Summary

Fixes multiple bugs that manifest during rapid successive deploys. The scenario: deploy an app 3 times in quick succession — deploy 1 (initial), deploy 2 (identical code, reuses pool), deploy 3 (changed code, new pool).

Bug 1: Pool loses visibility of its own sandboxes during reuse

Root cause: listSandboxes() in the sandbox pool controller queried sandboxes by pool.SandboxSpec.Version (an indexed ref). During pool reuse, the deployment launcher updates SandboxSpec.Version to the new deploy's version. This made the pool lose visibility of its existing sandboxes — they still had the old version in their spec but the query was now looking for the new version.

Consequence: The pool thinks current=0, creates a duplicate sandbox. The original sandbox becomes orphaned — still running, still in DNS, but invisible to its pool and never cleaned up.

Fix: listSandboxes() now always queries sandboxes by entity kind and filters by pool label. The pool label is set at sandbox creation and never changes, making it a reliable membership indicator. The version-based query was an optimization that broke correctness during pool reuse.

Bug 2: False crash detection on decommissioned pools

Root cause: When the old pool is decommissioned (desired=0, no version references), its short-lived duplicate sandbox (created by bug 1, lived ~20s) is retired. The crash detection logic counts any sandbox that died within 60s of creation as a "quick crash" — it doesn't distinguish between intentional retirement and actual crashes.

Consequence: The pool enters crash cooldown with exponential backoff, even though no actual crash occurred. This is cosmetic in isolation but could delay future pool reuse.

Fix: Skip crash detection entirely when a pool is being intentionally decommissioned (DesiredInstances == 0 AND len(ReferencedByVersions) == 0). All sandbox deaths in this state are expected, not crashes.

Bug 3: httpingress routes to dead sandboxes for up to 30+ seconds

Root cause: httpingress caches leases (sandbox routing info) and only invalidates them in two ways: (a) when a proxy attempt fails with a connection error, or (b) on a 30-second periodic renewal cycle. Additionally, RenewLease() in the activator didn't check sandbox status — it only verified the sandbox was still tracked. Since STOPPED/DEAD sandboxes are deliberately kept in tracking for fail-fast detection, renewals always succeeded even for dead sandboxes.

Consequence: After old sandboxes are retired, httpingress continues routing requests to the dead IP for up to 30+ seconds. Users see "Empty reply from server" (curl error 52), timeouts, and "no route to host" errors until the lease finally expires or a connection error triggers invalidation.

Fix (three layers):

  1. Proactive invalidation: Added SandboxInvalidation event type and Invalidations() channel to the AppActivator interface. When watchSandboxes() detects a sandbox transitioning from RUNNING to any other state, it pushes a notification. httpingress runs a watchInvalidations() goroutine that listens on this channel and immediately drops all cached leases pointing to the invalidated sandbox.
  2. RenewLease status check: RenewLease() now rejects leases for non-RUNNING sandboxes, so the 30s periodic renewal acts as a backup invalidation mechanism.
  3. Existing retry logic: The existing cached-lease retry path (invalidate + acquire fresh on connection error) continues to work as a last resort.

Bug 4: ~10 second delay before traffic shifts to new version

Root cause: waitForPoolReady() in the deployment launcher polls pool.ReadyInstances every 2 seconds. But ReadyInstances is only updated when the sandbox pool controller reconciles, which happens on a 10-second resync interval. So even though the new sandbox becomes RUNNING almost immediately (~200ms after container start), the deployment launcher doesn't see it for up to 10 seconds.

Consequence: Old pool cleanup (and thus old sandbox retirement) is delayed by ~10 seconds after the new sandbox is ready. During this window, the old sandbox is still serving traffic with stale code.

Fix: waitForPoolReady() now also calls hasRunningSandboxForPool(), which queries sandbox entities directly. Sandbox status is updated immediately when the sandbox boots, bypassing the pool controller's 10s resync cycle. Poll interval reduced from 2s to 500ms since the entity queries are cheap.

Bug sequence (from server logs)

Deploy 1: Pool A (v1) → S1 (v1) created, boots, serves traffic ✓
Deploy 2: Pool A reused, SandboxSpec.Version updated to v2
         → listSandboxes queries v2, misses S1 (has v1) → creates S2 (v2)
         → S1 still running, serving traffic, but invisible to pool
Deploy 3: Pool B (v3) → S3 (v3) created (new pool because code changed)
         → Pool A decommissioned (desired=0)
         → S2 retired (lived ~20s) → crash detection triggers falsely
         → S1 STILL RUNNING, orphaned, in DNS, never cleaned up
         → httpingress still has cached lease for S1's IP
         → ~10s delay before Pool B's ReadyInstances updates
         → requests to dead S1 IP return empty replies / timeouts

Files changed

File Change
controllers/sandboxpool/manager.go Fix listSandboxes to query by kind+pool label instead of version; skip crash detection for decommissioned pools
controllers/sandboxpool/manager_test.go Update TestManagerVersionFiltering for new pool-label-based membership; update listSandboxesForPool helper; add TestManagerPoolReuse_NoOrphanedSandboxes and TestManagerDecommissionedPool_NoCrashDetection
components/activator/activator.go Add SandboxInvalidation type and Invalidations() to AppActivator interface; send invalidation when sandbox leaves RUNNING; reject RenewLease for non-RUNNING sandboxes
controllers/deployment/launcher.go waitForPoolReady checks sandbox entities directly via hasRunningSandboxForPool; reduce poll interval to 500ms
servers/httpingress/httpingress.go Add watchInvalidations goroutine and invalidateSandboxLeases to immediately drop cached leases for dead sandboxes
servers/httpingress/lease_test.go Add Invalidations() to mock activator

Test plan

  • All existing sandboxpool tests pass (18 tests)
  • All existing httpingress lease tests pass
  • All existing deployment launcher tests pass
  • New test: TestManagerPoolReuse_NoOrphanedSandboxes — verifies old-version sandboxes remain visible after pool version update
  • New test: TestManagerDecommissionedPool_NoCrashDetection — verifies DEAD sandboxes in decommissioned pools don't trigger crash detection
  • make lint clean (0 issues)
  • Manual test: 3 rapid deploys in dev environment — no orphaned sandboxes, no false crashes, immediate lease invalidation visible in logs, traffic shifts to new version within ~1s

…e routing

Three bugs manifested during rapid successive deploys:

1. Pool sandbox visibility: listSandboxes queried by SandboxSpec.Version,
   which changes during pool reuse, causing the pool to lose visibility of
   existing sandboxes. Now queries by entity kind and filters by pool label,
   which never changes.

2. False crash detection: Short-lived sandboxes in decommissioned pools
   (desired=0, no references) triggered crash detection. Now skips crash
   detection for pools being intentionally torn down.

3. Stale lease routing: httpingress cached leases pointing to dead sandboxes
   for up to 30s (renewal cycle), causing empty replies and timeouts.
   - Activator now pushes SandboxInvalidation events when sandboxes leave
     RUNNING state
   - httpingress watches the invalidation channel and immediately drops
     affected cached leases
   - RenewLease rejects STOPPED/DEAD sandboxes as a backup

4. Slow traffic shift: waitForPoolReady polled pool.ReadyInstances (updated
   on 10s resync) instead of checking sandbox status directly. Now also
   checks sandbox entities which are updated immediately when sandboxes boot,
   with a 500ms poll interval.
@evanphx evanphx requested a review from a team as a code owner March 13, 2026 01:58
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 13, 2026

📝 Walkthrough

Walkthrough

Adds a SandboxInvalidation event and exposes invalidations via AppActivator.Invalidations(). The local activator publishes invalidations when sandboxes transition away from RUNNING and RenewLease is rejected for non-RUNNING sandboxes. The HTTP ingress server listens for invalidations to evict and release affected leases. Pool readiness now also detects RUNNING sandboxes directly; sandbox listing is based on pool labels and crash-cooldown is skipped for decommissioned pools. Tests and harnesses add helpers (NewTestLease, mock Invalidations, Miren.RunCmd, harness.HTTPGet) and a zero-downtime blackbox deploy test. CLAUDE.md build instructions clarified.

📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/activator/activator.go (1)

1014-1020: ⚠️ Potential issue | 🟠 Major

Emit invalidation on sandbox delete events too

Invalidations are only emitted on status transitions. A direct delete event currently removes tracking without invalidating cached leases, so stale routes can survive until renewal/error fallback.

🔧 Suggested fix
        if op.IsDelete() {
                // Entity was deleted - clean up from tracking
                // The ID should still be available in the operation even without the entity
                if op.HasEntityId() {
-                       a.removeSandboxFromTracking(entity.Id(op.EntityId()))
+                       sandboxID := entity.Id(op.EntityId())
+                       a.emitInvalidation(sandboxID)
+                       a.removeSandboxFromTracking(sandboxID)
                }
                return nil
        }
+func (a *localActivator) emitInvalidation(sandboxID entity.Id) {
+       select {
+       case a.invalidationCh <- SandboxInvalidation{SandboxID: sandboxID}:
+       default:
+               a.log.Warn("invalidation channel full, dropping notification", "sandbox", sandboxID)
+       }
+}
-               if oldStatus == compute_v1alpha.RUNNING && oldStatus != sb.Status {
-                       select {
-                       case a.invalidationCh <- SandboxInvalidation{SandboxID: sb.ID}:
-                       default:
-                               a.log.Warn("invalidation channel full, dropping notification", "sandbox", sb.ID)
-                       }
-               }
+               if oldStatus == compute_v1alpha.RUNNING && oldStatus != sb.Status {
+                       a.emitInvalidation(sb.ID)
+               }

Also applies to: 1100-1108

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/activator/activator.go` around lines 1014 - 1020, When handling
op.IsDelete() (and the similar block at lines ~1100-1108) you remove tracking
with a.removeSandboxFromTracking(entity.Id(op.EntityId())) but never emit a
lease invalidation; change the delete branch to emit the same invalidation used
for status transitions for that sandbox ID before calling
removeSandboxFromTracking (i.e., invoke the existing invalidation helper used
elsewhere in this file/module for sandbox status changes, passing
entity.Id(op.EntityId())), so cached leases/routes are invalidated on delete as
well.
🧹 Nitpick comments (2)
servers/httpingress/httpingress.go (1)

186-207: Avoid calling ReleaseLease while holding h.mu

invalidateSandboxLeases holds the server mutex across external ReleaseLease calls. If release is slow, cache operations block behind this path.

🔧 Suggested refactor
 func (h *Server) invalidateSandboxLeases(ctx context.Context, sandboxID entity.Id) {
-       h.mu.Lock()
-       defer h.mu.Unlock()
+       var toRelease []*activator.Lease
+       h.mu.Lock()

        for app, ar := range h.apps {
                var kept []*lease
                for _, l := range ar.leases {
-                       if l.Lease.Sandbox().ID == sandboxID {
+                       sb := l.Lease.Sandbox()
+                       if sb != nil && sb.ID == sandboxID {
                                h.Log.Info("invalidating cached lease for stopped sandbox",
                                        "app", app, "sandbox", sandboxID, "url", l.Lease.URL)
-                               h.aa.ReleaseLease(ctx, l.Lease)
+                               toRelease = append(toRelease, l.Lease)
                        } else {
                                kept = append(kept, l)
                        }
                }
@@
                        ar.leases = kept
                }
        }
+       h.mu.Unlock()
+
+       for _, lease := range toRelease {
+               _ = h.aa.ReleaseLease(ctx, lease)
+       }
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@servers/httpingress/httpingress.go` around lines 186 - 207,
invalidateSandboxLeases currently holds h.mu while calling h.aa.ReleaseLease
which can block other ops; change it to first collect the leases to release
without calling external code (iterate h.apps/ar.leases under h.mu and build a
slice of toRelease entries and the updated kept slices or app deletions), update
h.apps/ar.leases state while still holding h.mu, then unlock h.mu and iterate
the collected toRelease slice to call h.aa.ReleaseLease(ctx, lease). Use the
same identifiers (invalidateSandboxLeases, h.mu, h.apps, ar.leases,
h.aa.ReleaseLease) so the critical section only touches in-memory maps and the
potentially slow ReleaseLease runs outside the mutex.
servers/httpingress/lease_test.go (1)

48-50: Use a stable mock invalidation channel

Returning a new channel on each call makes it difficult to inject invalidation events in tests that call Invalidations() and send from a different point.

🔧 Suggested refactor
 type mockActivator struct {
        mu           sync.Mutex
        renewCount   int
        releaseCount int
        renewErr     error
        releasedURLs []string
        renewedURLs  []string
+       invalidationCh chan activator.SandboxInvalidation
 }
@@
 func (m *mockActivator) Invalidations() <-chan activator.SandboxInvalidation {
-       return make(chan activator.SandboxInvalidation)
+       m.mu.Lock()
+       defer m.mu.Unlock()
+       if m.invalidationCh == nil {
+               m.invalidationCh = make(chan activator.SandboxInvalidation, 8)
+       }
+       return m.invalidationCh
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@servers/httpingress/lease_test.go` around lines 48 - 50, The
mockActivator.Invalidations implementation returns a fresh channel each call
which prevents tests from sending events into the same channel; change
mockActivator to hold a persistent field (e.g. invalidations chan
activator.SandboxInvalidation) initialized once (in the mockActivator
constructor or when the struct is created) and have Invalidations() return that
field; also expose or document the field so tests can send invalidation events
into mockActivator.invalidations and close it when done.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@controllers/deployment/launcher.go`:
- Around line 953-958: hasRunningSandboxForPool currently masks listing errors
by returning false, causing waitForPoolReady to treat errors as “not ready”;
change hasRunningSandboxForPool to return (bool, error) (or otherwise surface
the underlying error) and update its callers (including the checks in
waitForPoolReady around the l.hasRunningSandboxForPool calls at the shown
locations) to handle and propagate errors instead of treating them as a negative
result—i.e., if the list call returns an error, return that error up to
waitForPoolReady so the caller can abort/handle the failure rather than timing
out and cleaning up.

In `@controllers/sandboxpool/manager_test.go`:
- Around line 1306-1311: The test mutates pool.SandboxSpec.Version in-memory but
never persists it to the test store, so reconciliation doesn't exercise the
version-update path; update the stored entity before calling reconcilePool by
saving the modified pool into the test datastore (i.e., persist the pool after
setting pool.SandboxSpec.Version = entity.Id("ver-2")) so that
NewManager(...)/reconcilePool(...) operates on the updated pool state.

In `@controllers/sandboxpool/manager.go`:
- Around line 324-330: The current listSandboxes implementation uses a global
indexAttr (entity.Ref(entity.EntityKind, compute_v1alpha.KindSandbox)) and
m.eac.List to fetch all sandboxes then filters in-memory, which is
O(pools×sandboxes); instead introduce and use a stable pool-membership indexed
attribute (e.g. "poolRef" or "sandboxPoolRef") on Sandbox entities and query by
that attribute so the backend returns only sandboxes for the given pool. Update
the Sandbox creation/ownership code to set the poolRef attribute (stable pool ID
or entity.Ref for the pool), then change listSandboxes to build an indexAttr for
sandbox-by-pool (referencing compute_v1alpha.KindSandbox and the poolRef value)
and call m.eac.List with that attribute/value so you avoid global scans (adjust
places using indexAttr, m.eac.List, and entity.Ref accordingly).

In `@servers/httpingress/httpingress.go`:
- Around line 176-181: The select loop reading from the invalidation channel can
spin if the channel from Invalidations() is closed because receiving yields zero
values; update the receive case to use the comma-ok form (inv, ok := <-ch) and
if ok is false treat the closure as a termination condition (e.g., return or
break the goroutine) before calling invalidateSandboxLeases(ctx, inv.SandboxID),
so the handler cleanly exits when the channel is closed; make this change in the
goroutine/select that currently reads from ch and calls invalidateSandboxLeases.

---

Outside diff comments:
In `@components/activator/activator.go`:
- Around line 1014-1020: When handling op.IsDelete() (and the similar block at
lines ~1100-1108) you remove tracking with
a.removeSandboxFromTracking(entity.Id(op.EntityId())) but never emit a lease
invalidation; change the delete branch to emit the same invalidation used for
status transitions for that sandbox ID before calling removeSandboxFromTracking
(i.e., invoke the existing invalidation helper used elsewhere in this
file/module for sandbox status changes, passing entity.Id(op.EntityId())), so
cached leases/routes are invalidated on delete as well.

---

Nitpick comments:
In `@servers/httpingress/httpingress.go`:
- Around line 186-207: invalidateSandboxLeases currently holds h.mu while
calling h.aa.ReleaseLease which can block other ops; change it to first collect
the leases to release without calling external code (iterate h.apps/ar.leases
under h.mu and build a slice of toRelease entries and the updated kept slices or
app deletions), update h.apps/ar.leases state while still holding h.mu, then
unlock h.mu and iterate the collected toRelease slice to call
h.aa.ReleaseLease(ctx, lease). Use the same identifiers
(invalidateSandboxLeases, h.mu, h.apps, ar.leases, h.aa.ReleaseLease) so the
critical section only touches in-memory maps and the potentially slow
ReleaseLease runs outside the mutex.

In `@servers/httpingress/lease_test.go`:
- Around line 48-50: The mockActivator.Invalidations implementation returns a
fresh channel each call which prevents tests from sending events into the same
channel; change mockActivator to hold a persistent field (e.g. invalidations
chan activator.SandboxInvalidation) initialized once (in the mockActivator
constructor or when the struct is created) and have Invalidations() return that
field; also expose or document the field so tests can send invalidation events
into mockActivator.invalidations and close it when done.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bae1d254-49aa-4f87-b309-012d6f47382d

📥 Commits

Reviewing files that changed from the base of the PR and between 7be361d and 25c558c.

📒 Files selected for processing (6)
  • components/activator/activator.go
  • controllers/deployment/launcher.go
  • controllers/sandboxpool/manager.go
  • controllers/sandboxpool/manager_test.go
  • servers/httpingress/httpingress.go
  • servers/httpingress/lease_test.go

Comment on lines +324 to 330
// Always query all sandboxes by kind and filter by pool label.
// We used to query by pool.SandboxSpec.Version, but that field changes
// during pool reuse (when a new deploy has identical specs), causing the
// pool to lose visibility of its existing sandboxes.
indexAttr := entity.Ref(entity.EntityKind, compute_v1alpha.KindSandbox)

resp, err := m.eac.List(ctx, indexAttr)
Copy link
Copy Markdown

@coderabbitai coderabbitai bot Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Global sandbox scans in listSandboxes will become a hot-path bottleneck

This now lists all sandboxes for every pool reconciliation/monitor pass and filters in memory. At scale, this creates O(pools × sandboxes) load and can significantly increase reconcile latency.

Consider introducing an indexed pool-membership attribute (stable pool ID ref) so lookups stay version-agnostic but query-selective.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@controllers/sandboxpool/manager.go` around lines 324 - 330, The current
listSandboxes implementation uses a global indexAttr
(entity.Ref(entity.EntityKind, compute_v1alpha.KindSandbox)) and m.eac.List to
fetch all sandboxes then filters in-memory, which is O(pools×sandboxes); instead
introduce and use a stable pool-membership indexed attribute (e.g. "poolRef" or
"sandboxPoolRef") on Sandbox entities and query by that attribute so the backend
returns only sandboxes for the given pool. Update the Sandbox creation/ownership
code to set the poolRef attribute (stable pool ID or entity.Ref for the pool),
then change listSandboxes to build an indexAttr for sandbox-by-pool (referencing
compute_v1alpha.KindSandbox and the poolRef value) and call m.eac.List with that
attribute/value so you avoid global scans (adjust places using indexAttr,
m.eac.List, and entity.Ref accordingly).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged, but we're going to leave this as-is for now. A few reasons:

  1. The old code already did global scans for addon pools (the else branch that fell back to listing by kind) — this isn't a new pattern, we're just making it consistent.
  2. At current scale, the entity store handles these queries fine. Pool reconciliation runs on a 10s resync, not on every request.
  3. Adding an indexed pool-ref attribute is the right optimization if this becomes a bottleneck, but it touches the entity schema and sandbox creation paths — that's a larger change we'd rather do intentionally rather than as part of a bug fix PR.
  4. hasRunningSandboxForPool in the launcher does the same global scan — if we index, both should be updated together.

If we see reconcile latency climb as sandbox counts grow, we'll revisit with a pool-ref index.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phinze, understood — those are all solid reasons. The consistency argument is particularly compelling, and coordinating a pool-ref index across both listSandboxes and hasRunningSandboxForPool in a dedicated follow-up is the right call.


✏️ Learnings added
Learnt from: phinze
Repo: mirendev/runtime PR: 672
File: controllers/sandboxpool/manager.go:324-330
Timestamp: 2026-03-13T18:36:08.403Z
Learning: In mirendev/runtime, the global sandbox scan pattern in `listSandboxes` (controllers/sandboxpool/manager.go) and `hasRunningSandboxForPool` (controllers/deployment/launcher.go) is intentional and acceptable at current scale. Pool reconciliation runs on a 10s resync, not on every request. The old code already used global scans for addon pools. The right optimization—adding a stable pool-ref indexed attribute on Sandbox entities—intentionally deferred because it touches the entity schema and sandbox creation paths, and both functions should be updated together when that work is done. Do not flag global-scan / O(pools×sandboxes) patterns in these two functions until a dedicated indexing effort is undertaken.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: phinze
Repo: mirendev/runtime PR: 256
File: controllers/sandboxpool/manager.go:428-563
Timestamp: 2025-10-23T18:56:26.936Z
Learning: Optimistic concurrency control issues in the SandboxPoolManager (controllers/sandboxpool/manager.go) should be deferred to issue MIR-219, which tracks a holistic review of the optimistic concurrency locking system. Do not suggest adding Revision field checks or optimistic locking fixes for SandboxPoolManager Put operations until that effort is complete.

Learnt from: phinze
Repo: mirendev/runtime PR: 294
File: components/activator/activator.go:707-722
Timestamp: 2025-11-05T19:00:27.973Z
Learning: In the Activator (components/activator/activator.go), the ~300ms retry delay during App.ActiveVersion switches is acceptable. This occurs when findPoolInStore temporarily fails to find a reused pool because ReferencedByVersions hasn't been updated yet by DeploymentLauncher. The delay is transient and self-heals once the launcher updates the pool references. This is a known optimization opportunity but not blocking for persistent sandbox pool functionality.

Learnt from: phinze
Repo: mirendev/runtime PR: 651
File: controllers/sandbox/create_saga_test.go:63-70
Timestamp: 2026-03-09T23:00:17.977Z
Learning: In controllers/sandbox/create_saga_test.go, phinze prefers keeping the mockEntityStore.PatchSandbox mock simple (no revision validation or meta.Revision mutation) until tests are added that exercise multi-patch revision ordering. Do not suggest tightening the mock's revision enforcement unless such tests are present in the PR.

Learnt from: phinze
Repo: mirendev/runtime PR: 637
File: controllers/deployment/launcher.go:214-220
Timestamp: 2026-03-02T01:19:30.972Z
Learning: In controllers/deployment/launcher.go, the `cleanupOldVersionPools` function operates at the version level, not the service level. It scales down pools belonging to old versions regardless of which service they're for. If one service fails to create a pool during reconciliation, the other services' old pools still need cleanup once their new pools are ready—cleanup should not be skipped per-service.

Learnt from: phinze
Repo: mirendev/runtime PR: 637
File: controllers/deployment/launcher.go:848-853
Timestamp: 2026-03-02T01:19:45.274Z
Learning: In Go polling loops that call EAC.Get (EntityAccessClient.Get) in mirendev/runtime, avoid wrapping every call with a per-call context.WithTimeout. Instead rely on the parent context for cancellation and perform manual deadline checks between iterations (e.g., time.Now().After(deadline)) while also handling cancellation in the select on the parent context. This provides sufficient timeout enforcement without the overhead and complexity of creating a new child context per iteration. Apply this guideline to polling patterns in controllers/deployment/launcher.go and similar polling scenarios across the repository.

phinze added 2 commits March 13, 2026 12:31
Test that invalidateSandboxLeases correctly evicts cached leases when
sandboxes die, only affects matching sandboxes, and works across
multiple apps pointing at the same sandbox.

Also adds activator.NewTestLease helper for constructing leases with
sandbox references in tests outside the activator package.
TestZeroDowntimeDeploy fires continuous HTTP requests through httpingress
while triggering a redeploy, verifying no requests fail during the
transition. This exercises the proactive lease invalidation end-to-end.

Also adds RunCmd and HTTPGet harness helpers for making HTTP requests
from inside the dev container via curl.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
blackbox/harness/miren.go (1)

105-135: Consider extracting shared command-execution plumbing from Run/RunCmd.

Env setup, output capture, exit-code handling, and logging are duplicated, which increases drift risk when one path changes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@blackbox/harness/miren.go` around lines 105 - 135, The Run and RunCmd
implementations duplicate command-execution plumbing (env setup, stdout/stderr
capture, exit-code handling, and logging); extract that shared logic into a
single helper (e.g., runCommand or executeCmd) and have Run and RunCmd call it.
The helper should accept the exec.Cmd or args, perform cmd.Env =
append(cmd.Environ(), "TERM=dumb"), wire bytes.Buffers to cmd.Stdout/cmd.Stderr,
run cmd, determine exitCode from exec.ExitError, build and return the existing
Result struct, and perform the same m.t.Logf logging so callers (Run/RunCmd)
only prepare the cmd/args and forward to the helper.
blackbox/zero_downtime_test.go (2)

69-81: Empty conditional block creates confusion about 502/503 tolerance.

Lines 69-72 contain an empty if block with a comment stating that 502/503 responses are "noted" and "may be acceptable," but the block does nothing. The subsequent code (lines 73-80) records all errors and all non-200 responses as failures, causing the test to fail at line 106.

This contradiction makes the test's intent unclear. If the goal is strict zero-downtime (no 502/503 allowed), remove the empty block and misleading comment. If some transient errors during transition are acceptable, implement actual tolerance logic.

Suggested fix (assuming strict zero-downtime is intended)
 			code, _, err := harness.HTTPGet(m, host, "/")
 			totalRequests.Add(1)
 
-			if err != nil || (code != 200 && code != 502 && code != 503) {
-				// 502/503 during brief transition are noted but may be acceptable
-				// in some edge cases; we track them all
-			}
 			if err != nil {
 				failMu.Lock()
 				failedResults = append(failedResults, fmt.Sprintf("request #%d: error: %v", totalRequests.Load(), err))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@blackbox/zero_downtime_test.go` around lines 69 - 81, The empty if block
around checking (code != 200 && code != 502 && code != 503) is misleading and
contradicts the subsequent failure recording logic; either remove that empty
conditional and its comment so failures are recorded strictly (zero-downtime),
or implement explicit tolerance by skipping appending to failedResults for
502/503. Update the section that appends to failedResults (references:
failedResults, totalRequests.Load(), failMu) so it only records non-200
responses when you intend to treat them as failures, or explicitly ignore
502/503 before acquiring failMu if transient errors are acceptable.

89-89: Use filepath.Join for path construction.

String concatenation for paths is less idiomatic than filepath.Join, which handles path separators correctly.

Suggested fix
-	m.MustRun("deploy", "-a", name, "-d", m.ContainerPath(c.TestdataDir+"/go-server"), "-f", "-e", "DEPLOY_VERSION=v2")
+	m.MustRun("deploy", "-a", name, "-d", m.ContainerPath(filepath.Join(c.TestdataDir, "go-server")), "-f", "-e", "DEPLOY_VERSION=v2")

Add "path/filepath" to imports.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@blackbox/zero_downtime_test.go` at line 89, Replace the string-concatenated
path in the m.MustRun call with filepath.Join: call
m.ContainerPath(filepath.Join(c.TestdataDir, "go-server")) inside the existing
m.MustRun invocation (the function call referencing m.MustRun and
m.ContainerPath with c.TestdataDir and "go-server"); also add the
"path/filepath" import to the file's imports. Ensure only the path construction
is changed and the rest of the m.MustRun arguments remain identical.
servers/httpingress/lease_test.go (1)

50-52: Return a stable invalidations channel from the mock.

On Line 51, creating a new channel on every call makes subscriptions non-deterministic and hard to exercise in tests that need to push invalidation events.

💡 Proposed refactor
 type mockActivator struct {
 	mu           sync.Mutex
 	renewCount   int
 	releaseCount int
 	renewErr     error
 	releasedURLs []string
 	renewedURLs  []string
+	invalidCh    chan activator.SandboxInvalidation
 }
@@
 func (m *mockActivator) Invalidations() <-chan activator.SandboxInvalidation {
-	return make(chan activator.SandboxInvalidation)
+	m.mu.Lock()
+	defer m.mu.Unlock()
+	if m.invalidCh == nil {
+		m.invalidCh = make(chan activator.SandboxInvalidation, 1)
+	}
+	return m.invalidCh
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@servers/httpingress/lease_test.go` around lines 50 - 52, The
mockActivator.Invalidations method currently creates a new channel on each call
which makes subscriptions flaky; change mockActivator to hold a persistent
channel field (e.g., invalidations chan activator.SandboxInvalidation)
initialized once (in the mockActivator constructor or struct literal) and have
Invalidations() return that stored channel so all callers share a stable
invalidation stream.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@blackbox/harness/miren.go`:
- Around line 88-100: RunCmd can panic in ModeLocal because it indexes
args[0]/args[1:] without checking length; add a guard at the start of the
ModeLocal branch in Miren.RunCmd to check len(args) > 0 and, if not, call
m.t.Fatal or m.t.Fatalf with a clear message (e.g., "RunCmd: no args provided
for ModeLocal") so the test fails cleanly instead of panicking; update the
switch case handling for ModeLocal in the RunCmd method to perform this
validation before constructing exec.Command.

---

Nitpick comments:
In `@blackbox/harness/miren.go`:
- Around line 105-135: The Run and RunCmd implementations duplicate
command-execution plumbing (env setup, stdout/stderr capture, exit-code
handling, and logging); extract that shared logic into a single helper (e.g.,
runCommand or executeCmd) and have Run and RunCmd call it. The helper should
accept the exec.Cmd or args, perform cmd.Env = append(cmd.Environ(),
"TERM=dumb"), wire bytes.Buffers to cmd.Stdout/cmd.Stderr, run cmd, determine
exitCode from exec.ExitError, build and return the existing Result struct, and
perform the same m.t.Logf logging so callers (Run/RunCmd) only prepare the
cmd/args and forward to the helper.

In `@blackbox/zero_downtime_test.go`:
- Around line 69-81: The empty if block around checking (code != 200 && code !=
502 && code != 503) is misleading and contradicts the subsequent failure
recording logic; either remove that empty conditional and its comment so
failures are recorded strictly (zero-downtime), or implement explicit tolerance
by skipping appending to failedResults for 502/503. Update the section that
appends to failedResults (references: failedResults, totalRequests.Load(),
failMu) so it only records non-200 responses when you intend to treat them as
failures, or explicitly ignore 502/503 before acquiring failMu if transient
errors are acceptable.
- Line 89: Replace the string-concatenated path in the m.MustRun call with
filepath.Join: call m.ContainerPath(filepath.Join(c.TestdataDir, "go-server"))
inside the existing m.MustRun invocation (the function call referencing
m.MustRun and m.ContainerPath with c.TestdataDir and "go-server"); also add the
"path/filepath" import to the file's imports. Ensure only the path construction
is changed and the rest of the m.MustRun arguments remain identical.

In `@servers/httpingress/lease_test.go`:
- Around line 50-52: The mockActivator.Invalidations method currently creates a
new channel on each call which makes subscriptions flaky; change mockActivator
to hold a persistent channel field (e.g., invalidations chan
activator.SandboxInvalidation) initialized once (in the mockActivator
constructor or struct literal) and have Invalidations() return that stored
channel so all callers share a stable invalidation stream.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 21d6246e-2c6d-46c0-9a7b-94a95f4b59bc

📥 Commits

Reviewing files that changed from the base of the PR and between 25c558c and 2377903.

📒 Files selected for processing (5)
  • blackbox/harness/http.go
  • blackbox/harness/miren.go
  • blackbox/zero_downtime_test.go
  • components/activator/testing.go
  • servers/httpingress/lease_test.go

phinze added 2 commits March 13, 2026 13:25
- hasRunningSandboxForPool returns (bool, error) instead of swallowing
  list failures silently
- watchInvalidations guards against channel closure to prevent busy loop
- Clarify invalidation condition (compare sb.Status not oldStatus)
- Fix blackbox HTTPGet: use HTTPS on 443 (port 80 redirects), drop
  -o /dev/stdout that caused curl exit 23
- CLAUDE.md: document that binary must be built inside dev container
- Persist pool version update in store before reconciliation in
  TestManagerPoolReuse_NoOrphanedSandboxes so the test exercises the
  actual production scenario (store has ver-2, sandbox has ver-1)
- Guard RunCmd against empty args to prevent panic in local mode
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
blackbox/harness/miren.go (1)

95-139: Consider consolidating duplicated exec/capture/log logic shared with Run.

Run and RunCmd now have near-identical command execution flow (env setup, run, exit-code handling, output capture, logging). Extracting a shared helper would reduce drift risk.

♻️ Suggested refactor sketch
+func (m *Miren) execAndCapture(logPrefix string, cmd *exec.Cmd, args []string) *Result {
+	m.t.Helper()
+	cmd.Env = append(cmd.Environ(), "TERM=dumb")
+
+	var stdout, stderr bytes.Buffer
+	cmd.Stdout = &stdout
+	cmd.Stderr = &stderr
+
+	err := cmd.Run()
+	exitCode := 0
+	if err != nil {
+		if exitErr, ok := err.(*exec.ExitError); ok {
+			exitCode = exitErr.ExitCode()
+		} else {
+			m.t.Fatalf("failed to execute command: %v", err)
+		}
+	}
+
+	r := &Result{ExitCode: exitCode, Stdout: stdout.String(), Stderr: stderr.String()}
+	m.t.Logf("%s %s → exit %d", logPrefix, strings.Join(args, " "), exitCode)
+	if r.Stdout != "" { m.t.Logf("stdout: %s", r.Stdout) }
+	if r.Stderr != "" { m.t.Logf("stderr: %s", r.Stderr) }
+	return r
+}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@blackbox/harness/miren.go` around lines 95 - 139, Extract the duplicated
command-execution logic from RunCmd and Run into a shared helper (e.g.,
runCommand or executeCmd) that accepts either an *exec.Cmd or the args+mode and
returns the Result struct (ExitCode, Stdout, Stderr) and any non-ExitError
error; move the common pieces—setting TERM in cmd.Env, wiring cmd.Dir, capturing
stdout/stderr via bytes.Buffer, running cmd, extracting exit code from
exec.ExitError, and logging via m.t.Logf—into that helper, then have Run and
RunCmd build the appropriate cmd (using devExec for ModeDev and args for
ModeLocal) and call the helper to get the Result; keep Result, Run, RunCmd,
ModeDev, ModeLocal, devExec, and m.cluster.Mode names so callers remain
unchanged.
controllers/sandboxpool/manager_test.go (1)

1323-1325: Use require.Len before indexing sandboxes[0] (Line 1323).

If the length assertion fails, continuing to sandboxes[0] can panic and hide the real failure signal.

Suggested patch
-	assert.Equal(t, 1, len(sandboxes), "existing sandbox should still be visible after version update")
+	require.Len(t, sandboxes, 1, "existing sandbox should still be visible after version update")
 	assert.Equal(t, entity.Id("ver-1"), sandboxes[0].Spec.Version, "original sandbox should be the one found")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@controllers/sandboxpool/manager_test.go` around lines 1323 - 1325, Replace
the unsafe length assertion with a required length check so the test fails
immediately instead of panicking on indexing: call require.Len(t, sandboxes, 1)
(or insert require.Len before the existing index access) to ensure sandboxes has
one element before accessing sandboxes[0], then keep the existing assertion
comparing sandboxes[0].Spec.Version to entity.Id("ver-1").
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@blackbox/harness/miren.go`:
- Around line 95-139: Extract the duplicated command-execution logic from RunCmd
and Run into a shared helper (e.g., runCommand or executeCmd) that accepts
either an *exec.Cmd or the args+mode and returns the Result struct (ExitCode,
Stdout, Stderr) and any non-ExitError error; move the common pieces—setting TERM
in cmd.Env, wiring cmd.Dir, capturing stdout/stderr via bytes.Buffer, running
cmd, extracting exit code from exec.ExitError, and logging via m.t.Logf—into
that helper, then have Run and RunCmd build the appropriate cmd (using devExec
for ModeDev and args for ModeLocal) and call the helper to get the Result; keep
Result, Run, RunCmd, ModeDev, ModeLocal, devExec, and m.cluster.Mode names so
callers remain unchanged.

In `@controllers/sandboxpool/manager_test.go`:
- Around line 1323-1325: Replace the unsafe length assertion with a required
length check so the test fails immediately instead of panicking on indexing:
call require.Len(t, sandboxes, 1) (or insert require.Len before the existing
index access) to ensure sandboxes has one element before accessing sandboxes[0],
then keep the existing assertion comparing sandboxes[0].Spec.Version to
entity.Id("ver-1").

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2924d27b-f8ba-4fe3-bc22-d3ce1d9c65ed

📥 Commits

Reviewing files that changed from the base of the PR and between 0cee6ef and 63b63a9.

📒 Files selected for processing (2)
  • blackbox/harness/miren.go
  • controllers/sandboxpool/manager_test.go

Copy link
Copy Markdown
Contributor

@phinze phinze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed a few tests and fixes, and approving to merge this sucker. Lots of good improvements!

@phinze phinze merged commit 765443a into main Mar 13, 2026
11 checks passed
@phinze phinze deleted the mir-797-investigate-rapid-deploys-showing-up-as-sandbox-po branch March 13, 2026 19:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants