Skip to content

MIR-1159: keep ephemeral pools at 1 sandbox and fix lease URL races#821

Merged
evanphx merged 1 commit into
mainfrom
mir-1159-ephemeral-deploys-launching-multiple-sandboxes
May 27, 2026
Merged

MIR-1159: keep ephemeral pools at 1 sandbox and fix lease URL races#821
evanphx merged 1 commit into
mainfrom
mir-1159-ephemeral-deploys-launching-multiple-sandboxes

Conversation

@evanphx
Copy link
Copy Markdown
Contributor

@evanphx evanphx commented May 23, 2026

Summary

On garden, the ephemeral preview deploy for mirendev-preview ended up with a sandbox pool of desired_instances: 8 and requests timing out in the browser. Two distinct bugs were involved:

  1. Ephemeral pools scaled like normal auto-mode pools. Every saturating burst caused the activator to bump DesiredInstances, ratcheting the pool up under traffic instead of queueing on the single preview sandbox.
  2. The activator could return a Lease with an empty URL, and could leave waiters asleep when the URL became available outside of a status transition. Combined with a 50s cap that was too tight for cold image pulls, requests never reached the new sandbox.

Ephemeral pool: cap at 1 instance (commit 28644c7)

  • New SandboxPool.ephemeral bool. Set by Launcher.ensurePoolForService whenever the backing AppVersion.EphemeralLabel != "", and for those versions desiredInstances is forced to 1 regardless of the configured concurrency mode.
  • findMatchingPool now filters by ephemeral status so a non-ephemeral version cannot inherit an ephemeral pool (or vice versa) and end up with the wrong scaling policy.
  • Activator.AcquireLease short-circuits to incrementPool=false for ephemeral pools, and requestPoolCapacity defensively refuses to patch DesiredInstances when Ephemeral && DesiredInstances >= 1.
  • On pool reuse for an ephemeral version, any stale DesiredInstances > 1 is clamped back to 1.

Tests cover both the pool-creation path and the activator's refusal to scale.

Activator request routing: fix lease URL races and bump wait budget (commit 4823db8)

  • waitForSandbox.checkForSandbox now requires s.url != "" on both the inner scan and the locked double-check, matching the predicate already present in AcquireLease. Previously, a sandbox tracked as RUNNING before its URL had been populated could be returned as a Lease with URL: "", causing httpingress to fail proxying with unsupported protocol scheme "".
  • watchSandboxes captures oldURL before updating, and now also fires the notification fan-out when a RUNNING sandbox's URL transitions from empty to non-empty. Without this, waiters that woke on the status flip but saw url="" stayed asleep until the 30s fallback ticker.
  • The per-request wait cap goes from 50s to 120s. Default PortWaitTimeout is 15s and a first-time image pull on a cold node can take 30-90s, so 50s was returning ErrPoolTimeout while sandboxes were still booting. Stays comfortably under httpingress' leaseAcquisitionTimeout (2m). The fallback ticker moves from 30s to 60s to keep the same "half-way" semantics.

Test plan

  • make lint clean
  • make test (iso): 4422 passes, 0 fails
  • New unit tests pass:
    • TestCreatePoolForVersionEphemeral
    • TestNonEphemeralPoolUnaffected
    • TestFindMatchingPoolFiltersByEphemeral
    • TestRequestPoolCapacityRefusesEphemeralPool
    • TestRequestPoolCapacityNonEphemeralStillScales
    • TestWaitForSandboxSkipsEmptyURL
    • TestAcquireLeaseSucceedsAfterURLArrives
  • Manual e2e on dev cluster:
    • Deploy an ephemeral version, hammer the preview URL, confirm desired_instances stays at 1 and requests succeed.
    • Confirm a non-ephemeral deploy still scales normally.
  • On garden: existing pre-fix ephemeral pools keep ephemeral=false until they're recreated. Either let TTL expire them or backfill ephemeral=true on pools whose backing AppVersion.EphemeralLabel != "".

Notes

  • Saturated previews will queue on the single sandbox up to the 120s wait budget. This is acceptable for preview environments — predictable resource cost over high concurrency.
  • The hasPending race between requestPoolCapacity patching the store and watchSandboxes reflecting the new PENDING sandbox is the underlying reason normal auto-mode pools can also overshoot. The ephemeral guard sidesteps it for previews; addressing it for normal pools is a separate follow-up.

@evanphx evanphx requested a review from a team as a code owner May 23, 2026 00:05
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 23, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR replaces the single-value DesiredInstances() strategy with a MinInstances()/MaxInstances() capacity model, adds an EphemeralStrategy and NewStrategyForVersion(), applies version-aware strategies across the activator and pool manager, enforces non-empty sandbox URLs before issuing leases (with wait time increased to 120s), caps pool DesiredInstances using strategy.MaxInstances(), uses strategy.MinInstances() when seeding or scaling down pools, and adds tests validating these behaviors.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
components/activator/activator_test.go (1)

2675-2750: ⚡ Quick win

Add an uncached-pool ephemeral regression test.

This test pre-seeds activator.pools[key], so it only validates the cached-state guard. Please add a case where activator.pools is empty and the pool is found via store lookup, to ensure ephemeral pools still refuse increment there.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/activator/activator_test.go` around lines 2675 - 2750,
TestRequestPoolCapacityRefusesEphemeralPool only exercises the cached-path
because activator.pools is pre-seeded; add a second subcase where
activator.pools is empty to force a store lookup: create the same ephemeral
SandboxPool in the in-memory server (as already done), clear activator.pools (or
create a fresh localActivator with empty pools), then call
activator.requestPoolCapacity(ctx, testVer, "web") and assert the returned pool
has DesiredInstances == 1, stored entity remains Ephemeral with DesiredInstances
1, and the revision did not change; reference the existing test function name
TestRequestPoolCapacityRefusesEphemeralPool, the activator.requestPoolCapacity
method, the activator.pools map and the verKey used to locate the pool to
implement this additional branch.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@components/activator/activator_test.go`:
- Around line 3001-3003: The test's call to activator.waitForSandbox is using an
unbounded context (ctx) and can hang; wrap the call with a short timeout context
instead by creating a context.WithTimeout (e.g., a few seconds) and passing that
context to waitForSandbox, ensuring you call the returned cancel() via defer to
clean up; update the invocation of activator.waitForSandbox(ctx, testVer, "web",
false) to use the timeout context variable and keep error/lease handling
unchanged.

In `@components/activator/activator.go`:
- Around line 602-611: The ephemeral-pool guard is only applied for the
cached-path (the block checking state.pool.Ephemeral and DesiredInstances) but
not applied in the cache-miss / foundPoolWithRev code path where
DesiredInstances is incremented; update the code path that handles
foundPoolWithRev (the branch that patches/increments DesiredInstances after a
cache miss) to perform the same guard: if pool.Ephemeral &&
pool.DesiredInstances >= 1 then log a debug message (including pool.ID and
service) and return the pool without incrementing or patching, thereby
preventing ephemeral pools from being scaled above 1 via the foundPoolWithRev
path.

---

Nitpick comments:
In `@components/activator/activator_test.go`:
- Around line 2675-2750: TestRequestPoolCapacityRefusesEphemeralPool only
exercises the cached-path because activator.pools is pre-seeded; add a second
subcase where activator.pools is empty to force a store lookup: create the same
ephemeral SandboxPool in the in-memory server (as already done), clear
activator.pools (or create a fresh localActivator with empty pools), then call
activator.requestPoolCapacity(ctx, testVer, "web") and assert the returned pool
has DesiredInstances == 1, stored entity remains Ephemeral with DesiredInstances
1, and the revision did not change; reference the existing test function name
TestRequestPoolCapacityRefusesEphemeralPool, the activator.requestPoolCapacity
method, the activator.pools map and the verKey used to locate the pool to
implement this additional branch.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1a496ffe-0be8-4480-bef2-5cc414776db1

📥 Commits

Reviewing files that changed from the base of the PR and between 0611dce and 4823db8.

📒 Files selected for processing (6)
  • api/compute/compute_v1alpha/schema.gen.go
  • api/compute/schema.yml
  • components/activator/activator.go
  • components/activator/activator_test.go
  • controllers/deployment/launcher.go
  • controllers/deployment/launcher_test.go

Comment thread components/activator/activator_test.go
Comment thread components/activator/activator.go
Copy link
Copy Markdown
Contributor

@phinze phinze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, it's us. You know who and you know what.

The fixes are clean but we think we see a better abstraction here worth chasing. Going to request-changes on the pool.Ephemeral field specifically. The incident pieces themselves (empty-URL predicate, late-URL notification, wait budget bump) look right and should land as-is.

Here's the thread we pulled on: ephemeral is already a really well-localized AppVersion-level concept (ephemeral_label, ephemeral_ttl, ephemeral_expires_at, plus pkg/ephemeral and the GC controller). Every existing site derives ephemeral-ness on demand from the version. The new pool bool would be the first place we copy it onto a sub-entity, with the clamp-on-reuse code in the launcher to keep the copy in sync. CodeRabbit also caught a missed check site in the cache-miss path, which is decent evidence the bool-and-conditionals approach is fragile in the way you'd predict.

The cleaner shape uses an abstraction already in the codebase: route ephemeral through the strategy interface at construction. FixedStrategy{numInstances: 1} covers the "don't scale" semantic for free (its checkCapacity always returns true, so requestPoolCapacity is never reached). The activator changes here mostly disappear, schema stays untouched.

Grabbing the pen on this one. Going to push a commit or two showing what we mean. Hold off from your side while we cook, and we'll be back with another review when they're up.

Copy link
Copy Markdown
Contributor

@phinze phinze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi again from us. Pushed three commits to demonstrate what we meant.

0182906b (floor) routes ephemeral versions through FixedStrategy{1} at strategy construction. Equivalent behavior to what this PR shipped, without the pool.Ephemeral schema field, the three activator guards, the launcher clamp, or the find-matching filter. Strategy gains a NewStrategyForVersion(ver, service, svc) wrapper that does the dispatch in one place.

33412293 (stretch) introduces EphemeralStrategy proper: scale-to-zero capped at 1, with a default 15-minute idle window so browsing a preview tab doesn't churn through cold starts. The cap mechanism becomes MinInstances()/MaxInstances() on the strategy interface, and the activator's >= MaxPoolSize checks now consult strategy.MaxInstances() so the cap is type-driven per pool. The launcher consults strategy.MinInstances() for seed and reuse, so it no longer needs to know anything about ephemeral routing.

5be88ca4 picks up CodeRabbit's suggestion to bound the wait context in TestAcquireLeaseSucceedsAfterURLArrives so a regression fails in 2s rather than hanging for 120s.

A couple of things that came up while we were in here but didn't act on:

  • The requestPoolCapacity cache layer has a non-trivial amount of "is the cache stale?" defensive logic. We bumped into exactly the smell that MIR-596 already tracks; out of scope here.
  • The bare "web" literal used as the routing-target check in NewStrategyForVersion is one of many sites in the codebase that hardcode that string. Filed MIR-1169 to track sweeping it into a named constant; left bare in our changes to match the surrounding convention.

Take a look when you have a moment. Happy to talk through any of it.

@phinze phinze force-pushed the mir-1159-ephemeral-deploys-launching-multiple-sandboxes branch from 5be88ca to e65f091 Compare May 26, 2026 23:29
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/concurrency/strategy.go`:
- Around line 180-182: EphemeralStrategy.releaseCapacity is a no-op while
ConcurrencyTracker.AcquireLease increments used, so ephemeral sandboxes never
return to Used()==0 and prevent scaling to zero; update
EphemeralStrategy.releaseCapacity to decrement the tracker by the given size
(call the appropriate ConcurrencyTracker release method or subtract from its
used counter), perform a nil check on tracker, and clamp/validate so used never
goes negative; keep the change contained to the
EphemeralStrategy.releaseCapacity implementation so
MinInstances()/ScaleDownDelay() semantics are restored.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 96d06cc4-05f9-430e-b1a3-5221524c0c09

📥 Commits

Reviewing files that changed from the base of the PR and between 5be88ca and e65f091.

📒 Files selected for processing (7)
  • components/activator/activator.go
  • components/activator/activator_test.go
  • controllers/deployment/launcher.go
  • controllers/deployment/launcher_test.go
  • controllers/sandboxpool/manager.go
  • pkg/concurrency/strategy.go
  • pkg/concurrency/strategy_test.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • pkg/concurrency/strategy_test.go
  • controllers/deployment/launcher.go
  • controllers/deployment/launcher_test.go
  • components/activator/activator.go

Comment thread pkg/concurrency/strategy.go
@phinze phinze force-pushed the mir-1159-ephemeral-deploys-launching-multiple-sandboxes branch from e65f091 to 01c2233 Compare May 27, 2026 00:03
On garden, the ephemeral preview deploy for mirendev-preview ended up
with a sandbox pool of desired_instances: 8 and requests timing out
in the browser. Two distinct bugs were involved:

1. Ephemeral pools scaled like normal auto-mode pools. Every
   saturating burst caused the activator to bump DesiredInstances,
   ratcheting the pool up under traffic instead of queueing on the
   single preview sandbox.

2. The activator could return a Lease with an empty URL, and could
   leave waiters asleep when the URL became available outside of a
   status transition. Combined with a 50s cap that was too tight for
   cold image pulls, requests never reached the new sandbox.

## Empty-URL lease + wait budget (originally by Evan)

waitForSandbox's checkForSandbox now requires url != "" on both the
inner scan and the locked double-check, matching the predicate
already present in AcquireLease. watchSandboxes captures oldURL
before updating, and fires the notification fan-out when a RUNNING
sandbox's URL transitions from empty to non-empty. Per-request wait
cap goes from 50s to 120s (default PortWaitTimeout is 15s; first-time
image pull on a cold node can take 30-90s; stays under httpingress'
leaseAcquisitionTimeout of 2m). Fallback ticker moves from 30s to
60s to keep "half-way" semantics. TestAcquireLeaseSucceedsAfterURLArrives
bounds its wait context to 2s so a regression in the URL-arrival
notification path fails fast rather than hanging for 120s (per
CodeRabbit).

## Ephemeral routing via the strategy interface

The first cut added a pool.Ephemeral schema field plus scattered "if
Ephemeral" guards in the activator and launcher. That mirrors
version-level state onto a sub-entity and creates a synchronization
burden, and we found we'd missed a guard site (the cache-miss path in
requestPoolCapacity). This consolidates the dispatch into a single
decision at strategy construction.

NewStrategyForVersion routes the web service of an ephemeral version
to EphemeralStrategy{}; supporting services (db, worker, etc.) on the
same ephemeral version honor their configured ServiceConcurrency like
a normal deploy. EphemeralStrategy mirrors FixedStrategy's capacity
model (HasCapacity always true, no per-sandbox tracking, because
there's no scaling alternative for an overloaded preview sandbox so
traffic just routes at the one instance), with two distinguishing
differences: MinInstances=0 so the sandboxpool manager scales the
pool to zero when idle, and ScaleDownDelay defaults to 15 minutes so
browsing a preview tab doesn't churn through cold starts. The
configured ScaleDownDelay on the service still wins if set.

The strategy interface gains MinInstances()/MaxInstances(), the
bounds the strategy enforces on pool size: Auto returns [0,
MaxPoolSize], Fixed [N, N], Ephemeral [0, 1]. The activator's three
ceiling checks in requestPoolCapacity consult strategy.MaxInstances()
instead of the MaxPoolSize const, so the cap is type-driven per pool.
When a pool is at cap, requestPoolCapacity returns the cached pool
without erroring; the caller waits for a sandbox via waitForSandbox's
polling loop. MaxPoolSize moves from activator to pkg/concurrency.

The launcher consults strategy.MinInstances() for both initial pool
seeding and update-on-reuse, so it no longer needs to know about
ephemeral routing for the scaling decision. The only ephemeral check
that remains at the launcher is in findMatchingPool: ephemeral
versions always get an isolated pool, since sharing a pool with a
non-ephemeral version would mean the pool's scaling behavior depends
on which version is making the request.
@phinze phinze force-pushed the mir-1159-ephemeral-deploys-launching-multiple-sandboxes branch from 01c2233 to 946ecb2 Compare May 27, 2026 17:28
@evanphx evanphx merged commit d88b862 into main May 27, 2026
17 checks passed
@evanphx evanphx deleted the mir-1159-ephemeral-deploys-launching-multiple-sandboxes branch May 27, 2026 19:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants