MIR-1159: keep ephemeral pools at 1 sandbox and fix lease URL races#821
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR replaces the single-value DesiredInstances() strategy with a MinInstances()/MaxInstances() capacity model, adds an EphemeralStrategy and NewStrategyForVersion(), applies version-aware strategies across the activator and pool manager, enforces non-empty sandbox URLs before issuing leases (with wait time increased to 120s), caps pool DesiredInstances using strategy.MaxInstances(), uses strategy.MinInstances() when seeding or scaling down pools, and adds tests validating these behaviors. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
components/activator/activator_test.go (1)
2675-2750: ⚡ Quick winAdd an uncached-pool ephemeral regression test.
This test pre-seeds
activator.pools[key], so it only validates the cached-state guard. Please add a case whereactivator.poolsis empty and the pool is found via store lookup, to ensure ephemeral pools still refuse increment there.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/activator/activator_test.go` around lines 2675 - 2750, TestRequestPoolCapacityRefusesEphemeralPool only exercises the cached-path because activator.pools is pre-seeded; add a second subcase where activator.pools is empty to force a store lookup: create the same ephemeral SandboxPool in the in-memory server (as already done), clear activator.pools (or create a fresh localActivator with empty pools), then call activator.requestPoolCapacity(ctx, testVer, "web") and assert the returned pool has DesiredInstances == 1, stored entity remains Ephemeral with DesiredInstances 1, and the revision did not change; reference the existing test function name TestRequestPoolCapacityRefusesEphemeralPool, the activator.requestPoolCapacity method, the activator.pools map and the verKey used to locate the pool to implement this additional branch.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@components/activator/activator_test.go`:
- Around line 3001-3003: The test's call to activator.waitForSandbox is using an
unbounded context (ctx) and can hang; wrap the call with a short timeout context
instead by creating a context.WithTimeout (e.g., a few seconds) and passing that
context to waitForSandbox, ensuring you call the returned cancel() via defer to
clean up; update the invocation of activator.waitForSandbox(ctx, testVer, "web",
false) to use the timeout context variable and keep error/lease handling
unchanged.
In `@components/activator/activator.go`:
- Around line 602-611: The ephemeral-pool guard is only applied for the
cached-path (the block checking state.pool.Ephemeral and DesiredInstances) but
not applied in the cache-miss / foundPoolWithRev code path where
DesiredInstances is incremented; update the code path that handles
foundPoolWithRev (the branch that patches/increments DesiredInstances after a
cache miss) to perform the same guard: if pool.Ephemeral &&
pool.DesiredInstances >= 1 then log a debug message (including pool.ID and
service) and return the pool without incrementing or patching, thereby
preventing ephemeral pools from being scaled above 1 via the foundPoolWithRev
path.
---
Nitpick comments:
In `@components/activator/activator_test.go`:
- Around line 2675-2750: TestRequestPoolCapacityRefusesEphemeralPool only
exercises the cached-path because activator.pools is pre-seeded; add a second
subcase where activator.pools is empty to force a store lookup: create the same
ephemeral SandboxPool in the in-memory server (as already done), clear
activator.pools (or create a fresh localActivator with empty pools), then call
activator.requestPoolCapacity(ctx, testVer, "web") and assert the returned pool
has DesiredInstances == 1, stored entity remains Ephemeral with DesiredInstances
1, and the revision did not change; reference the existing test function name
TestRequestPoolCapacityRefusesEphemeralPool, the activator.requestPoolCapacity
method, the activator.pools map and the verKey used to locate the pool to
implement this additional branch.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 1a496ffe-0be8-4480-bef2-5cc414776db1
📒 Files selected for processing (6)
api/compute/compute_v1alpha/schema.gen.goapi/compute/schema.ymlcomponents/activator/activator.gocomponents/activator/activator_test.gocontrollers/deployment/launcher.gocontrollers/deployment/launcher_test.go
phinze
left a comment
There was a problem hiding this comment.
Hi, it's us. You know who and you know what.
The fixes are clean but we think we see a better abstraction here worth chasing. Going to request-changes on the pool.Ephemeral field specifically. The incident pieces themselves (empty-URL predicate, late-URL notification, wait budget bump) look right and should land as-is.
Here's the thread we pulled on: ephemeral is already a really well-localized AppVersion-level concept (ephemeral_label, ephemeral_ttl, ephemeral_expires_at, plus pkg/ephemeral and the GC controller). Every existing site derives ephemeral-ness on demand from the version. The new pool bool would be the first place we copy it onto a sub-entity, with the clamp-on-reuse code in the launcher to keep the copy in sync. CodeRabbit also caught a missed check site in the cache-miss path, which is decent evidence the bool-and-conditionals approach is fragile in the way you'd predict.
The cleaner shape uses an abstraction already in the codebase: route ephemeral through the strategy interface at construction. FixedStrategy{numInstances: 1} covers the "don't scale" semantic for free (its checkCapacity always returns true, so requestPoolCapacity is never reached). The activator changes here mostly disappear, schema stays untouched.
Grabbing the pen on this one. Going to push a commit or two showing what we mean. Hold off from your side while we cook, and we'll be back with another review when they're up.
phinze
left a comment
There was a problem hiding this comment.
Hi again from us. Pushed three commits to demonstrate what we meant.
0182906b (floor) routes ephemeral versions through FixedStrategy{1} at strategy construction. Equivalent behavior to what this PR shipped, without the pool.Ephemeral schema field, the three activator guards, the launcher clamp, or the find-matching filter. Strategy gains a NewStrategyForVersion(ver, service, svc) wrapper that does the dispatch in one place.
33412293 (stretch) introduces EphemeralStrategy proper: scale-to-zero capped at 1, with a default 15-minute idle window so browsing a preview tab doesn't churn through cold starts. The cap mechanism becomes MinInstances()/MaxInstances() on the strategy interface, and the activator's >= MaxPoolSize checks now consult strategy.MaxInstances() so the cap is type-driven per pool. The launcher consults strategy.MinInstances() for seed and reuse, so it no longer needs to know anything about ephemeral routing.
5be88ca4 picks up CodeRabbit's suggestion to bound the wait context in TestAcquireLeaseSucceedsAfterURLArrives so a regression fails in 2s rather than hanging for 120s.
A couple of things that came up while we were in here but didn't act on:
- The
requestPoolCapacitycache layer has a non-trivial amount of "is the cache stale?" defensive logic. We bumped into exactly the smell that MIR-596 already tracks; out of scope here. - The bare
"web"literal used as the routing-target check inNewStrategyForVersionis one of many sites in the codebase that hardcode that string. Filed MIR-1169 to track sweeping it into a named constant; left bare in our changes to match the surrounding convention.
Take a look when you have a moment. Happy to talk through any of it.
5be88ca to
e65f091
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/concurrency/strategy.go`:
- Around line 180-182: EphemeralStrategy.releaseCapacity is a no-op while
ConcurrencyTracker.AcquireLease increments used, so ephemeral sandboxes never
return to Used()==0 and prevent scaling to zero; update
EphemeralStrategy.releaseCapacity to decrement the tracker by the given size
(call the appropriate ConcurrencyTracker release method or subtract from its
used counter), perform a nil check on tracker, and clamp/validate so used never
goes negative; keep the change contained to the
EphemeralStrategy.releaseCapacity implementation so
MinInstances()/ScaleDownDelay() semantics are restored.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 96d06cc4-05f9-430e-b1a3-5221524c0c09
📒 Files selected for processing (7)
components/activator/activator.gocomponents/activator/activator_test.gocontrollers/deployment/launcher.gocontrollers/deployment/launcher_test.gocontrollers/sandboxpool/manager.gopkg/concurrency/strategy.gopkg/concurrency/strategy_test.go
🚧 Files skipped from review as they are similar to previous changes (4)
- pkg/concurrency/strategy_test.go
- controllers/deployment/launcher.go
- controllers/deployment/launcher_test.go
- components/activator/activator.go
e65f091 to
01c2233
Compare
On garden, the ephemeral preview deploy for mirendev-preview ended up
with a sandbox pool of desired_instances: 8 and requests timing out
in the browser. Two distinct bugs were involved:
1. Ephemeral pools scaled like normal auto-mode pools. Every
saturating burst caused the activator to bump DesiredInstances,
ratcheting the pool up under traffic instead of queueing on the
single preview sandbox.
2. The activator could return a Lease with an empty URL, and could
leave waiters asleep when the URL became available outside of a
status transition. Combined with a 50s cap that was too tight for
cold image pulls, requests never reached the new sandbox.
## Empty-URL lease + wait budget (originally by Evan)
waitForSandbox's checkForSandbox now requires url != "" on both the
inner scan and the locked double-check, matching the predicate
already present in AcquireLease. watchSandboxes captures oldURL
before updating, and fires the notification fan-out when a RUNNING
sandbox's URL transitions from empty to non-empty. Per-request wait
cap goes from 50s to 120s (default PortWaitTimeout is 15s; first-time
image pull on a cold node can take 30-90s; stays under httpingress'
leaseAcquisitionTimeout of 2m). Fallback ticker moves from 30s to
60s to keep "half-way" semantics. TestAcquireLeaseSucceedsAfterURLArrives
bounds its wait context to 2s so a regression in the URL-arrival
notification path fails fast rather than hanging for 120s (per
CodeRabbit).
## Ephemeral routing via the strategy interface
The first cut added a pool.Ephemeral schema field plus scattered "if
Ephemeral" guards in the activator and launcher. That mirrors
version-level state onto a sub-entity and creates a synchronization
burden, and we found we'd missed a guard site (the cache-miss path in
requestPoolCapacity). This consolidates the dispatch into a single
decision at strategy construction.
NewStrategyForVersion routes the web service of an ephemeral version
to EphemeralStrategy{}; supporting services (db, worker, etc.) on the
same ephemeral version honor their configured ServiceConcurrency like
a normal deploy. EphemeralStrategy mirrors FixedStrategy's capacity
model (HasCapacity always true, no per-sandbox tracking, because
there's no scaling alternative for an overloaded preview sandbox so
traffic just routes at the one instance), with two distinguishing
differences: MinInstances=0 so the sandboxpool manager scales the
pool to zero when idle, and ScaleDownDelay defaults to 15 minutes so
browsing a preview tab doesn't churn through cold starts. The
configured ScaleDownDelay on the service still wins if set.
The strategy interface gains MinInstances()/MaxInstances(), the
bounds the strategy enforces on pool size: Auto returns [0,
MaxPoolSize], Fixed [N, N], Ephemeral [0, 1]. The activator's three
ceiling checks in requestPoolCapacity consult strategy.MaxInstances()
instead of the MaxPoolSize const, so the cap is type-driven per pool.
When a pool is at cap, requestPoolCapacity returns the cached pool
without erroring; the caller waits for a sandbox via waitForSandbox's
polling loop. MaxPoolSize moves from activator to pkg/concurrency.
The launcher consults strategy.MinInstances() for both initial pool
seeding and update-on-reuse, so it no longer needs to know about
ephemeral routing for the scaling decision. The only ephemeral check
that remains at the launcher is in findMatchingPool: ephemeral
versions always get an isolated pool, since sharing a pool with a
non-ephemeral version would mean the pool's scaling behavior depends
on which version is making the request.
01c2233 to
946ecb2
Compare
Summary
On
garden, the ephemeral preview deploy formirendev-previewended up with a sandbox pool ofdesired_instances: 8and requests timing out in the browser. Two distinct bugs were involved:DesiredInstances, ratcheting the pool up under traffic instead of queueing on the single preview sandbox.Ephemeral pool: cap at 1 instance (commit 28644c7)
SandboxPool.ephemeralbool. Set byLauncher.ensurePoolForServicewhenever the backingAppVersion.EphemeralLabel != "", and for those versionsdesiredInstancesis forced to1regardless of the configured concurrency mode.findMatchingPoolnow filters by ephemeral status so a non-ephemeral version cannot inherit an ephemeral pool (or vice versa) and end up with the wrong scaling policy.Activator.AcquireLeaseshort-circuits toincrementPool=falsefor ephemeral pools, andrequestPoolCapacitydefensively refuses to patchDesiredInstanceswhenEphemeral && DesiredInstances >= 1.DesiredInstances > 1is clamped back to 1.Tests cover both the pool-creation path and the activator's refusal to scale.
Activator request routing: fix lease URL races and bump wait budget (commit 4823db8)
waitForSandbox.checkForSandboxnow requiress.url != ""on both the inner scan and the locked double-check, matching the predicate already present inAcquireLease. Previously, a sandbox tracked asRUNNINGbefore its URL had been populated could be returned as a Lease withURL: "", causinghttpingressto fail proxying withunsupported protocol scheme "".watchSandboxescapturesoldURLbefore updating, and now also fires the notification fan-out when a RUNNING sandbox's URL transitions from empty to non-empty. Without this, waiters that woke on the status flip but sawurl=""stayed asleep until the 30s fallback ticker.PortWaitTimeoutis 15s and a first-time image pull on a cold node can take 30-90s, so 50s was returningErrPoolTimeoutwhile sandboxes were still booting. Stays comfortably under httpingress'leaseAcquisitionTimeout(2m). The fallback ticker moves from 30s to 60s to keep the same "half-way" semantics.Test plan
make lintcleanmake test(iso): 4422 passes, 0 failsTestCreatePoolForVersionEphemeralTestNonEphemeralPoolUnaffectedTestFindMatchingPoolFiltersByEphemeralTestRequestPoolCapacityRefusesEphemeralPoolTestRequestPoolCapacityNonEphemeralStillScalesTestWaitForSandboxSkipsEmptyURLTestAcquireLeaseSucceedsAfterURLArrivesdesired_instancesstays at 1 and requests succeed.garden: existing pre-fix ephemeral pools keepephemeral=falseuntil they're recreated. Either let TTL expire them or backfillephemeral=trueon pools whose backingAppVersion.EphemeralLabel != "".Notes
hasPendingrace betweenrequestPoolCapacitypatching the store andwatchSandboxesreflecting the new PENDING sandbox is the underlying reason normal auto-mode pools can also overshoot. The ephemeral guard sidesteps it for previews; addressing it for normal pools is a separate follow-up.