Eliminate 502s during deployment rollovers by phinze · Pull Request #637 · mirendev/runtime

phinze · 2026-03-01T18:43:55Z

I deployed a service and hit a 502 during the rollover. The httpingress was holding a cached lease to the old sandbox's IP, which had already been killed. The new sandbox was running fine, but the ingress didn't know about it yet — the very next request after the 502 would've succeeded, but the unlucky one got an error page.

Two things conspired to cause this: the httpingress would proxy to a cached lease, get a connection error, invalidate the lease, and immediately return 502 without retrying. And over in the launcher, reconcileAppVersion was scaling down the old pool the instant it created the new one, without waiting for the new sandbox to actually boot.

The fix is two complementary changes. First, the httpingress now retries once when a cached lease fails with a connection error — it invalidates the stale lease and loops back, which acquires a fresh lease to the new sandbox. The proxyToLease method gains a writeErrorResponse flag so the first attempt can suppress the 502 and give the retry a chance. DoRequest (the internal/non-proxy path) gets the same treatment.

Second, the launcher now waits for new pools to have at least one ready instance before scaling down old version pools. ensurePoolForService returns the new pool ID so reconcileAppVersion can poll ReadyInstances before calling cleanupOldVersionPools. If it times out (60s default), it logs a warning and proceeds — the httpingress retry covers any remaining gap.

Together these mean you should never see a 502 during a normal rollover: the launcher won't kill the old sandbox until the new one is ready, and even if something slips through the cracks, the ingress retries transparently.

During a deployment rollover, the httpingress can hold a cached lease pointing at a sandbox IP that's already been killed. The old behavior was to proxy the request, get EHOSTUNREACH, invalidate the lease, and return 502 — the very next request would succeed, but the unlucky first one got a user-visible error. Now proxyToLease takes a writeErrorResponse flag so the caller can suppress the 502 on the first attempt. serveAuthenticatedRequest runs a retry loop (maxRetries=1): if a cached lease fails with a connection error, it invalidates and tries again — the second iteration finds no cache, acquires a fresh lease to the new sandbox, and the request goes through. DoRequest gets the same treatment for internal (non-HTTP-proxy) requests. forwardToLease is removed since its logic folded naturally into the retry loop.

The other half of the 502-during-deploy problem: reconcileAppVersion was calling cleanupOldVersionPools (which sets DesiredInstances=0) right after ensurePoolForService, without waiting for the new sandbox to actually boot. This created a window where the old sandbox was dead and the new one wasn't serving yet. ensurePoolForService now returns the pool ID when it creates a new pool (empty ID when reusing an existing one). reconcileAppVersion collects those IDs and calls waitForPoolReady on each — polling the entity store every 2s until ReadyInstances > 0 or a configurable timeout (default 60s) expires. On timeout it logs a warning and proceeds; the httpingress retry from the previous commit covers any remaining gap. PoolReadyTimeout is exposed on the Launcher struct so tests can use a short value (100ms) instead of waiting a full minute.

coderabbitai · 2026-03-01T18:44:12Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Launcher: added PoolReadyTimeout field, changed ensurePoolForService to return a created pool ID, added waitForPoolReady and updated reconciliation to collect new pool IDs and poll them before cleanup; added newTestLauncher and tests for pool-wait behavior. HTTP ingress: added invalidateAppLeases, retry-on-stale-lease logic, changed proxyToLease to accept writeErrorResponse, added executeInternalRequest and buildInternalResponse, and added tests for retry, timeout, and lease-invalidation behaviors.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@controllers/deployment/launcher.go`:
- Around line 231-235: The current handler treats any error from
l.waitForPoolReady as a harmless timeout; change it to only proceed with cleanup
when the error is a genuine timeout (errors.Is(err, context.DeadlineExceeded) or
your package's Timeout error), otherwise surface/return/log the error and stop
cleanup; update the block around the call to l.waitForPoolReady (referencing
l.waitForPoolReady, poolID, l.PoolReadyTimeout) to: call waitForPoolReady, if
err == nil continue, else if errors.Is(err, context.DeadlineExceeded) log the
non-blocking timeout warning and proceed, else return or log an error and abort
cleanup so failures like EAC.Get or context.Canceled are not treated as benign;
apply the same change to the other identical handling at the second location
(lines referenced as 862-864).

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d65de7f and ac5e7b0.

📒 Files selected for processing (5)

controllers/deployment/launcher.go
controllers/deployment/launcher_test.go
servers/httpingress/httpingress.go
servers/httpingress/httpingress_test.go
servers/httpingress/lease_test.go

controllers/deployment/launcher.go

servers/httpingress/httpingress.go

evanphx

I think I hit this same thing this weekend. requests right after a deploy hung and I had to wait a bit for one to go through.

Remove redundant comments that restate what the code does, drop the "Fix 1" PR-narrative reference from waitForPoolReady, and collapse identical if/else branches in DoRequest's cached lease error path.

- Distinguish context cancellation from timeout in waitForPoolReady caller — cancelled contexts now stop reconciliation instead of being swallowed - Invalidate all cached leases for an app on stale connection error, ensuring retry acquires a fresh lease even with multiple cached leases

coderabbitai

🧹 Nitpick comments (2)

servers/httpingress/httpingress.go (2)
961-966: Consider using invalidateAppLeases for consistency with HTTP path.

In serveAuthenticatedRequest (line 565), a stale cached lease triggers invalidateAppLeases to clear all cached leases for the app. Here, only the single lease is invalidated via invalidateLease.

During rollover, multiple cached leases may point to the same dead sandbox. While this path immediately falls through to acquire a fresh lease, subsequent DoRequest calls could still pick up other stale cached leases.

For consistency and to fully clear stale state on rollover detection:
♻️ Suggested change
		if err != nil && isProxyConnectionError(err) {
			// Stale cached lease — invalidate and fall through to acquire fresh
-			h.invalidateLease(context.Background(), appId, curLease)
+			h.invalidateAppLeases(context.Background(), appId)
			h.Log.Warn("stale lease on internal request, retrying with fresh lease",
				"stale_url", curLease.Lease.URL, "app", appId)
			curLease = nil
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@servers/httpingress/httpingress.go` around lines 961 - 966, Replace the
single-lease invalidation with the app-wide invalidation to match the logic in
serveAuthenticatedRequest: when isProxyConnectionError(err) is detected in the
rollover path, call invalidateAppLeases(ctx, appId) instead of
invalidateLease(ctx, appId, curLease) so all cached leases for the app are
cleared (keep the same warning log and set curLease = nil afterwards); ensure
you use the same context semantics as surrounding code and reference
invalidateAppLeases, invalidateLease, serveAuthenticatedRequest, and DoRequest
when making the change.
1113-1115: Minor: Internal requests use http.Client.Timeout instead of configured transport.

The executeInternalRequest creates a new http.Client with h.config.RequestTimeout rather than using h.transport (which has ResponseHeaderTimeout and idleTimeoutConn wrapping). This is simpler but loses the per-read idle timeout that protects against slow-loris patterns.

For internal requests between trusted components, this is likely acceptable. If internal requests later need streaming support or finer-grained timeout control, consider switching to h.transport.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@servers/httpingress/httpingress.go` around lines 1113 - 1115, The internal
request client in executeInternalRequest currently builds a new http.Client with
only Timeout (client := &http.Client{ Timeout: h.config.RequestTimeout }) which
bypasses h.transport and its ResponseHeaderTimeout/idleTimeoutConn; change
executeInternalRequest to use the existing h.transport by creating the client as
&http.Client{ Transport: h.transport, Timeout: h.config.RequestTimeout } (or
otherwise reuse h.transport directly) so internal requests retain the
transport-level read/idle timeouts and any connection wrapping used elsewhere.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@servers/httpingress/httpingress.go`:
- Around line 961-966: Replace the single-lease invalidation with the app-wide
invalidation to match the logic in serveAuthenticatedRequest: when
isProxyConnectionError(err) is detected in the rollover path, call
invalidateAppLeases(ctx, appId) instead of invalidateLease(ctx, appId, curLease)
so all cached leases for the app are cleared (keep the same warning log and set
curLease = nil afterwards); ensure you use the same context semantics as
surrounding code and reference invalidateAppLeases, invalidateLease,
serveAuthenticatedRequest, and DoRequest when making the change.
- Around line 1113-1115: The internal request client in executeInternalRequest
currently builds a new http.Client with only Timeout (client := &http.Client{
Timeout: h.config.RequestTimeout }) which bypasses h.transport and its
ResponseHeaderTimeout/idleTimeoutConn; change executeInternalRequest to use the
existing h.transport by creating the client as &http.Client{ Transport:
h.transport, Timeout: h.config.RequestTimeout } (or otherwise reuse h.transport
directly) so internal requests retain the transport-level read/idle timeouts and
any connection wrapping used elsewhere.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between ac5e7b0 and e4c6d65.

📒 Files selected for processing (5)

controllers/deployment/launcher.go
controllers/deployment/launcher_test.go
servers/httpingress/httpingress.go
servers/httpingress/httpingress_test.go
servers/httpingress/lease_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

controllers/deployment/launcher_test.go

Consistent with the authenticated request path: when a stale lease is detected on an internal request, invalidate all cached leases for the app rather than just the one that failed.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@servers/httpingress/httpingress.go`:
- Around line 541-570: The retry-on-proxy-connection-error path around the lease
loop (useLease, proxyToLease, invalidateAppLeases) currently retries for all
HTTP methods; change it to only retry when the incoming request is idempotent
(GET, HEAD, OPTIONS, TRACE) or when the request body is replayable (req.GetBody
!= nil) to avoid duplicate non-idempotent writes; before calling continue on
isProxyConnectionError(err) check req.Method and whether req.GetBody is non-nil
(or another body-replay capability) and only invalidateAppLeases/continue when
replay-safe, otherwise treat the proxy error as fatal and return the error
response.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e4c6d65 and 6e6f60b.

📒 Files selected for processing (1)

servers/httpingress/httpingress.go

servers/httpingress/httpingress.go

Only proceed with cleanup on timeout (DeadlineExceeded). Other errors like entity store failures now propagate to the caller instead of being silently treated as timeouts.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@controllers/deployment/launcher.go`:
- Around line 848-853: The loop in waitForPoolReady uses the original ctx for
l.EAC.Get so a blocking Get can ignore the manual deadline; wrap each call (or
the whole loop) with a cancellable context that enforces the timeout—e.g.,
compute remaining time against the earlier deadline and call
context.WithDeadline or context.WithTimeout to create childCtx, pass childCtx to
l.EAC.Get (instead of ctx), defer cancel the child context, and translate
context.DeadlineExceeded / canceled into a PoolReadyTimeout error; reference
l.EAC.Get, ctx, deadline, timeout and waitForPoolReady when making the change.
- Around line 214-220: The loop that calls ensurePoolForService currently logs
errors and continues but later runs the old-pool cleanup which can scale down
pools for services that failed reconciliation; modify the logic in the
reconciliation flow (the loop calling ensurePoolForService in launcher.go) so
that when ensurePoolForService returns an error you do NOT schedule or perform
cleanup for that specific service—either collect failed service IDs/names in a
failedServices set and skip their old-pool cleanup, or abort the cleanup phase
entirely by returning the error from the reconciliation function; ensure you
reference the ensurePoolForService call site and the old-pool cleanup block to
add the conditional check and avoid scaling down pools for failed services.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6e6f60b and 8a94a3a.

📒 Files selected for processing (2)

controllers/deployment/launcher.go
controllers/deployment/launcher_test.go

controllers/deployment/launcher.go

Remove ECONNRESET/ECONNABORTED from isProxyConnectionError since those can fire after a request was partially processed, making retries unsafe for non-idempotent methods. Keep ECONNREFUSED/EHOSTUNREACH/ENETUNREACH which indicate the TCP connection was never established. Also fix TestRapidRedeployWithDisk timeout — the integration test was using the default 60s PoolReadyTimeout, hitting the 600s package timeout with 10 redeployments.

phinze added 2 commits March 1, 2026 12:43

phinze requested a review from a team as a code owner March 1, 2026 18:43

coderabbitai bot reviewed Mar 1, 2026

View reviewed changes

controllers/deployment/launcher.go Show resolved Hide resolved

servers/httpingress/httpingress.go Show resolved Hide resolved

evanphx approved these changes Mar 1, 2026

View reviewed changes

phinze added 2 commits March 1, 2026 14:53

Clean up comments and simplify error handling

7f7e978

Remove redundant comments that restate what the code does, drop the "Fix 1" PR-narrative reference from waitForPoolReady, and collapse identical if/else branches in DoRequest's cached lease error path.

phinze force-pushed the phinze/zero-downtime-rollover branch from ac5e7b0 to e4c6d65 Compare March 1, 2026 20:59

coderabbitai bot reviewed Mar 1, 2026

View reviewed changes

Use invalidateAppLeases in internal request retry path

6e6f60b

Consistent with the authenticated request path: when a stale lease is detected on an internal request, invalidate all cached leases for the app rather than just the one that failed.

coderabbitai bot reviewed Mar 1, 2026

View reviewed changes

servers/httpingress/httpingress.go Show resolved Hide resolved

Distinguish timeout from other errors in waitForPoolReady

8a94a3a

Only proceed with cleanup on timeout (DeadlineExceeded). Other errors like entity store failures now propagate to the caller instead of being silently treated as timeouts.

coderabbitai bot reviewed Mar 2, 2026

View reviewed changes

controllers/deployment/launcher.go Show resolved Hide resolved

controllers/deployment/launcher.go Show resolved Hide resolved

phinze force-pushed the phinze/zero-downtime-rollover branch from e8778de to 0b70d3c Compare March 2, 2026 01:16

phinze merged commit 97e2fb5 into main Mar 2, 2026
11 checks passed

phinze deleted the phinze/zero-downtime-rollover branch March 2, 2026 01:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eliminate 502s during deployment rollovers#637

Eliminate 502s during deployment rollovers#637
phinze merged 7 commits intomainfrom
phinze/zero-downtime-rollover

phinze commented Mar 1, 2026

Uh oh!

coderabbitai bot commented Mar 1, 2026 •

edited

Loading

Reviews paused

Walkthrough

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

evanphx left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

phinze commented Mar 1, 2026

Uh oh!

coderabbitai bot commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

evanphx left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Mar 1, 2026 •

edited

Loading