Skip to content

Eliminate 502s during deployment rollovers#637

Merged
phinze merged 7 commits intomainfrom
phinze/zero-downtime-rollover
Mar 2, 2026
Merged

Eliminate 502s during deployment rollovers#637
phinze merged 7 commits intomainfrom
phinze/zero-downtime-rollover

Conversation

@phinze
Copy link
Copy Markdown
Contributor

@phinze phinze commented Mar 1, 2026

I deployed a service and hit a 502 during the rollover. The httpingress was holding a cached lease to the old sandbox's IP, which had already been killed. The new sandbox was running fine, but the ingress didn't know about it yet — the very next request after the 502 would've succeeded, but the unlucky one got an error page.

Two things conspired to cause this: the httpingress would proxy to a cached lease, get a connection error, invalidate the lease, and immediately return 502 without retrying. And over in the launcher, reconcileAppVersion was scaling down the old pool the instant it created the new one, without waiting for the new sandbox to actually boot.

The fix is two complementary changes. First, the httpingress now retries once when a cached lease fails with a connection error — it invalidates the stale lease and loops back, which acquires a fresh lease to the new sandbox. The proxyToLease method gains a writeErrorResponse flag so the first attempt can suppress the 502 and give the retry a chance. DoRequest (the internal/non-proxy path) gets the same treatment.

Second, the launcher now waits for new pools to have at least one ready instance before scaling down old version pools. ensurePoolForService returns the new pool ID so reconcileAppVersion can poll ReadyInstances before calling cleanupOldVersionPools. If it times out (60s default), it logs a warning and proceeds — the httpingress retry covers any remaining gap.

Together these mean you should never see a 502 during a normal rollover: the launcher won't kill the old sandbox until the new one is ready, and even if something slips through the cracks, the ingress retries transparently.

phinze added 2 commits March 1, 2026 12:43
During a deployment rollover, the httpingress can hold a cached lease
pointing at a sandbox IP that's already been killed. The old behavior
was to proxy the request, get EHOSTUNREACH, invalidate the lease, and
return 502 — the very next request would succeed, but the unlucky
first one got a user-visible error.

Now proxyToLease takes a writeErrorResponse flag so the caller can
suppress the 502 on the first attempt. serveAuthenticatedRequest
runs a retry loop (maxRetries=1): if a cached lease fails with a
connection error, it invalidates and tries again — the second
iteration finds no cache, acquires a fresh lease to the new sandbox,
and the request goes through. DoRequest gets the same treatment for
internal (non-HTTP-proxy) requests.

forwardToLease is removed since its logic folded naturally into the
retry loop.
The other half of the 502-during-deploy problem: reconcileAppVersion
was calling cleanupOldVersionPools (which sets DesiredInstances=0)
right after ensurePoolForService, without waiting for the new sandbox
to actually boot. This created a window where the old sandbox was
dead and the new one wasn't serving yet.

ensurePoolForService now returns the pool ID when it creates a new
pool (empty ID when reusing an existing one). reconcileAppVersion
collects those IDs and calls waitForPoolReady on each — polling the
entity store every 2s until ReadyInstances > 0 or a configurable
timeout (default 60s) expires. On timeout it logs a warning and
proceeds; the httpingress retry from the previous commit covers any
remaining gap.

PoolReadyTimeout is exposed on the Launcher struct so tests can use
a short value (100ms) instead of waiting a full minute.
@phinze phinze requested a review from a team as a code owner March 1, 2026 18:43
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 1, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Launcher: added PoolReadyTimeout field, changed ensurePoolForService to return a created pool ID, added waitForPoolReady and updated reconciliation to collect new pool IDs and poll them before cleanup; added newTestLauncher and tests for pool-wait behavior. HTTP ingress: added invalidateAppLeases, retry-on-stale-lease logic, changed proxyToLease to accept writeErrorResponse, added executeInternalRequest and buildInternalResponse, and added tests for retry, timeout, and lease-invalidation behaviors.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@controllers/deployment/launcher.go`:
- Around line 231-235: The current handler treats any error from
l.waitForPoolReady as a harmless timeout; change it to only proceed with cleanup
when the error is a genuine timeout (errors.Is(err, context.DeadlineExceeded) or
your package's Timeout error), otherwise surface/return/log the error and stop
cleanup; update the block around the call to l.waitForPoolReady (referencing
l.waitForPoolReady, poolID, l.PoolReadyTimeout) to: call waitForPoolReady, if
err == nil continue, else if errors.Is(err, context.DeadlineExceeded) log the
non-blocking timeout warning and proceed, else return or log an error and abort
cleanup so failures like EAC.Get or context.Canceled are not treated as benign;
apply the same change to the other identical handling at the second location
(lines referenced as 862-864).

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d65de7f and ac5e7b0.

📒 Files selected for processing (5)
  • controllers/deployment/launcher.go
  • controllers/deployment/launcher_test.go
  • servers/httpingress/httpingress.go
  • servers/httpingress/httpingress_test.go
  • servers/httpingress/lease_test.go

Copy link
Copy Markdown
Contributor

@evanphx evanphx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I hit this same thing this weekend. requests right after a deploy hung and I had to wait a bit for one to go through.

phinze added 2 commits March 1, 2026 14:53
Remove redundant comments that restate what the code does, drop the
"Fix 1" PR-narrative reference from waitForPoolReady, and collapse
identical if/else branches in DoRequest's cached lease error path.
- Distinguish context cancellation from timeout in waitForPoolReady
  caller — cancelled contexts now stop reconciliation instead of being
  swallowed
- Invalidate all cached leases for an app on stale connection error,
  ensuring retry acquires a fresh lease even with multiple cached leases
@phinze phinze force-pushed the phinze/zero-downtime-rollover branch from ac5e7b0 to e4c6d65 Compare March 1, 2026 20:59
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
servers/httpingress/httpingress.go (2)

961-966: Consider using invalidateAppLeases for consistency with HTTP path.

In serveAuthenticatedRequest (line 565), a stale cached lease triggers invalidateAppLeases to clear all cached leases for the app. Here, only the single lease is invalidated via invalidateLease.

During rollover, multiple cached leases may point to the same dead sandbox. While this path immediately falls through to acquire a fresh lease, subsequent DoRequest calls could still pick up other stale cached leases.

For consistency and to fully clear stale state on rollover detection:

♻️ Suggested change
		if err != nil && isProxyConnectionError(err) {
			// Stale cached lease — invalidate and fall through to acquire fresh
-			h.invalidateLease(context.Background(), appId, curLease)
+			h.invalidateAppLeases(context.Background(), appId)
			h.Log.Warn("stale lease on internal request, retrying with fresh lease",
				"stale_url", curLease.Lease.URL, "app", appId)
			curLease = nil
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@servers/httpingress/httpingress.go` around lines 961 - 966, Replace the
single-lease invalidation with the app-wide invalidation to match the logic in
serveAuthenticatedRequest: when isProxyConnectionError(err) is detected in the
rollover path, call invalidateAppLeases(ctx, appId) instead of
invalidateLease(ctx, appId, curLease) so all cached leases for the app are
cleared (keep the same warning log and set curLease = nil afterwards); ensure
you use the same context semantics as surrounding code and reference
invalidateAppLeases, invalidateLease, serveAuthenticatedRequest, and DoRequest
when making the change.

1113-1115: Minor: Internal requests use http.Client.Timeout instead of configured transport.

The executeInternalRequest creates a new http.Client with h.config.RequestTimeout rather than using h.transport (which has ResponseHeaderTimeout and idleTimeoutConn wrapping). This is simpler but loses the per-read idle timeout that protects against slow-loris patterns.

For internal requests between trusted components, this is likely acceptable. If internal requests later need streaming support or finer-grained timeout control, consider switching to h.transport.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@servers/httpingress/httpingress.go` around lines 1113 - 1115, The internal
request client in executeInternalRequest currently builds a new http.Client with
only Timeout (client := &http.Client{ Timeout: h.config.RequestTimeout }) which
bypasses h.transport and its ResponseHeaderTimeout/idleTimeoutConn; change
executeInternalRequest to use the existing h.transport by creating the client as
&http.Client{ Transport: h.transport, Timeout: h.config.RequestTimeout } (or
otherwise reuse h.transport directly) so internal requests retain the
transport-level read/idle timeouts and any connection wrapping used elsewhere.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@servers/httpingress/httpingress.go`:
- Around line 961-966: Replace the single-lease invalidation with the app-wide
invalidation to match the logic in serveAuthenticatedRequest: when
isProxyConnectionError(err) is detected in the rollover path, call
invalidateAppLeases(ctx, appId) instead of invalidateLease(ctx, appId, curLease)
so all cached leases for the app are cleared (keep the same warning log and set
curLease = nil afterwards); ensure you use the same context semantics as
surrounding code and reference invalidateAppLeases, invalidateLease,
serveAuthenticatedRequest, and DoRequest when making the change.
- Around line 1113-1115: The internal request client in executeInternalRequest
currently builds a new http.Client with only Timeout (client := &http.Client{
Timeout: h.config.RequestTimeout }) which bypasses h.transport and its
ResponseHeaderTimeout/idleTimeoutConn; change executeInternalRequest to use the
existing h.transport by creating the client as &http.Client{ Transport:
h.transport, Timeout: h.config.RequestTimeout } (or otherwise reuse h.transport
directly) so internal requests retain the transport-level read/idle timeouts and
any connection wrapping used elsewhere.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between ac5e7b0 and e4c6d65.

📒 Files selected for processing (5)
  • controllers/deployment/launcher.go
  • controllers/deployment/launcher_test.go
  • servers/httpingress/httpingress.go
  • servers/httpingress/httpingress_test.go
  • servers/httpingress/lease_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • controllers/deployment/launcher_test.go

Consistent with the authenticated request path: when a stale lease is
detected on an internal request, invalidate all cached leases for the
app rather than just the one that failed.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@servers/httpingress/httpingress.go`:
- Around line 541-570: The retry-on-proxy-connection-error path around the lease
loop (useLease, proxyToLease, invalidateAppLeases) currently retries for all
HTTP methods; change it to only retry when the incoming request is idempotent
(GET, HEAD, OPTIONS, TRACE) or when the request body is replayable (req.GetBody
!= nil) to avoid duplicate non-idempotent writes; before calling continue on
isProxyConnectionError(err) check req.Method and whether req.GetBody is non-nil
(or another body-replay capability) and only invalidateAppLeases/continue when
replay-safe, otherwise treat the proxy error as fatal and return the error
response.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e4c6d65 and 6e6f60b.

📒 Files selected for processing (1)
  • servers/httpingress/httpingress.go

Only proceed with cleanup on timeout (DeadlineExceeded). Other errors
like entity store failures now propagate to the caller instead of being
silently treated as timeouts.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@controllers/deployment/launcher.go`:
- Around line 848-853: The loop in waitForPoolReady uses the original ctx for
l.EAC.Get so a blocking Get can ignore the manual deadline; wrap each call (or
the whole loop) with a cancellable context that enforces the timeout—e.g.,
compute remaining time against the earlier deadline and call
context.WithDeadline or context.WithTimeout to create childCtx, pass childCtx to
l.EAC.Get (instead of ctx), defer cancel the child context, and translate
context.DeadlineExceeded / canceled into a PoolReadyTimeout error; reference
l.EAC.Get, ctx, deadline, timeout and waitForPoolReady when making the change.
- Around line 214-220: The loop that calls ensurePoolForService currently logs
errors and continues but later runs the old-pool cleanup which can scale down
pools for services that failed reconciliation; modify the logic in the
reconciliation flow (the loop calling ensurePoolForService in launcher.go) so
that when ensurePoolForService returns an error you do NOT schedule or perform
cleanup for that specific service—either collect failed service IDs/names in a
failedServices set and skip their old-pool cleanup, or abort the cleanup phase
entirely by returning the error from the reconciliation function; ensure you
reference the ensurePoolForService call site and the old-pool cleanup block to
add the conditional check and avoid scaling down pools for failed services.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6e6f60b and 8a94a3a.

📒 Files selected for processing (2)
  • controllers/deployment/launcher.go
  • controllers/deployment/launcher_test.go

Remove ECONNRESET/ECONNABORTED from isProxyConnectionError since those
can fire after a request was partially processed, making retries unsafe
for non-idempotent methods. Keep ECONNREFUSED/EHOSTUNREACH/ENETUNREACH
which indicate the TCP connection was never established.

Also fix TestRapidRedeployWithDisk timeout — the integration test was
using the default 60s PoolReadyTimeout, hitting the 600s package timeout
with 10 redeployments.
@phinze phinze force-pushed the phinze/zero-downtime-rollover branch from e8778de to 0b70d3c Compare March 2, 2026 01:16
@phinze phinze merged commit 97e2fb5 into main Mar 2, 2026
11 checks passed
@phinze phinze deleted the phinze/zero-downtime-rollover branch March 2, 2026 01:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants