Eliminate 502s during deployment rollovers#637
Conversation
During a deployment rollover, the httpingress can hold a cached lease pointing at a sandbox IP that's already been killed. The old behavior was to proxy the request, get EHOSTUNREACH, invalidate the lease, and return 502 — the very next request would succeed, but the unlucky first one got a user-visible error. Now proxyToLease takes a writeErrorResponse flag so the caller can suppress the 502 on the first attempt. serveAuthenticatedRequest runs a retry loop (maxRetries=1): if a cached lease fails with a connection error, it invalidates and tries again — the second iteration finds no cache, acquires a fresh lease to the new sandbox, and the request goes through. DoRequest gets the same treatment for internal (non-HTTP-proxy) requests. forwardToLease is removed since its logic folded naturally into the retry loop.
The other half of the 502-during-deploy problem: reconcileAppVersion was calling cleanupOldVersionPools (which sets DesiredInstances=0) right after ensurePoolForService, without waiting for the new sandbox to actually boot. This created a window where the old sandbox was dead and the new one wasn't serving yet. ensurePoolForService now returns the pool ID when it creates a new pool (empty ID when reusing an existing one). reconcileAppVersion collects those IDs and calls waitForPoolReady on each — polling the entity store every 2s until ReadyInstances > 0 or a configurable timeout (default 60s) expires. On timeout it logs a warning and proceeds; the httpingress retry from the previous commit covers any remaining gap. PoolReadyTimeout is exposed on the Launcher struct so tests can use a short value (100ms) instead of waiting a full minute.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughLauncher: added Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@controllers/deployment/launcher.go`:
- Around line 231-235: The current handler treats any error from
l.waitForPoolReady as a harmless timeout; change it to only proceed with cleanup
when the error is a genuine timeout (errors.Is(err, context.DeadlineExceeded) or
your package's Timeout error), otherwise surface/return/log the error and stop
cleanup; update the block around the call to l.waitForPoolReady (referencing
l.waitForPoolReady, poolID, l.PoolReadyTimeout) to: call waitForPoolReady, if
err == nil continue, else if errors.Is(err, context.DeadlineExceeded) log the
non-blocking timeout warning and proceed, else return or log an error and abort
cleanup so failures like EAC.Get or context.Canceled are not treated as benign;
apply the same change to the other identical handling at the second location
(lines referenced as 862-864).
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (5)
controllers/deployment/launcher.gocontrollers/deployment/launcher_test.goservers/httpingress/httpingress.goservers/httpingress/httpingress_test.goservers/httpingress/lease_test.go
evanphx
left a comment
There was a problem hiding this comment.
I think I hit this same thing this weekend. requests right after a deploy hung and I had to wait a bit for one to go through.
Remove redundant comments that restate what the code does, drop the "Fix 1" PR-narrative reference from waitForPoolReady, and collapse identical if/else branches in DoRequest's cached lease error path.
- Distinguish context cancellation from timeout in waitForPoolReady caller — cancelled contexts now stop reconciliation instead of being swallowed - Invalidate all cached leases for an app on stale connection error, ensuring retry acquires a fresh lease even with multiple cached leases
ac5e7b0 to
e4c6d65
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (2)
servers/httpingress/httpingress.go (2)
961-966: Consider usinginvalidateAppLeasesfor consistency with HTTP path.In
serveAuthenticatedRequest(line 565), a stale cached lease triggersinvalidateAppLeasesto clear all cached leases for the app. Here, only the single lease is invalidated viainvalidateLease.During rollover, multiple cached leases may point to the same dead sandbox. While this path immediately falls through to acquire a fresh lease, subsequent
DoRequestcalls could still pick up other stale cached leases.For consistency and to fully clear stale state on rollover detection:
♻️ Suggested change
if err != nil && isProxyConnectionError(err) { // Stale cached lease — invalidate and fall through to acquire fresh - h.invalidateLease(context.Background(), appId, curLease) + h.invalidateAppLeases(context.Background(), appId) h.Log.Warn("stale lease on internal request, retrying with fresh lease", "stale_url", curLease.Lease.URL, "app", appId) curLease = nil🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@servers/httpingress/httpingress.go` around lines 961 - 966, Replace the single-lease invalidation with the app-wide invalidation to match the logic in serveAuthenticatedRequest: when isProxyConnectionError(err) is detected in the rollover path, call invalidateAppLeases(ctx, appId) instead of invalidateLease(ctx, appId, curLease) so all cached leases for the app are cleared (keep the same warning log and set curLease = nil afterwards); ensure you use the same context semantics as surrounding code and reference invalidateAppLeases, invalidateLease, serveAuthenticatedRequest, and DoRequest when making the change.
1113-1115: Minor: Internal requests usehttp.Client.Timeoutinstead of configured transport.The
executeInternalRequestcreates a newhttp.Clientwithh.config.RequestTimeoutrather than usingh.transport(which hasResponseHeaderTimeoutandidleTimeoutConnwrapping). This is simpler but loses the per-read idle timeout that protects against slow-loris patterns.For internal requests between trusted components, this is likely acceptable. If internal requests later need streaming support or finer-grained timeout control, consider switching to
h.transport.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@servers/httpingress/httpingress.go` around lines 1113 - 1115, The internal request client in executeInternalRequest currently builds a new http.Client with only Timeout (client := &http.Client{ Timeout: h.config.RequestTimeout }) which bypasses h.transport and its ResponseHeaderTimeout/idleTimeoutConn; change executeInternalRequest to use the existing h.transport by creating the client as &http.Client{ Transport: h.transport, Timeout: h.config.RequestTimeout } (or otherwise reuse h.transport directly) so internal requests retain the transport-level read/idle timeouts and any connection wrapping used elsewhere.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@servers/httpingress/httpingress.go`:
- Around line 961-966: Replace the single-lease invalidation with the app-wide
invalidation to match the logic in serveAuthenticatedRequest: when
isProxyConnectionError(err) is detected in the rollover path, call
invalidateAppLeases(ctx, appId) instead of invalidateLease(ctx, appId, curLease)
so all cached leases for the app are cleared (keep the same warning log and set
curLease = nil afterwards); ensure you use the same context semantics as
surrounding code and reference invalidateAppLeases, invalidateLease,
serveAuthenticatedRequest, and DoRequest when making the change.
- Around line 1113-1115: The internal request client in executeInternalRequest
currently builds a new http.Client with only Timeout (client := &http.Client{
Timeout: h.config.RequestTimeout }) which bypasses h.transport and its
ResponseHeaderTimeout/idleTimeoutConn; change executeInternalRequest to use the
existing h.transport by creating the client as &http.Client{ Transport:
h.transport, Timeout: h.config.RequestTimeout } (or otherwise reuse h.transport
directly) so internal requests retain the transport-level read/idle timeouts and
any connection wrapping used elsewhere.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (5)
controllers/deployment/launcher.gocontrollers/deployment/launcher_test.goservers/httpingress/httpingress.goservers/httpingress/httpingress_test.goservers/httpingress/lease_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- controllers/deployment/launcher_test.go
Consistent with the authenticated request path: when a stale lease is detected on an internal request, invalidate all cached leases for the app rather than just the one that failed.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@servers/httpingress/httpingress.go`:
- Around line 541-570: The retry-on-proxy-connection-error path around the lease
loop (useLease, proxyToLease, invalidateAppLeases) currently retries for all
HTTP methods; change it to only retry when the incoming request is idempotent
(GET, HEAD, OPTIONS, TRACE) or when the request body is replayable (req.GetBody
!= nil) to avoid duplicate non-idempotent writes; before calling continue on
isProxyConnectionError(err) check req.Method and whether req.GetBody is non-nil
(or another body-replay capability) and only invalidateAppLeases/continue when
replay-safe, otherwise treat the proxy error as fatal and return the error
response.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
servers/httpingress/httpingress.go
Only proceed with cleanup on timeout (DeadlineExceeded). Other errors like entity store failures now propagate to the caller instead of being silently treated as timeouts.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@controllers/deployment/launcher.go`:
- Around line 848-853: The loop in waitForPoolReady uses the original ctx for
l.EAC.Get so a blocking Get can ignore the manual deadline; wrap each call (or
the whole loop) with a cancellable context that enforces the timeout—e.g.,
compute remaining time against the earlier deadline and call
context.WithDeadline or context.WithTimeout to create childCtx, pass childCtx to
l.EAC.Get (instead of ctx), defer cancel the child context, and translate
context.DeadlineExceeded / canceled into a PoolReadyTimeout error; reference
l.EAC.Get, ctx, deadline, timeout and waitForPoolReady when making the change.
- Around line 214-220: The loop that calls ensurePoolForService currently logs
errors and continues but later runs the old-pool cleanup which can scale down
pools for services that failed reconciliation; modify the logic in the
reconciliation flow (the loop calling ensurePoolForService in launcher.go) so
that when ensurePoolForService returns an error you do NOT schedule or perform
cleanup for that specific service—either collect failed service IDs/names in a
failedServices set and skip their old-pool cleanup, or abort the cleanup phase
entirely by returning the error from the reconciliation function; ensure you
reference the ensurePoolForService call site and the old-pool cleanup block to
add the conditional check and avoid scaling down pools for failed services.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
controllers/deployment/launcher.gocontrollers/deployment/launcher_test.go
Remove ECONNRESET/ECONNABORTED from isProxyConnectionError since those can fire after a request was partially processed, making retries unsafe for non-idempotent methods. Keep ECONNREFUSED/EHOSTUNREACH/ENETUNREACH which indicate the TCP connection was never established. Also fix TestRapidRedeployWithDisk timeout — the integration test was using the default 60s PoolReadyTimeout, hitting the 600s package timeout with 10 redeployments.
e8778de to
0b70d3c
Compare
I deployed a service and hit a 502 during the rollover. The httpingress was holding a cached lease to the old sandbox's IP, which had already been killed. The new sandbox was running fine, but the ingress didn't know about it yet — the very next request after the 502 would've succeeded, but the unlucky one got an error page.
Two things conspired to cause this: the httpingress would proxy to a cached lease, get a connection error, invalidate the lease, and immediately return 502 without retrying. And over in the launcher,
reconcileAppVersionwas scaling down the old pool the instant it created the new one, without waiting for the new sandbox to actually boot.The fix is two complementary changes. First, the httpingress now retries once when a cached lease fails with a connection error — it invalidates the stale lease and loops back, which acquires a fresh lease to the new sandbox. The
proxyToLeasemethod gains awriteErrorResponseflag so the first attempt can suppress the 502 and give the retry a chance.DoRequest(the internal/non-proxy path) gets the same treatment.Second, the launcher now waits for new pools to have at least one ready instance before scaling down old version pools.
ensurePoolForServicereturns the new pool ID soreconcileAppVersioncan pollReadyInstancesbefore callingcleanupOldVersionPools. If it times out (60s default), it logs a warning and proceeds — the httpingress retry covers any remaining gap.Together these mean you should never see a 502 during a normal rollover: the launcher won't kill the old sandbox until the new one is ready, and even if something slips through the cracks, the ingress retries transparently.