fix: KEEP-569 Detect upstream WebSocket silence on proxied subscriptions#35
fix: KEEP-569 Detect upstream WebSocket silence on proxied subscriptions#35suisuss wants to merge 2 commits into
Conversation
Long-lived WebSocket subscriptions (eth_subscribe newHeads) could stop delivering messages while the proxied connection stayed open, leaving clients with a valid socket and no signal to reconnect. The previous proxy used a transparent byte-pipe with no read deadline and let gorilla auto-pong inbound pings, so a half-open upstream TCP or a wedged provider subscription dispatcher never surfaced as an error on either leg. This change: - Forwards WS ping/pong frames end-to-end in both directions instead of auto-ponging at the proxy. A client ping reaches the upstream RPC node and the upstream's pong reaches the client, with payload bytes preserved for any client-side correlation. - Sets a 90s read deadline on the upstream conn that resets on every received frame (data, ping, pong). If upstream stops sending anything for that long, ReadMessage returns a timeout, both legs are torn down, and the client receives a close so its existing reconnect logic can run. - Marks the endpoint unhealthy on idle-timeout teardown, since the read deadline only fires on the backend conn so the timeout is unambiguously an upstream failure. Tests added for the upstream-silence teardown path and for end-to-end ping/pong forwarding (asserts payload round-trips, proving the proxy is not auto-replying locally).
WalkthroughAdds WebSocket liveness detection and control-frame forwarding to the proxy path. Introduces configurable timeouts ( 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/server/server_test.go`:
- Around line 644-650: The websocket.DefaultDialer.Dial call returns
(*websocket.Conn, *http.Response, error) and the response body must be closed to
avoid leaking resources; update the Dial call site where proxyWSURL is used to
capture the second return value (resp), and if resp != nil ensure
resp.Body.Close() is called (e.g. defer resp.Body.Close() after a successful
dial or close it immediately on dial error) so that the http.Response from
websocket.DefaultDialer.Dial is always closed.
- Around line 548-552: websocket.DefaultDialer.Dial currently discards the
*http.Response; change the assignment to capture it (e.g., client, resp, err :=
websocket.DefaultDialer.Dial(...)), and if resp != nil defer resp.Body.Close()
so the response body is always closed (do this before returning or calling
t.Fatalf on err) to prevent leaking resources; reference the Dial call, the
client variable, the resp variable, and the existing t.Fatalf error handling
when making the change.
In `@internal/server/server.go`:
- Around line 1249-1252: The code currently bumps src's read deadline (using
src.SetReadDeadline with wsIdleTimeout) then performs an unbounded
dst.WriteMessage which can block and cause the next Read to immediately time
out; fix this in defaultProxyWebSocket by setting a separate, shorter write
deadline on dst before calling WriteMessage (use
dst.SetWriteDeadline(time.Now().Add(wsWriteTimeout)) where wsWriteTimeout <
wsIdleTimeout) and clear/reset it afterwards; additionally, when handling errors
from WriteMessage, treat write-timeout errors (net.Error with Timeout()) as
downstream backpressure and avoid marking the upstream as unhealthy—only treat
non-timeout write failures as endpoint failures.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: d56cdab3-2c59-4590-91f6-ecc4be9b08cc
📒 Files selected for processing (2)
internal/server/server.gointernal/server/server_test.go
Addresses CodeRabbit feedback on PR project-aethermesh#35. Without a write deadline, dst.WriteMessage on the backend leg could block arbitrarily long when the client was slow draining its socket. Bumping the src read deadline before the write meant the next ReadMessage would time out instantly after a slow write completed, marking a healthy upstream unhealthy. Three changes in proxyWebSocketCopy / defaultProxyWebSocket: - Cap each forwarded write with wsWriteTimeout (30s) via SetWriteDeadline, so a slow peer can't stall the proxy. - Move the src read-deadline bump to AFTER the write succeeds. The semantic becomes "no upstream activity since the last successful end-to-end forward," which removes the false positive when downstream is slow. - Tag backend-leg read-deadline timeouts with a wsBackendIdleError sentinel. Only this error marks the upstream endpoint unhealthy. Other timeouts (write deadlines, control-frame forwarding failures, client leg read errors) fall through to the existing isExpectedWSClose path and don't cause endpoint health changes. Also closed *http.Response bodies returned from websocket.DefaultDialer.Dial in the two new tests (golangci-lint bodyclose).
|
@sanbotto please review |
Problem
Long-lived WebSocket subscriptions through aetherlay (e.g.
eth_subscribefornewHeads) could stop delivering messages while the proxied connection stayed open. Clients ended up with a valid socket and no signal to reconnect — block notifications silently stopped flowing.Root cause: the proxy was a transparent byte-pipe with no liveness detection.
defaultProxyWebSockethad no read deadline on either conn, soReadMessage()would block forever on a half-open upstream TCP.ReadMessageby default. A client ping sent to aetherlay was answered locally by aetherlay rather than reaching the upstream, so the client's ping/pong loop only proved the client↔aetherlay leg was alive — it said nothing about aetherlay↔upstream.Approach
Two changes, both in
internal/server/server.go:SetPingHandler/SetPongHandleron both legs so a client ping reaches the upstream RPC node and the upstream's pong reaches the client, payload bytes preserved. Aetherlay generates no pings of its own — it relies on whatever ping cadence the client (or upstream) chooses.SetReadDeadline(now + 90s)on the backend conn, reset on every received frame (data, ping, pong). If upstream sends nothing for 90s,ReadMessagereturns a timeout, both legs are torn down, and the client receives a close. This catches the case where the client doesn't ping or where forwarded pings are absorbed somewhere upstream.The teardown path explicitly marks the endpoint unhealthy on a
net.Error.Timeout()from the backend leg (timeouts can only originate from the backend conn since the client conn has no deadline, so attribution is unambiguous).Verification
Tested ping/pong against every reachable upstream WSS in production (Alchemy, dRPC, QuickNode, Tempo, Solana). All replied within 100-355ms with the original payload echoed, confirming forwarding works end-to-end (or at least far past the CDN edge).
Two new tests in
internal/server/server_test.go:TestProxyWebSocket_UpstreamIdleTimeout— silent upstream → client torn down withinwsIdleTimeout(overridden to 300ms for the test) and endpoint marked unhealthy.TestProxyWebSocket_PingPongForwarded— asserts a client ping reaches the upstream with the original payload and the upstream's pong reaches the client unchanged.Full test suite +
go vetclean.Tunables
wsIdleTimeout = 90 * time.SecondandwsControlWriteTimeout = 10 * time.Secondare package-levelvars so tests can override them. Hard-coded for now; can promote to env vars if needed.Test plan
eth_subscribesurvives a forced upstream stall (block source paused at the provider) and triggers client reconnect within ~90sSummary by CodeRabbit
Bug Fixes
Tests