fix: KEEP-569 Detect upstream WebSocket silence on proxied subscriptions by suisuss · Pull Request #35 · project-aethermesh/aetherlay

suisuss · 2026-05-15T00:20:31Z

Problem

Long-lived WebSocket subscriptions through aetherlay (e.g. eth_subscribe for newHeads) could stop delivering messages while the proxied connection stayed open. Clients ended up with a valid socket and no signal to reconnect — block notifications silently stopped flowing.

Root cause: the proxy was a transparent byte-pipe with no liveness detection.

defaultProxyWebSocket had no read deadline on either conn, so ReadMessage() would block forever on a half-open upstream TCP.
gorilla/websocket auto-pongs inbound pings inside ReadMessage by default. A client ping sent to aetherlay was answered locally by aetherlay rather than reaching the upstream, so the client's ping/pong loop only proved the client↔aetherlay leg was alive — it said nothing about aetherlay↔upstream.
The recent close-frame handling fixes (fix: Improve how websockets deal with timeouts and closed connections #33) only fire on graceful close. They cannot detect silent failure.

Approach

Two changes, both in internal/server/server.go:

Forward WS control frames end-to-end. Override SetPingHandler / SetPongHandler on both legs so a client ping reaches the upstream RPC node and the upstream's pong reaches the client, payload bytes preserved. Aetherlay generates no pings of its own — it relies on whatever ping cadence the client (or upstream) chooses.
Passive idle-timeout backstop on the upstream conn. SetReadDeadline(now + 90s) on the backend conn, reset on every received frame (data, ping, pong). If upstream sends nothing for 90s, ReadMessage returns a timeout, both legs are torn down, and the client receives a close. This catches the case where the client doesn't ping or where forwarded pings are absorbed somewhere upstream.

The teardown path explicitly marks the endpoint unhealthy on a net.Error.Timeout() from the backend leg (timeouts can only originate from the backend conn since the client conn has no deadline, so attribution is unambiguous).

Verification

Tested ping/pong against every reachable upstream WSS in production (Alchemy, dRPC, QuickNode, Tempo, Solana). All replied within 100-355ms with the original payload echoed, confirming forwarding works end-to-end (or at least far past the CDN edge).

Two new tests in internal/server/server_test.go:

TestProxyWebSocket_UpstreamIdleTimeout — silent upstream → client torn down within wsIdleTimeout (overridden to 300ms for the test) and endpoint marked unhealthy.
TestProxyWebSocket_PingPongForwarded — asserts a client ping reaches the upstream with the original payload and the upstream's pong reaches the client unchanged.

Full test suite + go vet clean.

Tunables

wsIdleTimeout = 90 * time.Second and wsControlWriteTimeout = 10 * time.Second are package-level vars so tests can override them. Hard-coded for now; can promote to env vars if needed.

Test plan

CI passes
Deploy to staging and verify a long-running eth_subscribe survives a forced upstream stall (block source paused at the provider) and triggers client reconnect within ~90s

Summary by CodeRabbit

Bug Fixes
- WebSocket ping/pong control frames are now properly forwarded between clients and servers.
- Idle WebSocket connections are automatically closed after timeout.
- Upstream servers that stop responding are detected and marked unhealthy.
Tests
- Added regression tests for upstream idle timeout and ping/pong forwarding.

Long-lived WebSocket subscriptions (eth_subscribe newHeads) could stop delivering messages while the proxied connection stayed open, leaving clients with a valid socket and no signal to reconnect. The previous proxy used a transparent byte-pipe with no read deadline and let gorilla auto-pong inbound pings, so a half-open upstream TCP or a wedged provider subscription dispatcher never surfaced as an error on either leg. This change: - Forwards WS ping/pong frames end-to-end in both directions instead of auto-ponging at the proxy. A client ping reaches the upstream RPC node and the upstream's pong reaches the client, with payload bytes preserved for any client-side correlation. - Sets a 90s read deadline on the upstream conn that resets on every received frame (data, ping, pong). If upstream stops sending anything for that long, ReadMessage returns a timeout, both legs are torn down, and the client receives a close so its existing reconnect logic can run. - Marks the endpoint unhealthy on idle-timeout teardown, since the read deadline only fires on the backend conn so the timeout is unambiguously an upstream failure. Tests added for the upstream-silence teardown path and for end-to-end ping/pong forwarding (asserts payload round-trips, proving the proxy is not auto-replying locally).

coderabbitai · 2026-05-15T00:20:46Z

Walkthrough

Adds WebSocket liveness detection and control-frame forwarding to the proxy path. Introduces configurable timeouts (wsIdleTimeout, wsWriteTimeout, wsControlWriteTimeout) and a wsBackendIdleError to mark backend read-deadline timeouts. Implements installControlForwarders to forward ping/pong frames and extends proxyWebSocketCopy with an optional bumpSrcDeadline behavior. defaultProxyWebSocket seeds backend deadlines, installs asymmetric control forwarders, proxies both directions concurrently, and treats backend idle timeouts as endpoint unhealthy. Two tests cover upstream idle timeout and ping/pong forwarding.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The PR title accurately summarizes the main change: adding WebSocket silence detection for proxied subscriptions to fix upstream connectivity issues (KEEP-569).

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/server/server_test.go`:
- Around line 644-650: The websocket.DefaultDialer.Dial call returns
(*websocket.Conn, *http.Response, error) and the response body must be closed to
avoid leaking resources; update the Dial call site where proxyWSURL is used to
capture the second return value (resp), and if resp != nil ensure
resp.Body.Close() is called (e.g. defer resp.Body.Close() after a successful
dial or close it immediately on dial error) so that the http.Response from
websocket.DefaultDialer.Dial is always closed.
- Around line 548-552: websocket.DefaultDialer.Dial currently discards the
*http.Response; change the assignment to capture it (e.g., client, resp, err :=
websocket.DefaultDialer.Dial(...)), and if resp != nil defer resp.Body.Close()
so the response body is always closed (do this before returning or calling
t.Fatalf on err) to prevent leaking resources; reference the Dial call, the
client variable, the resp variable, and the existing t.Fatalf error handling
when making the change.

In `@internal/server/server.go`:
- Around line 1249-1252: The code currently bumps src's read deadline (using
src.SetReadDeadline with wsIdleTimeout) then performs an unbounded
dst.WriteMessage which can block and cause the next Read to immediately time
out; fix this in defaultProxyWebSocket by setting a separate, shorter write
deadline on dst before calling WriteMessage (use
dst.SetWriteDeadline(time.Now().Add(wsWriteTimeout)) where wsWriteTimeout <
wsIdleTimeout) and clear/reset it afterwards; additionally, when handling errors
from WriteMessage, treat write-timeout errors (net.Error with Timeout()) as
downstream backpressure and avoid marking the upstream as unhealthy—only treat
non-timeout write failures as endpoint failures.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d56cdab3-2c59-4590-91f6-ecc4be9b08cc

📥 Commits

Reviewing files that changed from the base of the PR and between 59f3d52 and 7e59ce4.

📒 Files selected for processing (2)

internal/server/server.go
internal/server/server_test.go

Addresses CodeRabbit feedback on PR project-aethermesh#35. Without a write deadline, dst.WriteMessage on the backend leg could block arbitrarily long when the client was slow draining its socket. Bumping the src read deadline before the write meant the next ReadMessage would time out instantly after a slow write completed, marking a healthy upstream unhealthy. Three changes in proxyWebSocketCopy / defaultProxyWebSocket: - Cap each forwarded write with wsWriteTimeout (30s) via SetWriteDeadline, so a slow peer can't stall the proxy. - Move the src read-deadline bump to AFTER the write succeeds. The semantic becomes "no upstream activity since the last successful end-to-end forward," which removes the false positive when downstream is slow. - Tag backend-leg read-deadline timeouts with a wsBackendIdleError sentinel. Only this error marks the upstream endpoint unhealthy. Other timeouts (write deadlines, control-frame forwarding failures, client leg read errors) fall through to the existing isExpectedWSClose path and don't cause endpoint health changes. Also closed *http.Response bodies returned from websocket.DefaultDialer.Dial in the two new tests (golangci-lint bodyclose).

suisuss · 2026-05-15T01:33:47Z

@sanbotto please review

coderabbitai Bot reviewed May 15, 2026

View reviewed changes

Comment thread internal/server/server_test.go Outdated

Comment thread internal/server/server_test.go

Comment thread internal/server/server.go Outdated

suisuss changed the title ~~fix: Detect upstream WebSocket silence on proxied subscriptions~~ fix: KEEP-569 Detect upstream WebSocket silence on proxied subscriptions May 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: KEEP-569 Detect upstream WebSocket silence on proxied subscriptions#35

fix: KEEP-569 Detect upstream WebSocket silence on proxied subscriptions#35
suisuss wants to merge 2 commits into
project-aethermesh:mainfrom
suisuss:fix/improve-ws-upstream-idle-detection

suisuss commented May 15, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 15, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

suisuss commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

suisuss commented May 15, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Approach

Verification

Tunables

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

suisuss commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

suisuss commented May 15, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 15, 2026 •

edited

Loading