Skip to content

Feature: Per-connection pending outbound observability and configurable back-pressure cap#218

Merged
hoytech merged 3 commits intohoytech:masterfrom
archief2910:feature/213-capping-slow-ws-clients
Apr 28, 2026
Merged

Feature: Per-connection pending outbound observability and configurable back-pressure cap#218
hoytech merged 3 commits intohoytech:masterfrom
archief2910:feature/213-capping-slow-ws-clients

Conversation

@archief2910
Copy link
Copy Markdown
Contributor

@archief2910 archief2910 commented Apr 22, 2026

Description

Observability (#212) — src/apps/relay/RelayWebsocket.cpp

  • Extends Connection::Stats with pendingOutbound: application bytes passed to WebSocket::send that are not yet fully drained (still in uWS’s queue or an in-flight partial write).
  • In doSend, increments pendingOutbound by payload.size() immediately before send(), passes that size through the send completion callback’s user-data pointer (uintptr_t via void *), and decrements in the callback when uWS reports the message finished (success, immediate failure, or after queued drain).
  • The completion callback returns early when ws == nullptr, matching uWS’s onEnd flush path: that can run after onDisconnection has logged and deleted the Connection, so we must not dereference getUserData() there. The disconnect log therefore captures whatever pendingOutbound was at teardown time; post-disconnect cancellation callbacks do not touch freed memory.
  • Extends the existing disconnect log line (UP/DN bytes and compression) with Pending: … using the same renderSize helper.
  • No change to when or what is sent for the observability path alone; the cap below is optional and off by default.

Back-pressure cap — src/apps/relay/RelayWebsocket.cpp, src/apps/relay/golpe.yaml, strfry.conf, src/PrometheusMetrics.h

  • New config relay.maxPendingOutboundBytes (relay__maxPendingOutboundBytes / --set relay__maxPendingOutboundBytes=…), default 0 = unlimited (backward compatible).
  • After each send() in doSend, if the cap is non-zero and pendingOutbound exceeds the threshold, the relay logs a warning and calls websocket->terminate(), then returns immediately (Connection is freed synchronously in onDisconnection; c must not be used after terminate()).
  • terminate() is used instead of close() so we do not enqueue an extra CLOSE frame on an already backlogged outbound path.
  • Adds Prometheus counter strfry_slow_client_terminations_total for operators who scrape /metrics.
  • Does not patch uWebSockets, does not change ReqMonitor flow control, and does not pace upstream producers beyond dropping the slow peer.

Related issues


Motivation and context

Slow or stalled readers can leave a large amount of application payload buffered inside uWS. Operators previously saw aggregate UP bytes on disconnect but not how much was still queued. pendingOutbound makes that visible per connection. The optional cap bounds memory by terminating peers that fall too far behind, without claiming full TCP buffer accounting.


How has this been tested?

Environment: Local build (WSL/Linux with the existing strfry toolchain per README).

Commands run:

  • make setup-golpe && make -j4 (or project-equivalent full relay build).
  • Manual smoke (observability): ./strfry relay → WebSocket client connect → REQ path that receives outbound frames → clean close; confirm disconnect log includes Pending: (often 0b when the queue has drained before close).
  • Manual smoke (cap): populate a small DB (e.g. strfry import --no-verify with many medium-sized events), start relay with --set relay__maxPendingOutboundBytes=<low value>, connect a client that performs the WebSocket handshake then sends many REQs and does not read (small SO_RCVBUF helps); confirm log line Slow client: pendingOutbound … exceeds relay.maxPendingOutboundBytes and strfry_slow_client_terminations_total on /metrics.

Screenshots: N/A


Types of changes

  • Non-functional change (docs, style, minor refactor)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Default maxPendingOutboundBytes = 0 preserves prior behavior except for the extra Pending: field in the disconnect log.


Checklist

  • My code follows the code style of this project.
  • I have updated the TODO accordingly.
  • All new and existing tests passed

@archief2910
Copy link
Copy Markdown
Contributor Author

Hi @hoytech , can you please check once i have implemented and tested the changes for the sub-issue 2 (closing slow connection clients) .

@archief2910
Copy link
Copy Markdown
Contributor Author

archief2910 commented Apr 28, 2026

hi @hoytech , can you please check this once .

@hoytech
Copy link
Copy Markdown
Owner

hoytech commented Apr 28, 2026

Pasting from TG discussion:

I wonder what would happen if a connection does a REQ that results in a "large" response (ie, bigger than maxPendingOutboundBytes) before it has a chance to read anything. Would it get disconnected right away?

@archief2910
Copy link
Copy Markdown
Contributor Author

archief2910 commented Apr 28, 2026

we can set the maxPendingOutboundBytes to 128 MB as the Worst-case single REQ: maxFilterLimit (500) × maxEventSize (64 KB) = 32 MB so that protects against genuinely stalled connections out of the box .

@archief2910
Copy link
Copy Markdown
Contributor Author

archief2910 commented Apr 28, 2026

we could also set it more conservatively at 32–48 MB. @hoytech

@hoytech
Copy link
Copy Markdown
Owner

hoytech commented Apr 28, 2026

Thank you!

@hoytech hoytech merged commit 0785c0c into hoytech:master Apr 28, 2026
1 check failed
@archief2910 archief2910 deleted the feature/213-capping-slow-ws-clients branch April 28, 2026 21:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature : Cap slow WebSocket clients with relay.maxPendingOutboundBytes

2 participants