Feature: Per-connection pending outbound observability and configurable back-pressure cap#218
Merged
hoytech merged 3 commits intohoytech:masterfrom Apr 28, 2026
Conversation
Contributor
Author
|
Hi @hoytech , can you please check once i have implemented and tested the changes for the sub-issue 2 (closing slow connection clients) . |
Contributor
Author
|
hi @hoytech , can you please check this once . |
Owner
|
Pasting from TG discussion: I wonder what would happen if a connection does a REQ that results in a "large" response (ie, bigger than |
Contributor
Author
|
we can set the maxPendingOutboundBytes to 128 MB as the Worst-case single REQ: maxFilterLimit (500) × maxEventSize (64 KB) = 32 MB so that protects against genuinely stalled connections out of the box . |
Contributor
Author
|
we could also set it more conservatively at 32–48 MB. @hoytech |
Owner
|
Thank you! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Observability (#212) —
src/apps/relay/RelayWebsocket.cppConnection::StatswithpendingOutbound: application bytes passed toWebSocket::sendthat are not yet fully drained (still in uWS’s queue or an in-flight partial write).doSend, incrementspendingOutboundbypayload.size()immediately beforesend(), passes that size through the send completion callback’s user-data pointer (uintptr_tviavoid *), and decrements in the callback when uWS reports the message finished (success, immediate failure, or after queued drain).ws == nullptr, matching uWS’sonEndflush path: that can run afteronDisconnectionhas logged and deleted theConnection, so we must not dereferencegetUserData()there. The disconnect log therefore captures whateverpendingOutboundwas at teardown time; post-disconnect cancellation callbacks do not touch freed memory.Pending: …using the samerenderSizehelper.Back-pressure cap —
src/apps/relay/RelayWebsocket.cpp,src/apps/relay/golpe.yaml,strfry.conf,src/PrometheusMetrics.hrelay.maxPendingOutboundBytes(relay__maxPendingOutboundBytes/--set relay__maxPendingOutboundBytes=…), default0= unlimited (backward compatible).send()indoSend, if the cap is non-zero andpendingOutboundexceeds the threshold, the relay logs a warning and callswebsocket->terminate(), then returns immediately (Connectionis freed synchronously inonDisconnection;cmust not be used afterterminate()).terminate()is used instead ofclose()so we do not enqueue an extra CLOSE frame on an already backlogged outbound path.strfry_slow_client_terminations_totalfor operators who scrape/metrics.ReqMonitorflow control, and does not pace upstream producers beyond dropping the slow peer.Related issues
Motivation and context
Slow or stalled readers can leave a large amount of application payload buffered inside uWS. Operators previously saw aggregate UP bytes on disconnect but not how much was still queued.
pendingOutboundmakes that visible per connection. The optional cap bounds memory by terminating peers that fall too far behind, without claiming full TCP buffer accounting.How has this been tested?
Environment: Local build (WSL/Linux with the existing strfry toolchain per README).
Commands run:
make setup-golpe && make -j4(or project-equivalent full relay build)../strfry relay→ WebSocket client connect → REQ path that receives outbound frames → clean close; confirm disconnect log includesPending:(often0bwhen the queue has drained before close).strfry import --no-verifywith many medium-sized events), start relay with--set relay__maxPendingOutboundBytes=<low value>, connect a client that performs the WebSocket handshake then sends many REQs and does not read (smallSO_RCVBUFhelps); confirm log lineSlow client: pendingOutbound … exceeds relay.maxPendingOutboundBytesandstrfry_slow_client_terminations_totalon/metrics.Screenshots: N/A
Types of changes
Default
maxPendingOutboundBytes = 0preserves prior behavior except for the extraPending:field in the disconnect log.Checklist