Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Breaking WS API changes
Subscribe filter
Node→Nodes:New
batchmessage type:Events are now can batched — up to 100 per WS frame:
{"type":"batch","data":[{event1},{event2},...],"timestamp":"2025-..."}Single events still sent unwrapped as
{"type":"event",...}. Clients must handle both.--no-databasemode:Minimal
/api/ws+/api/healthrouter for WS-only streaming without a DB. Stats/metrics/alerts disabled. Mainly for streaming performance evaluation.Optimizations
WS sender: Lagged handling
Previously, a
RecvError::Laggedfrom the broadcast channel killed the WS connection — the client had to reconnect and lost all context. Now lagged events are counted and skipped, the connection stays alive, and the client keeps receiving. Debug logs reportlagged_totalso slow clients are visible without being punished.WS sender: opportunistic batching
After receiving one event, the WS handler drains up to 100 more via
try_recv()(non-blocking) and sends them in a single frame. Measured 1.6M ev/s vs 158K unbatched. Also helps drainbroadcast_lagspikes faster (observed up to 524K with batch=10).Dedicated ingestion runtimes (default: 8)
With a single tokio runtime, 1024 TCP ingestion tasks starved WS client tasks — the WS sender couldn't get scheduled often enough to drain its broadcast receiver, causing
broadcast_lagto climb and events to be lost. Ingestion now runs on 8 dedicated single-thread tokio runtimes (--ingestion-threads,SO_REUSEPORT), freeing the main runtime for WS clients, API, and the aggregator.Aggregator bottleneck fixes
Three bottlenecks were hit at 600K+ events/s with WS subscribers connected:
Lock contention on node_channels HashMap —
broadcast_event()took a write lock onRwLock<HashMap>for every event. Replaced with actor pattern: aggregator owns the HashMap, WS handlers send commands via mpsc with oneshot replies. Lock removed entirely.Aggregator busy-loop —
try_recv()in a tight loop burned CPU when idle. Replaced withselect!over batch channel + command channel.JSON serialization in single aggregator — Building WS envelopes (RawValue parse + serde) for every event at 600K/s caused
mpsc_lagto grow to 272K — aggregator couldn't keep up. Moved serialization to the 8 ingestion runtimes (~75K ev/s each). Aggregator now does pure routing. Result:mpsc_lagstays at 0.