server: add streaming ingest and e2e coverage#8
Conversation
run_websocket_server's `ingest_mode` argument was reaching `hl_listen`
(so the file watcher pointed at the right node_*_streaming directories)
but never reaching the OrderBookListener itself, which was constructed
via `OrderBookListener::new(...)` — defaulted to IngestMode::Block.
Net effect for `--ingest-mode stream` deploys:
- File watcher correctly read node_*_streaming.
- process_data → receive_batch → match self.ingest_mode → Block →
receive_block_batch, populating order_status_cache / order_diff_cache.
- First snapshot fetch → init_from_snapshot at height H → while-pop_cache
loop pops a cached batch at height H+N (N >> 1), apply_updates strict
+1 monotonic check fires, "Failed to apply updates to this book" log,
retry=true, order_book_state stays None.
- drain_streaming_blocks (the streaming-mode replay path) is gated
behind retry==false, so it never runs.
- Result: snapshot + refdata multicast keeps flowing (those don't
depend on order_book_state), but DoB / TOB delta emit pipelines
stay silent.
Reproduced live on aws-tyo-hl-mainnet running this branch with
`--ingest-mode stream`: tcpdump showed 17.2 pkt/s on dob-snapshot-port
(10003) and 0.4 pkt/s combined on the two mktdata ports (9001 + 10001).
The fix is to use `new_with_ingest_mode` so the listener's ingest_mode
matches the file watcher's. With this, in stream mode:
- receive_stream_batch is invoked → events go into streaming_state.blocks.
- init_from_snapshot's while-pop_cache loop is a no-op (block-mode
caches stay empty), retry stays false, order_book_state is set,
drain_streaming_blocks runs and replays the streaming events
against the snapshot via apply_stream_diff.
cargo test --workspace: 129 passed, 0 failed.
|
Tested this branch on
Fix in 43e467f: use |
After 43e467f, init_from_snapshot's streaming-mode replay path (drain_streaming_blocks) was reachable, but immediately failed with Failed to apply cached streaming updates after snapshot: Received finalized streaming block 987725382, current height 987725455 because streaming_state.blocks accumulated events while waiting for the first snapshot fetch (~5-15s of clock drift). Those events have block heights <= snapshot height — already reflected in the snapshot — so apply_stream_diff correctly rejects them via its `block_number < self.height` check. drain_streaming_blocks bails on the first error, so each snapshot fetch cycle only clears one stale block. With ~70 stale blocks queued, that would have taken ~70 minutes of restart-cycles to clear, during which DoB/TOB delta multicast stays silent. Fix: in init_from_snapshot, before draining, prune streaming_state.blocks to drop any block whose height <= snapshot height. Logs the dropped count for observability. Reproduced live on aws-tyo-hl-mainnet: 18:57:26 ERROR ... Received finalized streaming block 987725382, current height 987725455 18:58:26 ERROR ... Received finalized streaming block 987725383, current height 987726329 (mktdata: 1 packet / 5s on 9001+10001 throughout) cargo test --workspace: 129 passed, 0 failed.
…en ctor Three follow-on fixes after the listener constructed-with-wrong-mode and prune-stale-blocks fixes from 43e467f and d49d15b. Surfaced when the streaming pipeline started actually running on aws-tyo-hl-mainnet. 1. Cross-file finalization race (bug) hl-node writes streaming statuses, diffs, and fills concurrently; notify delivers them through one channel but cross-file ordering lags. drain_streaming_blocks may finalize block N (because we saw N+1's diffs) when N's statuses then arrive, causing ensure_stream_block_not_finalized to return Err and tear down the listener. Soft-fix: drop late events with a warn log in receive_stream_statuses / receive_stream_diffs. Removes the now-dead helper. Test renamed to *_is_dropped and asserts Ok(). 2. Per-diff L2 snapshot publish (perf) drain_streaming_blocks called publish_l2_snapshot inside the per- diff loop. Each call recomputes the L2 snapshot for every active instrument and spawns a tokio task — on a busy mainnet feed this was the dominant cost and stream mode could not keep pace with real-time hl-node throughput. process_data already calls l2_snapshots(true) at the end of every file-read chunk (same cadence block mode uses), so the per-diff call was redundant beyond being expensive. Removed. 3. Remove OrderBookListener::new convenience constructor The convenience constructor defaulted to IngestMode::Block, which silently created Block-mode listeners in the streaming code path (43e467f's bug). Removing it forces every call site to specify the mode and makes that bug-class structurally impossible. Test fixtures updated to use new_with_ingest_mode explicitly. New test: init_from_snapshot_prunes_stale_pre_snapshot_streaming_blocks covers d49d15b directly (no test asserted that prune behavior before). The existing wire-parity test dual_validator_capture_matches_block_and_stream_payloads (line 552) still passes — block-vs-stream multicast frame byte parity is maintained through all of these changes. cargo test --workspace: 130 passed, 0 failed.
Both apply_updates (block mode) and apply_stream_diff (stream mode) used
to return Err if a Remove or Update diff referenced an oid that wasn't
on our internal book. That error propagated up to hl_listen, tore down
the listener task, and systemd restarted the process. On a busy mainnet
feed in stream mode this triggered every ~30s — the cancel reached us
before the corresponding snapshot reflected the new order.
Block mode has the same strict check; it just hits this case rarely
enough in production not to crash. The two modes need symmetric
behavior — Steve specifically called out parity as the concern — and
the snapshot validator already reconciles the book against hl-node every
60s with surgical per-coin recovery. So both modes can safely log+skip
the failed op:
- apply_updates Update/Remove (state.rs:198-231)
- apply_stream_diff Update/Remove (state.rs:290-310)
The previous batch_boundary close on the failure path is no longer
needed because we don't bail mid-batch — the regular boundary close at
the end of the iteration handles it.
New e2e test missing_order_remove_is_skipped_in_both_modes asserts
both block and stream listeners stay ready after a phantom Remove,
catching regressions in the parity Steve flagged.
Reproduced live on aws-tyo-hl-mainnet:
19:27:01 ERROR ... OrderDiffs processing error: Unable to find order
on the book NodeDataOrderDiff { oid: 415431370878, coin: BTC,
raw_book_diff: Remove }
followed by NRestarts climbing.
cargo test --workspace: 131 passed, 0 failed.
|
Pushed four follow-on fixes after deploying surfaced each one. All preserve block-vs-stream wire parity (existing
cargo test --workspace: 131 passed, 0 failed. |
|
Update: deployed all fixes to a real HL mainnet node. Stream mode runs but produces ~10 missing-order Update/Remove diffs per second where block mode produces ~0. The soft-tolerance in Root cause is the snapshot/streaming-event boundary alignment — Switched the mainnet host back to block mode (still on this branch — confirms all fixes preserve block-mode behavior). Stream mode root cause still needs investigation before this can be merged. |
The earlier soft-tolerance fix (7caf715) covered Update/Remove for orders missing from the book in both apply_updates and apply_stream_diff, but NOT the third strict failure mode in those functions: a New diff whose opening status isn't available. Block mode (apply_updates): the order_statuses batch is filtered by `is_inserted_into_book()` into an order_map. If a New diff's oid isn't in that map (status didn't end up in the batch — e.g. transient order), strict path returned `Unable to find order opening status` and tore the listener down. Reproduced live on aws-tyo-hl-mainnet running this branch in block mode (~1 occurrence per 4 hours, recovers in ~10s but generates restart noise). Stream mode (apply_stream_diff): drain_streaming_blocks normally defers New diffs without status (BREAKs the inner loop). Defensive fix here preserves parity with block mode for any pathway that might still call apply_stream_diff with order_status=None. Both modes now log+skip and advance height/time so the diff isn't replayed; snapshot validation reconciles the coin's book within 60s via apply_recovery, same recovery pattern Update/Remove already use. New e2e test new_diff_without_opening_status_is_skipped_in_block_mode asserts the listener stays healthy after a phantom New, mirroring the existing missing_order_remove_is_skipped_in_both_modes coverage. Note: soft-tolerance does NOT cause crossed books — OrderBook::add_order runs match_order on every add, so the internal book is structurally non-crossing even when state lingers. The actual cost of soft-tolerance is up to 60s of phantom executions/wrong events on the wire per affected coin until snapshot validation detects divergence and emits InstrumentReset. cargo test --workspace: 132 passed, 0 failed.
Trade::from_fills used unwrap() to extract the Ask and Bid fills from a HashMap<Side, NodeDataFill>, and assert_eq! to check coin/tid agreement between the pair. The pair-up upstream in coin_to_trades is best-effort: it pops two adjacent fills from the batch and inserts them keyed by Side. If both happen to share a Side, the second insert overwrites the first and the HashMap has only one entry, so unwrap() panics. Same shape if the pair has mismatched coin or tid. Failure mode in production: this panic kills the tokio worker running MulticastPublisher::run, which silently terminates the TOB Quote/Trade emit pipeline. The publisher process itself stays up (other tokio tasks unaffected), so systemd doesn't restart, NRestarts stays 0, and the operator only sees TOB ports go quiet on the wire while DoB keeps working. Reproduced live on aws-tyo-hl-mainnet running the streaming-ingest-parity branch in block mode, after the upstream soft-tolerance fixes (43e467f, d49d15b, b16d985, 7caf715, 89a113d) let the publisher stay up long enough to encounter a malformed pair. This is malbeclabs/hyperliquid#4 — same bug, surfaced harder by the absence of restarts. Fix: - Trade::from_fills returns Option<Self>, uses ? on the HashMap removes for both sides, and log+returns None on coin/tid mismatch (replacing the assert_eq!s). - coin_to_trades handles None by logging a warn and continuing. The trade pair is dropped on the floor; the rest of the batch is still processed normally. New unit tests in types::trade_from_fills_tests cover all four cases: matched pair (ok), missing side (None), coin mismatch (None), tid mismatch (None). cargo test --workspace: 136 passed, 0 failed.
|
Update on the streaming ingest canary/debugging: I found two streaming-mode ordering issues and pushed a fix in
Validation:
Tests run:
Clippy exits 0, with the same pre-existing warning noise. |
|
Update on the DOB I traced this to instrument metadata resolution rather than the DOB tap itself. The live hl-node files were emitting coins such as Root causes fixed in
Downstream impact before the fix:
Validation:
Live canary:
|
armcconnell
left a comment
There was a problem hiding this comment.
Follow-ups (non-blocking):
Important — observability gap. The soft-tolerance branches in server/src/listeners/order_book/state.rs (apply_updates / apply_stream_diff, ~6 places) and the drain-skip branches in server/src/listeners/order_book/mod.rs:1141-1173 log+skip without bumping a metric. Snapshot validation reconciles every 60s so it's recoverable, but a slow-burn validator/publisher divergence is invisible on dashboards. Suggest orderbook_listener_apply_skipped_total{source, kind} with bounded labels.
Minor:
server/src/listeners/order_book/progress.rs::now_epoch_msusesSystemTimefor a monotonic report-cadence gate; should beInstantto be NTP-jump-resilient.FillPairAccumulator::evict_to_capacityruns post-batch inserver/src/multicast/publisher.rs:119-134; a single batch >MAX_PENDING=100_000can technically breach the cap mid-batch. Not real at current HL volume but worth doing inline.- The
block_mode_multicast_e2emodule now holds many stream-mode tests too — name is stale. Rename when convenient.
armcconnell
left a comment
There was a problem hiding this comment.
I left a comment for some follow up items but I think we should merge this now.
Summary
This PR adds opt-in Hyperliquid streaming ingest while preserving the existing block-by-block ingest path. It keeps block mode as the compatibility baseline, adds live-shape fixture coverage for streaming mode, and expands multicast E2E coverage for TOB and DOB payloads.
Major Changes
--ingest-mode block|streamand--hl-data-rootconfiguration.node_raw_book_diffs_streaming,node_order_statuses_streaming, andnode_fills_streaming.Newdiffs to insert resting orders directly instead of passing through local matching, matching validator-decided book state.#...token-alias fills./metricslistener.E2E And Fixture Coverage
edge-multicast-refin Docker.Canary Notes
Validated on
aws-tyo-hl-mainnetandtyo-hl-nodeduring streaming-mode canaries. The canaries helped identify and fix:Neworders being incorrectly matched away locally;#...token-alias fills.Latest canary confirmed TOB fill-pair metrics split expected one-sided token-alias fills from true orphan drops, while trade packets continue to publish.
Tests Run
cargo test -p server multicast::publishercargo test -p server fill_pair_accumulator -- --nocapturecargo test -p server listeners::order_book::block_mode_multicast_e2e -- --nocapturecargo test -p server dual_validator_fixture_matches_block_and_stream_goldens -- --nocapturecargo test -p servercargo clippy --workspace --all-targetsClippy exits successfully with pre-existing warning noise.