feat: add gossip metrics benchmarks and CI integration#1177
feat: add gossip metrics benchmarks and CI integration#1177quake merged 6 commits intonervosnetwork:developfrom
Conversation
|
I prefer to run as production-like gossip intervals first, and gossip performance benchmark may don't need to be in CI, since we will not change this part frequently. it's good if we can run benchmark for this part for every release, or we just track the performance change by running a cron job like this: https://sunchengzhu.github.io/fiber-ci-metrics/ @quake what's your opinions for metrics? |
|
Sure, I’ve switched the benchmarks to production-like gossip intervals. For the perf tests, we can use macros to configure the interval settings. And I also removed the ci changes. Gossip Benchmark result |
- add gossip protocol metrics counters/histograms and active-sync/query observability - add criterion gossip benchmarks for multi-node propagation and sync-recovery - add tests/perf gossip benchmark flows driven by metrics (steady/burst + baseline/compare) - integrate benchmark CI with metrics-enabled startup and bootstrap gossip regression gating
41eec78 to
5dd32a1
Compare
|
@swananan is this PR ready to be merged? |
This PR is ready for review. The main thing I’d still like to discuss is the metrics data design, but if that looks good, I’d be happy to merge this PR and continue the optimization work on a separate branch. |
|
@quake I think it's ok to be merge (some trivial comments), you can also take a review. |
47264fe to
4c73ce4
Compare
lgtm, only two trivial comments from my side. |
aa0a271 to
a6484b1
Compare
|
link #1263, since this PR may have effect on gossip benchmarks. |
* fix: push limit to DB layer in list_payments to avoid unbounded memory usage (#1261) * fix: push limit to DB layer in list_payments to avoid unbounded memory usage get_payment_sessions_with_limit previously called collect_by_prefix without a limit, loading ALL payment session KV pairs into memory before applying .filter_map().take(limit) as iterator adapters. Add PrefixIterator::new_from() for cursor-based lazy iteration, and prefix_iter/prefix_iter_from helpers to FiberStore trait. Rewrite get_payment_sessions_with_limit as a simple iterator chain using the lazy batched PrefixIterator, which fetches only 100 entries at a time and stops as soon as enough results are collected. * Update crates/fiber-store/src/iterator.rs Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * feat: update macOS build configuration for portability (#1237) * feat: update macOS build configuration for portability * refactor: simplify macOS build steps by removing Homebrew setup * renaming for keep portable * feat: add official Docker image support (#1244) * feat: add official Docker image support * ci: publish docker images to ghcr and docker hub * fix docker image readme * Bump fiber-rpc-gen to 0.1.22 (#1264) * network: onion & socks5 support for fiber (#1228) * onion & socks5 support for fiber * make CI happy * Add default configuration * fix tor service * use nested structure for onion and proxy config * send MaintainConnections message to NetworkActor when tor is reconnected * make fmt happy * Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix some issues pointed out by copilot * fix openroc-json-generator * update ckb-testtool * update cargo.lock * merge newest changes * fix a race condition in onion service start * make fmt happy * Update crates/fiber-lib/src/fiber/onion_service.rs Co-authored-by: Eval Exec <execvy@gmail.com> * Change default onion external port * update * update .gitignore * isolate wasm configuration related to proxy and tor * add timeout check for start_onion_service & retry in `OnionService::start` * Added 3-second delay before sending MaintainConnections * move `proxy` and `onion` related configurations to their individual modules --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Eval Exec <execvy@gmail.com> * Fix fnn-migrate error messages and update README (#1249) * Initial plan * fix: update fnn-migrate flag from -p to -d in error message and README Agent-Logs-Url: https://github.com/gpBlockchain/fiber/sessions/a6b4f2a5-59de-4e8b-a07f-a3ffb49b7a48 Co-authored-by: gpBlockchain <32102187+gpBlockchain@users.noreply.github.com> * fix: show fiber data directory (without /store) in fnn-migrate error message Agent-Logs-Url: https://github.com/gpBlockchain/fiber/sessions/0de95c55-8af8-4a5c-b2f7-98b4c62c35b2 Co-authored-by: gpBlockchain <32102187+gpBlockchain@users.noreply.github.com> * Apply suggestion from @gpBlockchain * Apply suggestion from @gpBlockchain * Apply suggestion from @gpBlockchain * Update crates/fiber-store/src/db_migrate.rs Co-authored-by: gpBlockchain <32102187+gpBlockchain@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Quake Wang <quake.wang@gmail.com> * chore(deps): bump tokio from 1.50.0 to 1.51.1 Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.50.0 to 1.51.1. - [Release notes](https://github.com/tokio-rs/tokio/releases) - [Commits](tokio-rs/tokio@tokio-1.50.0...tokio-1.51.1) --- updated-dependencies: - dependency-name: tokio dependency-version: 1.51.1 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * docs: refresh public node and network node documentation (#1266) * rename testnet-nodes.md and add network-nodes.md * docs: update README network node links * docs: refine pubkey-based node docs * docs: clarify node2 rpc discovery in public nodes guide * fix: abort funding on insufficient UDT cells (#1195) (#1253) When UDT cells are not yet indexed, the funding transaction builder would fail with a generic TxBuilderError that was silently swallowed (empty tx logged and ignored). This left UDT channels permanently stuck in the negotiating state. - Add FundingError::InsufficientCells variant that is non-temporary, so schedule_funding_retry aborts the channel instead of retrying - Reclassify AbsentTx as temporary so empty funding results get retried via the existing backoff mechanism - Extract map_tx_builder_error to convert the sentinel UDT message into InsufficientCells before it reaches the retry logic - Move FundingError tests to dedicated tests/error.rs module and add coverage for the new error mapping and classification Co-authored-by: ian <ian@cryptape.com> * feat: add gossip metrics benchmarks and CI integration (#1177) * feat: add gossip metrics benchmarks and CI integration - add gossip protocol metrics counters/histograms and active-sync/query observability - add criterion gossip benchmarks for multi-node propagation and sync-recovery - add tests/perf gossip benchmark flows driven by metrics (steady/burst + baseline/compare) - integrate benchmark CI with metrics-enabled startup and bootstrap gossip regression gating * fix: stabilize gossip benchmark runs and move gossip perf to dedicated workflow * chore: ignore generated gossip perf benchmark artifacts * fix: tighten gossip duplicate/rejected metrics and perf timing * ci: move perf benchmark artifacts under tests/perf/artifacts * refactor: move gossip metrics helpers into dedicated modules * chore(deps): bump rand from 0.8.5 to 0.9.3 in /tests/deploy/udt-init Bumps [rand](https://github.com/rust-random/rand) from 0.8.5 to 0.9.3. - [Release notes](https://github.com/rust-random/rand/releases) - [Changelog](https://github.com/rust-random/rand/blob/0.9.3/CHANGELOG.md) - [Commits](rust-random/rand@0.8.5...0.9.3) --- updated-dependencies: - dependency-name: rand dependency-version: 0.9.3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * Trigger gossip sync immediately on peer connection (#1271) * Trigger gossip sync immediately on peer connection Send TickNetworkMaintenance to self when a new peer connects, eliminating the initial 0-60s wait before gossip data sync begins. This is critical for WASM nodes that need peer address data from gossip shortly after connecting to a bootnode. Fixes: #1269 * Fix gossip tick race: gate on control.is_some(), defer from ReceivedControl Agent-Logs-Url: https://github.com/nervosnetwork/fiber/sessions/5c445ccc-f1ef-4a6d-96aa-4254471fffb9 Co-authored-by: quake <8990+quake@users.noreply.github.com> * Add regression test for immediate gossip sync on peer connect Agent-Logs-Url: https://github.com/nervosnetwork/fiber/sessions/42d6ab86-ae76-47da-8ccd-b535b89f7be1 Co-authored-by: quake <8990+quake@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: quake <8990+quake@users.noreply.github.com> * feat: add addr_type parameter to connect_peer RPC for transport type filtering (#1270) * feat: add addr_type parameter to connect_peer RPC for transport type filtering When connect_peer is called with only a pubkey, the node randomly selects an address from the peer's known addresses. In WASM environments, this may select a non-WSS address that is unsupported. Add an optional addr_type parameter (tcp/ws/wss) to allow callers to filter addresses by transport type before random selection. * refactor: decouple actor protocol from JSON types and improve error messages - Introduce MultiAddrTransport enum in fiber-types as the internal transport filter type, keeping fiber-json-types::MultiAddrType as the RPC-facing type only - Add From<MultiAddrType> for MultiAddrTransport conversion in fiber-json-types convert.rs (behind conversion feature) - Update NetworkActorCommand::ConnectPeerWithPubkey to use MultiAddrTransport instead of fiber_json_types::MultiAddrType - Convert at the RPC boundary in rpc/peer.rs via .map(Into::into) - Add NoMatchingAddress(Pubkey, MultiAddrTransport) error variant to distinguish 'peer not found' from 'peer has no addresses matching the requested transport type' - Rename matches_addr_type -> matches_addr_transport for clarity * refactor: use tentacle::utils::TransportType and rename MultiAddrType Per review: reuse the existing tentacle::utils::TransportType instead of introducing a custom MultiAddrTransport in fiber-types. - Remove MultiAddrTransport enum and Display impl from fiber-types - Remove From<MultiAddrType> conversion from fiber-json-types/convert.rs - Rename MultiAddrType -> TransportType in fiber-json-types/peer.rs - Update NetworkActorCommand::ConnectPeerWithPubkey to use tentacle::utils::TransportType directly - Remove cfg(not(wasm32)) gate from TransportType import and find_type() (both are available on all targets in tentacle) - Replace matches_addr_transport() with find_type() == transport - Add to_transport_type() conversion in rpc/peer.rs at the RPC boundary - Update NoMatchingAddress error to use tentacle::utils::TransportType - Update TypeScript types and regenerate RPC docs * fix: allow DNS-based WSS addresses to pass the private address filter The private address filter used multiaddr_to_socketaddr() which only handles Ip4/Ip6 protocols, silently dropping DNS-based addresses like /dns4/example.com/tcp/443/wss. This prevented WSS addresses from being broadcast in node announcements. Add is_addr_reachable() helper that treats Dns4/Dns6 addresses as always reachable (since DNS implies a public endpoint), while preserving the existing IP-based reachability check. Applied at all three filter locations: announcement creation, graph ingestion, and gossip processing. * feat(cch): default final TLC expiry deltas to 60 hours (#1258) * feat(cch): default final TLC expiry deltas to 60 hours Raise DEFAULT_BTC_FINAL_TLC_EXPIRY_DELTA_BLOCKS to 360 (~10 min/block) and DEFAULT_CKB_FINAL_TLC_EXPIRY_DELTA_SECONDS to 216,000. Update CCH actor tests that assumed the previous 30h defaults. Made-with: Cursor * test(cch): replace expiry magic numbers with named constants Use BTC_BLOCK_TIME_SECS, DEFAULT_BTC_FINAL_TLC_EXPIRY_DELTA_BLOCKS, DEFAULT_CKB_FINAL_TLC_EXPIRY_DELTA_SECONDS, and per-test scenario consts in CCH actor tests. Made-with: Cursor --------- Co-authored-by: ian <ian@cryptape.com> * Local RPC method not found should not return unauthorized (#1235) * chore: bump version to v0.8.1 (#1274) --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Yukang <moorekang@gmail.com> Co-authored-by: Officeyutong <yt.xyxx@gmail.com> Co-authored-by: Eval Exec <execvy@gmail.com> Co-authored-by: gpBlockchain <32102187+gpBlockchain@users.noreply.github.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: sunchengzhu <36075573+sunchengzhu@users.noreply.github.com> Co-authored-by: ian <me@iany.me> Co-authored-by: ian <ian@cryptape.com> Co-authored-by: swananan <jt26wzz@gmail.com> Co-authored-by: quake <8990+quake@users.noreply.github.com> Co-authored-by: jjy <jjyruby@gmail.com>
Summery
This PR adds baseline gossip observability and benchmark coverage before protocol optimization. #1087
It includes 4 parts:
1. new Criterion benchmarks (NetworkNode based)
Criterion Benchmarks (NetworkNode-based)
Added two new benches:
gossip_propagation:gossip_sync_recovery:2. new gossip metrics
Added the following metrics:
fiber.gossip.received_broadcast_messages_totalfiber.gossip.applied_broadcast_messages_totalfiber.gossip.duplicate_broadcast_messages_totalfiber.gossip.rejected_broadcast_messages_totalfiber.gossip.received_bytes_totalfiber.gossip.sent_bytes_totalfiber.gossip.propagation_received_latency_msfiber.gossip.propagation_applied_latency_msfiber.gossip.propagation_samples_skipped_totalfiber.gossip.active_sync_completion_msfiber.gossip.active_sync_started_totalfiber.gossip.active_sync_finished_totalfiber.gossip.active_sync_timeout_totalfiber.gossip.missing_dependency_messages_totalfiber.gossip.dependency_query_requests_totalfiber.gossip.dependency_query_items_totalAlso configured dedicated histogram buckets for gossip latency metrics, using millisecond boundaries [1, 2, 5, 10, 20, 50, 100, 200, 500, 1000, 2000, 5000, 10000, 20000, 60000] to capture both fast propagation and long-tail sync delays with consistent percentile/coverage analysis.
3. new perf benchmark flow
Extended
tests/perfwithgossip-benchmarkmodes:gossip-benchmark takes two metric snapshots (before / after) and computes deltas from new metrics.
How outputs are calculated from metrics:
Outputs include:
Supports baseline/compare report generation.
4. CI updates to run and compare benchmarks safely
Updated benchmark CI flow to include gossip benchmark in a safe bootstrap way:
Notes
One open question is whether we should run these benchmarks with production-like gossip intervals (
60s/20s) instead of test-friendly intervals. Current settings are chosen to keep runtime practical and feedback fast, but they may not fully reflect production behavior. I’d like feedback on whether we should switch to production values (or keep both fast-mode and prod-mode runs).This PR is shared early to discuss and validate the overall direction first (metrics scope, benchmark design, and CI strategy) before moving to deeper optimization changes.