feat(metrics): Prometheus exporter + pipeline instrumentation#50
Merged
feat(metrics): Prometheus exporter + pipeline instrumentation#50
Conversation
Block every .md from being tracked except README.md, and ignore reference HTML under docs/ so local-only architecture diagrams cannot leak into commits.
Add `charon-metrics` crate: a Prometheus text-format HTTP endpoint on
a configurable bind address (default 0.0.0.0:9091, off the Prometheus
server default of 9090 so a local compose stack doesn't collide).
Metric names live as `const &str` so dashboard JSON and alert rules
stay in sync with call sites through a single source of truth.
Instrument the per-block pipeline in `charon-cli`:
- blocks scanned (counter, per chain)
- position bucket counts (gauge, per chain × {healthy, near_liq, liquidatable})
- block duration (histogram, per chain, seconds)
- simulation outcomes (counter, per chain × {ok, revert, error})
- opportunities queued / dropped by stage (counters)
- per-opportunity profit (histogram, USD cents)
- queue depth (gauge)
- build info (labels: version, git_sha)
Config surface: new `[metrics]` block in TOML with `enabled` and
`bind`; both fall back to sane defaults so existing configs keep
working without edits.
Closes #20.
This was referenced Apr 22, 2026
obchain
added a commit
that referenced
this pull request
Apr 22, 2026
…et supervision, metrics
- Reconnect backoff now adds 0-25% random jitter before each sleep to
avoid correlated retry storms against a single RPC endpoint when
many listeners disconnect at the same instant.
- BlockListener::publish uses try_send instead of a blocking await on
the mpsc sender; a full channel drops the block with a warn log and
a charon_listener_dropped_events_total counter increment, keeping the
WS drain loop responsive so the transport never buffers past its
server-side limit.
- Track last_seen block per chain. On reconnect, fetch the current head
and backfill every block between last_seen + 1 and head - 1 via
get_block_by_number, emitting ChainEvent::NewBlock { backfill: true }
so downstream consumers see the same heartbeat during disconnect
windows.
- CLI run_listen now spawns listeners into a tokio::task::JoinSet and a
supervise() helper drains join results, logging per-chain task panics
or errors and triggering shutdown when every listener exits. Ctrl-C
also aborts outstanding listeners.
- Per-block log downgraded from info to debug (BSC ~3 s = 28,800
info lines/day otherwise). Add charon_listener_connects_total,
charon_listener_disconnects_total, charon_blocks_received_total,
charon_listener_dropped_events_total counters for PR #50.
- Workspace adds rand and metrics deps; ChainEvent is #[non_exhaustive]
and carries a backfill flag.
Closes #92 #93 #94 #95 #96
This was referenced Apr 23, 2026
Closed
Closed
Closed
Closed
Closed
4 tasks
Register explicit bucket boundaries for charon_pipeline_block_duration_seconds and charon_executor_profit_usd_cents via PrometheusBuilder::set_buckets_for_metric. Without matchers, the exporter renders both histograms as Prometheus summaries, producing NaN from histogram_quantile and empty heatmaps in the companion Grafana dashboard. Block-duration buckets target BSC's 3s block cadence (healthy / warning / alert / overrun). Profit buckets cover the $0.05 dust to $10k+ windfall range observed on Venus liquidations. Closes #275 Closes #218 Closes #217
Declare metrics 0.24 and metrics-exporter-prometheus 0.16 in the root
[workspace.dependencies] table so every crate that pulls them in
reuses the same version. charon-metrics switches its direct version
strings to { workspace = true }; the http-listener feature stays
pinned at the workspace level.
Closes #219
process_opportunity pushes entries to the queue in two paths: simulation-gated (BOT_SIGNER_KEY set) and dry-run (signer absent). Both were incrementing charon_executor_opportunities_queued_total with the same label set, making the gate bypass invisible in dashboards. Thread a `simulated: bool` through `record_opportunity_queued` so the counter splits by whether the eth_call gate actually ran. Help text on the exporter updated to document the label semantics. This is the observability half of #220; the hard-refusal half (#170) lands separately on the executor branch. Closes #220
Default bind moves from 0.0.0.0:9091 to 127.0.0.1:9091 so a bare VPS deploy does not expose /metrics to the public internet out of the box. MetricsConfig gains an Option<String> auth_token paired with a validate() gate: when enabled and bind is non-loopback, refusing to start unless auth_token is non-empty. Config::load calls the gate after TOML parse so the check fails fast at startup. The exporter itself does not yet terminate the Bearer check — the token is a shared secret enforced by a reverse proxy (nginx, caddy, Traefik) in front of the listener, so bot + proxy read one source. Module rustdoc and the default.toml block describe the loopback-or- proxy contract; four unit tests lock in the three validate paths (loopback ok, non-loopback + token ok, non-loopback + empty/None token reject) plus the disabled-shortcut. Closes #213 Closes #214
Mirror the lint floor established on feat/20-multi-liq-batcher (issue #211) onto feat/22 so the two lineages converge on an identical policy at merge: forbid unsafe_code, deny arithmetic_side_effects, cast_possible_truncation, unwrap_used. Root Cargo.toml grows [workspace.lints.rust] + [workspace.lints.clippy] and a workspace-level `thiserror = "1"` dependency. charon-metrics opts in via `[lints] workspace = true` and pulls `thiserror` through the workspace declaration. Repair one violation surfaced by the policy: `probe.local_addr().unwrap()` in the exporter smoke test becomes `.expect("probe socket must expose its bound local_addr")`. Closes #216
charon-metrics is a library crate. Returning anyhow::Result from
init() forced charon-cli to match on Display strings to tell a
port-collision (retryable) apart from a recorder double-install
(fatal). Every sibling library (charon-core, charon-executor on
feat/20) already uses thiserror — this is the last outlier.
Introduce `MetricsError` (#[non_exhaustive]) with two variants:
`BucketConfig { metric, source }` for `set_buckets_for_metric`
failures — carries the offending metric name so logs pinpoint the
offender — and `InstallFailed { bind, source }` for
`PrometheusBuilder::install` failures. Both variants hold the
underlying `metrics_exporter_prometheus::BuildError` via `#[source]`
so `Error::source()` chains preserve the original diagnosis.
`init` keeps its public signature shape via a crate-level
`Result<T, E = MetricsError>` alias; charon-cli's `?` continues to
work unchanged thanks to `anyhow::Error: From<E: Error>`. One unit
test drives the `BucketConfig` path via an empty bucket slice and
asserts Display + source chain.
Closes #215
- MetricsConfig gains deny_unknown_fields + non_exhaustive so typos in [metrics] fail at load time instead of silently falling back to the default loopback bind. - charon_metrics::init is now idempotent; a second call short-circuits before touching PrometheusBuilder so repeated invocations no longer panic inside set_global_recorder. - New charon_metrics::install returns the ExporterFuture so the CLI can push the HTTP listener into the same JoinSet that supervises block listeners. A panic or clean exit in any supervised task now triggers controlled shutdown instead of leaving the bot running blind to Grafana. Closes #221 Closes #222 Closes #223
Drive the exporter through a real TCP GET and assert the Prometheus text body carries # HELP, # TYPE, the expected metric names from charon_metrics::names, and a round-tripped label. The test lives in tests/ so it runs in a fresh process — the lib-side unit tests share the global recorder state with each other, which would otherwise mask a broken install path. Closes #224
Adds three families of operator-facing Prometheus series and the instrumentation call-sites that feed them: Mempool (closes #300): - charon_mempool_pending_oracle_updates (gauge, chain) - charon_mempool_drained_total (counter, chain) - charon_mempool_websocket_reconnects_total (counter, chain) Emitted from PendingCache::insert/drain and MempoolMonitor::run's reconnect branch. PendingCache/MempoolMonitor now carry the chain short-name so every sample labels consistently with the scanner. Gas (closes #301): - charon_gas_base_fee_wei (gauge, chain) - charon_gas_priority_fee_wei (gauge, chain) - charon_gas_max_fee_wei (gauge, chain) - charon_gas_ceiling_skips_total (counter, chain, reason) Emitted from GasOracle::fetch_params. max_fee_wei is intentionally left untouched on the ceiling-skip branch so the gauge's semantic stays "what the last *submitted* tx carried". RPC latency (closes #302): - charon_rpc_call_duration_seconds (histogram, method, endpoint_kind) - charon_rpc_errors_total (counter, method, error_kind) - charon_rpc_connection_reconnects_total (counter, endpoint_kind) Adds a time_rpc<F, T>(method, endpoint_kind, fut) -> T ergonomic helper that wraps any future and records its wall-clock duration into the histogram. Adopted at the eth_call simulation gate, the eth_sendRawTransaction submit path, and the pending-tx lookup in the mempool monitor. Block listener and mempool monitor also bump the reconnect counter on every reconnect attempt. Histogram buckets 1ms..30s are justified in rustdoc. All new constants and helpers carry rustdoc; bucket-registration errors continue to flow through MetricsError::BucketConfig. Unit tests exercise every new helper label combination and assert time_rpc preserves the inner future's output on both branches; the scrape integration test now asserts each new metric name and a representative label pair round-trips through the exporter.
…trics # Conflicts: # Cargo.lock # Cargo.toml # crates/charon-cli/src/main.rs # crates/charon-core/src/config.rs # crates/charon-core/src/lib.rs # crates/charon-executor/src/gas.rs # crates/charon-executor/src/simulation.rs # crates/charon-executor/src/submit.rs # crates/charon-scanner/src/listener.rs # crates/charon-scanner/src/mempool.rs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
charon-metrics: Prometheus text-format HTTP endpoint (default0.0.0.0:9091) with constants-as-source-of-truth for metric namescharon-cliat every natural choke-point (block drain, bucket classification, simulation gate, queue push, drop stages)[metrics]block inconfig/default.tomlwithenabled+bind, both defaulted so existing configs keep workingMetrics exposed
charon_scanner_blocks_total{chain}countercharon_scanner_positions{chain,bucket}gauge (bucket = healthy | near_liq | liquidatable)charon_pipeline_block_duration_seconds{chain}histogramcharon_executor_simulations_total{chain,result}counter (result = ok | revert | error)charon_executor_opportunities_queued_total{chain}countercharon_executor_opportunities_dropped_total{chain,stage}counter (stage = router | profit | simulation | build)charon_executor_profit_usd_cents{chain}histogramcharon_executor_queue_depthgaugecharon_build_info{version,git_sha}gaugeWhy
:9091and not:9090Prometheus server defaults to
:9090. Keeping the exporter off that port avoids a port collision when the full compose stack runs on a single host, which is the topology the Hetzner demo uses.Test plan
cargo build --workspacecleancargo clippy --workspace --all-targets -- -D warningscleancargo fmt --all --checkcleancargo test -p charon-metrics— exporter binds HTTP listener, typed helpers panic-freecharon listenagainst BSC →curl http://localhost:9091/metricsreturns# HELP+# TYPEfor every metric above (runs in follow-up session with real RPC)Stacked PR
Base is
feat/21-mempool-monitor(PR #46). Merging this before its parent will tangle the diff; merge order is #46 → this PR.Closes #20.