Skip to content

feat(metrics): Prometheus exporter + pipeline instrumentation#50

Merged
obchain merged 12 commits intomainfrom
feat/22-prometheus-metrics
Apr 24, 2026
Merged

feat(metrics): Prometheus exporter + pipeline instrumentation#50
obchain merged 12 commits intomainfrom
feat/22-prometheus-metrics

Conversation

@obchain
Copy link
Copy Markdown
Owner

@obchain obchain commented Apr 22, 2026

Summary

  • New crate charon-metrics: Prometheus text-format HTTP endpoint (default 0.0.0.0:9091) with constants-as-source-of-truth for metric names
  • Instrument the per-block pipeline in charon-cli at every natural choke-point (block drain, bucket classification, simulation gate, queue push, drop stages)
  • Config surface: new [metrics] block in config/default.toml with enabled + bind, both defaulted so existing configs keep working

Metrics exposed

  • charon_scanner_blocks_total{chain} counter
  • charon_scanner_positions{chain,bucket} gauge (bucket = healthy | near_liq | liquidatable)
  • charon_pipeline_block_duration_seconds{chain} histogram
  • charon_executor_simulations_total{chain,result} counter (result = ok | revert | error)
  • charon_executor_opportunities_queued_total{chain} counter
  • charon_executor_opportunities_dropped_total{chain,stage} counter (stage = router | profit | simulation | build)
  • charon_executor_profit_usd_cents{chain} histogram
  • charon_executor_queue_depth gauge
  • charon_build_info{version,git_sha} gauge

Why :9091 and not :9090

Prometheus server defaults to :9090. Keeping the exporter off that port avoids a port collision when the full compose stack runs on a single host, which is the topology the Hetzner demo uses.

Test plan

  • cargo build --workspace clean
  • cargo clippy --workspace --all-targets -- -D warnings clean
  • cargo fmt --all --check clean
  • cargo test -p charon-metrics — exporter binds HTTP listener, typed helpers panic-free
  • Full workspace test suite green (60 Rust tests + 1 doctest)
  • Manual smoke: charon listen against BSC → curl http://localhost:9091/metrics returns # HELP + # TYPE for every metric above (runs in follow-up session with real RPC)

Stacked PR

Base is feat/21-mempool-monitor (PR #46). Merging this before its parent will tangle the diff; merge order is #46 → this PR.

Closes #20.

obchain added 2 commits April 22, 2026 16:12
Block every .md from being tracked except README.md, and ignore
reference HTML under docs/ so local-only architecture diagrams
cannot leak into commits.
Add `charon-metrics` crate: a Prometheus text-format HTTP endpoint on
a configurable bind address (default 0.0.0.0:9091, off the Prometheus
server default of 9090 so a local compose stack doesn't collide).
Metric names live as `const &str` so dashboard JSON and alert rules
stay in sync with call sites through a single source of truth.

Instrument the per-block pipeline in `charon-cli`:
- blocks scanned (counter, per chain)
- position bucket counts (gauge, per chain × {healthy, near_liq, liquidatable})
- block duration (histogram, per chain, seconds)
- simulation outcomes (counter, per chain × {ok, revert, error})
- opportunities queued / dropped by stage (counters)
- per-opportunity profit (histogram, USD cents)
- queue depth (gauge)
- build info (labels: version, git_sha)

Config surface: new `[metrics]` block in TOML with `enabled` and
`bind`; both fall back to sane defaults so existing configs keep
working without edits.

Closes #20.
obchain added a commit that referenced this pull request Apr 22, 2026
…et supervision, metrics

- Reconnect backoff now adds 0-25% random jitter before each sleep to
  avoid correlated retry storms against a single RPC endpoint when
  many listeners disconnect at the same instant.
- BlockListener::publish uses try_send instead of a blocking await on
  the mpsc sender; a full channel drops the block with a warn log and
  a charon_listener_dropped_events_total counter increment, keeping the
  WS drain loop responsive so the transport never buffers past its
  server-side limit.
- Track last_seen block per chain. On reconnect, fetch the current head
  and backfill every block between last_seen + 1 and head - 1 via
  get_block_by_number, emitting ChainEvent::NewBlock { backfill: true }
  so downstream consumers see the same heartbeat during disconnect
  windows.
- CLI run_listen now spawns listeners into a tokio::task::JoinSet and a
  supervise() helper drains join results, logging per-chain task panics
  or errors and triggering shutdown when every listener exits. Ctrl-C
  also aborts outstanding listeners.
- Per-block log downgraded from info to debug (BSC ~3 s = 28,800
  info lines/day otherwise). Add charon_listener_connects_total,
  charon_listener_disconnects_total, charon_blocks_received_total,
  charon_listener_dropped_events_total counters for PR #50.
- Workspace adds rand and metrics deps; ChainEvent is #[non_exhaustive]
  and carries a backfill flag.

Closes #92 #93 #94 #95 #96
This was referenced Apr 23, 2026
obchain added 9 commits April 23, 2026 15:45
Register explicit bucket boundaries for
charon_pipeline_block_duration_seconds and
charon_executor_profit_usd_cents via
PrometheusBuilder::set_buckets_for_metric. Without matchers, the
exporter renders both histograms as Prometheus summaries, producing
NaN from histogram_quantile and empty heatmaps in the companion
Grafana dashboard.

Block-duration buckets target BSC's 3s block cadence
(healthy / warning / alert / overrun). Profit buckets cover the
$0.05 dust to $10k+ windfall range observed on Venus liquidations.

Closes #275
Closes #218
Closes #217
Declare metrics 0.24 and metrics-exporter-prometheus 0.16 in the root
[workspace.dependencies] table so every crate that pulls them in
reuses the same version. charon-metrics switches its direct version
strings to { workspace = true }; the http-listener feature stays
pinned at the workspace level.

Closes #219
process_opportunity pushes entries to the queue in two paths:
simulation-gated (BOT_SIGNER_KEY set) and dry-run (signer absent).
Both were incrementing charon_executor_opportunities_queued_total
with the same label set, making the gate bypass invisible in
dashboards.

Thread a `simulated: bool` through `record_opportunity_queued` so
the counter splits by whether the eth_call gate actually ran. Help
text on the exporter updated to document the label semantics.

This is the observability half of #220; the hard-refusal half
(#170) lands separately on the executor branch.

Closes #220
Default bind moves from 0.0.0.0:9091 to 127.0.0.1:9091 so a bare VPS
deploy does not expose /metrics to the public internet out of the
box. MetricsConfig gains an Option<String> auth_token paired with a
validate() gate: when enabled and bind is non-loopback, refusing to
start unless auth_token is non-empty. Config::load calls the gate
after TOML parse so the check fails fast at startup.

The exporter itself does not yet terminate the Bearer check — the
token is a shared secret enforced by a reverse proxy (nginx, caddy,
Traefik) in front of the listener, so bot + proxy read one source.
Module rustdoc and the default.toml block describe the loopback-or-
proxy contract; four unit tests lock in the three validate paths
(loopback ok, non-loopback + token ok, non-loopback + empty/None
token reject) plus the disabled-shortcut.

Closes #213
Closes #214
Mirror the lint floor established on feat/20-multi-liq-batcher
(issue #211) onto feat/22 so the two lineages converge on an
identical policy at merge: forbid unsafe_code, deny
arithmetic_side_effects, cast_possible_truncation, unwrap_used.
Root Cargo.toml grows [workspace.lints.rust] + [workspace.lints.clippy]
and a workspace-level `thiserror = "1"` dependency. charon-metrics
opts in via `[lints] workspace = true` and pulls `thiserror` through
the workspace declaration.

Repair one violation surfaced by the policy:
`probe.local_addr().unwrap()` in the exporter smoke test becomes
`.expect("probe socket must expose its bound local_addr")`.

Closes #216
charon-metrics is a library crate. Returning anyhow::Result from
init() forced charon-cli to match on Display strings to tell a
port-collision (retryable) apart from a recorder double-install
(fatal). Every sibling library (charon-core, charon-executor on
feat/20) already uses thiserror — this is the last outlier.

Introduce `MetricsError` (#[non_exhaustive]) with two variants:
`BucketConfig { metric, source }` for `set_buckets_for_metric`
failures — carries the offending metric name so logs pinpoint the
offender — and `InstallFailed { bind, source }` for
`PrometheusBuilder::install` failures. Both variants hold the
underlying `metrics_exporter_prometheus::BuildError` via `#[source]`
so `Error::source()` chains preserve the original diagnosis.

`init` keeps its public signature shape via a crate-level
`Result<T, E = MetricsError>` alias; charon-cli's `?` continues to
work unchanged thanks to `anyhow::Error: From<E: Error>`. One unit
test drives the `BucketConfig` path via an empty bucket slice and
asserts Display + source chain.

Closes #215
- MetricsConfig gains deny_unknown_fields + non_exhaustive so typos in
  [metrics] fail at load time instead of silently falling back to the
  default loopback bind.
- charon_metrics::init is now idempotent; a second call short-circuits
  before touching PrometheusBuilder so repeated invocations no longer
  panic inside set_global_recorder.
- New charon_metrics::install returns the ExporterFuture so the CLI
  can push the HTTP listener into the same JoinSet that supervises
  block listeners. A panic or clean exit in any supervised task now
  triggers controlled shutdown instead of leaving the bot running
  blind to Grafana.

Closes #221
Closes #222
Closes #223
Drive the exporter through a real TCP GET and assert the Prometheus
text body carries # HELP, # TYPE, the expected metric names from
charon_metrics::names, and a round-tripped label. The test lives in
tests/ so it runs in a fresh process — the lib-side unit tests share
the global recorder state with each other, which would otherwise
mask a broken install path.

Closes #224
Adds three families of operator-facing Prometheus series and the
instrumentation call-sites that feed them:

Mempool (closes #300):
- charon_mempool_pending_oracle_updates (gauge, chain)
- charon_mempool_drained_total (counter, chain)
- charon_mempool_websocket_reconnects_total (counter, chain)
Emitted from PendingCache::insert/drain and MempoolMonitor::run's
reconnect branch. PendingCache/MempoolMonitor now carry the chain
short-name so every sample labels consistently with the scanner.

Gas (closes #301):
- charon_gas_base_fee_wei (gauge, chain)
- charon_gas_priority_fee_wei (gauge, chain)
- charon_gas_max_fee_wei (gauge, chain)
- charon_gas_ceiling_skips_total (counter, chain, reason)
Emitted from GasOracle::fetch_params. max_fee_wei is intentionally
left untouched on the ceiling-skip branch so the gauge's semantic
stays "what the last *submitted* tx carried".

RPC latency (closes #302):
- charon_rpc_call_duration_seconds (histogram, method, endpoint_kind)
- charon_rpc_errors_total (counter, method, error_kind)
- charon_rpc_connection_reconnects_total (counter, endpoint_kind)
Adds a time_rpc<F, T>(method, endpoint_kind, fut) -> T ergonomic
helper that wraps any future and records its wall-clock duration
into the histogram. Adopted at the eth_call simulation gate, the
eth_sendRawTransaction submit path, and the pending-tx lookup in
the mempool monitor. Block listener and mempool monitor also bump
the reconnect counter on every reconnect attempt. Histogram
buckets 1ms..30s are justified in rustdoc.

All new constants and helpers carry rustdoc; bucket-registration
errors continue to flow through MetricsError::BucketConfig. Unit
tests exercise every new helper label combination and assert
time_rpc preserves the inner future's output on both branches;
the scrape integration test now asserts each new metric name and
a representative label pair round-trips through the exporter.
@obchain obchain changed the base branch from feat/21-mempool-monitor to main April 24, 2026 13:14
…trics

# Conflicts:
#	Cargo.lock
#	Cargo.toml
#	crates/charon-cli/src/main.rs
#	crates/charon-core/src/config.rs
#	crates/charon-core/src/lib.rs
#	crates/charon-executor/src/gas.rs
#	crates/charon-executor/src/simulation.rs
#	crates/charon-executor/src/submit.rs
#	crates/charon-scanner/src/listener.rs
#	crates/charon-scanner/src/mempool.rs
@obchain obchain merged commit edcb8fe into main Apr 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[telemetry] Prometheus metrics endpoint (liquidations, profit, scan latency, chain health)

1 participant