Skip to content

feat(grafana): importable dashboard JSON for Charon metrics#54

Merged
obchain merged 12 commits intomainfrom
feat/26-grafana-dashboard
Apr 24, 2026
Merged

feat(grafana): importable dashboard JSON for Charon metrics#54
obchain merged 12 commits intomainfrom
feat/26-grafana-dashboard

Conversation

@obchain
Copy link
Copy Markdown
Owner

@obchain obchain commented Apr 22, 2026

Summary

  • deploy/grafana/charon.json: schema-v39 dashboard importable into Grafana 10.x or Grafana Cloud free tier
  • 9 panels covering every metric the charon-metrics exporter actually emits — scanner, pipeline latency, position buckets, queue depth, cumulative profit, simulations, opportunity funnel, per-tx profit heatmap, build info
  • Templating: $datasource (Prometheus picker), $chain and $instance (auto-populated from label_values)
  • README gets a 3-step import section

Panels intentionally omitted (for now)

The issue body lists Mempool, Gas, and RPC-latency panels. The exporter doesn't emit the backing series yet, so the panels are NOT included — a blank panel is worse than a missing one. Follow-up PRs will ship the metrics and the panels together:

  • Mempool (txs/min, impacted positions) — requires wiring into charon-scanner::mempool
  • Gas (base fee, priority fee, tx cost cents) — requires wiring into charon-executor::gas
  • RPC health (latency, error rate) — requires instrumentation in the provider layer

Dashboard UID

charon-v0, tagged charon, liquidation, defi. Re-importing replaces the existing copy rather than duplicating.

Test plan

  • python3 -m json.tool parses cleanly
  • Schema version 39 (Grafana 10.4+)
  • Cargo workspace sweep green (no code changes that would break anything)
  • Manual smoke: import into Grafana Cloud with a Prometheus data source pointing at a running charon listen (next session — requires a Grafana Cloud account)

Stacked PR

Base is feat/25-foundry-fork-tests (PR #53). Merge order: #46#50#51#52#53 → this.

Closes #49.

New `deploy/grafana/charon.json` — one-click-importable dashboard
covering every metric the `charon-metrics` exporter actually emits.

Panels:
- Scanner: blocks / sec per chain
- Pipeline: block-latency p50 and p95 (from the histogram)
- Scanner: live position counts stacked by bucket
  (green/yellow/red for healthy/near_liq/liquidatable)
- Executor: queue depth stat + cumulative profit stat (USD)
- Executor: simulations / min stacked by result (ok/revert/error)
- Executor: opportunities queued vs dropped (per drop stage)
- Executor: per-opportunity profit distribution as a heatmap
- Build info: version + git_sha in a table

Templating:
- `$datasource` — Prometheus data source picker
- `$chain` — auto-populated from `charon_scanner_blocks_total` labels
- `$instance` — auto-populated from `charon_build_info` labels

Aspirational panels from #49 body that are NOT included yet because
the exporter doesn't emit the underlying series:
- Mempool txs / min, impacted positions flagged
- Gas (base fee, priority fee, tx cost in cents)
- RPC latency p50/p95, error rate per endpoint
These will be added as follow-up PRs alongside the metrics that feed
them; shipping the panels blank would only clutter the dashboard.

Dashboard UID `charon-v0` and stable tags mean re-importing replaces
rather than duplicates the dashboard in Grafana.

README gains a three-step import section pointing at the new file.

Closes #49.
This was referenced Apr 22, 2026
obchain added 6 commits April 23, 2026 15:58
The p50/p95 block-duration quantiles already use a [5m] range,
giving ~100 observations at BSC's 3s cadence — enough to keep the
estimate stable between scrapes. Extend the panel description so
the rationale is visible to operators reading the dashboard.

Closes #279
The cumulative-profit panel previously queried the histogram _sum
accumulator directly and divided by 100. _sum resets on process
restart, so a mid-window redeploy rendered as a sharp step-down
indistinguishable from a real loss.

Switch to increase(..._sum[$__range]) / 100 so the counter-reset
semantics of increase() absorb restarts and the window tracks the
dashboard time picker. Title and description updated to match.

Closes #276
The opportunity funnel panel aggregated queued as a single total
while dropped was broken out by stage. The two series rendered on
incomparable axes: one line vs four, no way to read stage-level
loss rate against intake.

Wrap the queued query in label_replace(..., "stage", "queued")
so it joins the dropped series under one stage-partitioned legend.
Colour overrides stay inert — Grafana picks from the classic
palette for the five stage labels.

Closes #280
Template variables $chain and $instance resolve via label_values,
which returns empty until the bot is scraping. Without an explicit
default, fresh imports rendered every panel as No Data.

Set allValue='.*' on both, mark the current selection as All, and
add a short description on each variable plus the dashboard so
operators see that the All-default exists by design. Panels now
resolve on first import and auto-refine once labels populate.

Closes #282
Dashboard JSON is schema v39, which requires Grafana 10.4.x or newer
(or any Grafana Cloud org). Earlier 9.x installs reject the import
or silently drop panels. Call out the version requirement in the
Grafana section so self-hosted operators running a stale Grafana do
not hit a cryptic import error.

Closes #278
The exporter currently binds 0.0.0.0:9091 without auth (tracked in
#213 and #214). Directing operators to point a remote Prometheus at
that URL bakes the exposure into the quickstart, leaking profit
histograms, build SHA, queue depth, and sim results to anyone with
network access — on a Hetzner VPS that is the public internet.

Add a callout above the import steps: bind to 127.0.0.1 and tunnel,
or put an authenticated reverse proxy in front of :9091, before
configuring an external scrape. Step 1 echoes the same guidance so
someone skim-reading the numbered list does not miss it.

Closes #277
obchain added 3 commits April 23, 2026 19:00
exclude git_sha from the build-info table transform and document the
deferred mempool/gas/rpc-latency panels in the dashboard description
(tracked in #300, #301, #302).
five prometheus rules covering bot-down, 1h-zero-liquidations, queue
depth spike, simulation failure rate, and opportunity drop rate. load
via prometheus rule_files or grafana unified alerting.
adds grafana-lint workflow: json.tool parse, dashboard-linter schema
and promql check, promtool check rules for alerts.yaml. replaces the
pr-test-plan claim that json.tool alone validated the dashboard.
@obchain obchain changed the base branch from feat/25-foundry-fork-tests to main April 24, 2026 14:46
obchain added 2 commits April 24, 2026 22:50
- Replace non-existent charon_scanner_positions_by_bucket reference in
  CharonNoLiquidationAttemptsOneHour annotation with canonical
  charon_scanner_positions{bucket="near_liq"|"liquidatable"} series.
- Fix CharonSimulationFailureRateHigh numerator: the sim_result label
  values declared in charon-metrics are "ok" / "revert" / "error" —
  there is no "failure". Use result=~"revert|error" so the ratio is
  non-zero when simulations actually fail.

Names and labels now match crates/charon-metrics/src/lib.rs::names
and ::sim_result on main. No Rust changes.
@obchain obchain merged commit ae4dac5 into main Apr 24, 2026
0 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[tooling] Anvil BSC mainnet fork script for local end-to-end demo

1 participant