Skip to content

[grafana] block_duration_seconds quantile window [1m] too narrow — unstable p50/p95 on BSC 3s blocks #279

@obchain

Description

@obchain

Refs #54

Location

deploy/grafana/charon.json — Panel 2 (pipeline block latency), PromQL expressions

Queries

histogram_quantile(0.50, sum by (le, chain) (rate(charon_pipeline_block_duration_seconds_bucket{...}[1m])))
histogram_quantile(0.95, sum by (le, chain) (rate(charon_pipeline_block_duration_seconds_bucket{...}[1m])))

Problem

BSC produces one block approximately every 3 seconds. A 1-minute window contains roughly 20 blocks. Prometheus histogram_quantile() requires sufficient observation density in each bucket for stable quantile estimates. With only ~20 observations and default buckets (which have no resolution in the 0-3s range relevant to BSC), the quantile estimates will be highly unstable and oscillate heavily between scrape intervals.

The Prometheus documentation recommends a minimum 5-minute range vector for histogram_quantile() on low-frequency or narrow-value series.

Impact

Panel 2 will show erratic, spiky latency graphs that do not reflect actual pipeline performance. Operators may misinterpret normal BSC block timing variance as pipeline issues.

Suggested Fix

Change [1m] to [5m] in both histogram_quantile expressions in Panel 2. Consider adding a panel description explaining the BSC 3s block time context.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinglayer:devopsCI / deploy / infra / telemetrypriority:p1-coreCore MVP scopestatus:readyScoped and ready to pick uptype:featureNew capability or deliverable

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions