Skip to content

tsdb(wal): st-per-sample for histograms initial code and benchmarks#18221

Merged
ywwg merged 1 commit into
mainfrom
owilliams/nh-st
May 6, 2026
Merged

tsdb(wal): st-per-sample for histograms initial code and benchmarks#18221
ywwg merged 1 commit into
mainfrom
owilliams/nh-st

Conversation

@ywwg
Copy link
Copy Markdown
Member

@ywwg ywwg commented Mar 3, 2026

Implements ST for Histograms and Float Histograms (and their custom bucket cousins) in WAL. New tests, new benchmarks. First draft was AI, AI was bad, I rewrote it myself, had AI write the test instead, which found a bug (yay!), fixed it myself.

Benchmarks:
encode-hist-benchstat.txt
decode-hist-benchstat.txt
encode-fhist-benchstat.txt
decode-fhist-benchstat.txt

Part of #17790

[CHANGE] Adds Start Time value to all WAL Histogram samples in memory, and therefore may increase memory usage.

@ywwg
Copy link
Copy Markdown
Member Author

ywwg commented Mar 3, 2026

@copilot please review

Copy link
Copy Markdown

Copilot AI commented Mar 3, 2026

@ywwg I've opened a new pull request, #18222, to work on those changes. Once the pull request is ready, I'll request review from you.

Comment thread tsdb/record/record.go
@ywwg ywwg marked this pull request as ready for review April 14, 2026 17:41
@ywwg ywwg force-pushed the owilliams/nh-st branch from ebb6db4 to ef5e137 Compare April 14, 2026 17:44
@ywwg ywwg changed the title ST for Histograms tsdb(wal): st-per-sample for histograms initial code and benchmarks Apr 16, 2026
Copy link
Copy Markdown
Member

@krajorama krajorama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass.

Comment thread tsdb/record/record.go Outdated
Comment thread tsdb/record/record.go Outdated
Comment thread tsdb/record/record.go Outdated
Comment thread tsdb/record/record.go Outdated
Comment thread tsdb/record/record_test.go Outdated
Comment thread tsdb/record/record_test.go Outdated
Comment thread tsdb/record/record.go Outdated
Copy link
Copy Markdown
Member

@krajorama krajorama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few more comments. Seems pretty straightforward though, good job!

Comment thread tsdb/record/record.go
Comment thread tsdb/record/record.go
Comment thread tsdb/record/record.go Outdated
Comment thread tsdb/record/record.go Outdated
Comment thread tsdb/record/record.go Outdated
@ywwg
Copy link
Copy Markdown
Member Author

ywwg commented Apr 21, 2026

Addressed notes. I have a commented-out block for the ST marker code just in case you want it. (No strong preference)

Comment thread tsdb/record/record.go Outdated
Comment thread tsdb/record/record.go
Comment thread tsdb/record/record_test.go Outdated
Comment thread tsdb/record/record_test.go
@ywwg ywwg requested review from carrieedwards and krajorama April 22, 2026 14:29
Comment thread tsdb/record/record.go Outdated
Copy link
Copy Markdown
Member

@krajorama krajorama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, few minor comments.

Comment thread tsdb/record/record.go Outdated
Comment thread tsdb/record/record.go Outdated
Comment thread tsdb/record/record.go Outdated
Comment thread tsdb/record/record.go Outdated
Comment thread tsdb/record/record.go Outdated
Comment thread tsdb/record/record.go Outdated
Comment thread tsdb/record/record.go Outdated
@ywwg
Copy link
Copy Markdown
Member Author

ywwg commented Apr 24, 2026

notes addressed

Copy link
Copy Markdown
Member

@krajorama krajorama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there I think, couple of comments.
And for the changelog entry: it only talks about the in-memory format, but we're also changing on disk format.

Comment thread tsdb/record/record.go Outdated
Comment thread tsdb/record/record.go
Comment thread tsdb/wlog/checkpoint.go Outdated
Comment thread tsdb/wlog/checkpoint.go Outdated
@ywwg
Copy link
Copy Markdown
Member Author

ywwg commented May 4, 2026

Addressed notes

@ywwg
Copy link
Copy Markdown
Member Author

ywwg commented May 4, 2026

I can't figure out why we're getting an undefined symbol error

edit: it's because CI merges the PR with current main and tests, rather than testing the PR as-is

Comment thread tsdb/wlog/checkpoint_test.go Outdated
@ywwg ywwg force-pushed the owilliams/nh-st branch from 1b67ed6 to d7c9dba Compare May 5, 2026 14:57
Implements ST for Histograms and Float Histograms (and their custom bucket cousins) in WAL. New tests, new benchmarks.

Part of #17790

```release-notes
[CHANGE] Adds Start Time value to all WAL Histogram samples in memory, and therefore may increase memory usage.
```

Signed-off-by: Owen Williams <owen.williams@grafana.com>
@ywwg ywwg force-pushed the owilliams/nh-st branch from d7c9dba to 1189fd5 Compare May 5, 2026 14:58
@ywwg
Copy link
Copy Markdown
Member Author

ywwg commented May 5, 2026

flaky test

@ywwg ywwg requested review from carrieedwards and krajorama May 5, 2026 20:44
Copy link
Copy Markdown
Contributor

@carrieedwards carrieedwards left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@krajorama
Copy link
Copy Markdown
Member

flaky test

Restarted.

I asked Claude to look at TestHeadCompactionWhileScraping failing and it's related to Bartek's issue brought to the dev summit about making main() reusable. I've voted and commented on that.

Copy link
Copy Markdown
Member

@krajorama krajorama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiny comment otherwise LGTM

Comment thread tsdb/record/record.go
Comment thread tsdb/record/record.go
@ywwg ywwg merged commit da1f89e into main May 6, 2026
80 of 83 checks passed
@ywwg ywwg deleted the owilliams/nh-st branch May 6, 2026 18:33
@github-project-automation github-project-automation Bot moved this from Backlog to Done in Delta Temporality May 6, 2026
Comment thread tsdb/record/record.go
T int64
H *histogram.Histogram
Ref chunks.HeadSeriesRef
ST, T int64
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc

eleboucher pushed a commit to eleboucher/homelab that referenced this pull request May 28, 2026
…➔ v3.12.0) (#730)

This PR contains the following updates:

| Package | Update | Change |
|---|---|---|
| [quay.io/prometheus/prometheus](https://github.com/prometheus/prometheus) | minor | `v3.11.3` → `v3.12.0` |

---

### Release Notes

<details>
<summary>prometheus/prometheus (quay.io/prometheus/prometheus)</summary>

### [`v3.12.0`](https://github.com/prometheus/prometheus/releases/tag/v3.12.0): 3.12.0 / 2026-05-28

[Compare Source](prometheus/prometheus@v3.11.3...v3.12.0)

- \[SECURITY] Remote-write: Reject snappy-compressed requests whose declared decoded length exceeds the 32MB. Thanks to [@&#8203;hibrian827](https://github.com/hibrian827) for reporting it. [#&#8203;18642](prometheus/prometheus#18642)
- \[SECURITY] STACKIT SD: Fix secrets being exposed in plaintext via `/-/config` endpoint. Thanks to [@&#8203;August829](https://github.com/August829) and [@&#8203;Phaxma](https://github.com/Phaxma) for reporting. GHSA-39j6-789q-qxvh [#&#8203;18649](prometheus/prometheus#18649)
- \[CHANGE] TSDB/Agent: Adds Start Timestamp field to all WAL Histogram samples in memory; used `st-storage` flag is enabled. [#&#8203;18221](prometheus/prometheus#18221)
- \[FEATURE] API: Add `/api/v1/status/self_metrics` endpoint returning the current state of the Prometheus server's own metrics about itself as JSON. [#&#8203;18411](prometheus/prometheus#18411)
- \[FEATURE] Discovery: Add DigitalOcean Managed Databases service discovery [#&#8203;18287](prometheus/prometheus#18287)
- \[FEATURE] Prometheus: Add support for the aix/ppc64 compilation target [#&#8203;18321](prometheus/prometheus#18321)
- \[FEATURE] Discovery: Add Outscale VM service discovery (`outscale_sd_configs`) for discovering scrape targets from the Outscale Cloud API. [#&#8203;18139](prometheus/prometheus#18139)
- \[FEATURE] PromQL: Emit a warning when `sort`, `sort_by_label` or `sort_by_label_desc` is used within range (matrix) queries, as these functions do not have effect in that context. [#&#8203;18498](prometheus/prometheus#18498)
- \[FEATURE] PromQL: Add `start()`, `end()`, `range()`, and `step()` experimental functions [#&#8203;17877](prometheus/prometheus#17877)
- \[FEATURE] PromQL: Update `resets()` function to consider start timestamp resets. Hidden behind `use-start-timestamps` feature flag. [#&#8203;18627](prometheus/prometheus#18627)
- \[FEATURE] Prometheus: Promote auto-reload-config as stable [#&#8203;18620](prometheus/prometheus#18620)
- \[FEATURE] TSDB/Agent: Add `CheckpointFromInMemorySeries` option to `agent.DB` that enables checkpoint based on in-memory series. [#&#8203;17948](prometheus/prometheus#17948)
- \[FEATURE] UI: Add a web interface for deleting time series and cleaning tombstones, accessible from the Status menu. [#&#8203;18390](prometheus/prometheus#18390)
- \[FEATURE] PromQL: Use start timestamps for `rate()`, `irate(), and `increase()`calculations, behind a feature flag`use-start-timestamps`. Doesn't work together with extended range selectors `anchored`and`smoothed\`. [#&#8203;18344](prometheus/prometheus#18344)
- \[FEATURE] Scrape: Added a feature flag `st-synthesis` which synthesizes unknown STs for scraped cumulative metrics. Useful when Remote Writing 2.0 with delta or Otel-based backends. [#&#8203;18279](prometheus/prometheus#18279)
- \[FEATURE] promqltest: support `@st` annotation in `load` blocks to specify per-sample start timestamps. [#&#8203;18360](prometheus/prometheus#18360)
- \[ENHANCEMENT] API: reject concurrent fgprof profiles. [#&#8203;18651](prometheus/prometheus#18651)
- \[ENHANCEMENT] AWS SD: Add optional `external_id` field to ECS/MSK/RDS/Elasticache. [#&#8203;18579](prometheus/prometheus#18579)
- \[ENHANCEMENT] AWS SD: Add optional `external_id` field. [#&#8203;17171](prometheus/prometheus#17171)
- \[ENHANCEMENT] Discovery: Propagate SD target updates faster by introducing dynamic backoff interval instead of static 5s interval for throttling. [#&#8203;18187](prometheus/prometheus#18187)
- \[ENHANCEMENT] Promtool: Add `--header` flag to `query instant` command, matching existing `query range` behaviour. [#&#8203;18418](prometheus/prometheus#18418)
- \[ENHANCEMENT]: AWS SD: Allows EC2 service discovery to discover IPv6 addresses to communicate with target endpoints. The private IPv4 address remains the default when both IPv4 and IPv6 addresses are present. [#&#8203;16088](prometheus/prometheus#16088)
- \[PERF] TSDB: Make head chunk lookup in range queries constant time instead of quadratic time [#&#8203;18302](prometheus/prometheus#18302)
- \[PERF] TSDB: Skip entire stripes in mmapHeadChunks when no series need mmapping, reducing CPU utilization significantly at production-relevant scales. [#&#8203;18541](prometheus/prometheus#18541)
- \[PERF] TSDB: Skip clean series during periodic head chunk mmap using cached head chunk count [#&#8203;18272](prometheus/prometheus#18272)
- \[PERF] PromQL: Address FloatHistogram.KahanAdd performance regression on Go 1.26. [#&#8203;18568](prometheus/prometheus#18568)
- \[BUGFIX] PromQL: Fix `info()` function incorrectly handling negated `__name__` matchers [#&#8203;17932](prometheus/prometheus#17932)
- \[BUGFIX] API: Return duration expressions in `/parse_ast`. [#&#8203;18624](prometheus/prometheus#18624)
- \[BUGFIX] API: correctly document formats accepted for duration query request parameters (step, timeout and lookback delta) in OpenAPI spec [#&#8203;18305](prometheus/prometheus#18305)
- \[BUGFIX] Scrape: AppenderV2 now tracks staleness even when OOO/duplicate series errors happen similar to AppenderV1 [#&#8203;18567](prometheus/prometheus#18567)
- \[BUGFIX] Config: Validate remote\_write queue\_config fields at load time to prevent runtime panic and silent misconfiguration. [#&#8203;18209](prometheus/prometheus#18209)
- \[BUGFIX] Discovery/Consul: Add `health_filter` for Health API filtering, fixing breakage when using Catalog-only fields like `ServiceTags` in `filter`. [#&#8203;18479](prometheus/prometheus#18479) [#&#8203;18499](prometheus/prometheus#18499)
- \[BUGFIX] OTLP: limit decompressed body size for gzip-encoded OTLP write requests. [#&#8203;18408](prometheus/prometheus#18408)
- \[BUGFIX] PromQL: Fix `smoothed` rate/increase returning zero instead of no result when all data falls strictly after the query range. [#&#8203;18523](prometheus/prometheus#18523)
- \[BUGFIX] PromQL: Fix metric name not being dropped when last\_over\_time or first\_over\_time is applied to subqueries containing name-dropping functions like abs(). [#&#8203;18409](prometheus/prometheus#18409)
- \[BUGFIX] PromQL: Fix missing warning when mixing exponential and custom-bucket histograms in stats queries. [#&#8203;18660](prometheus/prometheus#18660)
- \[BUGFIX] PromQL: Fix parsing of `range()` keyword in duration expressions such as `foo[5m+range()]`. [#&#8203;18623](prometheus/prometheus#18623)
- \[BUGFIX] PromQL: Fix smoothed vector selector returning no results in binary operations when the `@` modifier is used. [#&#8203;18531](prometheus/prometheus#18531)
- \[BUGFIX] PromQL: Reject NaN, infinite, and out-of-range duration expressions instead of silently producing an out-of-range time.Duration. [#&#8203;18639](prometheus/prometheus#18639)
- \[BUGFIX] Scrape: Fix panic when scraping malformed native histograms. [#&#8203;18414](prometheus/prometheus#18414)
- \[BUGFIX] Scrape: fix panic when scraping a target exposing a summary with no quantiles via the protobuf format. [#&#8203;18382](prometheus/prometheus#18382)
- \[BUGFIX] Scrape: fix scrape failure log file occasionally not applied after a configuration reload. [#&#8203;18421](prometheus/prometheus#18421)
- \[BUGFIX] TSDB: Allow retention percentage with new data path. [#&#8203;18628](prometheus/prometheus#18628)
- \[BUGFIX] TSDB: Preserve decimal precision in percentage-based retention [#&#8203;18374](prometheus/prometheus#18374)
- \[BUGFIX] TSDB: fix prometheus\_tsdb\_head\_chunks going negative after WAL replay [#&#8203;18401](prometheus/prometheus#18401)
- \[BUGFIX] TSDB: panic with native histograms during query of overlapping chunks. [#&#8203;18692](prometheus/prometheus#18692)
- \[BUGFIX] Tracing: fix startup failure for insecure OTLP HTTP tracing [#&#8203;18469](prometheus/prometheus#18469)
- \[BUGFIX] UI: Escape label values offered by PromQL autocomplete. [#&#8203;18658](prometheus/prometheus#18658)
- \[BUGFIX] UI: Improve Y-axis tick label precision for graph values over small ranges. [#&#8203;18682](prometheus/prometheus#18682)
- \[BUGFIX] `prometheus_sd_refresh*` and `prometheus_sd_discovered_targets` metrics for specific scrape jobs are deleted when the scrape job is removed. [#&#8203;17614](prometheus/prometheus#17614)
- \[BUGFIX] Remote: fixed validation for received RW2 requests when parsing metadata unit symbols. This fixes a case when request would cause (recovered) handler panic. [#&#8203;18641](prometheus/prometheus#18641)
- \[BUGFIX] TSDB/Agent: fix race in agent appender where concurrent appends for the same label set could produce duplicate in-memory series and duplicate WAL records. [#&#8203;18292](prometheus/prometheus#18292)
- \[BUGFIX] Config: Update `--enable-feature`  flag description and sort feature names. [#&#8203;18487](prometheus/prometheus#18487)

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about these updates again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0My4xMDEuMSIsInVwZGF0ZWRJblZlciI6IjQzLjEwMS4xIiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyJyZW5vdmF0ZS9jb250YWluZXIiLCJ0eXBlL21pbm9yIl19-->

Reviewed-on: https://git.erwanleboucher.dev/eleboucher/homelab/pulls/730
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants