tsdb: Skip clean series during periodic head chunk mmap#18272
Conversation
|
/prombench main |
|
⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️ Compared versions: After the successful deployment (check status here), the benchmarking results can be viewed at: Available Commands:
|
1136bf7 to
36dacdd
Compare
|
/prombench start |
|
Incorrect Available Commands:
Advanced Flags for
Examples:
|
|
/prombench main |
|
⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️ Compared versions: After the successful deployment (check status here), the benchmarking results can be viewed at: Available Commands:
|
a019fba to
7a639ad
Compare
|
/prombench main |
|
⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️ Compared versions: After the successful deployment (check status here), the benchmarking results can be viewed at: Available Commands:
|
5d1bcb0 to
653abf0
Compare
|
/prombench main |
|
⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️ Compared versions: After the successful deployment (check status here), the benchmarking results can be viewed at: Available Commands:
|
|
/prombench cancel |
6fd8854 to
98c438b
Compare
The periodic mmapHeadChunks cycle previously scanned every series across all stripes, acquiring a per-series lock on each, even though typically >99% have nothing to mmap. Add a dirtyForMmap flag (sync/atomic.Bool, 4 bytes) to memSeries. When a new head chunk is cut during append, the flag is set atomically. The new mmapDirtyHeadChunks function iterates series with a lock-free Load() to skip clean ones — only dirty series acquire the series mutex. The existing full-scan mmapHeadChunks is kept for Head.Close(). sync/atomic.Bool (4 bytes, align 4) is used instead of go.uber.org/ atomic.Bool (8 bytes due to nocmp) to fit in existing struct padding without growing memSeries. Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
Also use "ready" term instead of "dirty". Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
Cache the head chunk count in an atomic Uint32 instead of a boolean flag. This enables future optimization of the O(n) memChunk.len() traversal and detection of pathologically long head chunk chains. Reset headChunkCount on all paths that remove head chunks: snapshot restore, WAL replay, truncateChunksBefore, and mmapHeadChunks. In mmapHeadChunks, set the count after mmapChunks based on actual state to handle the race where truncation clears headChunks between the lock-free fast-path check and acquiring the series lock. Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
Fix pre-existing comment errors in code blocks touched by recent changes: headChunks.next → headChunks.prev (the field is prev), mmapHeadChunks() → mmapChunks() (the method called on each series), remove redundant line with grammar error, and clarify when headChunkCount is updated. Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
Address reviewer feedback: - Move headChunkCount update into cutNewHeadChunk and the histogram counter-reset paths, where head chunks are actually created. This is simpler than updating in observeChunkCreated and naturally avoids inflating the count for OOO chunk creation (krajorama, bboreham). - Move headChunkCount reset into mmapChunks so it is next to where headChunks is manipulated (krajorama). - Move >= 2 comment from field definition to the check site (bboreham). - Set headChunkCount on snapshot restore path. - Add test verifying OOO inserts do not inflate headChunkCount. Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
205c896 to
ccf6491
Compare
Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
- Count snapshot-restored chunks instead of hardcoding 1 (krajorama). - Use uint32 for truncation loop counter to avoid cast (krajorama). Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
- Remove headChunkCount stores from no-op paths in mmapChunks, to avoid masking bugs by silently correcting stale counts (bboreham). - Drop observeChunkCreated extraction, revert to inline metrics since the function no longer carries headChunkCount logic (bboreham). - Note that chunk counts are bounded by the 3-byte HeadChunkRef field and cannot overflow uint32 (bboreham). - Merge duplicated serialisation comments in mmapHeadChunks doc (bboreham). - Reword "unlink these from s.headChunks" to "Remove the tail of the list" (bboreham). Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
…8272) tsdb: Skip clean series during periodic head chunk mmap The periodic mmapHeadChunks cycle previously acquired a per-series lock on every series, even though typically >99% have nothing to mmap. This was identified as a CPU bottleneck in Grafana Mimir. Add a headChunkCount field (sync/atomic.Uint32) to memSeries that tracks the number of head chunks. It is incremented in cutNewHeadChunk and the histogram new-chunk paths, and reset by mmapChunks and truncateChunksBefore. mmapHeadChunks uses a lock-free Load to skip series with fewer than 2 head chunks, avoiding the per-series lock for clean series. sync/atomic.Uint32 (4 bytes) is used instead of go.uber.org/atomic (8 bytes) to fit in existing struct padding without growing memSeries. Chunk counts are bounded by the 3-byte field in HeadChunkRef, so cannot overflow uint32. Also fix pre-existing comment inaccuracies in the touched code: headChunks.next -> headChunks.prev, mmapHeadChunks() -> mmapChunks() in the doc comment, and a grammar error. --------- Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com> Signed-off-by: Raphael Bizos <r.bizos@criteo.com>
…➔ v3.12.0) (#730) This PR contains the following updates: | Package | Update | Change | |---|---|---| | [quay.io/prometheus/prometheus](https://github.com/prometheus/prometheus) | minor | `v3.11.3` → `v3.12.0` | --- ### Release Notes <details> <summary>prometheus/prometheus (quay.io/prometheus/prometheus)</summary> ### [`v3.12.0`](https://github.com/prometheus/prometheus/releases/tag/v3.12.0): 3.12.0 / 2026-05-28 [Compare Source](prometheus/prometheus@v3.11.3...v3.12.0) - \[SECURITY] Remote-write: Reject snappy-compressed requests whose declared decoded length exceeds the 32MB. Thanks to [@​hibrian827](https://github.com/hibrian827) for reporting it. [#​18642](prometheus/prometheus#18642) - \[SECURITY] STACKIT SD: Fix secrets being exposed in plaintext via `/-/config` endpoint. Thanks to [@​August829](https://github.com/August829) and [@​Phaxma](https://github.com/Phaxma) for reporting. GHSA-39j6-789q-qxvh [#​18649](prometheus/prometheus#18649) - \[CHANGE] TSDB/Agent: Adds Start Timestamp field to all WAL Histogram samples in memory; used `st-storage` flag is enabled. [#​18221](prometheus/prometheus#18221) - \[FEATURE] API: Add `/api/v1/status/self_metrics` endpoint returning the current state of the Prometheus server's own metrics about itself as JSON. [#​18411](prometheus/prometheus#18411) - \[FEATURE] Discovery: Add DigitalOcean Managed Databases service discovery [#​18287](prometheus/prometheus#18287) - \[FEATURE] Prometheus: Add support for the aix/ppc64 compilation target [#​18321](prometheus/prometheus#18321) - \[FEATURE] Discovery: Add Outscale VM service discovery (`outscale_sd_configs`) for discovering scrape targets from the Outscale Cloud API. [#​18139](prometheus/prometheus#18139) - \[FEATURE] PromQL: Emit a warning when `sort`, `sort_by_label` or `sort_by_label_desc` is used within range (matrix) queries, as these functions do not have effect in that context. [#​18498](prometheus/prometheus#18498) - \[FEATURE] PromQL: Add `start()`, `end()`, `range()`, and `step()` experimental functions [#​17877](prometheus/prometheus#17877) - \[FEATURE] PromQL: Update `resets()` function to consider start timestamp resets. Hidden behind `use-start-timestamps` feature flag. [#​18627](prometheus/prometheus#18627) - \[FEATURE] Prometheus: Promote auto-reload-config as stable [#​18620](prometheus/prometheus#18620) - \[FEATURE] TSDB/Agent: Add `CheckpointFromInMemorySeries` option to `agent.DB` that enables checkpoint based on in-memory series. [#​17948](prometheus/prometheus#17948) - \[FEATURE] UI: Add a web interface for deleting time series and cleaning tombstones, accessible from the Status menu. [#​18390](prometheus/prometheus#18390) - \[FEATURE] PromQL: Use start timestamps for `rate()`, `irate(), and `increase()`calculations, behind a feature flag`use-start-timestamps`. Doesn't work together with extended range selectors `anchored`and`smoothed\`. [#​18344](prometheus/prometheus#18344) - \[FEATURE] Scrape: Added a feature flag `st-synthesis` which synthesizes unknown STs for scraped cumulative metrics. Useful when Remote Writing 2.0 with delta or Otel-based backends. [#​18279](prometheus/prometheus#18279) - \[FEATURE] promqltest: support `@st` annotation in `load` blocks to specify per-sample start timestamps. [#​18360](prometheus/prometheus#18360) - \[ENHANCEMENT] API: reject concurrent fgprof profiles. [#​18651](prometheus/prometheus#18651) - \[ENHANCEMENT] AWS SD: Add optional `external_id` field to ECS/MSK/RDS/Elasticache. [#​18579](prometheus/prometheus#18579) - \[ENHANCEMENT] AWS SD: Add optional `external_id` field. [#​17171](prometheus/prometheus#17171) - \[ENHANCEMENT] Discovery: Propagate SD target updates faster by introducing dynamic backoff interval instead of static 5s interval for throttling. [#​18187](prometheus/prometheus#18187) - \[ENHANCEMENT] Promtool: Add `--header` flag to `query instant` command, matching existing `query range` behaviour. [#​18418](prometheus/prometheus#18418) - \[ENHANCEMENT]: AWS SD: Allows EC2 service discovery to discover IPv6 addresses to communicate with target endpoints. The private IPv4 address remains the default when both IPv4 and IPv6 addresses are present. [#​16088](prometheus/prometheus#16088) - \[PERF] TSDB: Make head chunk lookup in range queries constant time instead of quadratic time [#​18302](prometheus/prometheus#18302) - \[PERF] TSDB: Skip entire stripes in mmapHeadChunks when no series need mmapping, reducing CPU utilization significantly at production-relevant scales. [#​18541](prometheus/prometheus#18541) - \[PERF] TSDB: Skip clean series during periodic head chunk mmap using cached head chunk count [#​18272](prometheus/prometheus#18272) - \[PERF] PromQL: Address FloatHistogram.KahanAdd performance regression on Go 1.26. [#​18568](prometheus/prometheus#18568) - \[BUGFIX] PromQL: Fix `info()` function incorrectly handling negated `__name__` matchers [#​17932](prometheus/prometheus#17932) - \[BUGFIX] API: Return duration expressions in `/parse_ast`. [#​18624](prometheus/prometheus#18624) - \[BUGFIX] API: correctly document formats accepted for duration query request parameters (step, timeout and lookback delta) in OpenAPI spec [#​18305](prometheus/prometheus#18305) - \[BUGFIX] Scrape: AppenderV2 now tracks staleness even when OOO/duplicate series errors happen similar to AppenderV1 [#​18567](prometheus/prometheus#18567) - \[BUGFIX] Config: Validate remote\_write queue\_config fields at load time to prevent runtime panic and silent misconfiguration. [#​18209](prometheus/prometheus#18209) - \[BUGFIX] Discovery/Consul: Add `health_filter` for Health API filtering, fixing breakage when using Catalog-only fields like `ServiceTags` in `filter`. [#​18479](prometheus/prometheus#18479) [#​18499](prometheus/prometheus#18499) - \[BUGFIX] OTLP: limit decompressed body size for gzip-encoded OTLP write requests. [#​18408](prometheus/prometheus#18408) - \[BUGFIX] PromQL: Fix `smoothed` rate/increase returning zero instead of no result when all data falls strictly after the query range. [#​18523](prometheus/prometheus#18523) - \[BUGFIX] PromQL: Fix metric name not being dropped when last\_over\_time or first\_over\_time is applied to subqueries containing name-dropping functions like abs(). [#​18409](prometheus/prometheus#18409) - \[BUGFIX] PromQL: Fix missing warning when mixing exponential and custom-bucket histograms in stats queries. [#​18660](prometheus/prometheus#18660) - \[BUGFIX] PromQL: Fix parsing of `range()` keyword in duration expressions such as `foo[5m+range()]`. [#​18623](prometheus/prometheus#18623) - \[BUGFIX] PromQL: Fix smoothed vector selector returning no results in binary operations when the `@` modifier is used. [#​18531](prometheus/prometheus#18531) - \[BUGFIX] PromQL: Reject NaN, infinite, and out-of-range duration expressions instead of silently producing an out-of-range time.Duration. [#​18639](prometheus/prometheus#18639) - \[BUGFIX] Scrape: Fix panic when scraping malformed native histograms. [#​18414](prometheus/prometheus#18414) - \[BUGFIX] Scrape: fix panic when scraping a target exposing a summary with no quantiles via the protobuf format. [#​18382](prometheus/prometheus#18382) - \[BUGFIX] Scrape: fix scrape failure log file occasionally not applied after a configuration reload. [#​18421](prometheus/prometheus#18421) - \[BUGFIX] TSDB: Allow retention percentage with new data path. [#​18628](prometheus/prometheus#18628) - \[BUGFIX] TSDB: Preserve decimal precision in percentage-based retention [#​18374](prometheus/prometheus#18374) - \[BUGFIX] TSDB: fix prometheus\_tsdb\_head\_chunks going negative after WAL replay [#​18401](prometheus/prometheus#18401) - \[BUGFIX] TSDB: panic with native histograms during query of overlapping chunks. [#​18692](prometheus/prometheus#18692) - \[BUGFIX] Tracing: fix startup failure for insecure OTLP HTTP tracing [#​18469](prometheus/prometheus#18469) - \[BUGFIX] UI: Escape label values offered by PromQL autocomplete. [#​18658](prometheus/prometheus#18658) - \[BUGFIX] UI: Improve Y-axis tick label precision for graph values over small ranges. [#​18682](prometheus/prometheus#18682) - \[BUGFIX] `prometheus_sd_refresh*` and `prometheus_sd_discovered_targets` metrics for specific scrape jobs are deleted when the scrape job is removed. [#​17614](prometheus/prometheus#17614) - \[BUGFIX] Remote: fixed validation for received RW2 requests when parsing metadata unit symbols. This fixes a case when request would cause (recovered) handler panic. [#​18641](prometheus/prometheus#18641) - \[BUGFIX] TSDB/Agent: fix race in agent appender where concurrent appends for the same label set could produce duplicate in-memory series and duplicate WAL records. [#​18292](prometheus/prometheus#18292) - \[BUGFIX] Config: Update `--enable-feature` flag description and sort feature names. [#​18487](prometheus/prometheus#18487) </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about these updates again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0My4xMDEuMSIsInVwZGF0ZWRJblZlciI6IjQzLjEwMS4xIiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyJyZW5vdmF0ZS9jb250YWluZXIiLCJ0eXBlL21pbm9yIl19--> Reviewed-on: https://git.erwanleboucher.dev/eleboucher/homelab/pulls/730
The periodic
mmapHeadChunkscycle previously scanned every series across all stripes, acquiring a per-series lock on each, even though typically >99% have nothing to mmap. Profiling identified this as an ingester CPU bottleneck in Grafana Mimir.This PR adds a
headChunkCountfield (sync/atomic.Uint32, 4 bytes) tomemSeriesthat tracks the number of head chunks. It is incremented incutNewHeadChunkand the histogram new-chunk paths, and reset bymmapChunksandtruncateChunksBefore. ThemmapHeadChunksfunction uses a lock-freeLoad()to skip series with fewer than 2 head chunks — only series with chunks ready for mmapping acquire the series mutex.sync/atomic.Uint32(4 bytes, align 4) is used instead ofgo.uber.org/atomic(8 bytes due tonocmp) to fit in existing struct padding without growingmemSeries. Chunk counts are bounded by the 3-byte field inHeadChunkRef, so cannot overflowuint32.Should address the same problem as #14752, but with a different approach.
prombench (PS: @bwplotka made me aware in retrospect that prombench is currently found to be unreliable within the +-10% range):
Benchmark results (Intel Xeon Platinum 8280, 16 CPUs)
At production-relevant scales (100k series, 1–10% ready), the optimized path is 36–54% faster with a 14% geomean improvement across all cases. When 100% of series are ready it's within noise of the full scan. Memory and allocations are essentially flat.
Which issue(s) does the PR fix:
Does this PR introduce a user-facing change?