Skip to content

disagg: Add O11y on object store usage summary of each tiflash store (#10764)#10768

Merged
ti-chi-bot[bot] merged 2 commits intopingcap:release-nextgen-20251011from
ti-chi-bot:cherry-pick-10764-to-release-nextgen-20251011
Mar 25, 2026
Merged

disagg: Add O11y on object store usage summary of each tiflash store (#10764)#10768
ti-chi-bot[bot] merged 2 commits intopingcap:release-nextgen-20251011from
ti-chi-bot:cherry-pick-10764-to-release-nextgen-20251011

Conversation

@ti-chi-bot
Copy link
Copy Markdown
Member

This is an automated cherry-pick of #10764

What problem does this PR solve?

Issue Number: ref #10763

Problem Summary:

  • We need continuous and low-noise observability for remote object-store usage per TiFlash store.
  • Existing S3GC visibility is insufficient to quickly compare expected valid data size vs real object-store footprint.
  • We also need this summary collection to be configurable and safe for production overhead.

What is changed and how it works?

disagg: add configurable owner-only S3 storage summary and per-store usage metrics
  • Add periodic S3 storage summary in S3GCManagerService:

  • Add graceful shutdown behavior during summary scan:

    • getStoreStorageSummary checks shutdown_called in listPrefix loop and exits early.
  • Add per-store summary metrics:

    • New gauge family:
      • tiflash_storage_s3_store_summary_bytes{store_id, type}
      • type in {data_file_bytes, dt_file_bytes}
    • Metrics are updated after each getStoreStorageSummary completion.
  • Wire configuration from settings to S3GC config:

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
# Run chbenchmark workload and check the metrics of `prelock_keys` and OSS usage
tiup bench ch --host 10.2.12.81 -P 8081 --warehouses 8000 run -D chbenchmark8k -T 50 -t 0 --time 30m --ignore-error --queries q1
# Before the fix, from 23:29 to 00:00, the number of prelock_keys in memory would accumulate and increase with the write load; after the fix, from 02:00 to 02:30, there was no longer any persistent residue of prelock_keys in memory.
# Also can check the new added grafana panel "Remote Store Summary (Disagg arch)"
image image - [ ] No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Add configurable owner-only periodic S3 storage summary and per-store summary metrics (`tiflash_storage_s3_store_summary_bytes`) for TiFlash disaggregated S3GC observability.

Summary by CodeRabbit

  • New Features

    • Added per-store S3 storage summary metrics for data and index file sizes.
    • Added configurable periodic S3 summary collection (default 24h; ≤0 disables).
    • New Grafana panels: Remote Store Summary, Local Lock Manager status, and Local Lock Manager QPS.
  • Documentation

    • Updated dashboard layout and panel organization for improved metric visibility.

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@ti-chi-bot ti-chi-bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. type/cherry-pick-for-release-nextgen-20251011 labels Mar 24, 2026
@ti-chi-bot
Copy link
Copy Markdown
Member Author

@JaySon-Huang This PR has conflicts, I have hold it.
Please resolve them or ask others to resolve them, then comment /unhold to remove the hold label.

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Mar 24, 2026

@ti-chi-bot: ## If you want to know how to resolve it, please read the guide in TiDB Dev Guide.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 24, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

🗂️ Base branches to auto review (3)
  • release-8.5
  • release-7.5
  • release-8.1

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 46fa9777-84c9-4b91-8601-2291a9fafb89

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: JaySon-Huang <tshent@qq.com>
@JaySon-Huang
Copy link
Copy Markdown
Contributor

/unhold

@ti-chi-bot ti-chi-bot bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 25, 2026
@ti-chi-bot ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Mar 25, 2026
@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Mar 25, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Mar 25, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-03-25 01:09:42.985840621 +0000 UTC m=+316979.021910871: ☑️ agreed by JaySon-Huang.
  • 2026-03-25 01:28:54.342561527 +0000 UTC m=+318130.378631788: ☑️ agreed by JinheLin.

@ti-chi-bot ti-chi-bot bot added the approved label Mar 25, 2026
@ti-chi-bot ti-chi-bot bot added the cherry-pick-approved Cherry pick PR approved by release team. label Mar 25, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Mar 25, 2026

@kolafish: adding LGTM is restricted to approvers and reviewers in OWNERS files.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Mar 25, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JaySon-Huang, JinheLin, kolafish, yudongusa

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot merged commit c1faf95 into pingcap:release-nextgen-20251011 Mar 25, 2026
5 checks passed
@ti-chi-bot ti-chi-bot bot deleted the cherry-pick-10764-to-release-nextgen-20251011 branch March 25, 2026 03:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved cherry-pick-approved Cherry pick PR approved by release team. lgtm release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. type/cherry-pick-for-release-nextgen-20251011

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants