WIP reduce getGCState freq by CalvinNeo · Pull Request #10817 · pingcap/tiflash

CalvinNeo · 2026-04-24T10:11:10Z

What problem does this PR solve?

Issue Number: close #10818

Problem Summary:

What is changed and how it works?

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No code

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

None

Summary by CodeRabbit

Release Notes

Performance
- Enhanced GC safepoint retrieval with configurable caching strategies to optimize query execution performance.
Tests
- Added comprehensive test coverage for safepoint caching behaviors.

Signed-off-by: Calvin Neo <calvinneo1995@gmail.com>

ti-chi-bot · 2026-04-24T10:11:16Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign solotzg for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-04-24T10:11:27Z

📝 Walkthrough

Walkthrough

This PR introduces a configurable fetch strategy for GC safepoint retrieval in TiKV storage. A new GCSafepointFetchStrategy enum allows callers to choose between cache-only reads or cache-with-PD-fallback behavior. The method signature is updated, test coverage is added, and StorageDeltaMerge is modified to use the cache-only strategy.

Changes

Cohort / File(s)	Summary
GC Safepoint Fetch Strategy API `dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h`	Adds `GCSafepointFetchStrategy` enum with `CacheOnly` and `UpdateCacheIfNeeded` variants. Updates `PDClientHelper::getGCSafePointWithRetry` signature to accept `fetch_strategy` parameter, enabling conditional cache-only behavior or cache-with-PD-fallback logic.
Test Coverage `dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp`	Introduces `CountingPDClient` stub to track GC state calls. Adds two unit tests verifying `CacheOnly` strategy does not trigger PD fetches on cache miss and respects stale cached values without advancing the safe point.
Production Usage `dbms/src/Storages/StorageDeltaMerge.cpp`	Updates `checkStartTs` to explicitly request GC safe point with `GCSafepointFetchStrategy::CacheOnly` instead of relying on default behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

pingcap/tiflash#10807: Also modifies PDClientHelper::getGCSafePointWithRetry signature and behavior, adding backoff configuration and metrics alongside cache management improvements.

Suggested labels

severity/minor, size/L

Suggested reviewers

JinheLin
JaySon-Huang
kolafish

Poem

🐰 A cache strategy hops into place,
No frantic PD chases—just saved space!
Cache-only queries skip the distant call,
While updates keep the fresh data for all,
TiFlash now moves with a quicker grace!

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description follows the template structure but lacks critical content: no problem summary, empty commit message, unverified checkbox for unit tests (which are present), and no substantive explanation of changes or rationale.	Fill in the Problem Summary section, add a descriptive commit message, check the Unit test checkbox, and explain why this change reduces getGCState calls and its impact.
Docstring Coverage	⚠️ Warning	Docstring coverage is 10.34% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The PR title 'WIP reduce getGCState freq' is abbreviated, partially describes a real change (reducing GC state calls), but lacks clarity and specificity about the implementation approach.	Expand the title to clearly describe what was changed, e.g., 'Add configurable GC safepoint fetch strategy to reduce PD calls' or 'Implement cache-only strategy for GC safepoint fetches'.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

dbms/src/Storages/StorageDeltaMerge.cpp (1)
719-745: ⚠️ Potential issue | 🟠 Major

checkStartTs becomes a no-op on cache miss — this is intentional but needs explicit documentation.

checkStartTs is the safety net that rejects queries whose start_ts is below the GC safepoint. With GCSafepointFetchStrategy::CacheOnly, getGCSafePointWithRetry returns 0 on cache miss (confirmed in PDTiKVClient.h:200-208), making the comparison start_ts < 0 always false — the check is silently skipped.

The design is intentional: background/non-query callers (SchemaSyncService with ignore_cache=true, PrehandleSnapshot, DeltaMergeStore_InternalBg) populate the cache via PD, while query paths consume only the cached value to avoid per-query PD traffic. This is validated by explicit test coverage (CacheOnlyReadPathDoesNotFetchFromPD).

However, a startup window exists: if a query arrives before any background path has populated the cache for a given keyspace (fresh TiFlash process or a new keyspace that none of the background tasks have touched yet), the safety check is bypassed entirely. This trade-off between startup safety and steady-state performance should be:

Explicitly documented in the commit message or PR description — the current log message suggests the default behavior is preserved, which is misleading.

Confirmed acceptable by the authors — either the startup window is guaranteed short in practice (e.g., schema sync always runs before first query), or the risk is acceptable by design.

Consider adding a note in the code or PR explaining why this trade-off is acceptable at TiFlash startup.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dbms/src/Storages/StorageDeltaMerge.cpp` around lines 719 - 745, The
checkStartTs function uses PDClientHelper::getGCSafePointWithRetry(...,
GCSafepointFetchStrategy::CacheOnly) (called from checkStartTs) which returns 0
on cache miss, effectively making the start_ts < safe_point check a no-op during
cold starts; add an explicit in-code comment above checkStartTs (or adjacent to
the PDClientHelper call) stating that CacheOnly yields 0 on miss, that
background callers (SchemaSyncService, PrehandleSnapshot,
DeltaMergeStore_InternalBg) are responsible for populating the cache, and
document the startup window trade-off and why it is acceptable (or link to the
PR/issue) so reviewers/readers are not misled by the current behavior/logging.

🧹 Nitpick comments (3)

dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp (1)
1268-1268: Wall-clock sleep adds mild flakiness.

sleep_for(2s) combined with safe_point_update_interval_seconds=1 is fine in practice, but makes this test timing-dependent and slow. If you ever want to make it deterministic, the cache uses steady_clock inside getGCSafepointIfValid, so a custom clock injection or exposing an "expire now" hook would eliminate the sleep. Not required for this PR.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp` at line 1268, The test
uses a wall-clock sleep (std::this_thread::sleep_for(std::chrono::seconds(2)))
which makes it timing-dependent and slow; replace this with a deterministic
approach by injecting a test clock or adding an "expire now" hook so the
cache/GC safepoint logic can be advanced without sleeping — target the code
paths around getGCSafepointIfValid and the safe_point_update_interval_seconds
behavior (use a steady_clock-injectable implementation or call an exposed method
to force expiry) so the test can trigger the same cache refresh immediately and
remove the sleep.
dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h (1)
198-225: Consider whether the CacheOnly path should emit tiflash_gc_safepoint_backoff_count{type=success}.

In the CacheOnly branch no PD request is ever made, so calling observe_backoff_count(true) (with backoff_count == 0) inflates the type_success histogram with samples that do not correspond to an actual PD fetch. Previously every observation in that metric reflected a real PD interaction; after this change the vast majority of observations will come from cache-only lookups on the hot read path and the metric will mostly report zeros. If the intent of the metric is to track PD-fetch backoff, consider skipping the observation on the CacheOnly cache-hit/miss path (and also skipping it on the existing fast-path cache hit).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h` around lines 198 - 225,
The CacheOnly branch and the fast-path cached-hit branch currently call
observe_backoff_count(true) even though no PD request occurs; remove those calls
so the tiflash_gc_safepoint_backoff_count{type=success} metric is only observed
when an actual PD fetch/backoff happens. Specifically, in PDTiKVClient.h inside
the code paths handling GCSafepointFetchStrategy::CacheOnly and the
getGCSafepointIfValid(cache-hit) branch, eliminate the
observe_backoff_count(true) calls and ensure observe_backoff_count is only
invoked in the code path(s) that perform a real PD fetch (the fallback PD-fetch
logic). Use the existing symbols ks_gc_sp_map.getGCSafepoint,
ks_gc_sp_map.getGCSafepointIfValid, and observe_backoff_count to locate and
adjust the calls.
dbms/src/Storages/StorageDeltaMerge.cpp (1)
915-915: checkStartTs is now invoked three times per read; with CacheOnly this is effectively identical to a single call.

Previously each checkStartTs call could trigger a PD fetch when the cache was stale, so invoking it pre-read / post-read / post-snapshot gave each call an independent chance to pick up a freshly advanced safepoint. With CacheOnly all three calls read the same cached value (unless a background path races in between), so the "ensure after read" invariant the comments describe no longer adds meaningful coverage. Worth a short note in the PR or code comments that the post-read checks are kept as a defense-in-depth against a future background refresh between the two calls, rather than active safety.

Also applies to: 1011-1011, 1057-1057
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dbms/src/Storages/StorageDeltaMerge.cpp` at line 915, Multiple calls to
checkStartTs(mvcc_query_info.start_ts, context, query_info.req_id, keyspace_id)
now read the same cached safepoint under CacheOnly, so the post-read invocations
no longer increase coverage; update the code comment near the checkStartTs calls
(the pre-read/post-read/post-snapshot invocations) to state explicitly that with
CacheOnly the checks observe the same cached value and that the extra calls are
retained as defense-in-depth only (to catch a rare background refresh/race),
rather than for active additional safety, so future readers understand why we
keep the redundant calls.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@dbms/src/Storages/StorageDeltaMerge.cpp`:
- Around line 719-745: The checkStartTs function uses
PDClientHelper::getGCSafePointWithRetry(...,
GCSafepointFetchStrategy::CacheOnly) (called from checkStartTs) which returns 0
on cache miss, effectively making the start_ts < safe_point check a no-op during
cold starts; add an explicit in-code comment above checkStartTs (or adjacent to
the PDClientHelper call) stating that CacheOnly yields 0 on miss, that
background callers (SchemaSyncService, PrehandleSnapshot,
DeltaMergeStore_InternalBg) are responsible for populating the cache, and
document the startup window trade-off and why it is acceptable (or link to the
PR/issue) so reviewers/readers are not misled by the current behavior/logging.

---

Nitpick comments:
In `@dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp`:
- Line 1268: The test uses a wall-clock sleep
(std::this_thread::sleep_for(std::chrono::seconds(2))) which makes it
timing-dependent and slow; replace this with a deterministic approach by
injecting a test clock or adding an "expire now" hook so the cache/GC safepoint
logic can be advanced without sleeping — target the code paths around
getGCSafepointIfValid and the safe_point_update_interval_seconds behavior (use a
steady_clock-injectable implementation or call an exposed method to force
expiry) so the test can trigger the same cache refresh immediately and remove
the sleep.

In `@dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h`:
- Around line 198-225: The CacheOnly branch and the fast-path cached-hit branch
currently call observe_backoff_count(true) even though no PD request occurs;
remove those calls so the tiflash_gc_safepoint_backoff_count{type=success}
metric is only observed when an actual PD fetch/backoff happens. Specifically,
in PDTiKVClient.h inside the code paths handling
GCSafepointFetchStrategy::CacheOnly and the getGCSafepointIfValid(cache-hit)
branch, eliminate the observe_backoff_count(true) calls and ensure
observe_backoff_count is only invoked in the code path(s) that perform a real PD
fetch (the fallback PD-fetch logic). Use the existing symbols
ks_gc_sp_map.getGCSafepoint, ks_gc_sp_map.getGCSafepointIfValid, and
observe_backoff_count to locate and adjust the calls.

In `@dbms/src/Storages/StorageDeltaMerge.cpp`:
- Line 915: Multiple calls to checkStartTs(mvcc_query_info.start_ts, context,
query_info.req_id, keyspace_id) now read the same cached safepoint under
CacheOnly, so the post-read invocations no longer increase coverage; update the
code comment near the checkStartTs calls (the pre-read/post-read/post-snapshot
invocations) to state explicitly that with CacheOnly the checks observe the same
cached value and that the extra calls are retained as defense-in-depth only (to
catch a rare background refresh/race), rather than for active additional safety,
so future readers understand why we keep the redundant calls.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 08e9eb71-838f-4216-a4d9-5768a598805d

📥 Commits

Reviewing files that changed from the base of the PR and between 0dc254b and c3f1d3c.

📒 Files selected for processing (3)

dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h
dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp
dbms/src/Storages/StorageDeltaMerge.cpp

ti-chi-bot · 2026-04-24T10:22:09Z

@CalvinNeo: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-unit-test	`c3f1d3c`	link	true	`/test pull-unit-test`
pull-integration-next-gen	`c3f1d3c`	link	true	`/test pull-integration-next-gen`
pull-integration-test	`c3f1d3c`	link	true	`/test pull-integration-test`
pull-unit-next-gen	`c3f1d3c`	link	true	`/test pull-unit-next-gen`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

demo

c3f1d3c

Signed-off-by: Calvin Neo <calvinneo1995@gmail.com>

ti-chi-bot Bot added do-not-merge/needs-linked-issue release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Apr 24, 2026

ti-chi-bot Bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 24, 2026

ti-chi-bot Bot added do-not-merge/needs-triage-completed and removed do-not-merge/needs-linked-issue labels Apr 24, 2026

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP reduce getGCState freq#10817

WIP reduce getGCState freq#10817
CalvinNeo wants to merge 1 commit intopingcap:masterfrom
CalvinNeo:fix-gc-safepoint

CalvinNeo commented Apr 24, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

ti-chi-bot Bot commented Apr 24, 2026

Uh oh!

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (2 warnings, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

ti-chi-bot Bot commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

CalvinNeo commented Apr 24, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and how it works?

Check List

Release note

Summary by CodeRabbit

Release Notes

Uh oh!

ti-chi-bot Bot commented Apr 24, 2026

Uh oh!

coderabbitai Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (2 warnings, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

ti-chi-bot Bot commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

CalvinNeo commented Apr 24, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading