Skip to content

WIP reduce getGCState freq#10817

Open
CalvinNeo wants to merge 1 commit intopingcap:masterfrom
CalvinNeo:fix-gc-safepoint
Open

WIP reduce getGCState freq#10817
CalvinNeo wants to merge 1 commit intopingcap:masterfrom
CalvinNeo:fix-gc-safepoint

Conversation

@CalvinNeo
Copy link
Copy Markdown
Member

@CalvinNeo CalvinNeo commented Apr 24, 2026

What problem does this PR solve?

Issue Number: close #10818

Problem Summary:

What is changed and how it works?


Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

None

Summary by CodeRabbit

Release Notes

  • Performance

    • Enhanced GC safepoint retrieval with configurable caching strategies to optimize query execution performance.
  • Tests

    • Added comprehensive test coverage for safepoint caching behaviors.

Signed-off-by: Calvin Neo <calvinneo1995@gmail.com>
@ti-chi-bot ti-chi-bot Bot added do-not-merge/needs-linked-issue release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Apr 24, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 24, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign solotzg for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 24, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 24, 2026

📝 Walkthrough

Walkthrough

This PR introduces a configurable fetch strategy for GC safepoint retrieval in TiKV storage. A new GCSafepointFetchStrategy enum allows callers to choose between cache-only reads or cache-with-PD-fallback behavior. The method signature is updated, test coverage is added, and StorageDeltaMerge is modified to use the cache-only strategy.

Changes

Cohort / File(s) Summary
GC Safepoint Fetch Strategy API
dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h
Adds GCSafepointFetchStrategy enum with CacheOnly and UpdateCacheIfNeeded variants. Updates PDClientHelper::getGCSafePointWithRetry signature to accept fetch_strategy parameter, enabling conditional cache-only behavior or cache-with-PD-fallback logic.
Test Coverage
dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp
Introduces CountingPDClient stub to track GC state calls. Adds two unit tests verifying CacheOnly strategy does not trigger PD fetches on cache miss and respects stale cached values without advancing the safe point.
Production Usage
dbms/src/Storages/StorageDeltaMerge.cpp
Updates checkStartTs to explicitly request GC safe point with GCSafepointFetchStrategy::CacheOnly instead of relying on default behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • pingcap/tiflash#10807: Also modifies PDClientHelper::getGCSafePointWithRetry signature and behavior, adding backoff configuration and metrics alongside cache management improvements.

Suggested labels

severity/minor, size/L

Suggested reviewers

  • JinheLin
  • JaySon-Huang
  • kolafish

Poem

🐰 A cache strategy hops into place,
No frantic PD chases—just saved space!
Cache-only queries skip the distant call,
While updates keep the fresh data for all,
TiFlash now moves with a quicker grace!

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description follows the template structure but lacks critical content: no problem summary, empty commit message, unverified checkbox for unit tests (which are present), and no substantive explanation of changes or rationale. Fill in the Problem Summary section, add a descriptive commit message, check the Unit test checkbox, and explain why this change reduces getGCState calls and its impact.
Docstring Coverage ⚠️ Warning Docstring coverage is 10.34% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The PR title 'WIP reduce getGCState freq' is abbreviated, partially describes a real change (reducing GC state calls), but lacks clarity and specificity about the implementation approach. Expand the title to clearly describe what was changed, e.g., 'Add configurable GC safepoint fetch strategy to reduce PD calls' or 'Implement cache-only strategy for GC safepoint fetches'.
✅ Passed checks (2 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
dbms/src/Storages/StorageDeltaMerge.cpp (1)

719-745: ⚠️ Potential issue | 🟠 Major

checkStartTs becomes a no-op on cache miss — this is intentional but needs explicit documentation.

checkStartTs is the safety net that rejects queries whose start_ts is below the GC safepoint. With GCSafepointFetchStrategy::CacheOnly, getGCSafePointWithRetry returns 0 on cache miss (confirmed in PDTiKVClient.h:200-208), making the comparison start_ts < 0 always false — the check is silently skipped.

The design is intentional: background/non-query callers (SchemaSyncService with ignore_cache=true, PrehandleSnapshot, DeltaMergeStore_InternalBg) populate the cache via PD, while query paths consume only the cached value to avoid per-query PD traffic. This is validated by explicit test coverage (CacheOnlyReadPathDoesNotFetchFromPD).

However, a startup window exists: if a query arrives before any background path has populated the cache for a given keyspace (fresh TiFlash process or a new keyspace that none of the background tasks have touched yet), the safety check is bypassed entirely. This trade-off between startup safety and steady-state performance should be:

  1. Explicitly documented in the commit message or PR description — the current log message suggests the default behavior is preserved, which is misleading.
  2. Confirmed acceptable by the authors — either the startup window is guaranteed short in practice (e.g., schema sync always runs before first query), or the risk is acceptable by design.

Consider adding a note in the code or PR explaining why this trade-off is acceptable at TiFlash startup.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dbms/src/Storages/StorageDeltaMerge.cpp` around lines 719 - 745, The
checkStartTs function uses PDClientHelper::getGCSafePointWithRetry(...,
GCSafepointFetchStrategy::CacheOnly) (called from checkStartTs) which returns 0
on cache miss, effectively making the start_ts < safe_point check a no-op during
cold starts; add an explicit in-code comment above checkStartTs (or adjacent to
the PDClientHelper call) stating that CacheOnly yields 0 on miss, that
background callers (SchemaSyncService, PrehandleSnapshot,
DeltaMergeStore_InternalBg) are responsible for populating the cache, and
document the startup window trade-off and why it is acceptable (or link to the
PR/issue) so reviewers/readers are not misled by the current behavior/logging.
🧹 Nitpick comments (3)
dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp (1)

1268-1268: Wall-clock sleep adds mild flakiness.

sleep_for(2s) combined with safe_point_update_interval_seconds=1 is fine in practice, but makes this test timing-dependent and slow. If you ever want to make it deterministic, the cache uses steady_clock inside getGCSafepointIfValid, so a custom clock injection or exposing an "expire now" hook would eliminate the sleep. Not required for this PR.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp` at line 1268, The test
uses a wall-clock sleep (std::this_thread::sleep_for(std::chrono::seconds(2)))
which makes it timing-dependent and slow; replace this with a deterministic
approach by injecting a test clock or adding an "expire now" hook so the
cache/GC safepoint logic can be advanced without sleeping — target the code
paths around getGCSafepointIfValid and the safe_point_update_interval_seconds
behavior (use a steady_clock-injectable implementation or call an exposed method
to force expiry) so the test can trigger the same cache refresh immediately and
remove the sleep.
dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h (1)

198-225: Consider whether the CacheOnly path should emit tiflash_gc_safepoint_backoff_count{type=success}.

In the CacheOnly branch no PD request is ever made, so calling observe_backoff_count(true) (with backoff_count == 0) inflates the type_success histogram with samples that do not correspond to an actual PD fetch. Previously every observation in that metric reflected a real PD interaction; after this change the vast majority of observations will come from cache-only lookups on the hot read path and the metric will mostly report zeros. If the intent of the metric is to track PD-fetch backoff, consider skipping the observation on the CacheOnly cache-hit/miss path (and also skipping it on the existing fast-path cache hit).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h` around lines 198 - 225,
The CacheOnly branch and the fast-path cached-hit branch currently call
observe_backoff_count(true) even though no PD request occurs; remove those calls
so the tiflash_gc_safepoint_backoff_count{type=success} metric is only observed
when an actual PD fetch/backoff happens. Specifically, in PDTiKVClient.h inside
the code paths handling GCSafepointFetchStrategy::CacheOnly and the
getGCSafepointIfValid(cache-hit) branch, eliminate the
observe_backoff_count(true) calls and ensure observe_backoff_count is only
invoked in the code path(s) that perform a real PD fetch (the fallback PD-fetch
logic). Use the existing symbols ks_gc_sp_map.getGCSafepoint,
ks_gc_sp_map.getGCSafepointIfValid, and observe_backoff_count to locate and
adjust the calls.
dbms/src/Storages/StorageDeltaMerge.cpp (1)

915-915: checkStartTs is now invoked three times per read; with CacheOnly this is effectively identical to a single call.

Previously each checkStartTs call could trigger a PD fetch when the cache was stale, so invoking it pre-read / post-read / post-snapshot gave each call an independent chance to pick up a freshly advanced safepoint. With CacheOnly all three calls read the same cached value (unless a background path races in between), so the "ensure after read" invariant the comments describe no longer adds meaningful coverage. Worth a short note in the PR or code comments that the post-read checks are kept as a defense-in-depth against a future background refresh between the two calls, rather than active safety.

Also applies to: 1011-1011, 1057-1057

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dbms/src/Storages/StorageDeltaMerge.cpp` at line 915, Multiple calls to
checkStartTs(mvcc_query_info.start_ts, context, query_info.req_id, keyspace_id)
now read the same cached safepoint under CacheOnly, so the post-read invocations
no longer increase coverage; update the code comment near the checkStartTs calls
(the pre-read/post-read/post-snapshot invocations) to state explicitly that with
CacheOnly the checks observe the same cached value and that the extra calls are
retained as defense-in-depth only (to catch a rare background refresh/race),
rather than for active additional safety, so future readers understand why we
keep the redundant calls.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@dbms/src/Storages/StorageDeltaMerge.cpp`:
- Around line 719-745: The checkStartTs function uses
PDClientHelper::getGCSafePointWithRetry(...,
GCSafepointFetchStrategy::CacheOnly) (called from checkStartTs) which returns 0
on cache miss, effectively making the start_ts < safe_point check a no-op during
cold starts; add an explicit in-code comment above checkStartTs (or adjacent to
the PDClientHelper call) stating that CacheOnly yields 0 on miss, that
background callers (SchemaSyncService, PrehandleSnapshot,
DeltaMergeStore_InternalBg) are responsible for populating the cache, and
document the startup window trade-off and why it is acceptable (or link to the
PR/issue) so reviewers/readers are not misled by the current behavior/logging.

---

Nitpick comments:
In `@dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp`:
- Line 1268: The test uses a wall-clock sleep
(std::this_thread::sleep_for(std::chrono::seconds(2))) which makes it
timing-dependent and slow; replace this with a deterministic approach by
injecting a test clock or adding an "expire now" hook so the cache/GC safepoint
logic can be advanced without sleeping — target the code paths around
getGCSafepointIfValid and the safe_point_update_interval_seconds behavior (use a
steady_clock-injectable implementation or call an exposed method to force
expiry) so the test can trigger the same cache refresh immediately and remove
the sleep.

In `@dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h`:
- Around line 198-225: The CacheOnly branch and the fast-path cached-hit branch
currently call observe_backoff_count(true) even though no PD request occurs;
remove those calls so the tiflash_gc_safepoint_backoff_count{type=success}
metric is only observed when an actual PD fetch/backoff happens. Specifically,
in PDTiKVClient.h inside the code paths handling
GCSafepointFetchStrategy::CacheOnly and the getGCSafepointIfValid(cache-hit)
branch, eliminate the observe_backoff_count(true) calls and ensure
observe_backoff_count is only invoked in the code path(s) that perform a real PD
fetch (the fallback PD-fetch logic). Use the existing symbols
ks_gc_sp_map.getGCSafepoint, ks_gc_sp_map.getGCSafepointIfValid, and
observe_backoff_count to locate and adjust the calls.

In `@dbms/src/Storages/StorageDeltaMerge.cpp`:
- Line 915: Multiple calls to checkStartTs(mvcc_query_info.start_ts, context,
query_info.req_id, keyspace_id) now read the same cached safepoint under
CacheOnly, so the post-read invocations no longer increase coverage; update the
code comment near the checkStartTs calls (the pre-read/post-read/post-snapshot
invocations) to state explicitly that with CacheOnly the checks observe the same
cached value and that the extra calls are retained as defense-in-depth only (to
catch a rare background refresh/race), rather than for active additional safety,
so future readers understand why we keep the redundant calls.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 08e9eb71-838f-4216-a4d9-5768a598805d

📥 Commits

Reviewing files that changed from the base of the PR and between 0dc254b and c3f1d3c.

📒 Files selected for processing (3)
  • dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h
  • dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp
  • dbms/src/Storages/StorageDeltaMerge.cpp

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 24, 2026

@CalvinNeo: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-unit-test c3f1d3c link true /test pull-unit-test
pull-integration-next-gen c3f1d3c link true /test pull-integration-next-gen
pull-integration-test c3f1d3c link true /test pull-integration-test
pull-unit-next-gen c3f1d3c link true /test pull-unit-next-gen

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/needs-triage-completed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Too frequent getGCState

1 participant