Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
📝 WalkthroughWalkthroughThis PR introduces a configurable fetch strategy for GC safepoint retrieval in TiKV storage. A new Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
dbms/src/Storages/StorageDeltaMerge.cpp (1)
719-745:⚠️ Potential issue | 🟠 Major
checkStartTsbecomes a no-op on cache miss — this is intentional but needs explicit documentation.
checkStartTsis the safety net that rejects queries whosestart_tsis below the GC safepoint. WithGCSafepointFetchStrategy::CacheOnly,getGCSafePointWithRetryreturns0on cache miss (confirmed in PDTiKVClient.h:200-208), making the comparisonstart_ts < 0always false — the check is silently skipped.The design is intentional: background/non-query callers (SchemaSyncService with
ignore_cache=true, PrehandleSnapshot, DeltaMergeStore_InternalBg) populate the cache via PD, while query paths consume only the cached value to avoid per-query PD traffic. This is validated by explicit test coverage (CacheOnlyReadPathDoesNotFetchFromPD).However, a startup window exists: if a query arrives before any background path has populated the cache for a given keyspace (fresh TiFlash process or a new keyspace that none of the background tasks have touched yet), the safety check is bypassed entirely. This trade-off between startup safety and steady-state performance should be:
- Explicitly documented in the commit message or PR description — the current log message suggests the default behavior is preserved, which is misleading.
- Confirmed acceptable by the authors — either the startup window is guaranteed short in practice (e.g., schema sync always runs before first query), or the risk is acceptable by design.
Consider adding a note in the code or PR explaining why this trade-off is acceptable at TiFlash startup.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dbms/src/Storages/StorageDeltaMerge.cpp` around lines 719 - 745, The checkStartTs function uses PDClientHelper::getGCSafePointWithRetry(..., GCSafepointFetchStrategy::CacheOnly) (called from checkStartTs) which returns 0 on cache miss, effectively making the start_ts < safe_point check a no-op during cold starts; add an explicit in-code comment above checkStartTs (or adjacent to the PDClientHelper call) stating that CacheOnly yields 0 on miss, that background callers (SchemaSyncService, PrehandleSnapshot, DeltaMergeStore_InternalBg) are responsible for populating the cache, and document the startup window trade-off and why it is acceptable (or link to the PR/issue) so reviewers/readers are not misled by the current behavior/logging.
🧹 Nitpick comments (3)
dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp (1)
1268-1268: Wall-clock sleep adds mild flakiness.
sleep_for(2s)combined withsafe_point_update_interval_seconds=1is fine in practice, but makes this test timing-dependent and slow. If you ever want to make it deterministic, the cache usessteady_clockinsidegetGCSafepointIfValid, so a custom clock injection or exposing an "expire now" hook would eliminate the sleep. Not required for this PR.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp` at line 1268, The test uses a wall-clock sleep (std::this_thread::sleep_for(std::chrono::seconds(2))) which makes it timing-dependent and slow; replace this with a deterministic approach by injecting a test clock or adding an "expire now" hook so the cache/GC safepoint logic can be advanced without sleeping — target the code paths around getGCSafepointIfValid and the safe_point_update_interval_seconds behavior (use a steady_clock-injectable implementation or call an exposed method to force expiry) so the test can trigger the same cache refresh immediately and remove the sleep.dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h (1)
198-225: Consider whether the CacheOnly path should emittiflash_gc_safepoint_backoff_count{type=success}.In the
CacheOnlybranch no PD request is ever made, so callingobserve_backoff_count(true)(withbackoff_count == 0) inflates thetype_successhistogram with samples that do not correspond to an actual PD fetch. Previously every observation in that metric reflected a real PD interaction; after this change the vast majority of observations will come from cache-only lookups on the hot read path and the metric will mostly report zeros. If the intent of the metric is to track PD-fetch backoff, consider skipping the observation on theCacheOnlycache-hit/miss path (and also skipping it on the existing fast-path cache hit).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h` around lines 198 - 225, The CacheOnly branch and the fast-path cached-hit branch currently call observe_backoff_count(true) even though no PD request occurs; remove those calls so the tiflash_gc_safepoint_backoff_count{type=success} metric is only observed when an actual PD fetch/backoff happens. Specifically, in PDTiKVClient.h inside the code paths handling GCSafepointFetchStrategy::CacheOnly and the getGCSafepointIfValid(cache-hit) branch, eliminate the observe_backoff_count(true) calls and ensure observe_backoff_count is only invoked in the code path(s) that perform a real PD fetch (the fallback PD-fetch logic). Use the existing symbols ks_gc_sp_map.getGCSafepoint, ks_gc_sp_map.getGCSafepointIfValid, and observe_backoff_count to locate and adjust the calls.dbms/src/Storages/StorageDeltaMerge.cpp (1)
915-915:checkStartTsis now invoked three times per read; with CacheOnly this is effectively identical to a single call.Previously each
checkStartTscall could trigger a PD fetch when the cache was stale, so invoking it pre-read / post-read / post-snapshot gave each call an independent chance to pick up a freshly advanced safepoint. WithCacheOnlyall three calls read the same cached value (unless a background path races in between), so the "ensure after read" invariant the comments describe no longer adds meaningful coverage. Worth a short note in the PR or code comments that the post-read checks are kept as a defense-in-depth against a future background refresh between the two calls, rather than active safety.Also applies to: 1011-1011, 1057-1057
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dbms/src/Storages/StorageDeltaMerge.cpp` at line 915, Multiple calls to checkStartTs(mvcc_query_info.start_ts, context, query_info.req_id, keyspace_id) now read the same cached safepoint under CacheOnly, so the post-read invocations no longer increase coverage; update the code comment near the checkStartTs calls (the pre-read/post-read/post-snapshot invocations) to state explicitly that with CacheOnly the checks observe the same cached value and that the extra calls are retained as defense-in-depth only (to catch a rare background refresh/race), rather than for active additional safety, so future readers understand why we keep the redundant calls.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@dbms/src/Storages/StorageDeltaMerge.cpp`:
- Around line 719-745: The checkStartTs function uses
PDClientHelper::getGCSafePointWithRetry(...,
GCSafepointFetchStrategy::CacheOnly) (called from checkStartTs) which returns 0
on cache miss, effectively making the start_ts < safe_point check a no-op during
cold starts; add an explicit in-code comment above checkStartTs (or adjacent to
the PDClientHelper call) stating that CacheOnly yields 0 on miss, that
background callers (SchemaSyncService, PrehandleSnapshot,
DeltaMergeStore_InternalBg) are responsible for populating the cache, and
document the startup window trade-off and why it is acceptable (or link to the
PR/issue) so reviewers/readers are not misled by the current behavior/logging.
---
Nitpick comments:
In `@dbms/src/Storages/KVStore/tests/gtest_new_kvstore.cpp`:
- Line 1268: The test uses a wall-clock sleep
(std::this_thread::sleep_for(std::chrono::seconds(2))) which makes it
timing-dependent and slow; replace this with a deterministic approach by
injecting a test clock or adding an "expire now" hook so the cache/GC safepoint
logic can be advanced without sleeping — target the code paths around
getGCSafepointIfValid and the safe_point_update_interval_seconds behavior (use a
steady_clock-injectable implementation or call an exposed method to force
expiry) so the test can trigger the same cache refresh immediately and remove
the sleep.
In `@dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h`:
- Around line 198-225: The CacheOnly branch and the fast-path cached-hit branch
currently call observe_backoff_count(true) even though no PD request occurs;
remove those calls so the tiflash_gc_safepoint_backoff_count{type=success}
metric is only observed when an actual PD fetch/backoff happens. Specifically,
in PDTiKVClient.h inside the code paths handling
GCSafepointFetchStrategy::CacheOnly and the getGCSafepointIfValid(cache-hit)
branch, eliminate the observe_backoff_count(true) calls and ensure
observe_backoff_count is only invoked in the code path(s) that perform a real PD
fetch (the fallback PD-fetch logic). Use the existing symbols
ks_gc_sp_map.getGCSafepoint, ks_gc_sp_map.getGCSafepointIfValid, and
observe_backoff_count to locate and adjust the calls.
In `@dbms/src/Storages/StorageDeltaMerge.cpp`:
- Line 915: Multiple calls to checkStartTs(mvcc_query_info.start_ts, context,
query_info.req_id, keyspace_id) now read the same cached safepoint under
CacheOnly, so the post-read invocations no longer increase coverage; update the
code comment near the checkStartTs calls (the pre-read/post-read/post-snapshot
invocations) to state explicitly that with CacheOnly the checks observe the same
cached value and that the extra calls are retained as defense-in-depth only (to
catch a rare background refresh/race), rather than for active additional safety,
so future readers understand why we keep the redundant calls.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 08e9eb71-838f-4216-a4d9-5768a598805d
📒 Files selected for processing (3)
dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.hdbms/src/Storages/KVStore/tests/gtest_new_kvstore.cppdbms/src/Storages/StorageDeltaMerge.cpp
|
@CalvinNeo: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What problem does this PR solve?
Issue Number: close #10818
Problem Summary:
What is changed and how it works?
Check List
Tests
Side effects
Documentation
Release note
Summary by CodeRabbit
Release Notes
Performance
Tests