metrics: improve getGCSafePointWithRetry metrics#10807
metrics: improve getGCSafePointWithRetry metrics#10807ti-chi-bot[bot] merged 7 commits intopingcap:masterfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdded Prometheus metrics for GC safepoint requests and backoff; instrumented PDClientHelper::getGCSafePointWithRetry to emit labeled counters/histograms and added a configurable max backoff setting; updated multiple call sites to pass the new setting; bumped Changes
Sequence Diagram(s)sequenceDiagram
participant TiFlash as TiFlash
participant PDClient as PD Client
participant Backoff as Backoff
participant Metrics as Prometheus Metrics
TiFlash->>PDClient: getGCState()
Note over TiFlash,PDClient: each retry attempt
PDClient-->>TiFlash: response / error
alt PD response OK
TiFlash->>Metrics: increment tiflash_gc_safepoint_request_count(type="get_gc_state")
alt gc_safe_point == 0
TiFlash->>Metrics: increment tiflash_gc_safepoint_request_count(type="zero_gc_safe_point")
else valid safe point
TiFlash->>Metrics: observe tiflash_gc_safepoint_backoff_count(type="success")
end
else PD response error
TiFlash->>Metrics: increment tiflash_gc_safepoint_request_count(type="pd_response_error")
end
alt exception during request
TiFlash->>Metrics: increment tiflash_gc_safepoint_request_count(type="request_exception")
end
alt backoff needed
TiFlash->>Backoff: bo.backoff()
Backoff-->>TiFlash: returns / throws
alt backoff throws
TiFlash->>Metrics: increment tiflash_gc_safepoint_request_count(type="backoff_error")
TiFlash->>Metrics: observe tiflash_gc_safepoint_backoff_count(type="failure")
else backoff succeeds
TiFlash->>Metrics: observe tiflash_gc_safepoint_backoff_count(type="success")
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@contrib/abseil-cpp`:
- Line 1: The submodule bump in contrib/abseil-cpp is unrelated to the
getGCSafePointWithRetry metrics change and must be verified or separated:
confirm that the new abseil commit (the SHA in contrib/abseil-cpp) is reachable
from an upstream Abseil branch the project tracks (not a fork/dangling ref) by
fetching upstream refs and validating ancestry, then either move the submodule
update into its own PR with justification (ABI/behavior/build impacts and
release note) or update this PR’s description/release notes to explain why the
bump is required and that downstream consumers build cleanly against it; ensure
you reference and test usages in code paths like getGCSafePointWithRetry to
validate no behavioral or build regressions.
In `@contrib/GmSSL`:
- Line 1: Add a one-paragraph explanation to contrib/GmSSL documenting why the
GmSSL submodule (commit 75155a4c3752ae11b35384e166122751285410a7) was added:
include a brief summary of the upstream GmSSL project, the cryptographic
features it provides, and exactly how and where TiFlash will use it (which
components, build integration, and runtime expectations), plus any compatibility
or licensing notes and a pointer to the integration point (e.g., the submodule
and any init/build hook that consumes it) so reviewers understand the rationale
and impact.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 0092703e-6f5a-4107-a3ea-991c0cf7b770
📒 Files selected for processing (6)
contrib/GmSSLcontrib/abseil-cppcontrib/arm-optimized-routinescontrib/client-cdbms/src/Common/TiFlashMetrics.hdbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h
| @@ -1 +1 @@ | |||
| Subproject commit 75155a4c3752ae11b35384e166122751285410a7 | |||
| Subproject commit 34698ddc6f79c955b41096b480389a29b0105cac | |||
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Submodule pointer change =="
git diff --submodule=log HEAD^ HEAD -- contrib/GmSSL || true
echo
echo "== Current submodule status =="
git submodule status contrib/GmSSL || true
echo
echo "== Commit metadata for new gitlink =="
git ls-tree HEAD contrib/GmSSL
echo
echo "== If submodule is initialized, show current checked-out commit =="
if [ -d contrib/GmSSL/.git ] || [ -f contrib/GmSSL/.git ]; then
git -C contrib/GmSSL rev-parse HEAD
git -C contrib/GmSSL log --oneline -n 20
else
echo "Submodule not initialized in this environment."
fiRepository: pingcap/tiflash
Length of output: 445
🏁 Script executed:
git log --oneline -n 10Repository: pingcap/tiflash
Length of output: 69
🏁 Script executed:
git diff HEAD -- contrib/GmSSL | head -50Repository: pingcap/tiflash
Length of output: 41
🏁 Script executed:
git show contrib/GmSSLRepository: pingcap/tiflash
Length of output: 456
🏁 Script executed:
# Try to get any parent or merge commit info
git log --oneline --all -n 20Repository: pingcap/tiflash
Length of output: 1460
Document the rationale for adding the GmSSL dependency.
This PR adds GmSSL as a new submodule (commit 75155a4c3752ae11b35384e166122751285410a7), but includes no explanation of why it's needed or how it integrates with TiFlash. Include a summary of the upstream project and its intended use before merge.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@contrib/GmSSL` at line 1, Add a one-paragraph explanation to contrib/GmSSL
documenting why the GmSSL submodule (commit
75155a4c3752ae11b35384e166122751285410a7) was added: include a brief summary of
the upstream GmSSL project, the cryptographic features it provides, and exactly
how and where TiFlash will use it (which components, build integration, and
runtime expectations), plus any compatibility or licensing notes and a pointer
to the integration point (e.g., the submodule and any init/build hook that
consumes it) so reviewers understand the rationale and impact.
|
/retest |
|
/retest |
[LGTM Timeline notifier]Timeline:
|
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h (1)
207-275:⚠️ Potential issue | 🔴 CriticalRethrow from inner catch is caught by the outer catch —
type_backoff_errorand failure histogram are double-counted, andbo.backoffis invoked twice.The inner
try/catchat lines 225–237 sits inside the outertryat line 208. Whenbo.backoffon lines 228–230 throws, the inner catch:
- increments
tiflash_gc_safepoint_request_count{type=backoff_error},- observes
tiflash_gc_safepoint_backoff_count{type=failure},throw;re-raises the exception.Per standard C++ semantics, a
throw;from a catch block that is itself nested inside an outertrypropagates to that outertry's catch. So control transfers to the outercatch (pingcap::Exception &)at line 260, which then:
- skips
type_request_exception(thanks tohas_pd_response_error), but- executes
++backoff_countand callsbo.backoffagain on an already-exhaustedBackoffer(lines 266–267), which throws immediately,- the inner catch at lines 269–274 increments
type_backoff_errorand observestype_failurea second time with a differentbackoff_count, then rethrows.Net effect on the PD-response-error-then-backoff-exhausted path:
type_backoff_errorcounter: +2 per real failure.tiflash_gc_safepoint_backoff_count{type=failure}histogram: two observations (N and N+1) per real failure.bo.backoffis invoked twice unnecessarily.
has_pd_response_erroronly gatestype_request_exception; it does not gate the outer backoff call or the outertype_backoff_errorincrement, which is where the duplication comes from.🔧 One possible fix: do not re-run the backoff in the outer catch when the PD-response-error branch already performed (and failed) it
try { // Fetch the gc safepoint from PD. // - When deployed with classic cluster, the gc safepoint is cluster-based, keyspace_id=NullspaceID. // - When deployed with next-gen cluster, the gc safepoint is keyspace-based. GET_METRIC(tiflash_gc_safepoint_request_count, type_get_gc_state).Increment(); auto gc_state = pd_client->getGCState(keyspace_id); if (unlikely(gc_state.header().error().type() != pdpb::ErrorType::OK)) { - has_pd_response_error = true; GET_METRIC(tiflash_gc_safepoint_request_count, type_pd_response_error).Increment(); LOG_WARNING( Logger::get(), "getGCSafePointWithRetry keyspace={} message={} resp={}", keyspace_id, gc_state.header().error().message(), gc_state.ShortDebugString()); - try - { - ++backoff_count; - bo.backoff( - pingcap::kv::boPDRPC, - pingcap::Exception(gc_state.header().error().message(), pingcap::ErrorCodes::InternalError)); - } - catch (pingcap::Exception &) - { - GET_METRIC(tiflash_gc_safepoint_request_count, type_backoff_error).Increment(); - observe_backoff_count(false); - throw; - } + has_pd_response_error = true; + // Throw so the outer catch handles backoff + metrics exactly once. + throw pingcap::Exception( + gc_state.header().error().message(), + pingcap::ErrorCodes::InternalError); - continue; // retry }This funnels all backoff accounting through the outer catch, where the existing
has_pd_response_errorflag already correctly suppressestype_request_exception. Alternative: keep the inner try/catch but replacethrow;withcontinue;/break;semantics and don't also run the outer backoff for the same iteration.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h` around lines 207 - 275, The inner catch inside the PD-response-error branch (the catch that currently does GET_METRIC(..., type_backoff_error), observe_backoff_count(false) and then throw;) should not rethrow because that causes the outer catch to run the backoff and metrics again; instead, after recording the backoff error and observing failure, stop propagating the exception for this iteration (use continue to retry the loop) so bo.backoff is not invoked twice. Update the catch in the block surrounding pd_client->getGCState(keyspace_id) / bo.backoff(pingcap::kv::boPDRPC, pingcap::Exception(...)) to replace the rethrow with a loop continue (keeping the GET_METRIC and observe_backoff_count(false) calls), so backoff_count, bo.backoff, and type_backoff_error are only handled once per failure.
🧹 Nitpick comments (1)
dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h (1)
180-186: Optional: fold backoff counting into the lambda to prevent drift between the two call sites.
++backoff_countis manually duplicated at lines 227 and 266, which is easy to get out of sync with the histogram observation. Consider letting the lambda own both the counter and the histogram observation, e.g.:- UInt64 backoff_count = 0; - auto observe_backoff_count = [&](bool success) { - if (success) - GET_METRIC(tiflash_gc_safepoint_backoff_count, type_success).Observe(backoff_count); - else - GET_METRIC(tiflash_gc_safepoint_backoff_count, type_failure).Observe(backoff_count); - }; + UInt64 backoff_count = 0; + auto observe_success = [&] { + GET_METRIC(tiflash_gc_safepoint_backoff_count, type_success).Observe(backoff_count); + }; + auto observe_failure = [&] { + GET_METRIC(tiflash_gc_safepoint_backoff_count, type_failure).Observe(backoff_count); + };Purely a readability/maintenance improvement; defer if you prefer the current shape.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h` around lines 180 - 186, The backoff counter increment is duplicated outside the lambda (backoff_count with ++ at the call sites) and can drift from the histogram observation; modify the observe_backoff_count lambda (currently capturing backoff_count) so it owns the counter and increments it internally before calling GET_METRIC(tiflash_gc_safepoint_backoff_count, type_success/type_failure). Then remove the external ++backoff_count calls at the sites that call observe_backoff_count(false/true) so callers just call observe_backoff_count(success) and the lambda both increments the counter and records the histogram.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h`:
- Around line 207-275: The inner catch inside the PD-response-error branch (the
catch that currently does GET_METRIC(..., type_backoff_error),
observe_backoff_count(false) and then throw;) should not rethrow because that
causes the outer catch to run the backoff and metrics again; instead, after
recording the backoff error and observing failure, stop propagating the
exception for this iteration (use continue to retry the loop) so bo.backoff is
not invoked twice. Update the catch in the block surrounding
pd_client->getGCState(keyspace_id) / bo.backoff(pingcap::kv::boPDRPC,
pingcap::Exception(...)) to replace the rethrow with a loop continue (keeping
the GET_METRIC and observe_backoff_count(false) calls), so backoff_count,
bo.backoff, and type_backoff_error are only handled once per failure.
---
Nitpick comments:
In `@dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h`:
- Around line 180-186: The backoff counter increment is duplicated outside the
lambda (backoff_count with ++ at the call sites) and can drift from the
histogram observation; modify the observe_backoff_count lambda (currently
capturing backoff_count) so it owns the counter and increments it internally
before calling GET_METRIC(tiflash_gc_safepoint_backoff_count,
type_success/type_failure). Then remove the external ++backoff_count calls at
the sites that call observe_backoff_count(false/true) so callers just call
observe_backoff_count(success) and the lambda both increments the counter and
records the histogram.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 8a1ae2e3-a383-4118-be73-b0fcb1c533e8
📒 Files selected for processing (2)
dbms/src/Common/TiFlashMetrics.hdbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h
🚧 Files skipped from review as they are similar to previous changes (1)
- dbms/src/Common/TiFlashMetrics.h
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h (1)
187-198:⚠️ Potential issue | 🟡 MinorKeep cache hits out of the backoff histogram.
On a cache hit, no PD request or backoff happens, but Line 197 records a successful
backoff_count=0. In high-frequency paths this can swamp the histogram and hide the retry distribution. Either skip this observation or add a separate cache-hit metric.📊 Proposed adjustment
if (ks_gc_info.has_value()) { // Still valid, return the cached gc safepoint - observe_backoff_count(true); return ks_gc_info->gc_safepoint; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h` around lines 187 - 198, The code records a backoff observation on cache hits by calling observe_backoff_count(true) inside the ks_gc_sp_map.getGCSafepointIfValid cache-hit branch; remove or change this so cache hits do not contribute to the backoff histogram: either delete the observe_backoff_count(true) call from the cache-hit return path in PDTiKVClient (where ks_gc_sp_map.getGCSafepointIfValid(...) returns a value) so only real PD request paths call observe_backoff_count, or replace it with a distinct cache-hit metric (e.g., observe_gc_safepoint_cache_hit) and increment that instead; ensure observe_backoff_count remains invoked only in code paths that actually perform PD requests/retries.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@dbms/src/Interpreters/Settings.h`:
- Line 58: The setting safe_point_get_max_backoff_ms (declared with
SettingInt64) can be set negative and later clamped to 0 in
PDClientHelper::getGCSafePointWithRetry, which silently disables retries; change
the setting to validate and reject negative values (or switch to a non-negative
type) by updating the SettingInt64 registration for
safe_point_get_max_backoff_ms to enforce a lower bound of 0 (and optional
sensible upper bound) or replace it with an unsigned/explicit non-negative
setting type, and ensure any setter/path that reads this setting rejects/throws
on negative input so PDClientHelper can assume a non-negative value.
---
Outside diff comments:
In `@dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h`:
- Around line 187-198: The code records a backoff observation on cache hits by
calling observe_backoff_count(true) inside the
ks_gc_sp_map.getGCSafepointIfValid cache-hit branch; remove or change this so
cache hits do not contribute to the backoff histogram: either delete the
observe_backoff_count(true) call from the cache-hit return path in PDTiKVClient
(where ks_gc_sp_map.getGCSafepointIfValid(...) returns a value) so only real PD
request paths call observe_backoff_count, or replace it with a distinct
cache-hit metric (e.g., observe_gc_safepoint_cache_hit) and increment that
instead; ensure observe_backoff_count remains invoked only in code paths that
actually perform PD requests/retries.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: be2cdec2-35f9-428d-aa49-b27c21a7562c
📒 Files selected for processing (7)
dbms/src/Debug/dbgFuncSchema.cppdbms/src/Interpreters/Settings.hdbms/src/Storages/DeltaMerge/DeltaMergeStore_InternalBg.cppdbms/src/Storages/KVStore/MultiRaft/PrehandleSnapshot.cppdbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.hdbms/src/Storages/StorageDeltaMerge.cppdbms/src/TiDB/Schema/SchemaSyncService.cpp
| M(SettingUInt64, mpp_task_running_timeout, DEFAULT_MPP_TASK_RUNNING_TIMEOUT, "mpp task max time that running without any progress.") \ | ||
| M(SettingUInt64, mpp_task_waiting_timeout, DEFAULT_MPP_TASK_WAITING_TIMEOUT, "mpp task max time that waiting first data block from source input stream.") \ | ||
| M(SettingInt64, safe_point_update_interval_seconds, 1, "The interval in seconds to update safe point from PD.") \ | ||
| M(SettingInt64, safe_point_get_max_backoff_ms, 120000, "Max total backoff sleep time in milliseconds when fetching GC safe point from PD.") \ |
There was a problem hiding this comment.
Reject negative max-backoff values instead of accepting them silently.
SettingInt64 has no bounds validation, so safe_point_get_max_backoff_ms = -1 is accepted and later clamped to 0 in PDClientHelper::getGCSafePointWithRetry, effectively disabling retries for transient PD errors. Consider validating this setting or switching the setting/API to a non-negative type with an explicit upper bound.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@dbms/src/Interpreters/Settings.h` at line 58, The setting
safe_point_get_max_backoff_ms (declared with SettingInt64) can be set negative
and later clamped to 0 in PDClientHelper::getGCSafePointWithRetry, which
silently disables retries; change the setting to validate and reject negative
values (or switch to a non-negative type) by updating the SettingInt64
registration for safe_point_get_max_backoff_ms to enforce a lower bound of 0
(and optional sensible upper bound) or replace it with an unsigned/explicit
non-negative setting type, and ensure any setter/path that reads this setting
rejects/throws on negative input so PDClientHelper can assume a non-negative
value.
|
/retest |
|
/test pull-unit-test |
|
/retest |
|
/merge |
|
@CalvinNeo: We have migrated to builtin 👉 Please use
DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
|
/approve |
|
@kolafish: adding LGTM is restricted to approvers and reviewers in OWNERS files. DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: CalvinNeo, JaySon-Huang, JinheLin, kolafish The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@CalvinNeo: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@wuhuizuo: You cannot manually add or delete the reviewing state labels, only I and the tursted members have permission to do so. DetailsIn response to adding label named approved. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
close #10808 ### What problem does this PR solve? Issue Number: close #xxx Problem Summary: ### What is changed and how it works? ```commit-message ``` ### Check List Tests <!-- At least one of them must be included. --> - [ ] Unit test - [ ] Integration test - [ ] Manual test (add detailed scripts or steps below) - [ ] No code Side effects - [ ] Performance regression: Consumes more CPU - [ ] Performance regression: Consumes more Memory - [ ] Breaking backward compatibility Documentation - [ ] Affects user behaviors - [ ] Contains syntax changes - [ ] Contains variable changes - [ ] Contains experimental features - [ ] Changes MySQL compatibility ### Release note <!-- bugfix or new feature needs a release note --> ```release-note None ``` Signed-off-by: Calvin Neo <calvinneo1995@gmail.com>
What problem does this PR solve?
Issue Number: close #10808
Problem Summary:
What is changed and how it works?
Check List
Tests
Side effects
Documentation
Release note
Summary by CodeRabbit
Chores
New Features