Skip to content

metrics: improve getGCSafePointWithRetry metrics#10807

Merged
ti-chi-bot[bot] merged 7 commits intopingcap:masterfrom
CalvinNeo:add-metrics-tici
Apr 22, 2026
Merged

metrics: improve getGCSafePointWithRetry metrics#10807
ti-chi-bot[bot] merged 7 commits intopingcap:masterfrom
CalvinNeo:add-metrics-tici

Conversation

@CalvinNeo
Copy link
Copy Markdown
Member

@CalvinNeo CalvinNeo commented Apr 22, 2026

What problem does this PR solve?

Issue Number: close #10808

Problem Summary:

What is changed and how it works?


Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

None

Summary by CodeRabbit

  • Chores

    • Updated client-C submodule to a newer revision.
  • New Features

    • Added Prometheus metrics to track GC safepoint activity (request types, PD response errors, request exceptions, zero safe-point events, and backoff outcomes).
    • GC safe-point retrieval now honors a new configurable maximum backoff timeout (setting added and applied to relevant safe-point fetch paths).

f
Signed-off-by: Calvin Neo <calvinneo1995@gmail.com>
@ti-chi-bot ti-chi-bot Bot added do-not-merge/needs-linked-issue release-note-none Denotes a PR that doesn't merit a release note. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 22, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 22, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Added Prometheus metrics for GC safepoint requests and backoff; instrumented PDClientHelper::getGCSafePointWithRetry to emit labeled counters/histograms and added a configurable max backoff setting; updated multiple call sites to pass the new setting; bumped contrib/client-c submodule reference.

Changes

Cohort / File(s) Summary
Submodule Update
contrib/client-c
Bumped submodule commit reference; no visible source/API changes in this diff.
Metrics Definition
dbms/src/Common/TiFlashMetrics.h
Added tiflash_gc_safepoint_request_count (Counter) and tiflash_gc_safepoint_backoff_count (Histogram) families with labeled variants and registered them in the metrics list.
PD safe-point helper
dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h
Extended getGCSafePointWithRetry signature to accept safe_point_get_max_backoff_ms; instrumented retry loop to increment labeled counters (get_gc_state, zero_gc_safe_point, pd_response_error, request_exception, backoff_error) and observe backoff histogram (success/failure) around backoff attempts.
Settings
dbms/src/Interpreters/Settings.h
Added safe_point_get_max_backoff_ms (SettingInt64) with default 120000.
Call-site updates (pass new setting)
dbms/src/Debug/dbgFuncSchema.cpp, dbms/src/Storages/DeltaMerge/DeltaMergeStore_InternalBg.cpp, dbms/src/Storages/KVStore/MultiRaft/PrehandleSnapshot.cpp, dbms/src/Storages/StorageDeltaMerge.cpp, dbms/src/TiDB/Schema/SchemaSyncService.cpp
Updated calls to PDClientHelper::getGCSafePointWithRetry(...) to pass the new safe_point_get_max_backoff_ms setting (and related explicit parameters where applicable).

Sequence Diagram(s)

sequenceDiagram
    participant TiFlash as TiFlash
    participant PDClient as PD Client
    participant Backoff as Backoff
    participant Metrics as Prometheus Metrics

    TiFlash->>PDClient: getGCState()
    Note over TiFlash,PDClient: each retry attempt
    PDClient-->>TiFlash: response / error
    alt PD response OK
        TiFlash->>Metrics: increment tiflash_gc_safepoint_request_count(type="get_gc_state")
        alt gc_safe_point == 0
            TiFlash->>Metrics: increment tiflash_gc_safepoint_request_count(type="zero_gc_safe_point")
        else valid safe point
            TiFlash->>Metrics: observe tiflash_gc_safepoint_backoff_count(type="success")
        end
    else PD response error
        TiFlash->>Metrics: increment tiflash_gc_safepoint_request_count(type="pd_response_error")
    end
    alt exception during request
        TiFlash->>Metrics: increment tiflash_gc_safepoint_request_count(type="request_exception")
    end
    alt backoff needed
        TiFlash->>Backoff: bo.backoff()
        Backoff-->>TiFlash: returns / throws
        alt backoff throws
            TiFlash->>Metrics: increment tiflash_gc_safepoint_request_count(type="backoff_error")
            TiFlash->>Metrics: observe tiflash_gc_safepoint_backoff_count(type="failure")
        else backoff succeeds
            TiFlash->>Metrics: observe tiflash_gc_safepoint_backoff_count(type="success")
        end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

approved

Suggested reviewers

  • JaySon-Huang
  • JinheLin

Poem

🐰 I hopped through retries, metrics in paw,

Counting safepoints, backoffs, and flaw.
A ping to PD, a patient wait—
I tally each try, both small and great.
🥕📈

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is largely incomplete; the problem summary is empty, the commit-message section is blank, and no tests are marked. Only the linked issue number (#10808) is provided. Add details to the Problem Summary field, provide a clear commit message explaining the changes, and explicitly mark which tests (Unit, Integration, Manual, or No code) were performed.
Docstring Coverage ⚠️ Warning Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Linked Issues check ❓ Inconclusive The linked issue #10808 contains only template placeholders with no actual problem description, reproduction steps, expected behavior, or observed behavior provided by the reporter. Clarify in the PR description or issue what specific problem or requirement is being addressed, as the linked issue lacks sufficient detail to validate the implementation.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'metrics: improve getGCSafePointWithRetry metrics' clearly and specifically summarizes the main change: enhancement of metrics for the getGCSafePointWithRetry function.
Out of Scope Changes check ✅ Passed All code changes are directly related to metrics instrumentation for getGCSafePointWithRetry; no unrelated modifications or functionality changes outside the stated objective are present.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

a
Signed-off-by: Calvin Neo <calvinneo1995@gmail.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@contrib/abseil-cpp`:
- Line 1: The submodule bump in contrib/abseil-cpp is unrelated to the
getGCSafePointWithRetry metrics change and must be verified or separated:
confirm that the new abseil commit (the SHA in contrib/abseil-cpp) is reachable
from an upstream Abseil branch the project tracks (not a fork/dangling ref) by
fetching upstream refs and validating ancestry, then either move the submodule
update into its own PR with justification (ABI/behavior/build impacts and
release note) or update this PR’s description/release notes to explain why the
bump is required and that downstream consumers build cleanly against it; ensure
you reference and test usages in code paths like getGCSafePointWithRetry to
validate no behavioral or build regressions.

In `@contrib/GmSSL`:
- Line 1: Add a one-paragraph explanation to contrib/GmSSL documenting why the
GmSSL submodule (commit 75155a4c3752ae11b35384e166122751285410a7) was added:
include a brief summary of the upstream GmSSL project, the cryptographic
features it provides, and exactly how and where TiFlash will use it (which
components, build integration, and runtime expectations), plus any compatibility
or licensing notes and a pointer to the integration point (e.g., the submodule
and any init/build hook that consumes it) so reviewers understand the rationale
and impact.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 0092703e-6f5a-4107-a3ea-991c0cf7b770

📥 Commits

Reviewing files that changed from the base of the PR and between 45331bf and ec2acdf.

📒 Files selected for processing (6)
  • contrib/GmSSL
  • contrib/abseil-cpp
  • contrib/arm-optimized-routines
  • contrib/client-c
  • dbms/src/Common/TiFlashMetrics.h
  • dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h

Comment thread contrib/abseil-cpp Outdated
Comment thread contrib/GmSSL Outdated
@@ -1 +1 @@
Subproject commit 75155a4c3752ae11b35384e166122751285410a7
Subproject commit 34698ddc6f79c955b41096b480389a29b0105cac
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Submodule pointer change =="
git diff --submodule=log HEAD^ HEAD -- contrib/GmSSL || true

echo
echo "== Current submodule status =="
git submodule status contrib/GmSSL || true

echo
echo "== Commit metadata for new gitlink =="
git ls-tree HEAD contrib/GmSSL

echo
echo "== If submodule is initialized, show current checked-out commit =="
if [ -d contrib/GmSSL/.git ] || [ -f contrib/GmSSL/.git ]; then
  git -C contrib/GmSSL rev-parse HEAD
  git -C contrib/GmSSL log --oneline -n 20
else
  echo "Submodule not initialized in this environment."
fi

Repository: pingcap/tiflash

Length of output: 445


🏁 Script executed:

git log --oneline -n 10

Repository: pingcap/tiflash

Length of output: 69


🏁 Script executed:

git diff HEAD -- contrib/GmSSL | head -50

Repository: pingcap/tiflash

Length of output: 41


🏁 Script executed:

git show contrib/GmSSL

Repository: pingcap/tiflash

Length of output: 456


🏁 Script executed:

# Try to get any parent or merge commit info
git log --oneline --all -n 20

Repository: pingcap/tiflash

Length of output: 1460


Document the rationale for adding the GmSSL dependency.

This PR adds GmSSL as a new submodule (commit 75155a4c3752ae11b35384e166122751285410a7), but includes no explanation of why it's needed or how it integrates with TiFlash. Include a summary of the upstream project and its intended use before merge.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@contrib/GmSSL` at line 1, Add a one-paragraph explanation to contrib/GmSSL
documenting why the GmSSL submodule (commit
75155a4c3752ae11b35384e166122751285410a7) was added: include a brief summary of
the upstream GmSSL project, the cryptographic features it provides, and exactly
how and where TiFlash will use it (which components, build integration, and
runtime expectations), plus any compatibility or licensing notes and a pointer
to the integration point (e.g., the submodule and any init/build hook that
consumes it) so reviewers understand the rationale and impact.

a
Signed-off-by: Calvin Neo <calvinneo1995@gmail.com>
@CalvinNeo
Copy link
Copy Markdown
Member Author

/retest

Signed-off-by: Calvin Neo <calvinneo1995@gmail.com>
@ti-chi-bot ti-chi-bot Bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 22, 2026
@ti-chi-bot ti-chi-bot Bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Apr 22, 2026
@CalvinNeo
Copy link
Copy Markdown
Member Author

/retest

@ti-chi-bot ti-chi-bot Bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Apr 22, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 22, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-04-22 06:50:19.69754576 +0000 UTC m=+2148624.902905817: ☑️ agreed by JinheLin.
  • 2026-04-22 07:18:17.002855852 +0000 UTC m=+2150302.208215908: ☑️ agreed by JaySon-Huang.

d
Signed-off-by: Calvin Neo <calvinneo1995@gmail.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h (1)

207-275: ⚠️ Potential issue | 🔴 Critical

Rethrow from inner catch is caught by the outer catch — type_backoff_error and failure histogram are double-counted, and bo.backoff is invoked twice.

The inner try/catch at lines 225–237 sits inside the outer try at line 208. When bo.backoff on lines 228–230 throws, the inner catch:

  1. increments tiflash_gc_safepoint_request_count{type=backoff_error},
  2. observes tiflash_gc_safepoint_backoff_count{type=failure},
  3. throw; re-raises the exception.

Per standard C++ semantics, a throw; from a catch block that is itself nested inside an outer try propagates to that outer try's catch. So control transfers to the outer catch (pingcap::Exception &) at line 260, which then:

  • skips type_request_exception (thanks to has_pd_response_error), but
  • executes ++backoff_count and calls bo.backoff again on an already-exhausted Backoffer (lines 266–267), which throws immediately,
  • the inner catch at lines 269–274 increments type_backoff_error and observes type_failure a second time with a different backoff_count, then rethrows.

Net effect on the PD-response-error-then-backoff-exhausted path:

  • type_backoff_error counter: +2 per real failure.
  • tiflash_gc_safepoint_backoff_count{type=failure} histogram: two observations (N and N+1) per real failure.
  • bo.backoff is invoked twice unnecessarily.

has_pd_response_error only gates type_request_exception; it does not gate the outer backoff call or the outer type_backoff_error increment, which is where the duplication comes from.

🔧 One possible fix: do not re-run the backoff in the outer catch when the PD-response-error branch already performed (and failed) it
             try
             {
                 // Fetch the gc safepoint from PD.
                 // - When deployed with classic cluster, the gc safepoint is cluster-based, keyspace_id=NullspaceID.
                 // - When deployed with next-gen cluster, the gc safepoint is keyspace-based.
                 GET_METRIC(tiflash_gc_safepoint_request_count, type_get_gc_state).Increment();
                 auto gc_state = pd_client->getGCState(keyspace_id);
                 if (unlikely(gc_state.header().error().type() != pdpb::ErrorType::OK))
                 {
-                    has_pd_response_error = true;
                     GET_METRIC(tiflash_gc_safepoint_request_count, type_pd_response_error).Increment();
                     LOG_WARNING(
                         Logger::get(),
                         "getGCSafePointWithRetry keyspace={} message={} resp={}",
                         keyspace_id,
                         gc_state.header().error().message(),
                         gc_state.ShortDebugString());
-                    try
-                    {
-                        ++backoff_count;
-                        bo.backoff(
-                            pingcap::kv::boPDRPC,
-                            pingcap::Exception(gc_state.header().error().message(), pingcap::ErrorCodes::InternalError));
-                    }
-                    catch (pingcap::Exception &)
-                    {
-                        GET_METRIC(tiflash_gc_safepoint_request_count, type_backoff_error).Increment();
-                        observe_backoff_count(false);
-                        throw;
-                    }
+                    has_pd_response_error = true;
+                    // Throw so the outer catch handles backoff + metrics exactly once.
+                    throw pingcap::Exception(
+                        gc_state.header().error().message(),
+                        pingcap::ErrorCodes::InternalError);
-                    continue; // retry
                 }

This funnels all backoff accounting through the outer catch, where the existing has_pd_response_error flag already correctly suppresses type_request_exception. Alternative: keep the inner try/catch but replace throw; with continue;/break; semantics and don't also run the outer backoff for the same iteration.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h` around lines 207 - 275,
The inner catch inside the PD-response-error branch (the catch that currently
does GET_METRIC(..., type_backoff_error), observe_backoff_count(false) and then
throw;) should not rethrow because that causes the outer catch to run the
backoff and metrics again; instead, after recording the backoff error and
observing failure, stop propagating the exception for this iteration (use
continue to retry the loop) so bo.backoff is not invoked twice. Update the catch
in the block surrounding pd_client->getGCState(keyspace_id) /
bo.backoff(pingcap::kv::boPDRPC, pingcap::Exception(...)) to replace the rethrow
with a loop continue (keeping the GET_METRIC and observe_backoff_count(false)
calls), so backoff_count, bo.backoff, and type_backoff_error are only handled
once per failure.
🧹 Nitpick comments (1)
dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h (1)

180-186: Optional: fold backoff counting into the lambda to prevent drift between the two call sites.

++backoff_count is manually duplicated at lines 227 and 266, which is easy to get out of sync with the histogram observation. Consider letting the lambda own both the counter and the histogram observation, e.g.:

-        UInt64 backoff_count = 0;
-        auto observe_backoff_count = [&](bool success) {
-            if (success)
-                GET_METRIC(tiflash_gc_safepoint_backoff_count, type_success).Observe(backoff_count);
-            else
-                GET_METRIC(tiflash_gc_safepoint_backoff_count, type_failure).Observe(backoff_count);
-        };
+        UInt64 backoff_count = 0;
+        auto observe_success = [&] {
+            GET_METRIC(tiflash_gc_safepoint_backoff_count, type_success).Observe(backoff_count);
+        };
+        auto observe_failure = [&] {
+            GET_METRIC(tiflash_gc_safepoint_backoff_count, type_failure).Observe(backoff_count);
+        };

Purely a readability/maintenance improvement; defer if you prefer the current shape.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h` around lines 180 - 186,
The backoff counter increment is duplicated outside the lambda (backoff_count
with ++ at the call sites) and can drift from the histogram observation; modify
the observe_backoff_count lambda (currently capturing backoff_count) so it owns
the counter and increments it internally before calling
GET_METRIC(tiflash_gc_safepoint_backoff_count, type_success/type_failure). Then
remove the external ++backoff_count calls at the sites that call
observe_backoff_count(false/true) so callers just call
observe_backoff_count(success) and the lambda both increments the counter and
records the histogram.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h`:
- Around line 207-275: The inner catch inside the PD-response-error branch (the
catch that currently does GET_METRIC(..., type_backoff_error),
observe_backoff_count(false) and then throw;) should not rethrow because that
causes the outer catch to run the backoff and metrics again; instead, after
recording the backoff error and observing failure, stop propagating the
exception for this iteration (use continue to retry the loop) so bo.backoff is
not invoked twice. Update the catch in the block surrounding
pd_client->getGCState(keyspace_id) / bo.backoff(pingcap::kv::boPDRPC,
pingcap::Exception(...)) to replace the rethrow with a loop continue (keeping
the GET_METRIC and observe_backoff_count(false) calls), so backoff_count,
bo.backoff, and type_backoff_error are only handled once per failure.

---

Nitpick comments:
In `@dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h`:
- Around line 180-186: The backoff counter increment is duplicated outside the
lambda (backoff_count with ++ at the call sites) and can drift from the
histogram observation; modify the observe_backoff_count lambda (currently
capturing backoff_count) so it owns the counter and increments it internally
before calling GET_METRIC(tiflash_gc_safepoint_backoff_count,
type_success/type_failure). Then remove the external ++backoff_count calls at
the sites that call observe_backoff_count(false/true) so callers just call
observe_backoff_count(success) and the lambda both increments the counter and
records the histogram.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 8a1ae2e3-a383-4118-be73-b0fcb1c533e8

📥 Commits

Reviewing files that changed from the base of the PR and between 4247c2c and 2faa6cf.

📒 Files selected for processing (2)
  • dbms/src/Common/TiFlashMetrics.h
  • dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h
🚧 Files skipped from review as they are similar to previous changes (1)
  • dbms/src/Common/TiFlashMetrics.h

Signed-off-by: Calvin Neo <calvinneo1995@gmail.com>
@ti-chi-bot ti-chi-bot Bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed approved size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 22, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h (1)

187-198: ⚠️ Potential issue | 🟡 Minor

Keep cache hits out of the backoff histogram.

On a cache hit, no PD request or backoff happens, but Line 197 records a successful backoff_count=0. In high-frequency paths this can swamp the histogram and hide the retry distribution. Either skip this observation or add a separate cache-hit metric.

📊 Proposed adjustment
             if (ks_gc_info.has_value())
             {
                 // Still valid, return the cached gc safepoint
-                observe_backoff_count(true);
                 return ks_gc_info->gc_safepoint;
             }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h` around lines 187 - 198,
The code records a backoff observation on cache hits by calling
observe_backoff_count(true) inside the ks_gc_sp_map.getGCSafepointIfValid
cache-hit branch; remove or change this so cache hits do not contribute to the
backoff histogram: either delete the observe_backoff_count(true) call from the
cache-hit return path in PDTiKVClient (where
ks_gc_sp_map.getGCSafepointIfValid(...) returns a value) so only real PD request
paths call observe_backoff_count, or replace it with a distinct cache-hit metric
(e.g., observe_gc_safepoint_cache_hit) and increment that instead; ensure
observe_backoff_count remains invoked only in code paths that actually perform
PD requests/retries.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@dbms/src/Interpreters/Settings.h`:
- Line 58: The setting safe_point_get_max_backoff_ms (declared with
SettingInt64) can be set negative and later clamped to 0 in
PDClientHelper::getGCSafePointWithRetry, which silently disables retries; change
the setting to validate and reject negative values (or switch to a non-negative
type) by updating the SettingInt64 registration for
safe_point_get_max_backoff_ms to enforce a lower bound of 0 (and optional
sensible upper bound) or replace it with an unsigned/explicit non-negative
setting type, and ensure any setter/path that reads this setting rejects/throws
on negative input so PDClientHelper can assume a non-negative value.

---

Outside diff comments:
In `@dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h`:
- Around line 187-198: The code records a backoff observation on cache hits by
calling observe_backoff_count(true) inside the
ks_gc_sp_map.getGCSafepointIfValid cache-hit branch; remove or change this so
cache hits do not contribute to the backoff histogram: either delete the
observe_backoff_count(true) call from the cache-hit return path in PDTiKVClient
(where ks_gc_sp_map.getGCSafepointIfValid(...) returns a value) so only real PD
request paths call observe_backoff_count, or replace it with a distinct
cache-hit metric (e.g., observe_gc_safepoint_cache_hit) and increment that
instead; ensure observe_backoff_count remains invoked only in code paths that
actually perform PD requests/retries.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: be2cdec2-35f9-428d-aa49-b27c21a7562c

📥 Commits

Reviewing files that changed from the base of the PR and between 2faa6cf and 15f2783.

📒 Files selected for processing (7)
  • dbms/src/Debug/dbgFuncSchema.cpp
  • dbms/src/Interpreters/Settings.h
  • dbms/src/Storages/DeltaMerge/DeltaMergeStore_InternalBg.cpp
  • dbms/src/Storages/KVStore/MultiRaft/PrehandleSnapshot.cpp
  • dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h
  • dbms/src/Storages/StorageDeltaMerge.cpp
  • dbms/src/TiDB/Schema/SchemaSyncService.cpp

M(SettingUInt64, mpp_task_running_timeout, DEFAULT_MPP_TASK_RUNNING_TIMEOUT, "mpp task max time that running without any progress.") \
M(SettingUInt64, mpp_task_waiting_timeout, DEFAULT_MPP_TASK_WAITING_TIMEOUT, "mpp task max time that waiting first data block from source input stream.") \
M(SettingInt64, safe_point_update_interval_seconds, 1, "The interval in seconds to update safe point from PD.") \
M(SettingInt64, safe_point_get_max_backoff_ms, 120000, "Max total backoff sleep time in milliseconds when fetching GC safe point from PD.") \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Reject negative max-backoff values instead of accepting them silently.

SettingInt64 has no bounds validation, so safe_point_get_max_backoff_ms = -1 is accepted and later clamped to 0 in PDClientHelper::getGCSafePointWithRetry, effectively disabling retries for transient PD errors. Consider validating this setting or switching the setting/API to a non-negative type with an explicit upper bound.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dbms/src/Interpreters/Settings.h` at line 58, The setting
safe_point_get_max_backoff_ms (declared with SettingInt64) can be set negative
and later clamped to 0 in PDClientHelper::getGCSafePointWithRetry, which
silently disables retries; change the setting to validate and reject negative
values (or switch to a non-negative type) by updating the SettingInt64
registration for safe_point_get_max_backoff_ms to enforce a lower bound of 0
(and optional sensible upper bound) or replace it with an unsigned/explicit
non-negative setting type, and ensure any setter/path that reads this setting
rejects/throws on negative input so PDClientHelper can assume a non-negative
value.

@wuhuizuo
Copy link
Copy Markdown
Contributor

/retest

@wuhuizuo
Copy link
Copy Markdown
Contributor

/test pull-unit-test

@wuhuizuo
Copy link
Copy Markdown
Contributor

/retest

a
Signed-off-by: Calvin Neo <calvinneo1995@gmail.com>
@CalvinNeo
Copy link
Copy Markdown
Member Author

/merge

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 22, 2026

@CalvinNeo: We have migrated to builtin LGTM and approve plugins for reviewing.

👉 Please use /approve when you want to approve this pull request.

The changes announcement: LGTM plugin changes

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@CalvinNeo
Copy link
Copy Markdown
Member Author

/approve

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 22, 2026

@kolafish: adding LGTM is restricted to approvers and reviewers in OWNERS files.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 22, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: CalvinNeo, JaySon-Huang, JinheLin, kolafish
Once this PR has been reviewed and has the lgtm label, please assign benmeadowcroft for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 22, 2026

@CalvinNeo: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-sanitizer-tsan 606c2e6 link false /test pull-sanitizer-tsan

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 22, 2026

@wuhuizuo: You cannot manually add or delete the reviewing state labels, only I and the tursted members have permission to do so.

Details

In response to adding label named approved.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot ti-chi-bot Bot merged commit c6a40dd into pingcap:master Apr 22, 2026
9 of 10 checks passed
CalvinNeo added a commit that referenced this pull request Apr 23, 2026
close #10808

### What problem does this PR solve?

Issue Number: close #xxx

Problem Summary:

### What is changed and how it works?

```commit-message

```

### Check List

Tests <!-- At least one of them must be included. -->

- [ ] Unit test
- [ ] Integration test
- [ ] Manual test (add detailed scripts or steps below)
- [ ] No code

Side effects

- [ ] Performance regression: Consumes more CPU
- [ ] Performance regression: Consumes more Memory
- [ ] Breaking backward compatibility

Documentation

- [ ] Affects user behaviors
- [ ] Contains syntax changes
- [ ] Contains variable changes
- [ ] Contains experimental features
- [ ] Changes MySQL compatibility

### Release note

<!-- bugfix or new feature needs a release note -->

```release-note
None
```

Signed-off-by: Calvin Neo <calvinneo1995@gmail.com>
@coderabbitai coderabbitai Bot mentioned this pull request Apr 24, 2026
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved lgtm release-note-none Denotes a PR that doesn't merit a release note. severity/minor size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Could getGCSafePointWithRetry call too frequent?

6 participants