metrics: improve getGCSafePointWithRetry metrics by CalvinNeo · Pull Request #10807 · pingcap/tiflash

CalvinNeo · 2026-04-22T05:11:39Z

What problem does this PR solve?

Issue Number: close #10808

Problem Summary:

What is changed and how it works?

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No code

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

None

Summary by CodeRabbit

Chores
- Updated client-C submodule to a newer revision.
New Features
- Added Prometheus metrics to track GC safepoint activity (request types, PD response errors, request exceptions, zero safe-point events, and backoff outcomes).
- GC safe-point retrieval now honors a new configurable maximum backoff timeout (setting added and applied to relevant safe-point fetch paths).

Signed-off-by: Calvin Neo <calvinneo1995@gmail.com>

coderabbitai · 2026-04-22T05:11:58Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Added Prometheus metrics for GC safepoint requests and backoff; instrumented PDClientHelper::getGCSafePointWithRetry to emit labeled counters/histograms and added a configurable max backoff setting; updated multiple call sites to pass the new setting; bumped contrib/client-c submodule reference.

Changes

Cohort / File(s)	Summary
Submodule Update `contrib/client-c`	Bumped submodule commit reference; no visible source/API changes in this diff.
Metrics Definition `dbms/src/Common/TiFlashMetrics.h`	Added `tiflash_gc_safepoint_request_count` (Counter) and `tiflash_gc_safepoint_backoff_count` (Histogram) families with labeled variants and registered them in the metrics list.
PD safe-point helper `dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h`	Extended `getGCSafePointWithRetry` signature to accept `safe_point_get_max_backoff_ms`; instrumented retry loop to increment labeled counters (`get_gc_state`, `zero_gc_safe_point`, `pd_response_error`, `request_exception`, `backoff_error`) and observe backoff histogram (`success`/`failure`) around backoff attempts.
Settings `dbms/src/Interpreters/Settings.h`	Added `safe_point_get_max_backoff_ms` (SettingInt64) with default `120000`.
Call-site updates (pass new setting) `dbms/src/Debug/dbgFuncSchema.cpp`, `dbms/src/Storages/DeltaMerge/DeltaMergeStore_InternalBg.cpp`, `dbms/src/Storages/KVStore/MultiRaft/PrehandleSnapshot.cpp`, `dbms/src/Storages/StorageDeltaMerge.cpp`, `dbms/src/TiDB/Schema/SchemaSyncService.cpp`	Updated calls to `PDClientHelper::getGCSafePointWithRetry(...)` to pass the new `safe_point_get_max_backoff_ms` setting (and related explicit parameters where applicable).

Sequence Diagram(s)

sequenceDiagram
    participant TiFlash as TiFlash
    participant PDClient as PD Client
    participant Backoff as Backoff
    participant Metrics as Prometheus Metrics

    TiFlash->>PDClient: getGCState()
    Note over TiFlash,PDClient: each retry attempt
    PDClient-->>TiFlash: response / error
    alt PD response OK
        TiFlash->>Metrics: increment tiflash_gc_safepoint_request_count(type="get_gc_state")
        alt gc_safe_point == 0
            TiFlash->>Metrics: increment tiflash_gc_safepoint_request_count(type="zero_gc_safe_point")
        else valid safe point
            TiFlash->>Metrics: observe tiflash_gc_safepoint_backoff_count(type="success")
        end
    else PD response error
        TiFlash->>Metrics: increment tiflash_gc_safepoint_request_count(type="pd_response_error")
    end
    alt exception during request
        TiFlash->>Metrics: increment tiflash_gc_safepoint_request_count(type="request_exception")
    end
    alt backoff needed
        TiFlash->>Backoff: bo.backoff()
        Backoff-->>TiFlash: returns / throws
        alt backoff throws
            TiFlash->>Metrics: increment tiflash_gc_safepoint_request_count(type="backoff_error")
            TiFlash->>Metrics: observe tiflash_gc_safepoint_backoff_count(type="failure")
        else backoff succeeds
            TiFlash->>Metrics: observe tiflash_gc_safepoint_backoff_count(type="success")
        end
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

approved

Suggested reviewers

JaySon-Huang
JinheLin

Poem

🐰 I hopped through retries, metrics in paw,

Counting safepoints, backoffs, and flaw.
A ping to PD, a patient wait—
I tally each try, both small and great.
🥕📈

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is largely incomplete; the problem summary is empty, the commit-message section is blank, and no tests are marked. Only the linked issue number (`#10808`) is provided.	Add details to the Problem Summary field, provide a clear commit message explaining the changes, and explicitly mark which tests (Unit, Integration, Manual, or No code) were performed.
Docstring Coverage	⚠️ Warning	Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Linked Issues check	❓ Inconclusive	The linked issue `#10808` contains only template placeholders with no actual problem description, reproduction steps, expected behavior, or observed behavior provided by the reporter.	Clarify in the PR description or issue what specific problem or requirement is being addressed, as the linked issue lacks sufficient detail to validate the implementation.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'metrics: improve getGCSafePointWithRetry metrics' clearly and specifically summarizes the main change: enhancement of metrics for the getGCSafePointWithRetry function.
Out of Scope Changes check	✅ Passed	All code changes are directly related to metrics instrumentation for getGCSafePointWithRetry; no unrelated modifications or functionality changes outside the stated objective are present.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: Calvin Neo <calvinneo1995@gmail.com>

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@contrib/abseil-cpp`:
- Line 1: The submodule bump in contrib/abseil-cpp is unrelated to the
getGCSafePointWithRetry metrics change and must be verified or separated:
confirm that the new abseil commit (the SHA in contrib/abseil-cpp) is reachable
from an upstream Abseil branch the project tracks (not a fork/dangling ref) by
fetching upstream refs and validating ancestry, then either move the submodule
update into its own PR with justification (ABI/behavior/build impacts and
release note) or update this PR’s description/release notes to explain why the
bump is required and that downstream consumers build cleanly against it; ensure
you reference and test usages in code paths like getGCSafePointWithRetry to
validate no behavioral or build regressions.

In `@contrib/GmSSL`:
- Line 1: Add a one-paragraph explanation to contrib/GmSSL documenting why the
GmSSL submodule (commit 75155a4c3752ae11b35384e166122751285410a7) was added:
include a brief summary of the upstream GmSSL project, the cryptographic
features it provides, and exactly how and where TiFlash will use it (which
components, build integration, and runtime expectations), plus any compatibility
or licensing notes and a pointer to the integration point (e.g., the submodule
and any init/build hook that consumes it) so reviewers understand the rationale
and impact.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 0092703e-6f5a-4107-a3ea-991c0cf7b770

📥 Commits

Reviewing files that changed from the base of the PR and between 45331bf and ec2acdf.

📒 Files selected for processing (6)

contrib/GmSSL
contrib/abseil-cpp
contrib/arm-optimized-routines
contrib/client-c
dbms/src/Common/TiFlashMetrics.h
dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h

coderabbitai · 2026-04-22T05:15:51Z

@@ -1 +1 @@
-Subproject commit 75155a4c3752ae11b35384e166122751285410a7
+Subproject commit 34698ddc6f79c955b41096b480389a29b0105cac


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== Submodule pointer change ==" git diff --submodule=log HEAD^ HEAD -- contrib/GmSSL || true echo echo "== Current submodule status ==" git submodule status contrib/GmSSL || true echo echo "== Commit metadata for new gitlink ==" git ls-tree HEAD contrib/GmSSL echo echo "== If submodule is initialized, show current checked-out commit ==" if [ -d contrib/GmSSL/.git ] || [ -f contrib/GmSSL/.git ]; then git -C contrib/GmSSL rev-parse HEAD git -C contrib/GmSSL log --oneline -n 20 else echo "Submodule not initialized in this environment." fi

Repository: pingcap/tiflash

Length of output: 445

🏁 Script executed:

git log --oneline -n 10

Repository: pingcap/tiflash

Length of output: 69

🏁 Script executed:

git diff HEAD -- contrib/GmSSL | head -50

Repository: pingcap/tiflash

Length of output: 41

🏁 Script executed:

git show contrib/GmSSL

Repository: pingcap/tiflash

Length of output: 456

🏁 Script executed:

# Try to get any parent or merge commit info git log --oneline --all -n 20

Repository: pingcap/tiflash

Length of output: 1460

Document the rationale for adding the GmSSL dependency.

This PR adds GmSSL as a new submodule (commit 75155a4c3752ae11b35384e166122751285410a7), but includes no explanation of why it's needed or how it integrates with TiFlash. Include a summary of the upstream project and its intended use before merge.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@contrib/GmSSL` at line 1, Add a one-paragraph explanation to contrib/GmSSL documenting why the GmSSL submodule (commit 75155a4c3752ae11b35384e166122751285410a7) was added: include a brief summary of the upstream GmSSL project, the cryptographic features it provides, and exactly how and where TiFlash will use it (which components, build integration, and runtime expectations), plus any compatibility or licensing notes and a pointer to the integration point (e.g., the submodule and any init/build hook that consumes it) so reviewers understand the rationale and impact.

Signed-off-by: Calvin Neo <calvinneo1995@gmail.com>

CalvinNeo · 2026-04-22T05:58:25Z

/retest

Signed-off-by: Calvin Neo <calvinneo1995@gmail.com>

CalvinNeo · 2026-04-22T07:16:29Z

/retest

ti-chi-bot · 2026-04-22T07:18:18Z

[LGTM Timeline notifier]

Timeline:

2026-04-22 06:50:19.69754576 +0000 UTC m=+2148624.902905817: ☑️ agreed by JinheLin.
2026-04-22 07:18:17.002855852 +0000 UTC m=+2150302.208215908: ☑️ agreed by JaySon-Huang.

Signed-off-by: Calvin Neo <calvinneo1995@gmail.com>

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h (1)
207-275: ⚠️ Potential issue | 🔴 Critical

Rethrow from inner catch is caught by the outer catch — type_backoff_error and failure histogram are double-counted, and bo.backoff is invoked twice.

The inner try/catch at lines 225–237 sits inside the outer try at line 208. When bo.backoff on lines 228–230 throws, the inner catch:

increments tiflash_gc_safepoint_request_count{type=backoff_error},

observes tiflash_gc_safepoint_backoff_count{type=failure},

throw; re-raises the exception.

Per standard C++ semantics, a throw; from a catch block that is itself nested inside an outer try propagates to that outer try's catch. So control transfers to the outer catch (pingcap::Exception &) at line 260, which then:

skips type_request_exception (thanks to has_pd_response_error), but

executes ++backoff_count and calls bo.backoff again on an already-exhausted Backoffer (lines 266–267), which throws immediately,

the inner catch at lines 269–274 increments type_backoff_error and observes type_failure a second time with a different backoff_count, then rethrows.

Net effect on the PD-response-error-then-backoff-exhausted path:

type_backoff_error counter: +2 per real failure.

tiflash_gc_safepoint_backoff_count{type=failure} histogram: two observations (N and N+1) per real failure.

bo.backoff is invoked twice unnecessarily.

has_pd_response_error only gates type_request_exception; it does not gate the outer backoff call or the outer type_backoff_error increment, which is where the duplication comes from.
🔧 One possible fix: do not re-run the backoff in the outer catch when the PD-response-error branch already performed (and failed) it
             try
             {
                 // Fetch the gc safepoint from PD.
                 // - When deployed with classic cluster, the gc safepoint is cluster-based, keyspace_id=NullspaceID.
                 // - When deployed with next-gen cluster, the gc safepoint is keyspace-based.
                 GET_METRIC(tiflash_gc_safepoint_request_count, type_get_gc_state).Increment();
                 auto gc_state = pd_client->getGCState(keyspace_id);
                 if (unlikely(gc_state.header().error().type() != pdpb::ErrorType::OK))
                 {
-                    has_pd_response_error = true;
                     GET_METRIC(tiflash_gc_safepoint_request_count, type_pd_response_error).Increment();
                     LOG_WARNING(
                         Logger::get(),
                         "getGCSafePointWithRetry keyspace={} message={} resp={}",
                         keyspace_id,
                         gc_state.header().error().message(),
                         gc_state.ShortDebugString());
-                    try
-                    {
-                        ++backoff_count;
-                        bo.backoff(
-                            pingcap::kv::boPDRPC,
-                            pingcap::Exception(gc_state.header().error().message(), pingcap::ErrorCodes::InternalError));
-                    }
-                    catch (pingcap::Exception &)
-                    {
-                        GET_METRIC(tiflash_gc_safepoint_request_count, type_backoff_error).Increment();
-                        observe_backoff_count(false);
-                        throw;
-                    }
+                    has_pd_response_error = true;
+                    // Throw so the outer catch handles backoff + metrics exactly once.
+                    throw pingcap::Exception(
+                        gc_state.header().error().message(),
+                        pingcap::ErrorCodes::InternalError);
-                    continue; // retry
                 }
This funnels all backoff accounting through the outer catch, where the existing has_pd_response_error flag already correctly suppresses type_request_exception. Alternative: keep the inner try/catch but replace throw; with continue;/break; semantics and don't also run the outer backoff for the same iteration.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h` around lines 207 - 275,
The inner catch inside the PD-response-error branch (the catch that currently
does GET_METRIC(..., type_backoff_error), observe_backoff_count(false) and then
throw;) should not rethrow because that causes the outer catch to run the
backoff and metrics again; instead, after recording the backoff error and
observing failure, stop propagating the exception for this iteration (use
continue to retry the loop) so bo.backoff is not invoked twice. Update the catch
in the block surrounding pd_client->getGCState(keyspace_id) /
bo.backoff(pingcap::kv::boPDRPC, pingcap::Exception(...)) to replace the rethrow
with a loop continue (keeping the GET_METRIC and observe_backoff_count(false)
calls), so backoff_count, bo.backoff, and type_backoff_error are only handled
once per failure.

🧹 Nitpick comments (1)

dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h (1)

180-186: Optional: fold backoff counting into the lambda to prevent drift between the two call sites.

++backoff_count is manually duplicated at lines 227 and 266, which is easy to get out of sync with the histogram observation. Consider letting the lambda own both the counter and the histogram observation, e.g.:

-        UInt64 backoff_count = 0;
-        auto observe_backoff_count = [&](bool success) {
-            if (success)
-                GET_METRIC(tiflash_gc_safepoint_backoff_count, type_success).Observe(backoff_count);
-            else
-                GET_METRIC(tiflash_gc_safepoint_backoff_count, type_failure).Observe(backoff_count);
-        };
+        UInt64 backoff_count = 0;
+        auto observe_success = [&] {
+            GET_METRIC(tiflash_gc_safepoint_backoff_count, type_success).Observe(backoff_count);
+        };
+        auto observe_failure = [&] {
+            GET_METRIC(tiflash_gc_safepoint_backoff_count, type_failure).Observe(backoff_count);
+        };

Purely a readability/maintenance improvement; defer if you prefer the current shape.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h` around lines 180 - 186,
The backoff counter increment is duplicated outside the lambda (backoff_count
with ++ at the call sites) and can drift from the histogram observation; modify
the observe_backoff_count lambda (currently capturing backoff_count) so it owns
the counter and increments it internally before calling
GET_METRIC(tiflash_gc_safepoint_backoff_count, type_success/type_failure). Then
remove the external ++backoff_count calls at the sites that call
observe_backoff_count(false/true) so callers just call
observe_backoff_count(success) and the lambda both increments the counter and
records the histogram.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h`:
- Around line 207-275: The inner catch inside the PD-response-error branch (the
catch that currently does GET_METRIC(..., type_backoff_error),
observe_backoff_count(false) and then throw;) should not rethrow because that
causes the outer catch to run the backoff and metrics again; instead, after
recording the backoff error and observing failure, stop propagating the
exception for this iteration (use continue to retry the loop) so bo.backoff is
not invoked twice. Update the catch in the block surrounding
pd_client->getGCState(keyspace_id) / bo.backoff(pingcap::kv::boPDRPC,
pingcap::Exception(...)) to replace the rethrow with a loop continue (keeping
the GET_METRIC and observe_backoff_count(false) calls), so backoff_count,
bo.backoff, and type_backoff_error are only handled once per failure.

---

Nitpick comments:
In `@dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h`:
- Around line 180-186: The backoff counter increment is duplicated outside the
lambda (backoff_count with ++ at the call sites) and can drift from the
histogram observation; modify the observe_backoff_count lambda (currently
capturing backoff_count) so it owns the counter and increments it internally
before calling GET_METRIC(tiflash_gc_safepoint_backoff_count,
type_success/type_failure). Then remove the external ++backoff_count calls at
the sites that call observe_backoff_count(false/true) so callers just call
observe_backoff_count(success) and the lambda both increments the counter and
records the histogram.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 8a1ae2e3-a383-4118-be73-b0fcb1c533e8

📥 Commits

Reviewing files that changed from the base of the PR and between 4247c2c and 2faa6cf.

📒 Files selected for processing (2)

dbms/src/Common/TiFlashMetrics.h
dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h

🚧 Files skipped from review as they are similar to previous changes (1)

dbms/src/Common/TiFlashMetrics.h

Signed-off-by: Calvin Neo <calvinneo1995@gmail.com>

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h (1)

187-198: ⚠️ Potential issue | 🟡 Minor

Keep cache hits out of the backoff histogram.

On a cache hit, no PD request or backoff happens, but Line 197 records a successful backoff_count=0. In high-frequency paths this can swamp the histogram and hide the retry distribution. Either skip this observation or add a separate cache-hit metric.

📊 Proposed adjustment

             if (ks_gc_info.has_value())
             {
                 // Still valid, return the cached gc safepoint
-                observe_backoff_count(true);
                 return ks_gc_info->gc_safepoint;
             }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h` around lines 187 - 198,
The code records a backoff observation on cache hits by calling
observe_backoff_count(true) inside the ks_gc_sp_map.getGCSafepointIfValid
cache-hit branch; remove or change this so cache hits do not contribute to the
backoff histogram: either delete the observe_backoff_count(true) call from the
cache-hit return path in PDTiKVClient (where
ks_gc_sp_map.getGCSafepointIfValid(...) returns a value) so only real PD request
paths call observe_backoff_count, or replace it with a distinct cache-hit metric
(e.g., observe_gc_safepoint_cache_hit) and increment that instead; ensure
observe_backoff_count remains invoked only in code paths that actually perform
PD requests/retries.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@dbms/src/Interpreters/Settings.h`:
- Line 58: The setting safe_point_get_max_backoff_ms (declared with
SettingInt64) can be set negative and later clamped to 0 in
PDClientHelper::getGCSafePointWithRetry, which silently disables retries; change
the setting to validate and reject negative values (or switch to a non-negative
type) by updating the SettingInt64 registration for
safe_point_get_max_backoff_ms to enforce a lower bound of 0 (and optional
sensible upper bound) or replace it with an unsigned/explicit non-negative
setting type, and ensure any setter/path that reads this setting rejects/throws
on negative input so PDClientHelper can assume a non-negative value.

---

Outside diff comments:
In `@dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h`:
- Around line 187-198: The code records a backoff observation on cache hits by
calling observe_backoff_count(true) inside the
ks_gc_sp_map.getGCSafepointIfValid cache-hit branch; remove or change this so
cache hits do not contribute to the backoff histogram: either delete the
observe_backoff_count(true) call from the cache-hit return path in PDTiKVClient
(where ks_gc_sp_map.getGCSafepointIfValid(...) returns a value) so only real PD
request paths call observe_backoff_count, or replace it with a distinct
cache-hit metric (e.g., observe_gc_safepoint_cache_hit) and increment that
instead; ensure observe_backoff_count remains invoked only in code paths that
actually perform PD requests/retries.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: be2cdec2-35f9-428d-aa49-b27c21a7562c

📥 Commits

Reviewing files that changed from the base of the PR and between 2faa6cf and 15f2783.

📒 Files selected for processing (7)

dbms/src/Debug/dbgFuncSchema.cpp
dbms/src/Interpreters/Settings.h
dbms/src/Storages/DeltaMerge/DeltaMergeStore_InternalBg.cpp
dbms/src/Storages/KVStore/MultiRaft/PrehandleSnapshot.cpp
dbms/src/Storages/KVStore/TiKVHelpers/PDTiKVClient.h
dbms/src/Storages/StorageDeltaMerge.cpp
dbms/src/TiDB/Schema/SchemaSyncService.cpp

coderabbitai · 2026-04-22T08:44:51Z

    M(SettingUInt64, mpp_task_running_timeout, DEFAULT_MPP_TASK_RUNNING_TIMEOUT, "mpp task max time that running without any progress.")                                                                                                \
    M(SettingUInt64, mpp_task_waiting_timeout, DEFAULT_MPP_TASK_WAITING_TIMEOUT, "mpp task max time that waiting first data block from source input stream.")                                                                           \
    M(SettingInt64, safe_point_update_interval_seconds, 1, "The interval in seconds to update safe point from PD.")                                                                                                                     \
+    M(SettingInt64, safe_point_get_max_backoff_ms, 120000, "Max total backoff sleep time in milliseconds when fetching GC safe point from PD.")                                                                                         \


⚠️ Potential issue | 🟡 Minor

Reject negative max-backoff values instead of accepting them silently.

SettingInt64 has no bounds validation, so safe_point_get_max_backoff_ms = -1 is accepted and later clamped to 0 in PDClientHelper::getGCSafePointWithRetry, effectively disabling retries for transient PD errors. Consider validating this setting or switching the setting/API to a non-negative type with an explicit upper bound.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@dbms/src/Interpreters/Settings.h` at line 58, The setting safe_point_get_max_backoff_ms (declared with SettingInt64) can be set negative and later clamped to 0 in PDClientHelper::getGCSafePointWithRetry, which silently disables retries; change the setting to validate and reject negative values (or switch to a non-negative type) by updating the SettingInt64 registration for safe_point_get_max_backoff_ms to enforce a lower bound of 0 (and optional sensible upper bound) or replace it with an unsigned/explicit non-negative setting type, and ensure any setter/path that reads this setting rejects/throws on negative input so PDClientHelper can assume a non-negative value.

wuhuizuo · 2026-04-22T08:53:04Z

/retest

wuhuizuo · 2026-04-22T09:27:53Z

/test pull-unit-test

wuhuizuo · 2026-04-22T09:35:50Z

/retest

Signed-off-by: Calvin Neo <calvinneo1995@gmail.com>

CalvinNeo · 2026-04-22T11:34:28Z

/merge

ti-chi-bot · 2026-04-22T11:34:30Z

@CalvinNeo: We have migrated to builtin LGTM and approve plugins for reviewing.

👉 Please use /approve when you want to approve this pull request.

The changes announcement: LGTM plugin changes

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

CalvinNeo · 2026-04-22T11:34:47Z

/approve

ti-chi-bot · 2026-04-22T12:21:36Z

@kolafish: adding LGTM is restricted to approvers and reviewers in OWNERS files.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ti-chi-bot · 2026-04-22T12:21:37Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: CalvinNeo, JaySon-Huang, JinheLin, kolafish
Once this PR has been reviewed and has the lgtm label, please assign benmeadowcroft for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [CalvinNeo,JaySon-Huang,JinheLin]
dbms/src/Interpreters/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2026-04-22T12:32:22Z

@CalvinNeo: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-sanitizer-tsan	`606c2e6`	link	false	`/test pull-sanitizer-tsan`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

ti-chi-bot · 2026-04-22T13:11:32Z

@wuhuizuo: You cannot manually add or delete the reviewing state labels, only I and the tursted members have permission to do so.

Details

In response to adding label named approved.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

close #10808 ### What problem does this PR solve? Issue Number: close #xxx Problem Summary: ### What is changed and how it works? ```commit-message ``` ### Check List Tests  - [ ] Unit test - [ ] Integration test - [ ] Manual test (add detailed scripts or steps below) - [ ] No code Side effects - [ ] Performance regression: Consumes more CPU - [ ] Performance regression: Consumes more Memory - [ ] Breaking backward compatibility Documentation - [ ] Affects user behaviors - [ ] Contains syntax changes - [ ] Contains variable changes - [ ] Contains experimental features - [ ] Changes MySQL compatibility ### Release note  ```release-note None ``` Signed-off-by: Calvin Neo <calvinneo1995@gmail.com>

f

ec2acdf

Signed-off-by: Calvin Neo <calvinneo1995@gmail.com>

ti-chi-bot Bot added do-not-merge/needs-linked-issue release-note-none Denotes a PR that doesn't merit a release note. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 22, 2026

a

13c1626

Signed-off-by: Calvin Neo <calvinneo1995@gmail.com>

coderabbitai Bot reviewed Apr 22, 2026

View reviewed changes

ti-chi-bot Bot added do-not-merge/needs-triage-completed and removed do-not-merge/needs-linked-issue labels Apr 22, 2026

CalvinNeo added the severity/minor label Apr 22, 2026

a

fe1a97d

Signed-off-by: Calvin Neo <calvinneo1995@gmail.com>

ti-chi-bot Bot removed the do-not-merge/needs-triage-completed label Apr 22, 2026

da

4247c2c

Signed-off-by: Calvin Neo <calvinneo1995@gmail.com>

ti-chi-bot Bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 22, 2026

JinheLin approved these changes Apr 22, 2026

View reviewed changes

ti-chi-bot Bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Apr 22, 2026

JaySon-Huang approved these changes Apr 22, 2026

View reviewed changes

ti-chi-bot Bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Apr 22, 2026

d

2faa6cf

Signed-off-by: Calvin Neo <calvinneo1995@gmail.com>

coderabbitai Bot reviewed Apr 22, 2026

View reviewed changes

fix

15f2783

Signed-off-by: Calvin Neo <calvinneo1995@gmail.com>

ti-chi-bot Bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed approved size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 22, 2026

coderabbitai Bot reviewed Apr 22, 2026

View reviewed changes

a

606c2e6

Signed-off-by: Calvin Neo <calvinneo1995@gmail.com>

kolafish approved these changes Apr 22, 2026

View reviewed changes

wuhuizuo added the approved label Apr 22, 2026

ti-chi-bot Bot removed the approved label Apr 22, 2026

ti-chi-bot added the approved label Apr 22, 2026

ti-chi-bot Bot merged commit c6a40dd into pingcap:master Apr 22, 2026
9 of 10 checks passed

coderabbitai Bot mentioned this pull request Apr 22, 2026

metrics: improve getGCSafePointWithRetry metrics #10810

Merged

12 tasks

coderabbitai Bot mentioned this pull request Apr 24, 2026

WIP reduce getGCState freq #10817

Open

12 tasks

		@@ -1 +1 @@
		Subproject commit 75155a4c3752ae11b35384e166122751285410a7
		Subproject commit 34698ddc6f79c955b41096b480389a29b0105cac

Conversation

CalvinNeo commented Apr 22, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and how it works?

Check List

Release note

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (2 warnings, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

CalvinNeo commented Apr 22, 2026

Uh oh!

CalvinNeo commented Apr 22, 2026

Uh oh!

ti-chi-bot Bot commented Apr 22, 2026

[LGTM Timeline notifier]

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

wuhuizuo commented Apr 22, 2026

Uh oh!

wuhuizuo commented Apr 22, 2026

Uh oh!

wuhuizuo commented Apr 22, 2026

Uh oh!

CalvinNeo commented Apr 22, 2026

Uh oh!

ti-chi-bot Bot commented Apr 22, 2026

Uh oh!

CalvinNeo commented Apr 22, 2026

Uh oh!

ti-chi-bot Bot commented Apr 22, 2026

Uh oh!

ti-chi-bot Bot commented Apr 22, 2026

Uh oh!

ti-chi-bot Bot commented Apr 22, 2026

Uh oh!

ti-chi-bot Bot commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

CalvinNeo commented Apr 22, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 22, 2026 •

edited

Loading