metrics: add changefeed error info info by wk989898 · Pull Request #4499 · pingcap/ticdc

wk989898 · 2026-03-16T04:30:34Z

What problem does this PR solve?

Issue Number: close #4498

What is changed and how it works?

New Error Metrics: Introduced a new Prometheus metric, ChangefeedErrorInfoGauge, to capture detailed error and warning information for changefeeds, including keyspace, changefeed ID, state, error code, and a normalized error message.

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)

Questions

Will it cause performance regression or break compatibility?

Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Please refer to [Release Notes Language Style Guide](https://pingcap.github.io/tidb-dev-guide/contribute-to-tidb/release-notes-style-guide.html) to write a quality release note.

If you don't think this PR needs a release note then fill it with `None`.

Summary by CodeRabbit

New Features
- Added "Changefeed Error Details" dashboard panel in Grafana showing error state, code, and detailed message information for failed or warning changefeeds.
- Introduced error metric tracking system for changefeeds, enabling improved monitoring and observability of changefeed health and failure states.

Signed-off-by: wk989898 <nhsmwk@gmail.com>

gemini-code-assist · 2026-03-16T04:30:54Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the monitoring capabilities for TiCDC changefeeds by introducing a dedicated Prometheus metric for detailed error and warning information. This enhancement provides operators with immediate visibility into the specific issues affecting changefeeds, allowing for more efficient diagnosis and resolution of problems. The integration with Grafana dashboards ensures that this critical information is readily accessible and easily interpretable.

Highlights

New Error Metrics: Introduced a new Prometheus metric, ChangefeedErrorInfoGauge, to capture detailed error and warning information for changefeeds, including keyspace, changefeed ID, state, error code, and a normalized error message.
Error Metric Management: Implemented logic within the coordinator to actively collect, normalize (truncating long messages), and manage the lifecycle of these changefeed error metrics, ensuring they are updated and cleared as changefeed states change.
Grafana Dashboard Integration: Integrated the new error metrics into existing Grafana dashboards (ticdc_new_arch.json, ticdc_new_arch_next_gen.json, ticdc_new_arch_with_keyspace_name.json) by adding a "Changefeed Error Details" panel for enhanced visualization and troubleshooting.
Unit Tests: Added comprehensive unit tests to validate the correct collection, updating, and clearing behavior of the changefeed error metrics.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

coordinator/controller.go
- Added strings import for message normalization.
- Defined changefeedErrorMetricMsgLimit constant for message truncation.
- Introduced changefeedErrorMetricLabels struct to define labels for error metrics.
- Implemented normalizeChangefeedErrorMetricMessage to clean and truncate error messages.
- Added getChangefeedErrorMetricLabels to extract relevant error information from changefeed data.
- Modified collectMetrics to initialize and periodically call syncChangefeedErrorMetrics.
- Created syncChangefeedErrorMetrics function to manage the lifecycle of ChangefeedErrorInfoGauge.
coordinator/controller_test.go
- Added fmt, strings, kerneltype, metrics, and testutil imports for testing.
- Implemented getChangefeedErrorMetricLabelName helper for metric label consistency in tests.
- Added TestSyncChangefeedErrorMetrics to verify initial metric collection for changefeed errors.
- Added TestSyncChangefeedErrorMetricsUpdateAndClear to test metric updates and clearing when changefeed states change.
metrics/grafana/ticdc_new_arch.json
- Added a new Grafana table panel titled "Changefeed Error Details" to display ticdc_owner_changefeed_error_info.
- Updated several existing panel IDs to ensure uniqueness.
metrics/nextgengrafana/ticdc_new_arch_next_gen.json
- Added a new Grafana table panel titled "Changefeed Error Details" for the next-gen architecture, displaying ticdc_owner_changefeed_error_info.
- Updated several existing panel IDs to ensure uniqueness.
metrics/nextgengrafana/ticdc_new_arch_with_keyspace_name.json
- Added a new Grafana table panel titled "Changefeed Error Details" for the keyspace-aware architecture, displaying ticdc_owner_changefeed_error_info.
- Updated several existing panel IDs to ensure uniqueness.
pkg/metrics/changefeed.go
- Defined and registered ChangefeedErrorInfoGauge as a new Prometheus GaugeVec to track detailed error information for changefeeds.

Activity

No specific activity (comments, reviews, progress) was provided in the context for this pull request.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-03-16T04:30:55Z

Warning

Rate limit exceeded

@wk989898 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 10 minutes and 53 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 43421a5c-2b17-4a83-9f61-2e4f9e9b6814

📥 Commits

Reviewing files that changed from the base of the PR and between 1452837 and a9c2b29.

📒 Files selected for processing (3)

metrics/grafana/ticdc_new_arch.json
metrics/nextgengrafana/ticdc_new_arch_next_gen.json
metrics/nextgengrafana/ticdc_new_arch_with_keyspace_name.json

📝 Walkthrough

Walkthrough

This PR introduces a new Prometheus metric (ChangefeedErrorInfoGauge) to expose changefeed error details when changefeeds are in warning or failed states. The controller implements metric collection with in-memory label caching to avoid redundant gauge updates. Grafana dashboards are updated with a new "Changefeed Error Details" panel to visualize this information.

Changes

Cohort / File(s)	Summary
Metric Definition & Collection `coordinator/controller.go`, `pkg/metrics/changefeed.go`, `coordinator/helper.go`	Introduces `ChangefeedErrorInfoGauge` metric with labels `[keyspace, changefeed, state, code, message]`, implements cache-based metric collection in `collectMetrics` to track emitted labels and avoid redundant updates, and adds helper functions to extract error labels from changefeed info when state is `Failed` or `Warning`.
Grafana Dashboards `metrics/grafana/ticdc_new_arch.json`, `metrics/nextgengrafana/ticdc_new_arch_next_gen.json`, `metrics/nextgengrafana/ticdc_new_arch_with_keyspace_name.json`	Adds new "Changefeed Error Details" table panel (id `62002`) to Summary row querying `ticdc_owner_changefeed_error_info` with transformations to expose error columns; renumbers multiple existing panel IDs across dashboard.

Sequence Diagram

sequenceDiagram
    actor Tick as Metrics Tick
    participant Controller as Coordinator Controller
    participant DB as Changefeed DB
    participant Helper as Label Helper
    participant Gauge as ErrorInfoGauge
    participant Dashboard as Grafana Dashboard
    
    Tick->>Controller: collectMetrics()
    Controller->>DB: Foreach(changefeed)
    loop For Each Changefeed
        DB->>Helper: getChangefeedErrorMetricLabels(cf)
        Helper->>Helper: Check state == Failed/Warning
        Helper-->>DB: newLabels, hasError
        DB->>Controller: Evaluate cache
        alt Labels Changed or Error State Changed
            Controller->>Gauge: Delete old labels (if cached)
            alt hasError == true
                Controller->>Gauge: Set(newLabels, 1)
                Controller->>Controller: Update cache
            else hasError == false
                Controller->>Controller: Remove from cache
            end
        else Labels Unchanged & Error State Unchanged
            Controller->>Controller: Skip (early return)
        end
    end
    Controller->>Controller: Cleanup removed changefeeds
    Controller->>Gauge: Delete(removed changefeed labels)
    Gauge->>Dashboard: Expose metrics
    Dashboard->>Dashboard: Display Error Details

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

metrics: fix ticdc grafana dashboard layout conflicts #4507: Modifies the same Grafana dashboard JSONs (panel ID renumbering and new panel additions), suggesting coordinated dashboard updates.

Suggested reviewers

hongyunyan
asddongmen
tenfyzhong

Poem

🐰 When changefeeds stumble and warn with despair,
A gauge now captures each error with care,
No logs need hunting, the metrics now glow—
In Grafana's tableau, the problems we know! 📊✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title contains a repetition ('info info') that suggests a typo or editing error, making it ambiguous whether this accurately describes the change.	Clarify the title by removing the duplicate word—e.g., 'metrics: add changefeed error info' or 'metrics: add changefeed error info metric' to better reflect the actual change.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description check	✅ Passed	The PR description includes the required issue reference (close `#4498`), explains the changes (new metric ChangefeedErrorInfoGauge), lists test types, and provides an image demonstrating the feature in action.
Linked Issues check	✅ Passed	All code changes implement the objective from `#4498`: a new Prometheus metric now exposes changefeed error details (state, code, message) to improve diagnosis without requiring log queries.
Out of Scope Changes check	✅ Passed	All changes are directly aligned with the linked issue `#4498`: the metric definition, collector logic, Grafana panels, and helper functions all support exposing changefeed error information as specified.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a new metric, ticdc_owner_changefeed_error_info, to provide detailed information about changefeed errors. The implementation includes logic to create, update, and delete these metrics as the changefeed state changes. The changes are well-tested and include updates to the Grafana dashboards to visualize the new metric. My main feedback is a suggestion to refactor the metric synchronization logic for better readability and maintainability.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (4)

coordinator/controller_test.go (2)
49-145: Add a warning-state metric case for full objective coverage.

These tests cover failed → update → clear paths well, but the issue objective also requires warning-state visibility. Please add a warning-state assertion to lock behavior for both supported states.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@coordinator/controller_test.go` around lines 49 - 145, Add assertions for the
"warning" state to achieve full objective coverage: in
TestSyncChangefeedErrorMetrics create a changefeed with StateWarning (instead of
or in addition to StateFailed) using changefeed.NewChangefeed and a
RunningError, call controller.syncChangefeedErrorMetrics, and assert the
exported metric contains state="warning" (use
normalizeChangefeedErrorMetricMessage and getChangefeedErrorMetricLabelName for
message/label formatting); similarly in
TestSyncChangefeedErrorMetricsUpdateAndClear add a transition path that sets cf
info to StateWarning and verifies the metric shows the warning state, then
update/clear as done for failed to confirm metrics update and removal via
controller.syncChangefeedErrorMetrics.
42-47: Avoid duplicating keyspace-label literals in test helper.

Line 42-47 re-encodes "keyspace_name" / "namespace" in test code. This can drift from production label selection logic over time. Prefer reusing a shared constant/helper from metrics package (if exposed) to keep one source of truth.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@coordinator/controller_test.go` around lines 42 - 47, Replace the hard-coded
literals in getChangefeedErrorMetricLabelName (and its kerneltype.IsNextGen
branch) with the exported constant or helper from the metrics package so the
test uses the same source-of-truth as production; import the metrics package and
return the appropriate exported symbol (e.g., metrics.KeyspaceLabelName or the
relevant metrics.Changefeed label constant/helper—use the actual exported name
available) instead of "keyspace_name" / "namespace".
metrics/nextgengrafana/ticdc_new_arch_next_gen.json (1)
1197-1219: Add explicit state filter to dashboard query as defensive refactoring.

The metric ticdc_owner_changefeed_error_info only emits when changefeed state is "failed" or "warning" (see getChangefeedErrorMetricLabels in coordinator/controller.go), so the current query is functionally correct. However, adding an explicit state=~"warning|failed" selector makes the panel intent clear and guards against future metric behavior changes.
Proposed query refinement
- "expr": "max by (keyspace_name, changefeed, state, code, message) (ticdc_owner_changefeed_error_info{k8s_cluster=\"$k8s_cluster\", sharedpool_id=\"$tidb_cluster\", keyspace_name=~\"$keyspace_name\", changefeed=~\"$changefeed\"})",
+ "expr": "max by (keyspace_name, changefeed, state, code, message) (ticdc_owner_changefeed_error_info{k8s_cluster=\"$k8s_cluster\", sharedpool_id=\"$tidb_cluster\", keyspace_name=~\"$keyspace_name\", changefeed=~\"$changefeed\", state=~\"warning|failed\"})",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@metrics/nextgengrafana/ticdc_new_arch_next_gen.json` around lines 1197 -
1219, The panel's Prometheus target should explicitly restrict states to warning
or failed to make intent clear and guard against future metric changes: update
the target expression that references ticdc_owner_changefeed_error_info (the
"expr" in the targets for panel id 62002) to include a state selector like
state=~"warning|failed"; this aligns the query with
getChangefeedErrorMetricLabels' behavior and keeps filtering alongside existing
label matchers (k8s_cluster, sharedpool_id, keyspace_name, changefeed).
coordinator/controller.go (1)
270-300: Consider metric cardinality implications of the message label.

The sync logic is correct, but the message label could lead to high Prometheus metric cardinality if changefeeds cycle through different error messages over time (e.g., containing timestamps, connection addresses, or other dynamic content). Each unique message creates a new time series.

This may be acceptable given:

The truncation to 256 chars limits variation

Typically few changefeeds are in error state simultaneously

The diagnostic value outweighs the cost per PR objectives

Consider documenting this trade-off or monitoring cardinality in production.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@coordinator/controller.go` around lines 270 - 300, The
syncChangefeedErrorMetrics function currently exposes the dynamic "message"
label (used with metrics.ChangefeedErrorInfoGauge), which can increase
Prometheus cardinality; add a brief in-code comment above
syncChangefeedErrorMetrics documenting this trade-off (mention truncation to 256
chars and the potential for high cardinality) and how to monitor it, and also
emit a simple telemetry metric reflecting the current unique error-message
cardinality (e.g., update or add a gauge like
metrics.ChangefeedErrorMessageCardinality.Set(float64(len(cache))) inside
syncChangefeedErrorMetrics) so production can track the number of distinct
messages over time.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@metrics/grafana/ticdc_new_arch.json`:
- Around line 1218-1221: The panel's "expr" currently returns series with zero
values (stale labels); update the expression that uses the metric
ticdc_owner_changefeed_error_info (the "expr" field for refId "A") to exclude
zero-valued series by applying a > 0 filter after the metric or after the
aggregation (e.g., only keep series where the value is greater than 0) so
recovered changefeeds no longer show as current errors.

In `@pkg/metrics/changefeed.go`:
- Around line 76-83: The metric ChangefeedErrorInfoGauge currently includes raw
error text in the "message" label (constructed via
normalizeChangefeedErrorMetricMessage), which can leak secrets; update the code
that sets labels for ChangefeedErrorInfoGauge to sanitize the message using the
existing redaction helper (e.g., util.RedactValue()) before calling
normalizeChangefeedErrorMetricMessage or instead emit a stable error fingerprint
label (e.g., "error_fingerprint") and move full text only to logs/events; ensure
the change references the ChangefeedErrorInfoGauge and
normalizeChangefeedErrorMetricMessage symbols so all callers use the sanitized
value.

---

Nitpick comments:
In `@coordinator/controller_test.go`:
- Around line 49-145: Add assertions for the "warning" state to achieve full
objective coverage: in TestSyncChangefeedErrorMetrics create a changefeed with
StateWarning (instead of or in addition to StateFailed) using
changefeed.NewChangefeed and a RunningError, call
controller.syncChangefeedErrorMetrics, and assert the exported metric contains
state="warning" (use normalizeChangefeedErrorMetricMessage and
getChangefeedErrorMetricLabelName for message/label formatting); similarly in
TestSyncChangefeedErrorMetricsUpdateAndClear add a transition path that sets cf
info to StateWarning and verifies the metric shows the warning state, then
update/clear as done for failed to confirm metrics update and removal via
controller.syncChangefeedErrorMetrics.
- Around line 42-47: Replace the hard-coded literals in
getChangefeedErrorMetricLabelName (and its kerneltype.IsNextGen branch) with the
exported constant or helper from the metrics package so the test uses the same
source-of-truth as production; import the metrics package and return the
appropriate exported symbol (e.g., metrics.KeyspaceLabelName or the relevant
metrics.Changefeed label constant/helper—use the actual exported name available)
instead of "keyspace_name" / "namespace".

In `@coordinator/controller.go`:
- Around line 270-300: The syncChangefeedErrorMetrics function currently exposes
the dynamic "message" label (used with metrics.ChangefeedErrorInfoGauge), which
can increase Prometheus cardinality; add a brief in-code comment above
syncChangefeedErrorMetrics documenting this trade-off (mention truncation to 256
chars and the potential for high cardinality) and how to monitor it, and also
emit a simple telemetry metric reflecting the current unique error-message
cardinality (e.g., update or add a gauge like
metrics.ChangefeedErrorMessageCardinality.Set(float64(len(cache))) inside
syncChangefeedErrorMetrics) so production can track the number of distinct
messages over time.

In `@metrics/nextgengrafana/ticdc_new_arch_next_gen.json`:
- Around line 1197-1219: The panel's Prometheus target should explicitly
restrict states to warning or failed to make intent clear and guard against
future metric changes: update the target expression that references
ticdc_owner_changefeed_error_info (the "expr" in the targets for panel id 62002)
to include a state selector like state=~"warning|failed"; this aligns the query
with getChangefeedErrorMetricLabels' behavior and keeps filtering alongside
existing label matchers (k8s_cluster, sharedpool_id, keyspace_name, changefeed).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 46141cd1-8df7-4696-9e1d-3f7e1f1a1990

📥 Commits

Reviewing files that changed from the base of the PR and between 1154cb3 and 3db01dc.

📒 Files selected for processing (6)

coordinator/controller.go
coordinator/controller_test.go
metrics/grafana/ticdc_new_arch.json
metrics/nextgengrafana/ticdc_new_arch_next_gen.json
metrics/nextgengrafana/ticdc_new_arch_with_keyspace_name.json
pkg/metrics/changefeed.go

coderabbitai · 2026-03-16T04:37:36Z

+              "expr": "max by (namespace, changefeed, state, code, message) (ticdc_owner_changefeed_error_info{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", namespace=~\"$namespace\", changefeed=~\"$changefeed\"})",
+              "format": "time_series",
+              "instant": true,
+              "refId": "A"


⚠️ Potential issue | 🟠 Major

Filter out inactive error series in the panel query.

At Line [1218], the query does not exclude zero-valued series. If the gauge is reset to 0 for recovered changefeeds, stale labels can still appear in this table as “current” errors.

Suggested query fix

- "expr": "max by (namespace, changefeed, state, code, message) (ticdc_owner_changefeed_error_info{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", namespace=~\"$namespace\", changefeed=~\"$changefeed\"})", + "expr": "max by (namespace, changefeed, state, code, message) (ticdc_owner_changefeed_error_info{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", namespace=~\"$namespace\", changefeed=~\"$changefeed\"}) > 0",

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@metrics/grafana/ticdc_new_arch.json` around lines 1218 - 1221, The panel's "expr" currently returns series with zero values (stale labels); update the expression that uses the metric ticdc_owner_changefeed_error_info (the "expr" field for refId "A") to exclude zero-valued series by applying a > 0 filter after the metric or after the aggregation (e.g., only keep series where the value is greater than 0) so recovered changefeeds no longer show as current errors.

coderabbitai · 2026-03-16T04:37:36Z

+	ChangefeedErrorInfoGauge = prometheus.NewGaugeVec(
+		prometheus.GaugeOpts{
+			Namespace: "ticdc",
+			Subsystem: "owner",
+			Name:      "changefeed_error_info",
+			Help:      "The current warning or failed reason of changefeeds",
+		}, []string{getKeyspaceLabel(), "changefeed", "state", "code", "message"})
+


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== Metric definition and usage ===" rg -n --type go -C2 'ChangefeedErrorInfoGauge|changefeed_error_info|labelValues' echo echo "=== Normalization/redaction implementation for error message ===" rg -n --type go -A40 -B5 'func normalizeChangefeedErrorMetricMessage' echo echo "=== Potential redaction/sanitization helpers in related code ===" rg -n --type go -C2 'redact|sanitize|mask|password|passwd|token|secret|SinkURI|UserPassword'

Repository: pingcap/ticdc

Length of output: 50370

🏁 Script executed:

rg -n --type go 'changefeedErrorMetricMsgLimit' -B2 -A2

Repository: pingcap/ticdc

Length of output: 816

🏁 Script executed:

rg -n --type go 'getChangefeedErrorMetricLabels' -B3 -A8

Repository: pingcap/ticdc

Length of output: 1500

🏁 Script executed:

# Look at test expectations to understand what kind of messages are expected rg -n --type go -A5 'expectedMessage.*normalizeChangefeedErrorMetricMessage'

Repository: pingcap/ticdc

Length of output: 756

Apply redaction to error messages in ChangefeedErrorInfoGauge labels.

Line 82 includes raw error text as a Prometheus label. While normalizeChangefeedErrorMetricMessage (line 67–73) bounds length to 256 characters, it performs no redaction or masking. Error messages may contain connection strings, credentials, or API keys, which will leak into metrics storage. Use the existing redaction infrastructure (e.g., util.RedactValue()) to sanitize the message before label assignment, or use a stable error fingerprint label and keep full text in logs/events only.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pkg/metrics/changefeed.go` around lines 76 - 83, The metric ChangefeedErrorInfoGauge currently includes raw error text in the "message" label (constructed via normalizeChangefeedErrorMetricMessage), which can leak secrets; update the code that sets labels for ChangefeedErrorInfoGauge to sanitize the message using the existing redaction helper (e.g., util.RedactValue()) before calling normalizeChangefeedErrorMetricMessage or instead emit a stable error fingerprint label (e.g., "error_fingerprint") and move full text only to logs/events; ensure the change references the ChangefeedErrorInfoGauge and normalizeChangefeedErrorMetricMessage symbols so all callers use the sanitized value.

Signed-off-by: wk989898 <nhsmwk@gmail.com>

coderabbitai

🧹 Nitpick comments (1)

metrics/nextgengrafana/ticdc_new_arch_with_keyspace_name.json (1)
7460-7460: Prefer stable panel IDs for existing panels.

Renumbering existing panel IDs increases dashboard churn and can break deep links/bookmarks that reference viewPanel. If there’s no collision, keep existing IDs and allocate new IDs only for newly added panels.

Also applies to: 7556-7556, 7662-7662, 8245-8245
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@metrics/nextgengrafana/ticdc_new_arch_with_keyspace_name.json` at line 7460,
Several panel "id" fields were renumbered (e.g., "id": 62004 and the other
affected ids at 7556, 7662, 8245); restore the original stable panel IDs for
existing panels instead of assigning new numbers. Locate the JSON objects with
"id" keys (match the shown values like 62004, 7556, 7662, 8245), revert them to
their previous IDs if no collision exists, and if you must add new panels
generate new unique IDs that do not overlap existing ones (use a consistent
high-range allocation or an ID registry) so deep links/bookmarks (viewPanel)
remain stable.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@metrics/nextgengrafana/ticdc_new_arch_with_keyspace_name.json`:
- Line 7460: Several panel "id" fields were renumbered (e.g., "id": 62004 and
the other affected ids at 7556, 7662, 8245); restore the original stable panel
IDs for existing panels instead of assigning new numbers. Locate the JSON
objects with "id" keys (match the shown values like 62004, 7556, 7662, 8245),
revert them to their previous IDs if no collision exists, and if you must add
new panels generate new unique IDs that do not overlap existing ones (use a
consistent high-range allocation or an ID registry) so deep links/bookmarks
(viewPanel) remain stable.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c11186d7-4a21-4529-b41b-88fdf9b1df85

📥 Commits

Reviewing files that changed from the base of the PR and between 3db01dc and fe3d264.

📒 Files selected for processing (3)

metrics/grafana/ticdc_new_arch.json
metrics/nextgengrafana/ticdc_new_arch_next_gen.json
metrics/nextgengrafana/ticdc_new_arch_with_keyspace_name.json

Signed-off-by: wk989898 <nhsmwk@gmail.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@coordinator/controller.go`:
- Around line 79-86: The code currently picks runningErr by preferring
info.Error over info.Warning regardless of state; change this to select the
field based on info.State (use info.Error when info.State == config.StateFailed
and use info.Warning when info.State == config.StateWarning) so a warning-state
changefeed cannot emit failed-state details; update the assignment of runningErr
(the logic around info.State, runningErr, info.Error, info.Warning and the check
against config.StateFailed/config.StateWarning) accordingly so the chosen error
matches the current state.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b7d7cc5a-b736-4e30-94cb-311796d3067a

📥 Commits

Reviewing files that changed from the base of the PR and between fe3d264 and cf5e671.

📒 Files selected for processing (1)

coordinator/controller.go

Signed-off-by: wk989898 <nhsmwk@gmail.com>

ti-chi-bot · 2026-03-19T09:10:45Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: 3AceShowHand, lidezhu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [3AceShowHand,lidezhu]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2026-03-19T09:10:47Z

[LGTM Timeline notifier]

Timeline:

2026-03-19 08:38:34.789517381 +0000 UTC m=+432641.877174918: ☑️ agreed by 3AceShowHand.
2026-03-19 09:10:46.39585448 +0000 UTC m=+434573.483512017: ☑️ agreed by lidezhu.

Signed-off-by: wk989898 <nhsmwk@gmail.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@coordinator/helper.go`:
- Around line 76-79: The current logic sets runningErr := info.Error then falls
back to info.Warning, which can surface the wrong message when info.State
indicates a warning; update the selection to be state-first: inspect info.State
(e.g., "error", "warning") and choose info.Error only when State == "error",
choose info.Warning only when State == "warning", and finally fallback to
whichever non-nil field remains (or nil) — update the code around runningErr,
using the info.State, info.Error, and info.Warning symbols to implement this
deterministic selection and ensure nil checks when selecting the fallback.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: da5f2ea9-5efb-4af2-b04e-10fd6e849790

📥 Commits

Reviewing files that changed from the base of the PR and between cf5e671 and 1452837.

📒 Files selected for processing (5)

coordinator/controller.go
coordinator/helper.go
metrics/grafana/ticdc_new_arch.json
metrics/nextgengrafana/ticdc_new_arch_next_gen.json
metrics/nextgengrafana/ticdc_new_arch_with_keyspace_name.json

🚧 Files skipped from review as they are similar to previous changes (2)

coordinator/controller.go
metrics/grafana/ticdc_new_arch.json

coderabbitai · 2026-03-20T04:18:13Z

+	runningErr := info.Error
+	if runningErr == nil {
+		runningErr = info.Warning
+	}


⚠️ Potential issue | 🟠 Major

Use state-first error selection to avoid mismatched warning/failed details.

At Line 76, info.Error is always preferred even when info.State is warning. If both fields are present, this can publish a stale/incorrect code/message for the current state.

🔧 Proposed fix

- runningErr := info.Error - if runningErr == nil { - runningErr = info.Warning - } + var runningErr *config.RunningError + switch info.State { + case config.StateFailed: + runningErr = info.Error + if runningErr == nil { + runningErr = info.Warning + } + case config.StateWarning: + runningErr = info.Warning + if runningErr == nil { + runningErr = info.Error + } + } if runningErr == nil { return changefeedErrorMetricLabels{}, false }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

runningErr := info.Error

if runningErr == nil {

runningErr = info.Warning

}

var runningErr *config.RunningError

switch info.State {

case config.StateFailed:

runningErr = info.Error

if runningErr == nil {

runningErr = info.Warning

}

case config.StateWarning:

runningErr = info.Warning

if runningErr == nil {

runningErr = info.Error

}

}

if runningErr == nil {

return changefeedErrorMetricLabels{}, false

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@coordinator/helper.go` around lines 76 - 79, The current logic sets runningErr := info.Error then falls back to info.Warning, which can surface the wrong message when info.State indicates a warning; update the selection to be state-first: inspect info.State (e.g., "error", "warning") and choose info.Error only when State == "error", choose info.Warning only when State == "warning", and finally fallback to whichever non-nil field remains (or nil) — update the code around runningErr, using the info.State, info.Error, and info.Warning symbols to implement this deterministic selection and ensure nil checks when selecting the fallback.

close #4498

init

3db01dc

Signed-off-by: wk989898 <nhsmwk@gmail.com>

ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 16, 2026

gemini-code-assist bot reviewed Mar 16, 2026

View reviewed changes

Comment thread coordinator/controller.go Outdated

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

update

fe3d264

Signed-off-by: wk989898 <nhsmwk@gmail.com>

ti-chi-bot bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 16, 2026

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

update

cf5e671

Signed-off-by: wk989898 <nhsmwk@gmail.com>

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

Comment thread coordinator/controller.go Outdated

chore

2720f67

Signed-off-by: wk989898 <nhsmwk@gmail.com>

ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Mar 16, 2026

3AceShowHand approved these changes Mar 19, 2026

View reviewed changes

ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Mar 19, 2026

lidezhu approved these changes Mar 19, 2026

View reviewed changes

ti-chi-bot bot added the lgtm label Mar 19, 2026

ti-chi-bot bot removed the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Mar 19, 2026

wk989898 added 2 commits March 20, 2026 04:12

Merge branch 'master' into metric-0316

1452837

Signed-off-by: wk989898 <nhsmwk@gmail.com>

update

a9c2b29

Signed-off-by: wk989898 <nhsmwk@gmail.com>

coderabbitai bot reviewed Mar 20, 2026

View reviewed changes

ti-chi-bot bot merged commit e8de3a6 into pingcap:master Mar 20, 2026
26 checks passed

tenfyzhong pushed a commit that referenced this pull request Mar 25, 2026

metrics: add changefeed error info info (#4499)

5adb894

close #4498

wk989898 deleted the metric-0316 branch April 13, 2026 04:02

Conversation

wk989898 commented Mar 16, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and how it works?

Check List

Tests

Questions

Will it cause performance regression or break compatibility?

Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Summary by CodeRabbit

Uh oh!

gemini-code-assist bot commented Mar 16, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ti-chi-bot bot commented Mar 19, 2026

Uh oh!

ti-chi-bot bot commented Mar 19, 2026

[LGTM Timeline notifier]

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wk989898 commented Mar 16, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 16, 2026 •

edited

Loading