Skip to content

metrics: add changefeed error info info#4499

Merged
ti-chi-bot[bot] merged 6 commits intopingcap:masterfrom
wk989898:metric-0316
Mar 20, 2026
Merged

metrics: add changefeed error info info#4499
ti-chi-bot[bot] merged 6 commits intopingcap:masterfrom
wk989898:metric-0316

Conversation

@wk989898
Copy link
Copy Markdown
Collaborator

@wk989898 wk989898 commented Mar 16, 2026

What problem does this PR solve?

Issue Number: close #4498

What is changed and how it works?

New Error Metrics: Introduced a new Prometheus metric, ChangefeedErrorInfoGauge, to capture detailed error and warning information for changefeeds, including keyspace, changefeed ID, state, error code, and a normalized error message.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
Screenshot 2026-03-16 at 15 39 45

Questions

Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Please refer to [Release Notes Language Style Guide](https://pingcap.github.io/tidb-dev-guide/contribute-to-tidb/release-notes-style-guide.html) to write a quality release note.

If you don't think this PR needs a release note then fill it with `None`.

Summary by CodeRabbit

  • New Features
    • Added "Changefeed Error Details" dashboard panel in Grafana showing error state, code, and detailed message information for failed or warning changefeeds.
    • Introduced error metric tracking system for changefeeds, enabling improved monitoring and observability of changefeed health and failure states.

Signed-off-by: wk989898 <nhsmwk@gmail.com>
@ti-chi-bot ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 16, 2026
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the monitoring capabilities for TiCDC changefeeds by introducing a dedicated Prometheus metric for detailed error and warning information. This enhancement provides operators with immediate visibility into the specific issues affecting changefeeds, allowing for more efficient diagnosis and resolution of problems. The integration with Grafana dashboards ensures that this critical information is readily accessible and easily interpretable.

Highlights

  • New Error Metrics: Introduced a new Prometheus metric, ChangefeedErrorInfoGauge, to capture detailed error and warning information for changefeeds, including keyspace, changefeed ID, state, error code, and a normalized error message.
  • Error Metric Management: Implemented logic within the coordinator to actively collect, normalize (truncating long messages), and manage the lifecycle of these changefeed error metrics, ensuring they are updated and cleared as changefeed states change.
  • Grafana Dashboard Integration: Integrated the new error metrics into existing Grafana dashboards (ticdc_new_arch.json, ticdc_new_arch_next_gen.json, ticdc_new_arch_with_keyspace_name.json) by adding a "Changefeed Error Details" panel for enhanced visualization and troubleshooting.
  • Unit Tests: Added comprehensive unit tests to validate the correct collection, updating, and clearing behavior of the changefeed error metrics.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • coordinator/controller.go
    • Added strings import for message normalization.
    • Defined changefeedErrorMetricMsgLimit constant for message truncation.
    • Introduced changefeedErrorMetricLabels struct to define labels for error metrics.
    • Implemented normalizeChangefeedErrorMetricMessage to clean and truncate error messages.
    • Added getChangefeedErrorMetricLabels to extract relevant error information from changefeed data.
    • Modified collectMetrics to initialize and periodically call syncChangefeedErrorMetrics.
    • Created syncChangefeedErrorMetrics function to manage the lifecycle of ChangefeedErrorInfoGauge.
  • coordinator/controller_test.go
    • Added fmt, strings, kerneltype, metrics, and testutil imports for testing.
    • Implemented getChangefeedErrorMetricLabelName helper for metric label consistency in tests.
    • Added TestSyncChangefeedErrorMetrics to verify initial metric collection for changefeed errors.
    • Added TestSyncChangefeedErrorMetricsUpdateAndClear to test metric updates and clearing when changefeed states change.
  • metrics/grafana/ticdc_new_arch.json
    • Added a new Grafana table panel titled "Changefeed Error Details" to display ticdc_owner_changefeed_error_info.
    • Updated several existing panel IDs to ensure uniqueness.
  • metrics/nextgengrafana/ticdc_new_arch_next_gen.json
    • Added a new Grafana table panel titled "Changefeed Error Details" for the next-gen architecture, displaying ticdc_owner_changefeed_error_info.
    • Updated several existing panel IDs to ensure uniqueness.
  • metrics/nextgengrafana/ticdc_new_arch_with_keyspace_name.json
    • Added a new Grafana table panel titled "Changefeed Error Details" for the keyspace-aware architecture, displaying ticdc_owner_changefeed_error_info.
    • Updated several existing panel IDs to ensure uniqueness.
  • pkg/metrics/changefeed.go
    • Defined and registered ChangefeedErrorInfoGauge as a new Prometheus GaugeVec to track detailed error information for changefeeds.
Activity
  • No specific activity (comments, reviews, progress) was provided in the context for this pull request.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 16, 2026

Warning

Rate limit exceeded

@wk989898 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 10 minutes and 53 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 43421a5c-2b17-4a83-9f61-2e4f9e9b6814

📥 Commits

Reviewing files that changed from the base of the PR and between 1452837 and a9c2b29.

📒 Files selected for processing (3)
  • metrics/grafana/ticdc_new_arch.json
  • metrics/nextgengrafana/ticdc_new_arch_next_gen.json
  • metrics/nextgengrafana/ticdc_new_arch_with_keyspace_name.json
📝 Walkthrough

Walkthrough

This PR introduces a new Prometheus metric (ChangefeedErrorInfoGauge) to expose changefeed error details when changefeeds are in warning or failed states. The controller implements metric collection with in-memory label caching to avoid redundant gauge updates. Grafana dashboards are updated with a new "Changefeed Error Details" panel to visualize this information.

Changes

Cohort / File(s) Summary
Metric Definition & Collection
coordinator/controller.go, pkg/metrics/changefeed.go, coordinator/helper.go
Introduces ChangefeedErrorInfoGauge metric with labels [keyspace, changefeed, state, code, message], implements cache-based metric collection in collectMetrics to track emitted labels and avoid redundant updates, and adds helper functions to extract error labels from changefeed info when state is Failed or Warning.
Grafana Dashboards
metrics/grafana/ticdc_new_arch.json, metrics/nextgengrafana/ticdc_new_arch_next_gen.json, metrics/nextgengrafana/ticdc_new_arch_with_keyspace_name.json
Adds new "Changefeed Error Details" table panel (id 62002) to Summary row querying ticdc_owner_changefeed_error_info with transformations to expose error columns; renumbers multiple existing panel IDs across dashboard.

Sequence Diagram

sequenceDiagram
    actor Tick as Metrics Tick
    participant Controller as Coordinator Controller
    participant DB as Changefeed DB
    participant Helper as Label Helper
    participant Gauge as ErrorInfoGauge
    participant Dashboard as Grafana Dashboard
    
    Tick->>Controller: collectMetrics()
    Controller->>DB: Foreach(changefeed)
    loop For Each Changefeed
        DB->>Helper: getChangefeedErrorMetricLabels(cf)
        Helper->>Helper: Check state == Failed/Warning
        Helper-->>DB: newLabels, hasError
        DB->>Controller: Evaluate cache
        alt Labels Changed or Error State Changed
            Controller->>Gauge: Delete old labels (if cached)
            alt hasError == true
                Controller->>Gauge: Set(newLabels, 1)
                Controller->>Controller: Update cache
            else hasError == false
                Controller->>Controller: Remove from cache
            end
        else Labels Unchanged & Error State Unchanged
            Controller->>Controller: Skip (early return)
        end
    end
    Controller->>Controller: Cleanup removed changefeeds
    Controller->>Gauge: Delete(removed changefeed labels)
    Gauge->>Dashboard: Expose metrics
    Dashboard->>Dashboard: Display Error Details
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • hongyunyan
  • asddongmen
  • tenfyzhong

Poem

🐰 When changefeeds stumble and warn with despair,
A gauge now captures each error with care,
No logs need hunting, the metrics now glow—
In Grafana's tableau, the problems we know! 📊✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title contains a repetition ('info info') that suggests a typo or editing error, making it ambiguous whether this accurately describes the change. Clarify the title by removing the duplicate word—e.g., 'metrics: add changefeed error info' or 'metrics: add changefeed error info metric' to better reflect the actual change.
✅ Passed checks (3 passed)
Check name Status Explanation
Description check ✅ Passed The PR description includes the required issue reference (close #4498), explains the changes (new metric ChangefeedErrorInfoGauge), lists test types, and provides an image demonstrating the feature in action.
Linked Issues check ✅ Passed All code changes implement the objective from #4498: a new Prometheus metric now exposes changefeed error details (state, code, message) to improve diagnosis without requiring log queries.
Out of Scope Changes check ✅ Passed All changes are directly aligned with the linked issue #4498: the metric definition, collector logic, Grafana panels, and helper functions all support exposing changefeed error information as specified.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new metric, ticdc_owner_changefeed_error_info, to provide detailed information about changefeed errors. The implementation includes logic to create, update, and delete these metrics as the changefeed state changes. The changes are well-tested and include updates to the Grafana dashboards to visualize the new metric. My main feedback is a suggestion to refactor the metric synchronization logic for better readability and maintainability.

Comment thread coordinator/controller.go Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
coordinator/controller_test.go (2)

49-145: Add a warning-state metric case for full objective coverage.

These tests cover failed → update → clear paths well, but the issue objective also requires warning-state visibility. Please add a warning-state assertion to lock behavior for both supported states.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@coordinator/controller_test.go` around lines 49 - 145, Add assertions for the
"warning" state to achieve full objective coverage: in
TestSyncChangefeedErrorMetrics create a changefeed with StateWarning (instead of
or in addition to StateFailed) using changefeed.NewChangefeed and a
RunningError, call controller.syncChangefeedErrorMetrics, and assert the
exported metric contains state="warning" (use
normalizeChangefeedErrorMetricMessage and getChangefeedErrorMetricLabelName for
message/label formatting); similarly in
TestSyncChangefeedErrorMetricsUpdateAndClear add a transition path that sets cf
info to StateWarning and verifies the metric shows the warning state, then
update/clear as done for failed to confirm metrics update and removal via
controller.syncChangefeedErrorMetrics.

42-47: Avoid duplicating keyspace-label literals in test helper.

Line 42-47 re-encodes "keyspace_name" / "namespace" in test code. This can drift from production label selection logic over time. Prefer reusing a shared constant/helper from metrics package (if exposed) to keep one source of truth.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@coordinator/controller_test.go` around lines 42 - 47, Replace the hard-coded
literals in getChangefeedErrorMetricLabelName (and its kerneltype.IsNextGen
branch) with the exported constant or helper from the metrics package so the
test uses the same source-of-truth as production; import the metrics package and
return the appropriate exported symbol (e.g., metrics.KeyspaceLabelName or the
relevant metrics.Changefeed label constant/helper—use the actual exported name
available) instead of "keyspace_name" / "namespace".
metrics/nextgengrafana/ticdc_new_arch_next_gen.json (1)

1197-1219: Add explicit state filter to dashboard query as defensive refactoring.

The metric ticdc_owner_changefeed_error_info only emits when changefeed state is "failed" or "warning" (see getChangefeedErrorMetricLabels in coordinator/controller.go), so the current query is functionally correct. However, adding an explicit state=~"warning|failed" selector makes the panel intent clear and guards against future metric behavior changes.

Proposed query refinement
- "expr": "max by (keyspace_name, changefeed, state, code, message) (ticdc_owner_changefeed_error_info{k8s_cluster=\"$k8s_cluster\", sharedpool_id=\"$tidb_cluster\", keyspace_name=~\"$keyspace_name\", changefeed=~\"$changefeed\"})",
+ "expr": "max by (keyspace_name, changefeed, state, code, message) (ticdc_owner_changefeed_error_info{k8s_cluster=\"$k8s_cluster\", sharedpool_id=\"$tidb_cluster\", keyspace_name=~\"$keyspace_name\", changefeed=~\"$changefeed\", state=~\"warning|failed\"})",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@metrics/nextgengrafana/ticdc_new_arch_next_gen.json` around lines 1197 -
1219, The panel's Prometheus target should explicitly restrict states to warning
or failed to make intent clear and guard against future metric changes: update
the target expression that references ticdc_owner_changefeed_error_info (the
"expr" in the targets for panel id 62002) to include a state selector like
state=~"warning|failed"; this aligns the query with
getChangefeedErrorMetricLabels' behavior and keeps filtering alongside existing
label matchers (k8s_cluster, sharedpool_id, keyspace_name, changefeed).
coordinator/controller.go (1)

270-300: Consider metric cardinality implications of the message label.

The sync logic is correct, but the message label could lead to high Prometheus metric cardinality if changefeeds cycle through different error messages over time (e.g., containing timestamps, connection addresses, or other dynamic content). Each unique message creates a new time series.

This may be acceptable given:

  1. The truncation to 256 chars limits variation
  2. Typically few changefeeds are in error state simultaneously
  3. The diagnostic value outweighs the cost per PR objectives

Consider documenting this trade-off or monitoring cardinality in production.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@coordinator/controller.go` around lines 270 - 300, The
syncChangefeedErrorMetrics function currently exposes the dynamic "message"
label (used with metrics.ChangefeedErrorInfoGauge), which can increase
Prometheus cardinality; add a brief in-code comment above
syncChangefeedErrorMetrics documenting this trade-off (mention truncation to 256
chars and the potential for high cardinality) and how to monitor it, and also
emit a simple telemetry metric reflecting the current unique error-message
cardinality (e.g., update or add a gauge like
metrics.ChangefeedErrorMessageCardinality.Set(float64(len(cache))) inside
syncChangefeedErrorMetrics) so production can track the number of distinct
messages over time.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@metrics/grafana/ticdc_new_arch.json`:
- Around line 1218-1221: The panel's "expr" currently returns series with zero
values (stale labels); update the expression that uses the metric
ticdc_owner_changefeed_error_info (the "expr" field for refId "A") to exclude
zero-valued series by applying a > 0 filter after the metric or after the
aggregation (e.g., only keep series where the value is greater than 0) so
recovered changefeeds no longer show as current errors.

In `@pkg/metrics/changefeed.go`:
- Around line 76-83: The metric ChangefeedErrorInfoGauge currently includes raw
error text in the "message" label (constructed via
normalizeChangefeedErrorMetricMessage), which can leak secrets; update the code
that sets labels for ChangefeedErrorInfoGauge to sanitize the message using the
existing redaction helper (e.g., util.RedactValue()) before calling
normalizeChangefeedErrorMetricMessage or instead emit a stable error fingerprint
label (e.g., "error_fingerprint") and move full text only to logs/events; ensure
the change references the ChangefeedErrorInfoGauge and
normalizeChangefeedErrorMetricMessage symbols so all callers use the sanitized
value.

---

Nitpick comments:
In `@coordinator/controller_test.go`:
- Around line 49-145: Add assertions for the "warning" state to achieve full
objective coverage: in TestSyncChangefeedErrorMetrics create a changefeed with
StateWarning (instead of or in addition to StateFailed) using
changefeed.NewChangefeed and a RunningError, call
controller.syncChangefeedErrorMetrics, and assert the exported metric contains
state="warning" (use normalizeChangefeedErrorMetricMessage and
getChangefeedErrorMetricLabelName for message/label formatting); similarly in
TestSyncChangefeedErrorMetricsUpdateAndClear add a transition path that sets cf
info to StateWarning and verifies the metric shows the warning state, then
update/clear as done for failed to confirm metrics update and removal via
controller.syncChangefeedErrorMetrics.
- Around line 42-47: Replace the hard-coded literals in
getChangefeedErrorMetricLabelName (and its kerneltype.IsNextGen branch) with the
exported constant or helper from the metrics package so the test uses the same
source-of-truth as production; import the metrics package and return the
appropriate exported symbol (e.g., metrics.KeyspaceLabelName or the relevant
metrics.Changefeed label constant/helper—use the actual exported name available)
instead of "keyspace_name" / "namespace".

In `@coordinator/controller.go`:
- Around line 270-300: The syncChangefeedErrorMetrics function currently exposes
the dynamic "message" label (used with metrics.ChangefeedErrorInfoGauge), which
can increase Prometheus cardinality; add a brief in-code comment above
syncChangefeedErrorMetrics documenting this trade-off (mention truncation to 256
chars and the potential for high cardinality) and how to monitor it, and also
emit a simple telemetry metric reflecting the current unique error-message
cardinality (e.g., update or add a gauge like
metrics.ChangefeedErrorMessageCardinality.Set(float64(len(cache))) inside
syncChangefeedErrorMetrics) so production can track the number of distinct
messages over time.

In `@metrics/nextgengrafana/ticdc_new_arch_next_gen.json`:
- Around line 1197-1219: The panel's Prometheus target should explicitly
restrict states to warning or failed to make intent clear and guard against
future metric changes: update the target expression that references
ticdc_owner_changefeed_error_info (the "expr" in the targets for panel id 62002)
to include a state selector like state=~"warning|failed"; this aligns the query
with getChangefeedErrorMetricLabels' behavior and keeps filtering alongside
existing label matchers (k8s_cluster, sharedpool_id, keyspace_name, changefeed).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 46141cd1-8df7-4696-9e1d-3f7e1f1a1990

📥 Commits

Reviewing files that changed from the base of the PR and between 1154cb3 and 3db01dc.

📒 Files selected for processing (6)
  • coordinator/controller.go
  • coordinator/controller_test.go
  • metrics/grafana/ticdc_new_arch.json
  • metrics/nextgengrafana/ticdc_new_arch_next_gen.json
  • metrics/nextgengrafana/ticdc_new_arch_with_keyspace_name.json
  • pkg/metrics/changefeed.go

Comment on lines +1218 to +1221
"expr": "max by (namespace, changefeed, state, code, message) (ticdc_owner_changefeed_error_info{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", namespace=~\"$namespace\", changefeed=~\"$changefeed\"})",
"format": "time_series",
"instant": true,
"refId": "A"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Filter out inactive error series in the panel query.

At Line [1218], the query does not exclude zero-valued series. If the gauge is reset to 0 for recovered changefeeds, stale labels can still appear in this table as “current” errors.

Suggested query fix
- "expr": "max by (namespace, changefeed, state, code, message) (ticdc_owner_changefeed_error_info{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", namespace=~\"$namespace\", changefeed=~\"$changefeed\"})",
+ "expr": "max by (namespace, changefeed, state, code, message) (ticdc_owner_changefeed_error_info{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", namespace=~\"$namespace\", changefeed=~\"$changefeed\"}) > 0",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@metrics/grafana/ticdc_new_arch.json` around lines 1218 - 1221, The panel's
"expr" currently returns series with zero values (stale labels); update the
expression that uses the metric ticdc_owner_changefeed_error_info (the "expr"
field for refId "A") to exclude zero-valued series by applying a > 0 filter
after the metric or after the aggregation (e.g., only keep series where the
value is greater than 0) so recovered changefeeds no longer show as current
errors.

Comment thread pkg/metrics/changefeed.go
Comment on lines +76 to +83
ChangefeedErrorInfoGauge = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Namespace: "ticdc",
Subsystem: "owner",
Name: "changefeed_error_info",
Help: "The current warning or failed reason of changefeeds",
}, []string{getKeyspaceLabel(), "changefeed", "state", "code", "message"})

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Metric definition and usage ==="
rg -n --type go -C2 'ChangefeedErrorInfoGauge|changefeed_error_info|labelValues\(\)'

echo
echo "=== Normalization/redaction implementation for error message ==="
rg -n --type go -A40 -B5 'func normalizeChangefeedErrorMetricMessage'

echo
echo "=== Potential redaction/sanitization helpers in related code ==="
rg -n --type go -C2 'redact|sanitize|mask|password|passwd|token|secret|SinkURI|UserPassword'

Repository: pingcap/ticdc

Length of output: 50370


🏁 Script executed:

rg -n --type go 'changefeedErrorMetricMsgLimit' -B2 -A2

Repository: pingcap/ticdc

Length of output: 816


🏁 Script executed:

rg -n --type go 'getChangefeedErrorMetricLabels' -B3 -A8

Repository: pingcap/ticdc

Length of output: 1500


🏁 Script executed:

# Look at test expectations to understand what kind of messages are expected
rg -n --type go -A5 'expectedMessage.*normalizeChangefeedErrorMetricMessage'

Repository: pingcap/ticdc

Length of output: 756


Apply redaction to error messages in ChangefeedErrorInfoGauge labels.

Line 82 includes raw error text as a Prometheus label. While normalizeChangefeedErrorMetricMessage (line 67–73) bounds length to 256 characters, it performs no redaction or masking. Error messages may contain connection strings, credentials, or API keys, which will leak into metrics storage. Use the existing redaction infrastructure (e.g., util.RedactValue()) to sanitize the message before label assignment, or use a stable error fingerprint label and keep full text in logs/events only.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/metrics/changefeed.go` around lines 76 - 83, The metric
ChangefeedErrorInfoGauge currently includes raw error text in the "message"
label (constructed via normalizeChangefeedErrorMetricMessage), which can leak
secrets; update the code that sets labels for ChangefeedErrorInfoGauge to
sanitize the message using the existing redaction helper (e.g.,
util.RedactValue()) before calling normalizeChangefeedErrorMetricMessage or
instead emit a stable error fingerprint label (e.g., "error_fingerprint") and
move full text only to logs/events; ensure the change references the
ChangefeedErrorInfoGauge and normalizeChangefeedErrorMetricMessage symbols so
all callers use the sanitized value.

Signed-off-by: wk989898 <nhsmwk@gmail.com>
@ti-chi-bot ti-chi-bot bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 16, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
metrics/nextgengrafana/ticdc_new_arch_with_keyspace_name.json (1)

7460-7460: Prefer stable panel IDs for existing panels.

Renumbering existing panel IDs increases dashboard churn and can break deep links/bookmarks that reference viewPanel. If there’s no collision, keep existing IDs and allocate new IDs only for newly added panels.

Also applies to: 7556-7556, 7662-7662, 8245-8245

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@metrics/nextgengrafana/ticdc_new_arch_with_keyspace_name.json` at line 7460,
Several panel "id" fields were renumbered (e.g., "id": 62004 and the other
affected ids at 7556, 7662, 8245); restore the original stable panel IDs for
existing panels instead of assigning new numbers. Locate the JSON objects with
"id" keys (match the shown values like 62004, 7556, 7662, 8245), revert them to
their previous IDs if no collision exists, and if you must add new panels
generate new unique IDs that do not overlap existing ones (use a consistent
high-range allocation or an ID registry) so deep links/bookmarks (viewPanel)
remain stable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@metrics/nextgengrafana/ticdc_new_arch_with_keyspace_name.json`:
- Line 7460: Several panel "id" fields were renumbered (e.g., "id": 62004 and
the other affected ids at 7556, 7662, 8245); restore the original stable panel
IDs for existing panels instead of assigning new numbers. Locate the JSON
objects with "id" keys (match the shown values like 62004, 7556, 7662, 8245),
revert them to their previous IDs if no collision exists, and if you must add
new panels generate new unique IDs that do not overlap existing ones (use a
consistent high-range allocation or an ID registry) so deep links/bookmarks
(viewPanel) remain stable.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c11186d7-4a21-4529-b41b-88fdf9b1df85

📥 Commits

Reviewing files that changed from the base of the PR and between 3db01dc and fe3d264.

📒 Files selected for processing (3)
  • metrics/grafana/ticdc_new_arch.json
  • metrics/nextgengrafana/ticdc_new_arch_next_gen.json
  • metrics/nextgengrafana/ticdc_new_arch_with_keyspace_name.json

Signed-off-by: wk989898 <nhsmwk@gmail.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@coordinator/controller.go`:
- Around line 79-86: The code currently picks runningErr by preferring
info.Error over info.Warning regardless of state; change this to select the
field based on info.State (use info.Error when info.State == config.StateFailed
and use info.Warning when info.State == config.StateWarning) so a warning-state
changefeed cannot emit failed-state details; update the assignment of runningErr
(the logic around info.State, runningErr, info.Error, info.Warning and the check
against config.StateFailed/config.StateWarning) accordingly so the chosen error
matches the current state.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b7d7cc5a-b736-4e30-94cb-311796d3067a

📥 Commits

Reviewing files that changed from the base of the PR and between fe3d264 and cf5e671.

📒 Files selected for processing (1)
  • coordinator/controller.go

Comment thread coordinator/controller.go Outdated
Signed-off-by: wk989898 <nhsmwk@gmail.com>
@ti-chi-bot ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Mar 16, 2026
@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Mar 19, 2026
@ti-chi-bot ti-chi-bot bot added the lgtm label Mar 19, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Mar 19, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: 3AceShowHand, lidezhu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [3AceShowHand,lidezhu]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot removed the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Mar 19, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Mar 19, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-03-19 08:38:34.789517381 +0000 UTC m=+432641.877174918: ☑️ agreed by 3AceShowHand.
  • 2026-03-19 09:10:46.39585448 +0000 UTC m=+434573.483512017: ☑️ agreed by lidezhu.

Signed-off-by: wk989898 <nhsmwk@gmail.com>
Signed-off-by: wk989898 <nhsmwk@gmail.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@coordinator/helper.go`:
- Around line 76-79: The current logic sets runningErr := info.Error then falls
back to info.Warning, which can surface the wrong message when info.State
indicates a warning; update the selection to be state-first: inspect info.State
(e.g., "error", "warning") and choose info.Error only when State == "error",
choose info.Warning only when State == "warning", and finally fallback to
whichever non-nil field remains (or nil) — update the code around runningErr,
using the info.State, info.Error, and info.Warning symbols to implement this
deterministic selection and ensure nil checks when selecting the fallback.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: da5f2ea9-5efb-4af2-b04e-10fd6e849790

📥 Commits

Reviewing files that changed from the base of the PR and between cf5e671 and 1452837.

📒 Files selected for processing (5)
  • coordinator/controller.go
  • coordinator/helper.go
  • metrics/grafana/ticdc_new_arch.json
  • metrics/nextgengrafana/ticdc_new_arch_next_gen.json
  • metrics/nextgengrafana/ticdc_new_arch_with_keyspace_name.json
🚧 Files skipped from review as they are similar to previous changes (2)
  • coordinator/controller.go
  • metrics/grafana/ticdc_new_arch.json

Comment thread coordinator/helper.go
Comment on lines +76 to +79
runningErr := info.Error
if runningErr == nil {
runningErr = info.Warning
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use state-first error selection to avoid mismatched warning/failed details.

At Line 76, info.Error is always preferred even when info.State is warning. If both fields are present, this can publish a stale/incorrect code/message for the current state.

🔧 Proposed fix
-	runningErr := info.Error
-	if runningErr == nil {
-		runningErr = info.Warning
-	}
+	var runningErr *config.RunningError
+	switch info.State {
+	case config.StateFailed:
+		runningErr = info.Error
+		if runningErr == nil {
+			runningErr = info.Warning
+		}
+	case config.StateWarning:
+		runningErr = info.Warning
+		if runningErr == nil {
+			runningErr = info.Error
+		}
+	}
 	if runningErr == nil {
 		return changefeedErrorMetricLabels{}, false
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
runningErr := info.Error
if runningErr == nil {
runningErr = info.Warning
}
var runningErr *config.RunningError
switch info.State {
case config.StateFailed:
runningErr = info.Error
if runningErr == nil {
runningErr = info.Warning
}
case config.StateWarning:
runningErr = info.Warning
if runningErr == nil {
runningErr = info.Error
}
}
if runningErr == nil {
return changefeedErrorMetricLabels{}, false
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@coordinator/helper.go` around lines 76 - 79, The current logic sets
runningErr := info.Error then falls back to info.Warning, which can surface the
wrong message when info.State indicates a warning; update the selection to be
state-first: inspect info.State (e.g., "error", "warning") and choose info.Error
only when State == "error", choose info.Warning only when State == "warning",
and finally fallback to whichever non-nil field remains (or nil) — update the
code around runningErr, using the info.State, info.Error, and info.Warning
symbols to implement this deterministic selection and ensure nil checks when
selecting the fallback.

@ti-chi-bot ti-chi-bot bot merged commit e8de3a6 into pingcap:master Mar 20, 2026
26 checks passed
tenfyzhong pushed a commit that referenced this pull request Mar 25, 2026
@wk989898 wk989898 deleted the metric-0316 branch April 13, 2026 04:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved lgtm release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

add a metric to display the error message when the changefeed state is warning/failed

3 participants