metrics: add changefeed error info info#4499
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly improves the monitoring capabilities for TiCDC changefeeds by introducing a dedicated Prometheus metric for detailed error and warning information. This enhancement provides operators with immediate visibility into the specific issues affecting changefeeds, allowing for more efficient diagnosis and resolution of problems. The integration with Grafana dashboards ensures that this critical information is readily accessible and easily interpretable. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThis PR introduces a new Prometheus metric ( Changes
Sequence DiagramsequenceDiagram
actor Tick as Metrics Tick
participant Controller as Coordinator Controller
participant DB as Changefeed DB
participant Helper as Label Helper
participant Gauge as ErrorInfoGauge
participant Dashboard as Grafana Dashboard
Tick->>Controller: collectMetrics()
Controller->>DB: Foreach(changefeed)
loop For Each Changefeed
DB->>Helper: getChangefeedErrorMetricLabels(cf)
Helper->>Helper: Check state == Failed/Warning
Helper-->>DB: newLabels, hasError
DB->>Controller: Evaluate cache
alt Labels Changed or Error State Changed
Controller->>Gauge: Delete old labels (if cached)
alt hasError == true
Controller->>Gauge: Set(newLabels, 1)
Controller->>Controller: Update cache
else hasError == false
Controller->>Controller: Remove from cache
end
else Labels Unchanged & Error State Unchanged
Controller->>Controller: Skip (early return)
end
end
Controller->>Controller: Cleanup removed changefeeds
Controller->>Gauge: Delete(removed changefeed labels)
Gauge->>Dashboard: Expose metrics
Dashboard->>Dashboard: Display Error Details
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a new metric, ticdc_owner_changefeed_error_info, to provide detailed information about changefeed errors. The implementation includes logic to create, update, and delete these metrics as the changefeed state changes. The changes are well-tested and include updates to the Grafana dashboards to visualize the new metric. My main feedback is a suggestion to refactor the metric synchronization logic for better readability and maintainability.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (4)
coordinator/controller_test.go (2)
49-145: Add a warning-state metric case for full objective coverage.These tests cover failed → update → clear paths well, but the issue objective also requires warning-state visibility. Please add a warning-state assertion to lock behavior for both supported states.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@coordinator/controller_test.go` around lines 49 - 145, Add assertions for the "warning" state to achieve full objective coverage: in TestSyncChangefeedErrorMetrics create a changefeed with StateWarning (instead of or in addition to StateFailed) using changefeed.NewChangefeed and a RunningError, call controller.syncChangefeedErrorMetrics, and assert the exported metric contains state="warning" (use normalizeChangefeedErrorMetricMessage and getChangefeedErrorMetricLabelName for message/label formatting); similarly in TestSyncChangefeedErrorMetricsUpdateAndClear add a transition path that sets cf info to StateWarning and verifies the metric shows the warning state, then update/clear as done for failed to confirm metrics update and removal via controller.syncChangefeedErrorMetrics.
42-47: Avoid duplicating keyspace-label literals in test helper.Line 42-47 re-encodes
"keyspace_name"/"namespace"in test code. This can drift from production label selection logic over time. Prefer reusing a shared constant/helper from metrics package (if exposed) to keep one source of truth.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@coordinator/controller_test.go` around lines 42 - 47, Replace the hard-coded literals in getChangefeedErrorMetricLabelName (and its kerneltype.IsNextGen branch) with the exported constant or helper from the metrics package so the test uses the same source-of-truth as production; import the metrics package and return the appropriate exported symbol (e.g., metrics.KeyspaceLabelName or the relevant metrics.Changefeed label constant/helper—use the actual exported name available) instead of "keyspace_name" / "namespace".metrics/nextgengrafana/ticdc_new_arch_next_gen.json (1)
1197-1219: Add explicit state filter to dashboard query as defensive refactoring.The metric
ticdc_owner_changefeed_error_infoonly emits when changefeed state is "failed" or "warning" (seegetChangefeedErrorMetricLabelsincoordinator/controller.go), so the current query is functionally correct. However, adding an explicitstate=~"warning|failed"selector makes the panel intent clear and guards against future metric behavior changes.Proposed query refinement
- "expr": "max by (keyspace_name, changefeed, state, code, message) (ticdc_owner_changefeed_error_info{k8s_cluster=\"$k8s_cluster\", sharedpool_id=\"$tidb_cluster\", keyspace_name=~\"$keyspace_name\", changefeed=~\"$changefeed\"})", + "expr": "max by (keyspace_name, changefeed, state, code, message) (ticdc_owner_changefeed_error_info{k8s_cluster=\"$k8s_cluster\", sharedpool_id=\"$tidb_cluster\", keyspace_name=~\"$keyspace_name\", changefeed=~\"$changefeed\", state=~\"warning|failed\"})",🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@metrics/nextgengrafana/ticdc_new_arch_next_gen.json` around lines 1197 - 1219, The panel's Prometheus target should explicitly restrict states to warning or failed to make intent clear and guard against future metric changes: update the target expression that references ticdc_owner_changefeed_error_info (the "expr" in the targets for panel id 62002) to include a state selector like state=~"warning|failed"; this aligns the query with getChangefeedErrorMetricLabels' behavior and keeps filtering alongside existing label matchers (k8s_cluster, sharedpool_id, keyspace_name, changefeed).coordinator/controller.go (1)
270-300: Consider metric cardinality implications of themessagelabel.The sync logic is correct, but the
messagelabel could lead to high Prometheus metric cardinality if changefeeds cycle through different error messages over time (e.g., containing timestamps, connection addresses, or other dynamic content). Each unique message creates a new time series.This may be acceptable given:
- The truncation to 256 chars limits variation
- Typically few changefeeds are in error state simultaneously
- The diagnostic value outweighs the cost per PR objectives
Consider documenting this trade-off or monitoring cardinality in production.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@coordinator/controller.go` around lines 270 - 300, The syncChangefeedErrorMetrics function currently exposes the dynamic "message" label (used with metrics.ChangefeedErrorInfoGauge), which can increase Prometheus cardinality; add a brief in-code comment above syncChangefeedErrorMetrics documenting this trade-off (mention truncation to 256 chars and the potential for high cardinality) and how to monitor it, and also emit a simple telemetry metric reflecting the current unique error-message cardinality (e.g., update or add a gauge like metrics.ChangefeedErrorMessageCardinality.Set(float64(len(cache))) inside syncChangefeedErrorMetrics) so production can track the number of distinct messages over time.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@metrics/grafana/ticdc_new_arch.json`:
- Around line 1218-1221: The panel's "expr" currently returns series with zero
values (stale labels); update the expression that uses the metric
ticdc_owner_changefeed_error_info (the "expr" field for refId "A") to exclude
zero-valued series by applying a > 0 filter after the metric or after the
aggregation (e.g., only keep series where the value is greater than 0) so
recovered changefeeds no longer show as current errors.
In `@pkg/metrics/changefeed.go`:
- Around line 76-83: The metric ChangefeedErrorInfoGauge currently includes raw
error text in the "message" label (constructed via
normalizeChangefeedErrorMetricMessage), which can leak secrets; update the code
that sets labels for ChangefeedErrorInfoGauge to sanitize the message using the
existing redaction helper (e.g., util.RedactValue()) before calling
normalizeChangefeedErrorMetricMessage or instead emit a stable error fingerprint
label (e.g., "error_fingerprint") and move full text only to logs/events; ensure
the change references the ChangefeedErrorInfoGauge and
normalizeChangefeedErrorMetricMessage symbols so all callers use the sanitized
value.
---
Nitpick comments:
In `@coordinator/controller_test.go`:
- Around line 49-145: Add assertions for the "warning" state to achieve full
objective coverage: in TestSyncChangefeedErrorMetrics create a changefeed with
StateWarning (instead of or in addition to StateFailed) using
changefeed.NewChangefeed and a RunningError, call
controller.syncChangefeedErrorMetrics, and assert the exported metric contains
state="warning" (use normalizeChangefeedErrorMetricMessage and
getChangefeedErrorMetricLabelName for message/label formatting); similarly in
TestSyncChangefeedErrorMetricsUpdateAndClear add a transition path that sets cf
info to StateWarning and verifies the metric shows the warning state, then
update/clear as done for failed to confirm metrics update and removal via
controller.syncChangefeedErrorMetrics.
- Around line 42-47: Replace the hard-coded literals in
getChangefeedErrorMetricLabelName (and its kerneltype.IsNextGen branch) with the
exported constant or helper from the metrics package so the test uses the same
source-of-truth as production; import the metrics package and return the
appropriate exported symbol (e.g., metrics.KeyspaceLabelName or the relevant
metrics.Changefeed label constant/helper—use the actual exported name available)
instead of "keyspace_name" / "namespace".
In `@coordinator/controller.go`:
- Around line 270-300: The syncChangefeedErrorMetrics function currently exposes
the dynamic "message" label (used with metrics.ChangefeedErrorInfoGauge), which
can increase Prometheus cardinality; add a brief in-code comment above
syncChangefeedErrorMetrics documenting this trade-off (mention truncation to 256
chars and the potential for high cardinality) and how to monitor it, and also
emit a simple telemetry metric reflecting the current unique error-message
cardinality (e.g., update or add a gauge like
metrics.ChangefeedErrorMessageCardinality.Set(float64(len(cache))) inside
syncChangefeedErrorMetrics) so production can track the number of distinct
messages over time.
In `@metrics/nextgengrafana/ticdc_new_arch_next_gen.json`:
- Around line 1197-1219: The panel's Prometheus target should explicitly
restrict states to warning or failed to make intent clear and guard against
future metric changes: update the target expression that references
ticdc_owner_changefeed_error_info (the "expr" in the targets for panel id 62002)
to include a state selector like state=~"warning|failed"; this aligns the query
with getChangefeedErrorMetricLabels' behavior and keeps filtering alongside
existing label matchers (k8s_cluster, sharedpool_id, keyspace_name, changefeed).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 46141cd1-8df7-4696-9e1d-3f7e1f1a1990
📒 Files selected for processing (6)
coordinator/controller.gocoordinator/controller_test.gometrics/grafana/ticdc_new_arch.jsonmetrics/nextgengrafana/ticdc_new_arch_next_gen.jsonmetrics/nextgengrafana/ticdc_new_arch_with_keyspace_name.jsonpkg/metrics/changefeed.go
| "expr": "max by (namespace, changefeed, state, code, message) (ticdc_owner_changefeed_error_info{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", namespace=~\"$namespace\", changefeed=~\"$changefeed\"})", | ||
| "format": "time_series", | ||
| "instant": true, | ||
| "refId": "A" |
There was a problem hiding this comment.
Filter out inactive error series in the panel query.
At Line [1218], the query does not exclude zero-valued series. If the gauge is reset to 0 for recovered changefeeds, stale labels can still appear in this table as “current” errors.
Suggested query fix
- "expr": "max by (namespace, changefeed, state, code, message) (ticdc_owner_changefeed_error_info{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", namespace=~\"$namespace\", changefeed=~\"$changefeed\"})",
+ "expr": "max by (namespace, changefeed, state, code, message) (ticdc_owner_changefeed_error_info{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", namespace=~\"$namespace\", changefeed=~\"$changefeed\"}) > 0",🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@metrics/grafana/ticdc_new_arch.json` around lines 1218 - 1221, The panel's
"expr" currently returns series with zero values (stale labels); update the
expression that uses the metric ticdc_owner_changefeed_error_info (the "expr"
field for refId "A") to exclude zero-valued series by applying a > 0 filter
after the metric or after the aggregation (e.g., only keep series where the
value is greater than 0) so recovered changefeeds no longer show as current
errors.
| ChangefeedErrorInfoGauge = prometheus.NewGaugeVec( | ||
| prometheus.GaugeOpts{ | ||
| Namespace: "ticdc", | ||
| Subsystem: "owner", | ||
| Name: "changefeed_error_info", | ||
| Help: "The current warning or failed reason of changefeeds", | ||
| }, []string{getKeyspaceLabel(), "changefeed", "state", "code", "message"}) | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Metric definition and usage ==="
rg -n --type go -C2 'ChangefeedErrorInfoGauge|changefeed_error_info|labelValues\(\)'
echo
echo "=== Normalization/redaction implementation for error message ==="
rg -n --type go -A40 -B5 'func normalizeChangefeedErrorMetricMessage'
echo
echo "=== Potential redaction/sanitization helpers in related code ==="
rg -n --type go -C2 'redact|sanitize|mask|password|passwd|token|secret|SinkURI|UserPassword'Repository: pingcap/ticdc
Length of output: 50370
🏁 Script executed:
rg -n --type go 'changefeedErrorMetricMsgLimit' -B2 -A2Repository: pingcap/ticdc
Length of output: 816
🏁 Script executed:
rg -n --type go 'getChangefeedErrorMetricLabels' -B3 -A8Repository: pingcap/ticdc
Length of output: 1500
🏁 Script executed:
# Look at test expectations to understand what kind of messages are expected
rg -n --type go -A5 'expectedMessage.*normalizeChangefeedErrorMetricMessage'Repository: pingcap/ticdc
Length of output: 756
Apply redaction to error messages in ChangefeedErrorInfoGauge labels.
Line 82 includes raw error text as a Prometheus label. While normalizeChangefeedErrorMetricMessage (line 67–73) bounds length to 256 characters, it performs no redaction or masking. Error messages may contain connection strings, credentials, or API keys, which will leak into metrics storage. Use the existing redaction infrastructure (e.g., util.RedactValue()) to sanitize the message before label assignment, or use a stable error fingerprint label and keep full text in logs/events only.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/metrics/changefeed.go` around lines 76 - 83, The metric
ChangefeedErrorInfoGauge currently includes raw error text in the "message"
label (constructed via normalizeChangefeedErrorMetricMessage), which can leak
secrets; update the code that sets labels for ChangefeedErrorInfoGauge to
sanitize the message using the existing redaction helper (e.g.,
util.RedactValue()) before calling normalizeChangefeedErrorMetricMessage or
instead emit a stable error fingerprint label (e.g., "error_fingerprint") and
move full text only to logs/events; ensure the change references the
ChangefeedErrorInfoGauge and normalizeChangefeedErrorMetricMessage symbols so
all callers use the sanitized value.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
metrics/nextgengrafana/ticdc_new_arch_with_keyspace_name.json (1)
7460-7460: Prefer stable panel IDs for existing panels.Renumbering existing panel IDs increases dashboard churn and can break deep links/bookmarks that reference
viewPanel. If there’s no collision, keep existing IDs and allocate new IDs only for newly added panels.Also applies to: 7556-7556, 7662-7662, 8245-8245
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@metrics/nextgengrafana/ticdc_new_arch_with_keyspace_name.json` at line 7460, Several panel "id" fields were renumbered (e.g., "id": 62004 and the other affected ids at 7556, 7662, 8245); restore the original stable panel IDs for existing panels instead of assigning new numbers. Locate the JSON objects with "id" keys (match the shown values like 62004, 7556, 7662, 8245), revert them to their previous IDs if no collision exists, and if you must add new panels generate new unique IDs that do not overlap existing ones (use a consistent high-range allocation or an ID registry) so deep links/bookmarks (viewPanel) remain stable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@metrics/nextgengrafana/ticdc_new_arch_with_keyspace_name.json`:
- Line 7460: Several panel "id" fields were renumbered (e.g., "id": 62004 and
the other affected ids at 7556, 7662, 8245); restore the original stable panel
IDs for existing panels instead of assigning new numbers. Locate the JSON
objects with "id" keys (match the shown values like 62004, 7556, 7662, 8245),
revert them to their previous IDs if no collision exists, and if you must add
new panels generate new unique IDs that do not overlap existing ones (use a
consistent high-range allocation or an ID registry) so deep links/bookmarks
(viewPanel) remain stable.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c11186d7-4a21-4529-b41b-88fdf9b1df85
📒 Files selected for processing (3)
metrics/grafana/ticdc_new_arch.jsonmetrics/nextgengrafana/ticdc_new_arch_next_gen.jsonmetrics/nextgengrafana/ticdc_new_arch_with_keyspace_name.json
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@coordinator/controller.go`:
- Around line 79-86: The code currently picks runningErr by preferring
info.Error over info.Warning regardless of state; change this to select the
field based on info.State (use info.Error when info.State == config.StateFailed
and use info.Warning when info.State == config.StateWarning) so a warning-state
changefeed cannot emit failed-state details; update the assignment of runningErr
(the logic around info.State, runningErr, info.Error, info.Warning and the check
against config.StateFailed/config.StateWarning) accordingly so the chosen error
matches the current state.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: b7d7cc5a-b736-4e30-94cb-311796d3067a
📒 Files selected for processing (1)
coordinator/controller.go
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: 3AceShowHand, lidezhu The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[LGTM Timeline notifier]Timeline:
|
Signed-off-by: wk989898 <nhsmwk@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@coordinator/helper.go`:
- Around line 76-79: The current logic sets runningErr := info.Error then falls
back to info.Warning, which can surface the wrong message when info.State
indicates a warning; update the selection to be state-first: inspect info.State
(e.g., "error", "warning") and choose info.Error only when State == "error",
choose info.Warning only when State == "warning", and finally fallback to
whichever non-nil field remains (or nil) — update the code around runningErr,
using the info.State, info.Error, and info.Warning symbols to implement this
deterministic selection and ensure nil checks when selecting the fallback.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: da5f2ea9-5efb-4af2-b04e-10fd6e849790
📒 Files selected for processing (5)
coordinator/controller.gocoordinator/helper.gometrics/grafana/ticdc_new_arch.jsonmetrics/nextgengrafana/ticdc_new_arch_next_gen.jsonmetrics/nextgengrafana/ticdc_new_arch_with_keyspace_name.json
🚧 Files skipped from review as they are similar to previous changes (2)
- coordinator/controller.go
- metrics/grafana/ticdc_new_arch.json
| runningErr := info.Error | ||
| if runningErr == nil { | ||
| runningErr = info.Warning | ||
| } |
There was a problem hiding this comment.
Use state-first error selection to avoid mismatched warning/failed details.
At Line 76, info.Error is always preferred even when info.State is warning. If both fields are present, this can publish a stale/incorrect code/message for the current state.
🔧 Proposed fix
- runningErr := info.Error
- if runningErr == nil {
- runningErr = info.Warning
- }
+ var runningErr *config.RunningError
+ switch info.State {
+ case config.StateFailed:
+ runningErr = info.Error
+ if runningErr == nil {
+ runningErr = info.Warning
+ }
+ case config.StateWarning:
+ runningErr = info.Warning
+ if runningErr == nil {
+ runningErr = info.Error
+ }
+ }
if runningErr == nil {
return changefeedErrorMetricLabels{}, false
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| runningErr := info.Error | |
| if runningErr == nil { | |
| runningErr = info.Warning | |
| } | |
| var runningErr *config.RunningError | |
| switch info.State { | |
| case config.StateFailed: | |
| runningErr = info.Error | |
| if runningErr == nil { | |
| runningErr = info.Warning | |
| } | |
| case config.StateWarning: | |
| runningErr = info.Warning | |
| if runningErr == nil { | |
| runningErr = info.Error | |
| } | |
| } | |
| if runningErr == nil { | |
| return changefeedErrorMetricLabels{}, false | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@coordinator/helper.go` around lines 76 - 79, The current logic sets
runningErr := info.Error then falls back to info.Warning, which can surface the
wrong message when info.State indicates a warning; update the selection to be
state-first: inspect info.State (e.g., "error", "warning") and choose info.Error
only when State == "error", choose info.Warning only when State == "warning",
and finally fallback to whichever non-nil field remains (or nil) — update the
code around runningErr, using the info.State, info.Error, and info.Warning
symbols to implement this deterministic selection and ensure nil checks when
selecting the fallback.
What problem does this PR solve?
Issue Number: close #4498
What is changed and how it works?
New Error Metrics: Introduced a new Prometheus metric, ChangefeedErrorInfoGauge, to capture detailed error and warning information for changefeeds, including keyspace, changefeed ID, state, error code, and a normalized error message.
Check List
Tests
Questions
Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?
Release note
Summary by CodeRabbit