Skip to content

metrics: show error time in changefeed error details panel (#5086)#5104

Open
ti-chi-bot wants to merge 1 commit into
pingcap:release-8.5from
ti-chi-bot:cherry-pick-5086-to-release-8.5
Open

metrics: show error time in changefeed error details panel (#5086)#5104
ti-chi-bot wants to merge 1 commit into
pingcap:release-8.5from
ti-chi-bot:cherry-pick-5086-to-release-8.5

Conversation

@ti-chi-bot
Copy link
Copy Markdown
Member

This is an automated cherry-pick of #5086

What problem does this PR solve?

Issue Number: close #5085

What is changed and how it works?

  • Expose the current changefeed error occurrence time through the ticdc_owner_changefeed_error_info metric as an error_time label.
  • Render error_time in the Changefeed Error Details Grafana panel for both standard and next-generation dashboards.
  • Format non-zero error times as UTC RFC3339 strings and leave missing historical timestamps blank.
  • Add unit coverage for the new metric-label formatting behavior.

Check List

Tests

  • Unit test
  • Manual test
img_v3_0211q_ad0021e1-a099-4610-bbda-469fd888ffcg

Questions

Will it cause performance regression or break compatibility?

No expected performance regression. This extends the current error-info metric with one stable label for the active error occurrence time. Existing selectors remain valid, while consumers that depend on the exact label set should account for the added error_time label.

Do you need to update user documentation, design documentation or monitoring documentation?

No separate documentation update is needed. The affected Grafana dashboard definitions are updated in this PR.

Release note

None

Summary by CodeRabbit

Release Notes

  • New Features

    • Changefeed error metrics now capture the time errors occurred (UTC format) for improved troubleshooting and error tracking.
    • Updated monitoring dashboards to display error timestamps alongside error details.
  • Tests

    • Added unit tests for error timestamp handling and metric label validation.

Review Change Stack

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@ti-chi-bot ti-chi-bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. type/cherry-pick-for-release-8.5 This PR is cherry-picked to release-8.5 from a source PR. labels May 20, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 20, 2026

This cherry pick PR is for a release branch and has not yet been approved by triage owners.
Adding the do-not-merge/cherry-pick-not-approved label.

To merge this cherry pick:

  1. It must be LGTMed and approved by the reviewers firstly.
  2. For pull requests to TiDB-x branches, it must have no failed tests.
  3. AFTER it has lgtm and approved labels, please wait for the cherry-pick merging approval from triage owners.
Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot
Copy link
Copy Markdown
Member Author

@wlwilliamx This PR has conflicts, I have hold it.
Please resolve them or ask others to resolve them, then comment /unhold to remove the hold label.

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 20, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign lidezhu for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 20, 2026

@ti-chi-bot: ## If you want to know how to resolve it, please read the guide in TiDB Dev Guide.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 20, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: da96634f-0656-4a0f-9830-972725445a74

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot Bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 20, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new Prometheus metric and Grafana dashboard panel to display detailed changefeed error information, including error codes, messages, and occurrence times. However, the submission is currently broken as it contains Git merge conflict markers across multiple files, including Go source and JSON dashboard definitions. Additionally, the constant changefeedErrorMetricMsgLimit is referenced in coordinator/helper.go but is not defined, which will prevent the code from compiling.

Comment thread coordinator/helper.go
Comment on lines +17 to +23
<<<<<<< HEAD
=======
"strings"
"time"

"github.com/pingcap/ticdc/pkg/config"
>>>>>>> 7b68b7051 (metrics: show error time in changefeed error details panel (#5086))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The code contains Git merge conflict markers. These must be resolved before the code can be compiled.

	"strings"
	"time"

	"github.com/pingcap/ticdc/pkg/config"

Comment thread coordinator/helper.go
Comment on lines +49 to +126
<<<<<<< HEAD
=======

type changefeedErrorMetricLabels struct {
keyspace string
changefeed string
state string
errorTime string
code string
message string
}

func isUnchangedRuntimeState(info *config.ChangeFeedInfo, state config.FeedState, err *config.RunningError) bool {
if info == nil {
return true
}
if info.State != state {
return false
}
return sameRunningErrorSignature(info.Error, err)
}

func sameRunningErrorSignature(lhs *config.RunningError, rhs *config.RunningError) bool {
if lhs == nil || rhs == nil {
return lhs == rhs
}
return lhs.Addr == rhs.Addr &&
lhs.Code == rhs.Code &&
lhs.Message == rhs.Message
}

func (l changefeedErrorMetricLabels) labelValues() []string {
return []string{l.keyspace, l.changefeed, l.state, l.errorTime, l.code, l.message}
}

func normalizeChangefeedErrorMetricMessage(message string) string {
message = strings.Join(strings.Fields(message), " ")
if len(message) <= changefeedErrorMetricMsgLimit {
return message
}
return message[:changefeedErrorMetricMsgLimit-3] + "..."
}

func normalizeChangefeedErrorMetricTime(errorTime time.Time) string {
// Keep the label stable across nodes with different local time zones while remaining
// directly readable in Grafana's table view.
if errorTime.IsZero() {
return ""
}
return errorTime.UTC().Format(time.RFC3339)
}

func getChangefeedErrorMetricLabels(info *config.ChangeFeedInfo) (changefeedErrorMetricLabels, bool) {
if info == nil {
return changefeedErrorMetricLabels{}, false
}
if info.State != config.StateFailed && info.State != config.StateWarning {
return changefeedErrorMetricLabels{}, false
}

runningErr := info.Error
if runningErr == nil {
runningErr = info.Warning
}
if runningErr == nil {
return changefeedErrorMetricLabels{}, false
}

return changefeedErrorMetricLabels{
keyspace: info.ChangefeedID.Keyspace(),
changefeed: info.ChangefeedID.Name(),
state: string(info.State),
errorTime: normalizeChangefeedErrorMetricTime(runningErr.Time),
code: runningErr.Code,
message: normalizeChangefeedErrorMetricMessage(runningErr.Message),
}, true
}
>>>>>>> 7b68b7051 (metrics: show error time in changefeed error details panel (#5086))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The code contains Git merge conflict markers. Additionally, the constant changefeedErrorMetricMsgLimit is used in normalizeChangefeedErrorMetricMessage but is not defined in this file. Based on the dashboard description, it should be set to 256.

const changefeedErrorMetricMsgLimit = 256

type changefeedErrorMetricLabels struct {
	keyspace   string
	changefeed string
	state      string
	errorTime  string
	code       string
	message    string
}

func isUnchangedRuntimeState(info *config.ChangeFeedInfo, state config.FeedState, err *config.RunningError) bool {
	if info == nil {
		return true
	}
	if info.State != state {
		return false
	}
	return sameRunningErrorSignature(info.Error, err)
}

func sameRunningErrorSignature(lhs *config.RunningError, rhs *config.RunningError) bool {
	if lhs == nil || rhs == nil {
		return lhs == rhs
	}
	return lhs.Addr == rhs.Addr &&
		lhs.Code == rhs.Code &&
		lhs.Message == rhs.Message
}

func (l changefeedErrorMetricLabels) labelValues() []string {
	return []string{l.keyspace, l.changefeed, l.state, l.errorTime, l.code, l.message}
}

func normalizeChangefeedErrorMetricMessage(message string) string {
	message = strings.Join(strings.Fields(message), " ")
	if len(message) <= changefeedErrorMetricMsgLimit {
		return message
	}
	return message[:changefeedErrorMetricMsgLimit-3] + "..."
}

func normalizeChangefeedErrorMetricTime(errorTime time.Time) string {
	// Keep the label stable across nodes with different local time zones while remaining
	// directly readable in Grafana's table view.
	if errorTime.IsZero() {
		return ""
	}
	return errorTime.UTC().Format(time.RFC3339)
}

func getChangefeedErrorMetricLabels(info *config.ChangeFeedInfo) (changefeedErrorMetricLabels, bool) {
	if info == nil {
		return changefeedErrorMetricLabels{}, false
	}
	if info.State != config.StateFailed && info.State != config.StateWarning {
		return changefeedErrorMetricLabels{}, false
	}

	runningErr := info.Error
	if runningErr == nil {
		runningErr = info.Warning
	}
	if runningErr == nil {
		return changefeedErrorMetricLabels{}, false
	}

	return changefeedErrorMetricLabels{
		keyspace:   info.ChangefeedID.Keyspace(),
		changefeed: info.ChangefeedID.Name(),
		state:      string(info.State),
		errorTime:  normalizeChangefeedErrorMetricTime(runningErr.Time),
		code:       runningErr.Code,
		message:    normalizeChangefeedErrorMetricMessage(runningErr.Message),
	}, true
}

Comment thread pkg/metrics/changefeed.go
Comment on lines +76 to +88
<<<<<<< HEAD
=======
// ChangefeedErrorInfoGauge records the current warning or failed reason and its occurrence time
// for each changefeed.
ChangefeedErrorInfoGauge = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Namespace: "ticdc",
Subsystem: "owner",
Name: "changefeed_error_info",
Help: "The current warning or failed reason and occurrence time of changefeeds",
}, []string{getKeyspaceLabel(), "changefeed", "state", "error_time", "code", "message"})

>>>>>>> 7b68b7051 (metrics: show error time in changefeed error details panel (#5086))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The code contains Git merge conflict markers.

	// ChangefeedErrorInfoGauge records the current warning or failed reason and its occurrence time
	// for each changefeed.
	ChangefeedErrorInfoGauge = prometheus.NewGaugeVec(
		prometheus.GaugeOpts{
			Namespace: "ticdc",
			Subsystem: "owner",
			Name:      "changefeed_error_info",
			Help:      "The current warning or failed reason and occurrence time of changefeeds",
		}, []string{getKeyspaceLabel(), "changefeed", "state", "error_time", "code", "message"})

Comment on lines +4594 to +4733
<<<<<<< HEAD
=======
},
{
"datasource": "${DS_TEST-CLUSTER}",
"description": "Current warning or failed reason of each changefeed. The metric message is normalized to a single line and truncated to 256 characters.",
"fieldConfig": {
"defaults": {
"custom": {
"align": null,
"filterable": false
},
"links": [],
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
}
},
"overrides": [
{
"matcher": {
"id": "byName",
"options": "namespace"
},
"properties": [
{
"id": "custom.width",
"value": 120
}
]
},
{
"matcher": {
"id": "byName",
"options": "changefeed"
},
"properties": [
{
"id": "custom.width",
"value": 180
}
]
},
{
"matcher": {
"id": "byName",
"options": "state"
},
"properties": [
{
"id": "custom.width",
"value": 100
}
]
},
{
"matcher": {
"id": "byName",
"options": "code"
},
"properties": [
{
"id": "custom.width",
"value": 180
}
]
},
{
"matcher": {
"id": "byName",
"options": "error_time"
},
"properties": [
{
"id": "custom.width",
"value": 180
}
]
}
]
},
"gridPos": {
"h": 8,
"w": 24,
"x": 0,
"y": 26
},
"id": 62010,
"options": {
"showHeader": true,
"sortBy": []
},
"pluginVersion": "7.5.17",
"targets": [
{
"expr": "max by (namespace, changefeed, state, code, error_time, message) (ticdc_owner_changefeed_error_info{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", namespace=~\"$namespace\", changefeed=~\"$changefeed\"})",
"format": "time_series",
"instant": true,
"refId": "A"
}
],
"title": "Changefeed Error Details",
"transformations": [
{
"id": "labelsToFields",
"options": {}
},
{
"id": "organize",
"options": {
"excludeByName": {
"Metric": true,
"Time": true,
"Value": true,
"__name__": true
},
"indexByName": {
"namespace": 0,
"changefeed": 1,
"state": 2,
"error_time": 3,
"code": 4,
"message": 5
},
"renameByName": {}
}
}
],
"type": "table"
>>>>>>> 7b68b7051 (metrics: show error time in changefeed error details panel (#5086))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The JSON file contains Git merge conflict markers, which makes it invalid.

Comment on lines +4594 to +4733
<<<<<<< HEAD
=======
},
{
"datasource": "${DS_TEST-CLUSTER}",
"description": "Current warning or failed reason of each changefeed. The metric message is normalized to a single line and truncated to 256 characters.",
"fieldConfig": {
"defaults": {
"custom": {
"align": null,
"filterable": false
},
"links": [],
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
}
},
"overrides": [
{
"matcher": {
"id": "byName",
"options": "keyspace_name"
},
"properties": [
{
"id": "custom.width",
"value": 120
}
]
},
{
"matcher": {
"id": "byName",
"options": "changefeed"
},
"properties": [
{
"id": "custom.width",
"value": 180
}
]
},
{
"matcher": {
"id": "byName",
"options": "state"
},
"properties": [
{
"id": "custom.width",
"value": 100
}
]
},
{
"matcher": {
"id": "byName",
"options": "code"
},
"properties": [
{
"id": "custom.width",
"value": 180
}
]
},
{
"matcher": {
"id": "byName",
"options": "error_time"
},
"properties": [
{
"id": "custom.width",
"value": 180
}
]
}
]
},
"gridPos": {
"h": 8,
"w": 24,
"x": 0,
"y": 26
},
"id": 62010,
"options": {
"showHeader": true,
"sortBy": []
},
"pluginVersion": "7.5.17",
"targets": [
{
"expr": "max by (keyspace_name, changefeed, state, code, error_time, message) (ticdc_owner_changefeed_error_info{k8s_cluster=\"$k8s_cluster\", sharedpool_id=\"$tidb_cluster\", keyspace_name=~\"$keyspace_name\", changefeed=~\"$changefeed\"})",
"format": "time_series",
"instant": true,
"refId": "A"
}
],
"title": "Changefeed Error Details",
"transformations": [
{
"id": "labelsToFields",
"options": {}
},
{
"id": "organize",
"options": {
"excludeByName": {
"Metric": true,
"Time": true,
"Value": true,
"__name__": true
},
"indexByName": {
"keyspace_name": 0,
"changefeed": 1,
"state": 2,
"error_time": 3,
"code": 4,
"message": 5
},
"renameByName": {}
}
}
],
"type": "table"
>>>>>>> 7b68b7051 (metrics: show error time in changefeed error details panel (#5086))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The JSON file contains Git merge conflict markers, which makes it invalid.

Comment on lines +2337 to +2476
<<<<<<< HEAD
=======
},
{
"datasource": "${DS_TEST-CLUSTER}",
"description": "Current warning or failed reason of each changefeed. The metric message is normalized to a single line and truncated to 256 characters.",
"fieldConfig": {
"defaults": {
"custom": {
"align": null,
"filterable": false
},
"links": [],
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
}
},
"overrides": [
{
"matcher": {
"id": "byName",
"options": "keyspace_name"
},
"properties": [
{
"id": "custom.width",
"value": 120
}
]
},
{
"matcher": {
"id": "byName",
"options": "changefeed"
},
"properties": [
{
"id": "custom.width",
"value": 180
}
]
},
{
"matcher": {
"id": "byName",
"options": "state"
},
"properties": [
{
"id": "custom.width",
"value": 100
}
]
},
{
"matcher": {
"id": "byName",
"options": "code"
},
"properties": [
{
"id": "custom.width",
"value": 180
}
]
},
{
"matcher": {
"id": "byName",
"options": "error_time"
},
"properties": [
{
"id": "custom.width",
"value": 180
}
]
}
]
},
"gridPos": {
"h": 8,
"w": 24,
"x": 0,
"y": 26
},
"id": 62010,
"options": {
"showHeader": true,
"sortBy": []
},
"pluginVersion": "7.5.17",
"targets": [
{
"expr": "max by (keyspace_name, changefeed, state, code, error_time, message) (ticdc_owner_changefeed_error_info{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", keyspace_name=~\"$keyspace_name\", changefeed=~\"$changefeed\"})",
"format": "time_series",
"instant": true,
"refId": "A"
}
],
"title": "Changefeed Error Details",
"transformations": [
{
"id": "labelsToFields",
"options": {}
},
{
"id": "organize",
"options": {
"excludeByName": {
"Metric": true,
"Time": true,
"Value": true,
"__name__": true
},
"indexByName": {
"keyspace_name": 0,
"changefeed": 1,
"state": 2,
"error_time": 3,
"code": 4,
"message": 5
},
"renameByName": {}
}
}
],
"type": "table"
>>>>>>> 7b68b7051 (metrics: show error time in changefeed error details panel (#5086))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The JSON file contains Git merge conflict markers, which makes it invalid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/cherry-pick-not-approved do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. type/cherry-pick-for-release-8.5 This PR is cherry-picked to release-8.5 from a source PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants