metrics: show error time in changefeed error details panel (#5086)#5104
metrics: show error time in changefeed error details panel (#5086)#5104ti-chi-bot wants to merge 1 commit into
Conversation
Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
|
This cherry pick PR is for a release branch and has not yet been approved by triage owners. To merge this cherry pick:
DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@wlwilliamx This PR has conflicts, I have hold it. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@ti-chi-bot: ## If you want to know how to resolve it, please read the guide in TiDB Dev Guide. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a new Prometheus metric and Grafana dashboard panel to display detailed changefeed error information, including error codes, messages, and occurrence times. However, the submission is currently broken as it contains Git merge conflict markers across multiple files, including Go source and JSON dashboard definitions. Additionally, the constant changefeedErrorMetricMsgLimit is referenced in coordinator/helper.go but is not defined, which will prevent the code from compiling.
| <<<<<<< HEAD | ||
| ======= | ||
| "strings" | ||
| "time" | ||
|
|
||
| "github.com/pingcap/ticdc/pkg/config" | ||
| >>>>>>> 7b68b7051 (metrics: show error time in changefeed error details panel (#5086)) |
| <<<<<<< HEAD | ||
| ======= | ||
|
|
||
| type changefeedErrorMetricLabels struct { | ||
| keyspace string | ||
| changefeed string | ||
| state string | ||
| errorTime string | ||
| code string | ||
| message string | ||
| } | ||
|
|
||
| func isUnchangedRuntimeState(info *config.ChangeFeedInfo, state config.FeedState, err *config.RunningError) bool { | ||
| if info == nil { | ||
| return true | ||
| } | ||
| if info.State != state { | ||
| return false | ||
| } | ||
| return sameRunningErrorSignature(info.Error, err) | ||
| } | ||
|
|
||
| func sameRunningErrorSignature(lhs *config.RunningError, rhs *config.RunningError) bool { | ||
| if lhs == nil || rhs == nil { | ||
| return lhs == rhs | ||
| } | ||
| return lhs.Addr == rhs.Addr && | ||
| lhs.Code == rhs.Code && | ||
| lhs.Message == rhs.Message | ||
| } | ||
|
|
||
| func (l changefeedErrorMetricLabels) labelValues() []string { | ||
| return []string{l.keyspace, l.changefeed, l.state, l.errorTime, l.code, l.message} | ||
| } | ||
|
|
||
| func normalizeChangefeedErrorMetricMessage(message string) string { | ||
| message = strings.Join(strings.Fields(message), " ") | ||
| if len(message) <= changefeedErrorMetricMsgLimit { | ||
| return message | ||
| } | ||
| return message[:changefeedErrorMetricMsgLimit-3] + "..." | ||
| } | ||
|
|
||
| func normalizeChangefeedErrorMetricTime(errorTime time.Time) string { | ||
| // Keep the label stable across nodes with different local time zones while remaining | ||
| // directly readable in Grafana's table view. | ||
| if errorTime.IsZero() { | ||
| return "" | ||
| } | ||
| return errorTime.UTC().Format(time.RFC3339) | ||
| } | ||
|
|
||
| func getChangefeedErrorMetricLabels(info *config.ChangeFeedInfo) (changefeedErrorMetricLabels, bool) { | ||
| if info == nil { | ||
| return changefeedErrorMetricLabels{}, false | ||
| } | ||
| if info.State != config.StateFailed && info.State != config.StateWarning { | ||
| return changefeedErrorMetricLabels{}, false | ||
| } | ||
|
|
||
| runningErr := info.Error | ||
| if runningErr == nil { | ||
| runningErr = info.Warning | ||
| } | ||
| if runningErr == nil { | ||
| return changefeedErrorMetricLabels{}, false | ||
| } | ||
|
|
||
| return changefeedErrorMetricLabels{ | ||
| keyspace: info.ChangefeedID.Keyspace(), | ||
| changefeed: info.ChangefeedID.Name(), | ||
| state: string(info.State), | ||
| errorTime: normalizeChangefeedErrorMetricTime(runningErr.Time), | ||
| code: runningErr.Code, | ||
| message: normalizeChangefeedErrorMetricMessage(runningErr.Message), | ||
| }, true | ||
| } | ||
| >>>>>>> 7b68b7051 (metrics: show error time in changefeed error details panel (#5086)) |
There was a problem hiding this comment.
The code contains Git merge conflict markers. Additionally, the constant changefeedErrorMetricMsgLimit is used in normalizeChangefeedErrorMetricMessage but is not defined in this file. Based on the dashboard description, it should be set to 256.
const changefeedErrorMetricMsgLimit = 256
type changefeedErrorMetricLabels struct {
keyspace string
changefeed string
state string
errorTime string
code string
message string
}
func isUnchangedRuntimeState(info *config.ChangeFeedInfo, state config.FeedState, err *config.RunningError) bool {
if info == nil {
return true
}
if info.State != state {
return false
}
return sameRunningErrorSignature(info.Error, err)
}
func sameRunningErrorSignature(lhs *config.RunningError, rhs *config.RunningError) bool {
if lhs == nil || rhs == nil {
return lhs == rhs
}
return lhs.Addr == rhs.Addr &&
lhs.Code == rhs.Code &&
lhs.Message == rhs.Message
}
func (l changefeedErrorMetricLabels) labelValues() []string {
return []string{l.keyspace, l.changefeed, l.state, l.errorTime, l.code, l.message}
}
func normalizeChangefeedErrorMetricMessage(message string) string {
message = strings.Join(strings.Fields(message), " ")
if len(message) <= changefeedErrorMetricMsgLimit {
return message
}
return message[:changefeedErrorMetricMsgLimit-3] + "..."
}
func normalizeChangefeedErrorMetricTime(errorTime time.Time) string {
// Keep the label stable across nodes with different local time zones while remaining
// directly readable in Grafana's table view.
if errorTime.IsZero() {
return ""
}
return errorTime.UTC().Format(time.RFC3339)
}
func getChangefeedErrorMetricLabels(info *config.ChangeFeedInfo) (changefeedErrorMetricLabels, bool) {
if info == nil {
return changefeedErrorMetricLabels{}, false
}
if info.State != config.StateFailed && info.State != config.StateWarning {
return changefeedErrorMetricLabels{}, false
}
runningErr := info.Error
if runningErr == nil {
runningErr = info.Warning
}
if runningErr == nil {
return changefeedErrorMetricLabels{}, false
}
return changefeedErrorMetricLabels{
keyspace: info.ChangefeedID.Keyspace(),
changefeed: info.ChangefeedID.Name(),
state: string(info.State),
errorTime: normalizeChangefeedErrorMetricTime(runningErr.Time),
code: runningErr.Code,
message: normalizeChangefeedErrorMetricMessage(runningErr.Message),
}, true
}| <<<<<<< HEAD | ||
| ======= | ||
| // ChangefeedErrorInfoGauge records the current warning or failed reason and its occurrence time | ||
| // for each changefeed. | ||
| ChangefeedErrorInfoGauge = prometheus.NewGaugeVec( | ||
| prometheus.GaugeOpts{ | ||
| Namespace: "ticdc", | ||
| Subsystem: "owner", | ||
| Name: "changefeed_error_info", | ||
| Help: "The current warning or failed reason and occurrence time of changefeeds", | ||
| }, []string{getKeyspaceLabel(), "changefeed", "state", "error_time", "code", "message"}) | ||
|
|
||
| >>>>>>> 7b68b7051 (metrics: show error time in changefeed error details panel (#5086)) |
There was a problem hiding this comment.
The code contains Git merge conflict markers.
// ChangefeedErrorInfoGauge records the current warning or failed reason and its occurrence time
// for each changefeed.
ChangefeedErrorInfoGauge = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Namespace: "ticdc",
Subsystem: "owner",
Name: "changefeed_error_info",
Help: "The current warning or failed reason and occurrence time of changefeeds",
}, []string{getKeyspaceLabel(), "changefeed", "state", "error_time", "code", "message"})| <<<<<<< HEAD | ||
| ======= | ||
| }, | ||
| { | ||
| "datasource": "${DS_TEST-CLUSTER}", | ||
| "description": "Current warning or failed reason of each changefeed. The metric message is normalized to a single line and truncated to 256 characters.", | ||
| "fieldConfig": { | ||
| "defaults": { | ||
| "custom": { | ||
| "align": null, | ||
| "filterable": false | ||
| }, | ||
| "links": [], | ||
| "mappings": [], | ||
| "thresholds": { | ||
| "mode": "absolute", | ||
| "steps": [ | ||
| { | ||
| "color": "green", | ||
| "value": null | ||
| }, | ||
| { | ||
| "color": "red", | ||
| "value": 80 | ||
| } | ||
| ] | ||
| } | ||
| }, | ||
| "overrides": [ | ||
| { | ||
| "matcher": { | ||
| "id": "byName", | ||
| "options": "namespace" | ||
| }, | ||
| "properties": [ | ||
| { | ||
| "id": "custom.width", | ||
| "value": 120 | ||
| } | ||
| ] | ||
| }, | ||
| { | ||
| "matcher": { | ||
| "id": "byName", | ||
| "options": "changefeed" | ||
| }, | ||
| "properties": [ | ||
| { | ||
| "id": "custom.width", | ||
| "value": 180 | ||
| } | ||
| ] | ||
| }, | ||
| { | ||
| "matcher": { | ||
| "id": "byName", | ||
| "options": "state" | ||
| }, | ||
| "properties": [ | ||
| { | ||
| "id": "custom.width", | ||
| "value": 100 | ||
| } | ||
| ] | ||
| }, | ||
| { | ||
| "matcher": { | ||
| "id": "byName", | ||
| "options": "code" | ||
| }, | ||
| "properties": [ | ||
| { | ||
| "id": "custom.width", | ||
| "value": 180 | ||
| } | ||
| ] | ||
| }, | ||
| { | ||
| "matcher": { | ||
| "id": "byName", | ||
| "options": "error_time" | ||
| }, | ||
| "properties": [ | ||
| { | ||
| "id": "custom.width", | ||
| "value": 180 | ||
| } | ||
| ] | ||
| } | ||
| ] | ||
| }, | ||
| "gridPos": { | ||
| "h": 8, | ||
| "w": 24, | ||
| "x": 0, | ||
| "y": 26 | ||
| }, | ||
| "id": 62010, | ||
| "options": { | ||
| "showHeader": true, | ||
| "sortBy": [] | ||
| }, | ||
| "pluginVersion": "7.5.17", | ||
| "targets": [ | ||
| { | ||
| "expr": "max by (namespace, changefeed, state, code, error_time, message) (ticdc_owner_changefeed_error_info{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", namespace=~\"$namespace\", changefeed=~\"$changefeed\"})", | ||
| "format": "time_series", | ||
| "instant": true, | ||
| "refId": "A" | ||
| } | ||
| ], | ||
| "title": "Changefeed Error Details", | ||
| "transformations": [ | ||
| { | ||
| "id": "labelsToFields", | ||
| "options": {} | ||
| }, | ||
| { | ||
| "id": "organize", | ||
| "options": { | ||
| "excludeByName": { | ||
| "Metric": true, | ||
| "Time": true, | ||
| "Value": true, | ||
| "__name__": true | ||
| }, | ||
| "indexByName": { | ||
| "namespace": 0, | ||
| "changefeed": 1, | ||
| "state": 2, | ||
| "error_time": 3, | ||
| "code": 4, | ||
| "message": 5 | ||
| }, | ||
| "renameByName": {} | ||
| } | ||
| } | ||
| ], | ||
| "type": "table" | ||
| >>>>>>> 7b68b7051 (metrics: show error time in changefeed error details panel (#5086)) |
| <<<<<<< HEAD | ||
| ======= | ||
| }, | ||
| { | ||
| "datasource": "${DS_TEST-CLUSTER}", | ||
| "description": "Current warning or failed reason of each changefeed. The metric message is normalized to a single line and truncated to 256 characters.", | ||
| "fieldConfig": { | ||
| "defaults": { | ||
| "custom": { | ||
| "align": null, | ||
| "filterable": false | ||
| }, | ||
| "links": [], | ||
| "mappings": [], | ||
| "thresholds": { | ||
| "mode": "absolute", | ||
| "steps": [ | ||
| { | ||
| "color": "green", | ||
| "value": null | ||
| }, | ||
| { | ||
| "color": "red", | ||
| "value": 80 | ||
| } | ||
| ] | ||
| } | ||
| }, | ||
| "overrides": [ | ||
| { | ||
| "matcher": { | ||
| "id": "byName", | ||
| "options": "keyspace_name" | ||
| }, | ||
| "properties": [ | ||
| { | ||
| "id": "custom.width", | ||
| "value": 120 | ||
| } | ||
| ] | ||
| }, | ||
| { | ||
| "matcher": { | ||
| "id": "byName", | ||
| "options": "changefeed" | ||
| }, | ||
| "properties": [ | ||
| { | ||
| "id": "custom.width", | ||
| "value": 180 | ||
| } | ||
| ] | ||
| }, | ||
| { | ||
| "matcher": { | ||
| "id": "byName", | ||
| "options": "state" | ||
| }, | ||
| "properties": [ | ||
| { | ||
| "id": "custom.width", | ||
| "value": 100 | ||
| } | ||
| ] | ||
| }, | ||
| { | ||
| "matcher": { | ||
| "id": "byName", | ||
| "options": "code" | ||
| }, | ||
| "properties": [ | ||
| { | ||
| "id": "custom.width", | ||
| "value": 180 | ||
| } | ||
| ] | ||
| }, | ||
| { | ||
| "matcher": { | ||
| "id": "byName", | ||
| "options": "error_time" | ||
| }, | ||
| "properties": [ | ||
| { | ||
| "id": "custom.width", | ||
| "value": 180 | ||
| } | ||
| ] | ||
| } | ||
| ] | ||
| }, | ||
| "gridPos": { | ||
| "h": 8, | ||
| "w": 24, | ||
| "x": 0, | ||
| "y": 26 | ||
| }, | ||
| "id": 62010, | ||
| "options": { | ||
| "showHeader": true, | ||
| "sortBy": [] | ||
| }, | ||
| "pluginVersion": "7.5.17", | ||
| "targets": [ | ||
| { | ||
| "expr": "max by (keyspace_name, changefeed, state, code, error_time, message) (ticdc_owner_changefeed_error_info{k8s_cluster=\"$k8s_cluster\", sharedpool_id=\"$tidb_cluster\", keyspace_name=~\"$keyspace_name\", changefeed=~\"$changefeed\"})", | ||
| "format": "time_series", | ||
| "instant": true, | ||
| "refId": "A" | ||
| } | ||
| ], | ||
| "title": "Changefeed Error Details", | ||
| "transformations": [ | ||
| { | ||
| "id": "labelsToFields", | ||
| "options": {} | ||
| }, | ||
| { | ||
| "id": "organize", | ||
| "options": { | ||
| "excludeByName": { | ||
| "Metric": true, | ||
| "Time": true, | ||
| "Value": true, | ||
| "__name__": true | ||
| }, | ||
| "indexByName": { | ||
| "keyspace_name": 0, | ||
| "changefeed": 1, | ||
| "state": 2, | ||
| "error_time": 3, | ||
| "code": 4, | ||
| "message": 5 | ||
| }, | ||
| "renameByName": {} | ||
| } | ||
| } | ||
| ], | ||
| "type": "table" | ||
| >>>>>>> 7b68b7051 (metrics: show error time in changefeed error details panel (#5086)) |
| <<<<<<< HEAD | ||
| ======= | ||
| }, | ||
| { | ||
| "datasource": "${DS_TEST-CLUSTER}", | ||
| "description": "Current warning or failed reason of each changefeed. The metric message is normalized to a single line and truncated to 256 characters.", | ||
| "fieldConfig": { | ||
| "defaults": { | ||
| "custom": { | ||
| "align": null, | ||
| "filterable": false | ||
| }, | ||
| "links": [], | ||
| "mappings": [], | ||
| "thresholds": { | ||
| "mode": "absolute", | ||
| "steps": [ | ||
| { | ||
| "color": "green", | ||
| "value": null | ||
| }, | ||
| { | ||
| "color": "red", | ||
| "value": 80 | ||
| } | ||
| ] | ||
| } | ||
| }, | ||
| "overrides": [ | ||
| { | ||
| "matcher": { | ||
| "id": "byName", | ||
| "options": "keyspace_name" | ||
| }, | ||
| "properties": [ | ||
| { | ||
| "id": "custom.width", | ||
| "value": 120 | ||
| } | ||
| ] | ||
| }, | ||
| { | ||
| "matcher": { | ||
| "id": "byName", | ||
| "options": "changefeed" | ||
| }, | ||
| "properties": [ | ||
| { | ||
| "id": "custom.width", | ||
| "value": 180 | ||
| } | ||
| ] | ||
| }, | ||
| { | ||
| "matcher": { | ||
| "id": "byName", | ||
| "options": "state" | ||
| }, | ||
| "properties": [ | ||
| { | ||
| "id": "custom.width", | ||
| "value": 100 | ||
| } | ||
| ] | ||
| }, | ||
| { | ||
| "matcher": { | ||
| "id": "byName", | ||
| "options": "code" | ||
| }, | ||
| "properties": [ | ||
| { | ||
| "id": "custom.width", | ||
| "value": 180 | ||
| } | ||
| ] | ||
| }, | ||
| { | ||
| "matcher": { | ||
| "id": "byName", | ||
| "options": "error_time" | ||
| }, | ||
| "properties": [ | ||
| { | ||
| "id": "custom.width", | ||
| "value": 180 | ||
| } | ||
| ] | ||
| } | ||
| ] | ||
| }, | ||
| "gridPos": { | ||
| "h": 8, | ||
| "w": 24, | ||
| "x": 0, | ||
| "y": 26 | ||
| }, | ||
| "id": 62010, | ||
| "options": { | ||
| "showHeader": true, | ||
| "sortBy": [] | ||
| }, | ||
| "pluginVersion": "7.5.17", | ||
| "targets": [ | ||
| { | ||
| "expr": "max by (keyspace_name, changefeed, state, code, error_time, message) (ticdc_owner_changefeed_error_info{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", keyspace_name=~\"$keyspace_name\", changefeed=~\"$changefeed\"})", | ||
| "format": "time_series", | ||
| "instant": true, | ||
| "refId": "A" | ||
| } | ||
| ], | ||
| "title": "Changefeed Error Details", | ||
| "transformations": [ | ||
| { | ||
| "id": "labelsToFields", | ||
| "options": {} | ||
| }, | ||
| { | ||
| "id": "organize", | ||
| "options": { | ||
| "excludeByName": { | ||
| "Metric": true, | ||
| "Time": true, | ||
| "Value": true, | ||
| "__name__": true | ||
| }, | ||
| "indexByName": { | ||
| "keyspace_name": 0, | ||
| "changefeed": 1, | ||
| "state": 2, | ||
| "error_time": 3, | ||
| "code": 4, | ||
| "message": 5 | ||
| }, | ||
| "renameByName": {} | ||
| } | ||
| } | ||
| ], | ||
| "type": "table" | ||
| >>>>>>> 7b68b7051 (metrics: show error time in changefeed error details panel (#5086)) |
This is an automated cherry-pick of #5086
What problem does this PR solve?
Issue Number: close #5085
What is changed and how it works?
ticdc_owner_changefeed_error_infometric as anerror_timelabel.error_timein theChangefeed Error DetailsGrafana panel for both standard and next-generation dashboards.Check List
Tests
Questions
Will it cause performance regression or break compatibility?
No expected performance regression. This extends the current error-info metric with one stable label for the active error occurrence time. Existing selectors remain valid, while consumers that depend on the exact label set should account for the added
error_timelabel.Do you need to update user documentation, design documentation or monitoring documentation?
No separate documentation update is needed. The affected Grafana dashboard definitions are updated in this PR.
Release note
Summary by CodeRabbit
Release Notes
New Features
Tests