-
Notifications
You must be signed in to change notification settings - Fork 274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metric(dm): add resumable_err label to unit error count metric #7852
Conversation
[REVIEW NOTIFICATION] This pull request has been approved by:
To complete the pull request process, please ask the reviewers in the list to review by filling The full list of commands accepted by this bot can be found here. Reviewer can indicate their review by submitting an approval review. |
/run-all-tests |
/run-all-tests |
@@ -24,11 +24,11 @@ groups: | |||
summary: DM remain storage of relay log | |||
|
|||
- alert: DM_relay_process_exits_with_error | |||
expr: changes(dm_relay_exit_with_error_count[1m]) > 0 | |||
expr: changes(dm_relay_exit_with_error_count{resumable_err="false"}[1m]) > 0 or on(instance, job) increase(dm_relay_exit_with_error_count{resumable_err="true"}[1m]) > 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alert if we keep auto-resume
(will take at least 25s = 5+5+5+10
on default check interval)
since Backoff
will decrease backoff slowly(about 1-2m), we might alert only on first time if some resumable error happens twice in a short period
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change expr
to increase(dm_relay_exit_with_error_count{resumable_err="true"}[2m]) > 3
in case the resumeable error is a connection timeout
which takes 30s, @lance6716 ptal
@@ -3165,10 +3175,12 @@ func (s *Syncer) Resume(ctx context.Context, pr chan pb.ProcessResult) { | |||
var err error | |||
defer func() { | |||
if err != nil { | |||
processError := unit.NewProcessError(err) | |||
s.handleExitErrMetric(processError) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
syncer will prepare db before Process
on resume, the error should be counted
/run-all-tests |
1 similar comment
/run-all-tests |
/run-dm-integration-test |
…ow into jujiajia/diff-resumable-err
/run-all-tests |
/run-verify |
labels: | ||
env: ENV_LABELS_ENV | ||
level: critical | ||
expr: changes(dm_syncer_exit_with_error_count[1m]) > 0 | ||
expr: changes(dm_syncer_exit_with_error_count{resumable_err="false"}[1m]) > 0 or on(source_id, task) increase(dm_syncer_exit_with_error_count{resumable_err="true"}[2m]) > 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change expr
to increase(xxxxxxx[2m]) > 3
in case the resumeable error is a connection timeout
which takes 30s, @lance6716 ptal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the pause and resume action run quickly enough so that 3 resuming can happen in 2 minutes. There's another rule "DM_binlog_file_gap_between_master_syncer", will the service use it as well?
Also our document should be updated https://docs.pingcap.com/zh/tidb/stable/dm-handle-alerts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we met resumable error for the first time or the first time after long running
state, pause/resume
will happen every 5 seconds in default check interval. which type of pause/resume
is even slower than connection timeout
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apart from starting binlog stream, the pause/resume action contains rolling back checkpoints, in old version it is very slow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@okJiang ptal |
@buchuitoudegou ptal |
/run-all-tests |
/run-engine-integration-test |
2 similar comments
/run-engine-integration-test |
/run-engine-integration-test |
/merge |
This pull request has been accepted and is ready to merge. Commit hash: 161b80d
|
@D3Hunter: Your PR was out of date, I have automatically updated it for you. At the same time I will also trigger all tests for you: /run-all-tests
If the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
/run-all-tests |
2 similar comments
/run-all-tests |
/run-all-tests |
What problem does this PR solve?
Issue Number: ref #7115, #7376
What is changed and how it works?
resumable_err
label to allXXX_exit_with_error_count
metricchanges(dm_syncer_exit_with_error_count{resumable_err="false"}[1m]) > 0 or on(source_id, task) increase(dm_syncer_exit_with_error_count{resumable_err="true"}[1m]) > 3
to alert, otherXXX_exit_with_error_count
tooleft is un-resumable err, right is resumable
Check List
Tests
Questions
Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?
Release note