metric(dm): add resumable_err label to unit error count metric #7852

D3Hunter · 2022-12-08T08:58:59Z

What problem does this PR solve?

Issue Number: ref #7115, #7376

What is changed and how it works?

we need alert customer on dump/load/sync error, but some-error can be resumed automatically, we don't want alert too frequently, so i add a resumable_err label to all XXX_exit_with_error_count metric
uses changes(dm_syncer_exit_with_error_count{resumable_err="false"}[1m]) > 0 or on(source_id, task) increase(dm_syncer_exit_with_error_count{resumable_err="true"}[1m]) > 3 to alert, other XXX_exit_with_error_count too

left is un-resumable err, right is resumable

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
- alert expression is tested manually, see prev image
No code

Questions

Will it cause performance regression or break compatibility?

Do you need to update user documentation, design documentation or monitoring documentation?

Release note

adjust alert expression of `DM_XXX_process_exits_with_error` alerts to reduce false alarms

ti-chi-bot · 2022-12-08T08:59:02Z

[REVIEW NOTIFICATION]

This pull request has been approved by:

lance6716
okJiang

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

D3Hunter · 2022-12-08T08:59:11Z

/run-all-tests

D3Hunter · 2022-12-08T09:28:53Z

/run-all-tests

D3Hunter · 2022-12-08T15:36:37Z

dm/metrics/alertmanager/dm_worker.rules.yml

@@ -24,11 +24,11 @@ groups:
      summary: DM remain storage of relay log

  - alert: DM_relay_process_exits_with_error
-    expr: changes(dm_relay_exit_with_error_count[1m]) > 0
+    expr: changes(dm_relay_exit_with_error_count{resumable_err="false"}[1m]) > 0 or on(instance, job) increase(dm_relay_exit_with_error_count{resumable_err="true"}[1m]) > 3


alert if we keep auto-resume(will take at least 25s = 5+5+5+10on default check interval)

since Backoff will decrease backoff slowly(about 1-2m), we might alert only on first time if some resumable error happens twice in a short period

change expr to increase(dm_relay_exit_with_error_count{resumable_err="true"}[2m]) > 3 in case the resumeable error is a connection timeout which takes 30s, @lance6716 ptal

D3Hunter · 2022-12-08T15:40:13Z

dm/syncer/syncer.go

@@ -3165,10 +3175,12 @@ func (s *Syncer) Resume(ctx context.Context, pr chan pb.ProcessResult) {
 	var err error
 	defer func() {
 		if err != nil {
+			processError := unit.NewProcessError(err)
+			s.handleExitErrMetric(processError)


syncer will prepare db before Process on resume, the error should be counted

D3Hunter · 2022-12-08T15:42:35Z

/run-all-tests

D3Hunter · 2022-12-08T15:43:46Z

/run-all-tests

D3Hunter · 2022-12-09T02:00:01Z

/run-dm-integration-test

…ow into jujiajia/diff-resumable-err

D3Hunter · 2022-12-09T05:49:07Z

/run-all-tests

D3Hunter · 2022-12-09T06:40:13Z

/run-verify

dm/syncer/syncer.go

D3Hunter · 2022-12-14T12:25:00Z

dm/metrics/alertmanager/dm_worker.rules.yml

    labels:
      env: ENV_LABELS_ENV
      level: critical
-      expr: changes(dm_syncer_exit_with_error_count[1m]) > 0
+      expr: changes(dm_syncer_exit_with_error_count{resumable_err="false"}[1m]) > 0 or on(source_id, task) increase(dm_syncer_exit_with_error_count{resumable_err="true"}[2m]) > 3


change expr to increase(xxxxxxx[2m]) > 3 in case the resumeable error is a connection timeout which takes 30s, @lance6716 ptal

I'm not sure the pause and resume action run quickly enough so that 3 resuming can happen in 2 minutes. There's another rule "DM_binlog_file_gap_between_master_syncer", will the service use it as well?

Also our document should be updated https://docs.pingcap.com/zh/tidb/stable/dm-handle-alerts

if we met resumable error for the first time or the first time after long running state, pause/resume will happen every 5 seconds in default check interval. which type of pause/resume is even slower than connection timeout?

apart from starting binlog stream, the pause/resume action contains rolling back checkpoints, in old version it is very slow

#3734 (comment)

D3Hunter · 2022-12-15T03:23:03Z

@okJiang ptal

D3Hunter · 2022-12-16T03:12:05Z

@buchuitoudegou ptal

D3Hunter · 2022-12-16T03:16:21Z

/run-all-tests

D3Hunter · 2022-12-16T03:37:15Z

/run-engine-integration-test

D3Hunter · 2022-12-16T06:12:17Z

/run-engine-integration-test

D3Hunter · 2022-12-16T06:22:22Z

/run-engine-integration-test

D3Hunter · 2022-12-16T07:47:13Z

/merge

ti-chi-bot · 2022-12-16T07:47:15Z

This pull request has been accepted and is ready to merge.

Commit hash: 161b80d

ti-chi-bot · 2022-12-16T07:47:28Z

@D3Hunter: Your PR was out of date, I have automatically updated it for you.

At the same time I will also trigger all tests for you:

/run-all-tests

trigger some heavy tests which will not run always when PR updated.

If the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

D3Hunter · 2022-12-16T08:30:37Z

/run-all-tests

D3Hunter · 2022-12-16T08:32:06Z

/run-all-tests

D3Hunter · 2022-12-16T08:32:59Z

/run-all-tests

D3Hunter added 2 commits December 8, 2022 16:31

change

e210f9d

add for relay

88462b8

D3Hunter added the area/dm Issues or PRs related to DM. label Dec 8, 2022

ti-chi-bot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Dec 8, 2022

ti-chi-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 8, 2022

D3Hunter added 3 commits December 8, 2022 18:12

add metric on resume

bb114ec

change

aca73ec

change

f66ab2e

ti-chi-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 8, 2022

D3Hunter commented Dec 8, 2022

View reviewed changes

Merge branch 'master' into jujiajia/diff-resumable-err

60478a4

D3Hunter marked this pull request as ready for review December 8, 2022 15:41

ti-chi-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 8, 2022

D3Hunter requested a review from lance6716 December 8, 2022 15:42

D3Hunter added 2 commits December 9, 2022 13:47

fix i t

b42b746

Merge branch 'jujiajia/diff-resumable-err' of github.com:pingcap/tifl…

9d206e8

…ow into jujiajia/diff-resumable-err

D3Hunter requested a review from buchuitoudegou December 9, 2022 05:48

Merge branch 'master' into jujiajia/diff-resumable-err

6144159

add missed metric

53fdafa

lance6716 approved these changes Dec 14, 2022

View reviewed changes

dm/syncer/syncer.go Outdated Show resolved Hide resolved

ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Dec 14, 2022

fix comment, change alert expr

5dc3b68

D3Hunter commented Dec 14, 2022

View reviewed changes

Merge branch 'master' into jujiajia/diff-resumable-err

161b80d

ti-chi-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Dec 15, 2022

D3Hunter requested review from maxshuang and okJiang and removed request for maxshuang December 15, 2022 03:12

okJiang approved these changes Dec 16, 2022

View reviewed changes

ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Dec 16, 2022

ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Dec 16, 2022

Merge branch 'master' into jujiajia/diff-resumable-err

65de280

Merge branch 'master' into jujiajia/diff-resumable-err

c5cfd9a

ti-chi-bot merged commit df521ec into master Dec 16, 2022

D3Hunter deleted the jujiajia/diff-resumable-err branch December 16, 2022 09:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metric(dm): add resumable_err label to unit error count metric #7852

metric(dm): add resumable_err label to unit error count metric #7852

D3Hunter commented Dec 8, 2022 •

edited

Loading

ti-chi-bot commented Dec 8, 2022 •

edited

Loading

D3Hunter commented Dec 8, 2022

D3Hunter commented Dec 8, 2022

D3Hunter Dec 8, 2022

D3Hunter Dec 14, 2022

D3Hunter Dec 8, 2022

D3Hunter commented Dec 8, 2022

D3Hunter commented Dec 8, 2022

D3Hunter commented Dec 9, 2022

D3Hunter commented Dec 9, 2022

D3Hunter commented Dec 9, 2022

D3Hunter Dec 14, 2022

lance6716 Dec 14, 2022

D3Hunter Dec 15, 2022

lance6716 Dec 15, 2022

lance6716 Dec 15, 2022

D3Hunter commented Dec 15, 2022

D3Hunter commented Dec 16, 2022

D3Hunter commented Dec 16, 2022

D3Hunter commented Dec 16, 2022

D3Hunter commented Dec 16, 2022

D3Hunter commented Dec 16, 2022

D3Hunter commented Dec 16, 2022

ti-chi-bot commented Dec 16, 2022

ti-chi-bot commented Dec 16, 2022

D3Hunter commented Dec 16, 2022

D3Hunter commented Dec 16, 2022

D3Hunter commented Dec 16, 2022

metric(dm): add resumable_err label to unit error count metric #7852

metric(dm): add resumable_err label to unit error count metric #7852

Conversation

D3Hunter commented Dec 8, 2022 • edited Loading

What problem does this PR solve?

What is changed and how it works?

Check List

Tests

Questions

Will it cause performance regression or break compatibility?

Do you need to update user documentation, design documentation or monitoring documentation?

Release note

ti-chi-bot commented Dec 8, 2022 • edited Loading

D3Hunter commented Dec 8, 2022

D3Hunter commented Dec 8, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

D3Hunter commented Dec 8, 2022

D3Hunter commented Dec 8, 2022

D3Hunter commented Dec 9, 2022

D3Hunter commented Dec 9, 2022

D3Hunter commented Dec 9, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

D3Hunter commented Dec 15, 2022

D3Hunter commented Dec 16, 2022

D3Hunter commented Dec 16, 2022

D3Hunter commented Dec 16, 2022

D3Hunter commented Dec 16, 2022

D3Hunter commented Dec 16, 2022

D3Hunter commented Dec 16, 2022

ti-chi-bot commented Dec 16, 2022

ti-chi-bot commented Dec 16, 2022

D3Hunter commented Dec 16, 2022

D3Hunter commented Dec 16, 2022

D3Hunter commented Dec 16, 2022

D3Hunter commented Dec 8, 2022 •

edited

Loading

ti-chi-bot commented Dec 8, 2022 •

edited

Loading