Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metric(dm): add resumable_err label to unit error count metric #7852

Merged
merged 14 commits into from
Dec 16, 2022

Conversation

D3Hunter
Copy link
Contributor

@D3Hunter D3Hunter commented Dec 8, 2022

What problem does this PR solve?

Issue Number: ref #7115, #7376

What is changed and how it works?

  • we need alert customer on dump/load/sync error, but some-error can be resumed automatically, we don't want alert too frequently, so i add a resumable_err label to all XXX_exit_with_error_count metric
  • uses changes(dm_syncer_exit_with_error_count{resumable_err="false"}[1m]) > 0 or on(source_id, task) increase(dm_syncer_exit_with_error_count{resumable_err="true"}[1m]) > 3 to alert, other XXX_exit_with_error_count too

left is un-resumable err, right is resumable
image

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
    • alert expression is tested manually, see prev image
  • No code

Questions

Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?

Release note

adjust alert expression of `DM_XXX_process_exits_with_error` alerts to reduce false alarms

@D3Hunter D3Hunter added the area/dm Issues or PRs related to DM. label Dec 8, 2022
@ti-chi-bot
Copy link
Member

ti-chi-bot commented Dec 8, 2022

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • lance6716
  • okJiang

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Dec 8, 2022
@D3Hunter
Copy link
Contributor Author

D3Hunter commented Dec 8, 2022

/run-all-tests

@ti-chi-bot ti-chi-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 8, 2022
@D3Hunter
Copy link
Contributor Author

D3Hunter commented Dec 8, 2022

/run-all-tests

@ti-chi-bot ti-chi-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 8, 2022
@@ -24,11 +24,11 @@ groups:
summary: DM remain storage of relay log

- alert: DM_relay_process_exits_with_error
expr: changes(dm_relay_exit_with_error_count[1m]) > 0
expr: changes(dm_relay_exit_with_error_count{resumable_err="false"}[1m]) > 0 or on(instance, job) increase(dm_relay_exit_with_error_count{resumable_err="true"}[1m]) > 3
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alert if we keep auto-resume(will take at least 25s = 5+5+5+10on default check interval)

since Backoff will decrease backoff slowly(about 1-2m), we might alert only on first time if some resumable error happens twice in a short period

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change expr to increase(dm_relay_exit_with_error_count{resumable_err="true"}[2m]) > 3 in case the resumeable error is a connection timeout which takes 30s, @lance6716 ptal

@@ -3165,10 +3175,12 @@ func (s *Syncer) Resume(ctx context.Context, pr chan pb.ProcessResult) {
var err error
defer func() {
if err != nil {
processError := unit.NewProcessError(err)
s.handleExitErrMetric(processError)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syncer will prepare db before Process on resume, the error should be counted

@D3Hunter D3Hunter marked this pull request as ready for review December 8, 2022 15:41
@ti-chi-bot ti-chi-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 8, 2022
@D3Hunter
Copy link
Contributor Author

D3Hunter commented Dec 8, 2022

/run-all-tests

1 similar comment
@D3Hunter
Copy link
Contributor Author

D3Hunter commented Dec 8, 2022

/run-all-tests

@D3Hunter
Copy link
Contributor Author

D3Hunter commented Dec 9, 2022

/run-dm-integration-test

@D3Hunter
Copy link
Contributor Author

D3Hunter commented Dec 9, 2022

/run-all-tests

@D3Hunter
Copy link
Contributor Author

D3Hunter commented Dec 9, 2022

/run-verify

dm/syncer/syncer.go Outdated Show resolved Hide resolved
@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Dec 14, 2022
labels:
env: ENV_LABELS_ENV
level: critical
expr: changes(dm_syncer_exit_with_error_count[1m]) > 0
expr: changes(dm_syncer_exit_with_error_count{resumable_err="false"}[1m]) > 0 or on(source_id, task) increase(dm_syncer_exit_with_error_count{resumable_err="true"}[2m]) > 3
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change expr to increase(xxxxxxx[2m]) > 3 in case the resumeable error is a connection timeout which takes 30s, @lance6716 ptal

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the pause and resume action run quickly enough so that 3 resuming can happen in 2 minutes. There's another rule "DM_binlog_file_gap_between_master_syncer", will the service use it as well?

Also our document should be updated https://docs.pingcap.com/zh/tidb/stable/dm-handle-alerts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we met resumable error for the first time or the first time after long running state, pause/resume will happen every 5 seconds in default check interval. which type of pause/resume is even slower than connection timeout?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apart from starting binlog stream, the pause/resume action contains rolling back checkpoints, in old version it is very slow

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ti-chi-bot ti-chi-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Dec 15, 2022
@D3Hunter D3Hunter requested review from maxshuang and okJiang and removed request for maxshuang December 15, 2022 03:12
@D3Hunter
Copy link
Contributor Author

@okJiang ptal

@D3Hunter
Copy link
Contributor Author

@buchuitoudegou ptal

@D3Hunter
Copy link
Contributor Author

/run-all-tests

@D3Hunter
Copy link
Contributor Author

/run-engine-integration-test

2 similar comments
@D3Hunter
Copy link
Contributor Author

/run-engine-integration-test

@D3Hunter
Copy link
Contributor Author

/run-engine-integration-test

@ti-chi-bot ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Dec 16, 2022
@D3Hunter
Copy link
Contributor Author

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: 161b80d

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Dec 16, 2022
@ti-chi-bot
Copy link
Member

@D3Hunter: Your PR was out of date, I have automatically updated it for you.

At the same time I will also trigger all tests for you:

/run-all-tests

trigger some heavy tests which will not run always when PR updated.

If the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@D3Hunter
Copy link
Contributor Author

/run-all-tests

2 similar comments
@D3Hunter
Copy link
Contributor Author

/run-all-tests

@D3Hunter
Copy link
Contributor Author

/run-all-tests

@ti-chi-bot ti-chi-bot merged commit df521ec into master Dec 16, 2022
@D3Hunter D3Hunter deleted the jujiajia/diff-resumable-err branch December 16, 2022 09:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/dm Issues or PRs related to DM. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants