Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ddl: limit the number of DDL job retries #7474

Closed
wants to merge 2 commits into from

Conversation

@ciscoxll
Copy link
Contributor

commented Aug 23, 2018

What problem does this PR solve?

This PR solves the problem of infinite retry when a DDL job error occurs.

What is changed and how it works?

  • If the current DDL job fails, it will be retried indefinitely, so we need to limit the number of retries.
  • Avoid error jobs blocking the entire DDL.
  • Add some logging and metrics.
  • Set the DDL job error retry number set @@global.tidb_ddl_error_retry_limit = 10000.
  • Fix issue #7517

Check List

Tests

  • Unit test

PTAL @tiancaiamao @winkyao @zimulala .


This change is Reviewable

@@ -521,6 +522,54 @@ func (s *testDDLSuite) TestBuildJobDependence(c *C) {
})
}

func (s *testDDLSuite) TestErrorCountlimit(c *C) {

This comment has been minimized.

Copy link
@zimulala

zimulala Aug 24, 2018

Member

Please put it to the file of fail test.

@@ -460,6 +463,7 @@ func (w *worker) runDDLJob(d *ddlCtx, t *meta.Meta, job *model.Job) (ver int64,
job.State = model.JobStateCancelled
job.Error = errCancelledDDLJob
job.ErrorCount++
metrics.DDLWorkerHistogram.WithLabelValues(metrics.WorkerCancelDDLJob, metrics.RetLabel(err)).Observe(time.Since(startTime).Seconds())

This comment has been minimized.

Copy link
@winkyao

winkyao Aug 25, 2018

Member

Is this metric valuable?

@@ -51,6 +51,8 @@ const (
waitDependencyJobInterval = 200 * time.Millisecond
// noneDependencyJob means a job has no dependency-job.
noneDependencyJob = 0
// errorCountlimit limit the number of retries.
errorCountlimit = 10000

This comment has been minimized.

Copy link
@winkyao

winkyao Aug 25, 2018

Member

10000 is too large, maybe 512 is enough.

This comment has been minimized.

Copy link
@winkyao

winkyao Aug 25, 2018

Member

Could we add a config item or system variable to set it?

} else {
log.Infof("[ddl-%s] the DDL job is normal to cancel because %v", w, errors.ErrorStack(err))
log.Infof("[ddl-%s] the DDL job is normal to cancel because %v job query %s", w, errors.ErrorStack(err), job.Query)

This comment has been minimized.

Copy link
@winkyao

winkyao Aug 25, 2018

Member

, job query

@@ -519,13 +523,18 @@ func (w *worker) runDDLJob(d *ddlCtx, t *meta.Meta, job *model.Job) (ver int64,
if err != nil {
// If job is not cancelled, we should log this error.
if job.State != model.JobStateCancelled {
log.Errorf("[ddl-%s] run DDL job err %v", w, errors.ErrorStack(err))
log.Errorf("[ddl-%s] run DDL job err %v job query %s ", w, errors.ErrorStack(err), job.Query)

This comment has been minimized.

Copy link
@winkyao

winkyao Aug 25, 2018

Member

, job query

}

job.Error = toTError(err)
job.ErrorCount++
if job.ErrorCount > errorCountlimit {
job.State = model.JobStateCancelling
metrics.DDLWorkerHistogram.WithLabelValues(metrics.WorkerCancelDDLJob, metrics.RetLabel(err)).Observe(time.Since(startTime).Seconds())

This comment has been minimized.

Copy link
@crazycs520

crazycs520 Aug 27, 2018

Contributor

startTime is the job start execute time or start time of the last execute the job?

@@ -51,6 +51,8 @@ const (
waitDependencyJobInterval = 200 * time.Millisecond
// noneDependencyJob means a job has no dependency-job.
noneDependencyJob = 0
// errorCountlimit limit the number of retries.
errorCountlimit = 512

This comment has been minimized.

Copy link
@winkyao

winkyao Aug 27, 2018

Member

Make this be configurable.

This comment has been minimized.

Copy link
@ciscoxll

ciscoxll Aug 27, 2018

Author Contributor

@winkyao PTAL.

@ciscoxll ciscoxll force-pushed the ciscoxll:retry-limit branch from fe7d452 to 47a28c4 Aug 27, 2018

// SetDDLErrorRetryLimit sets ddlErrorRetryLimit count.
func SetDDLErrorRetryLimit(cnt int32) {
if cnt < minDDLErrorRetryLimit {
cnt = ddlErrorRetryLimit

This comment has been minimized.

Copy link
@winkyao

winkyao Aug 27, 2018

Member

Why not set to minDDLErrorRetryLimit?

@@ -223,4 +223,8 @@ func (s *testVarsutilSuite) TestVarsutil(c *C) {
c.Assert(err, IsNil)
c.Assert(val, Equals, "1")
c.Assert(v.EnableTablePartition, IsTrue)

c.Assert(GetDDLErrorRetryLimit(), Equals, int32(DefTiDBDDLErrorRetryLimit))

This comment has been minimized.

Copy link
@winkyao

winkyao Aug 27, 2018

Member

Add more test case like set -1 to DefTiDBDDLErrorRetryLimit. Please make your unit tests cover more cases.

This comment has been minimized.

Copy link
@ciscoxll

ciscoxll Aug 27, 2018

Author Contributor

@winkyao Done.

}

job.Error = toTError(err)
job.ErrorCount++
if job.ErrorCount > int64(variable.GetDDLErrorRetryLimit()) {
startTime := time.Now()

This comment has been minimized.

Copy link
@crazycs520

crazycs520 Aug 27, 2018

Contributor

why startTime is not job.StartTS

@@ -519,13 +520,19 @@ func (w *worker) runDDLJob(d *ddlCtx, t *meta.Meta, job *model.Job) (ver int64,
if err != nil {
// If job is not cancelled, we should log this error.
if job.State != model.JobStateCancelled {
log.Errorf("[ddl-%s] run DDL job err %v", w, errors.ErrorStack(err))
log.Errorf("[ddl-%s] run DDL job err %v, job query %s ", w, errors.ErrorStack(err), job.Query)

This comment has been minimized.

Copy link
@zimulala

zimulala Aug 27, 2018

Member

Please add metrics(WorkerCancelDDLJob) here.

This comment has been minimized.

Copy link
@ciscoxll

ciscoxll Aug 27, 2018

Author Contributor

@zimulala Done.

@@ -214,6 +217,7 @@ const (
DefTiDBDDLReorgWorkerCount = 16
DefTiDBHashAggPartialConcurrency = 4
DefTiDBHashAggFinalConcurrency = 4
DefTiDBDDLErrorRetryLimit = 512

This comment has been minimized.

Copy link
@zimulala

zimulala Aug 27, 2018

Member

Will this value be too small?

}

job.Error = toTError(err)
job.ErrorCount++
if job.ErrorCount > int64(variable.GetDDLErrorRetryLimit()) {

This comment has been minimized.

Copy link
@zimulala

zimulala Aug 28, 2018

Member

Do we need to add a log here?

@@ -293,6 +293,9 @@ type SessionVars struct {
EnableStreaming bool

writeStmtBufs WriteStmtBufs

// DDLErrorRetryLimit limits the number of error retries that occur in ddl job.
DDLErrorRetryLimit int64

This comment has been minimized.

Copy link
@zimulala

zimulala Aug 28, 2018

Member

Is this variable useful?

@ciscoxll ciscoxll force-pushed the ciscoxll:retry-limit branch 5 times, most recently from b6d83fa to 1f19c1b Aug 29, 2018

@ciscoxll ciscoxll force-pushed the ciscoxll:retry-limit branch from fb9c447 to 71acda4 Aug 30, 2018

@ciscoxll

This comment has been minimized.

Copy link
Contributor Author

commented Sep 3, 2018

/run-all-tests

@ciscoxll

This comment has been minimized.

Copy link
Contributor Author

commented Sep 4, 2018

1 similar comment
@ciscoxll

This comment has been minimized.

Copy link
Contributor Author

commented Sep 5, 2018

}

job.Error = toTError(err)
job.ErrorCount++
if job.ErrorCount > int64(variable.GetDDLErrorRetryLimit()) && job.Type != model.ActionAddIndex {
log.Infof("[ddl-%s] DDL job over maximum retry count is canceled because %v, job query %s", w, errors.ErrorStack(err), job.Query)
job.State = model.JobStateCancelling

This comment has been minimized.

Copy link
@zimulala

zimulala Sep 5, 2018

Member

Is there a problem with the following scenario?
If the operation of “Add column” updates to “Delete only”, then we cancel this job.

This comment has been minimized.

Copy link
@ciscoxll

ciscoxll Sep 6, 2018

Author Contributor

@zimulala I write a test and canceled this job, but no error.

This comment has been minimized.

Copy link
@zimulala

zimulala Sep 7, 2018

Member

@ciscoxll
If the column is “drop column” updates to “delete only”, but this job occurs many errors then this job cancel successfully. Then we return an error to the client, but this column's state is "delete only".
I think this will be a problem.

This comment has been minimized.

Copy link
@ciscoxll

ciscoxll Sep 7, 2018

Author Contributor

@zimulala I will write a rollback in another PR.

This comment has been minimized.

Copy link
@zimulala

zimulala Feb 13, 2019

Member

Handle it on #9295

@ciscoxll

This comment has been minimized.

Copy link
Contributor Author

commented Sep 6, 2018

@winkyao PTAL.

if job.ErrorCount > int64(variable.GetDDLErrorRetryLimit()) && job.Type != model.ActionAddIndex {
log.Infof("[ddl-%s] DDL job over maximum retry count is canceled because %v, job query %s", w, errors.ErrorStack(err), job.Query)
job.State = model.JobStateCancelling
metrics.DDLWorkerHistogram.WithLabelValues(metrics.WorkerCancelDDLJob, job.Type.String(), metrics.RetLabel(err)).Observe(time.Since(model.TSConvert2Time(job.StartTS)).Seconds())

This comment has been minimized.

Copy link
@crazycs520

crazycs520 Sep 13, 2018

Contributor

Repeat with line#529?

@ciscoxll ciscoxll added the status/DNM label Sep 28, 2018

@winkyao winkyao closed this Nov 23, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.