New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ddl: fix a bug that cancel add index ddl job when workers are not run, which cause tidb ddl hang up #8171

Merged
merged 25 commits into from Nov 28, 2018

Conversation

@winkyao
Member

winkyao commented Nov 5, 2018

What problem does this PR solve?

If we cancel add index ddl job when the job.Type == model.ActionAddIndex && job.SchemaState == model.StateWriteReorganization && job.SnapshotVer == 0, the TiDB ddl will hang up like:

MySQL [(none)]> admin show ddl jobs;                                                                                                                                                   [11/973]
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+                                           
| JOBS                                                                                                                                                                                                                                | STATE     |                                           
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+                                           
| ID:230, Type:drop index, State:running, SchemaState:none, SchemaID:31, TableID:48, RowCount:0, ArgLen:0, start time: 2018-10-25 17:24:31.123 +0800 CST, Err:[ddl:101]invalid table state public, ErrCount:344905, SnapshotVersion:0 | running   |                                           
| ID:231, Type:drop index, State:none, SchemaState:none, SchemaID:31, TableID:51, RowCount:0, ArgLen:0, start time: 2018-10-25 17:27:10.173 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0                                       | none      |                                           
| ID:232, Type:drop index, State:none, SchemaState:none, SchemaID:31, TableID:54, RowCount:0, ArgLen:0, start time: 2018-10-25 17:27:11.273 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0                                       | none      |                                           

What is changed and how it works?

// runDDLJob runs a DDL job. It returns the current schema version in this transaction and the error.
func (w *worker) runDDLJob(d *ddlCtx, t *meta.Meta, job *model.Job) (ver int64, err error) {
   log.Infof("[ddl-%s] run DDL job %s", w, job)
   timeStart := time.Now()
   defer func() {
      metrics.DDLWorkerHistogram.WithLabelValues(metrics.WorkerRunDDLJob, job.Type.String(), metrics.RetLabel(err)).Observe(time.Since(timeStart).Seconds())
   }()
   if job.IsFinished() {
      return
   }
   // The cause of this job state is that the job is cancelled by client.
   if job.IsCancelling() {
      // If the value of SnapshotVer isn't zero, it means the work is backfilling the indexes.
      if job.Type == model.ActionAddIndex && job.SchemaState == model.StateWriteReorganization && job.SnapshotVer != 0 {
         log.Infof("[ddl-%s] run the cancelling DDL job %s", w, job)
         w.reorgCtx.notifyReorgCancel()
      } else {
         job.State = model.JobStateCancelled
         job.Error = errCancelledDDLJob
         job.ErrorCount++
         return
      }
   }

if job.Type == model.ActionAddIndex && job.SchemaState == model.StateWriteReorganization && job.SnapshotVer == 0, we will set job.State to model.JobStateCancelled, unfortunately, the adding index state is model.StateWriteReorganization, and it will not be rollbacked. So when we execute drop index, we will got error:

tidb-2018-10-25T19-14-23.472.log:2018/10/25 17:44:44.356 ddl_worker.go:329: [info] [ddl] run DDL job ID:230, Type:drop index, State:running, SchemaState:none, SchemaID:31, TableID:48, RowCount:0, ArgLen:0, start time: 2018-10-25 17:24:31.123 +0800 CST, Err:[ddl:101]invalid table state write reorganization, ErrCount:172588, SnapshotVersion:0

it is returned by:

func onDropIndex(t *meta.Meta, job *model.Job) (ver int64, _ error) {
	...

	originalState := indexInfo.State
	switch indexInfo.State {
	case model.StatePublic:
                ...
	case model.StateWriteOnly:
		...
	case model.StateDeleteOnly:
		...
	case model.StateDeleteReorganization:
		...
	default:
		err = ErrInvalidTableState.GenWithStack("invalid table state %v", tblInfo.State)
	}

Check List

Tests

  • Unit test

Related changes

  • Need to cherry-pick to the release branch

This change is Reviewable

winkyao added some commits Nov 5, 2018

@crazycs520

This comment has been minimized.

Contributor

crazycs520 commented Nov 6, 2018

/run-all-tests

Show resolved Hide resolved ddl/index.go Outdated

winkyao added some commits Nov 12, 2018

@winkyao

This comment has been minimized.

Member

winkyao commented Nov 12, 2018

@zimulala

I think the operation of add column or other operations also has this problem. Could we fix these operations together?

defer func() { ddl.ReorgWaitTimeout = oldReorgWaitTimeout }()
hook := &ddl.TestDDLCallback{}
hook.OnJobRunBeforeExported = func(job *model.Job) {
log.Infof("hook.OnJobRunBeforeExported, job: %v", job)

This comment has been minimized.

@zimulala

zimulala Nov 13, 2018

Member

Remove this debug log?

Show resolved Hide resolved ddl/db_test.go
@winkyao

This comment has been minimized.

Member

winkyao commented Nov 14, 2018

@zimulala I will fix the issue of adding columns or other operations in another PR.

@winkyao winkyao added the status/DNM label Nov 15, 2018

@winkyao

This comment has been minimized.

Member

winkyao commented Nov 15, 2018

I'll add more test cases of canceling add column later.

winkyao added some commits Nov 20, 2018

@winkyao winkyao removed the status/DNM label Nov 21, 2018

@crazycs520

This comment has been minimized.

Contributor

crazycs520 commented Nov 26, 2018

@winkyao

This comment has been minimized.

Member

winkyao commented Nov 27, 2018

@crazycs520 crazycs520 added status/LGT2 and removed status/LGT1 labels Nov 27, 2018

@winkyao

This comment has been minimized.

Member

winkyao commented Nov 28, 2018

/run-all-tests

job.State = model.JobStateRollingback
col := &model.ColumnInfo{}
pos := &ast.ColumnPosition{}
offset := 0

This comment has been minimized.

@tiancaiamao

tiancaiamao Nov 28, 2018

Contributor

Should offet be math.MaxUint64 ?

This comment has been minimized.

@winkyao

winkyao Nov 28, 2018

Member

I don't think it is necessary. The initial offset value is the same with onAddColumn.

This comment has been minimized.

@winkyao

winkyao Nov 28, 2018

Member

Actually, this param is useless but just decode it.

@tiancaiamao

This comment has been minimized.

Contributor

tiancaiamao commented Nov 28, 2018

LGTM

@tiancaiamao tiancaiamao added status/LGT3 and removed status/LGT2 labels Nov 28, 2018

@winkyao winkyao merged commit d301c16 into pingcap:master Nov 28, 2018

4 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
jenkins-ci-tidb/build Jenkins job succeeded.
Details
license/cla Contributor License Agreement is signed.
Details

@winkyao winkyao deleted the winkyao:fix_error branch Nov 28, 2018

winkyao added a commit to winkyao/tidb that referenced this pull request Nov 28, 2018

zz-jason added a commit that referenced this pull request Nov 28, 2018

winkyao added a commit to winkyao/tidb that referenced this pull request Nov 29, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment