Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add index status is rollback done after inject network partition between ddl owner and one of tikv #50417

Closed
Lily2025 opened this issue Jan 15, 2024 · 2 comments · Fixed by #50429
Assignees
Labels
affects-7.6 component/ddl This issue is related to DDL of TiDB. severity/major type/bug The issue is confirmed as a bug.

Comments

@Lily2025
Copy link

Lily2025 commented Jan 15, 2024

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

1、tidb_enable_dist_task='on' and enable global sort
2、add index
3、inject network partition between ddl owner and one of tikv last for 3mins and recover

2. What did you expect to see? (Required)

after fault recover,add index can success

3. What did you see instead (Required)

after fault recover,add index status is rollback done

the status of ddl job is not synced or done or running or queueing (now: 2024-01-13 19:21:28, jobId: 491, job type: add index /* ingest cloud /, state: rollback done)
operatorLogs:
[2024-01-13 19:03:00] ###### start adding index
alter table sbtest1 add index index_test_1705143780092 (c)
[2024-01-13 19:03:00] ###### wait for ddl job finish
[2024-01-13 19:21:28] ###### wait for ddl job to finish failed
select job_id, job_type, state from information_schema.ddl_jobs where query = 'alter table sbtest1 add index index_test_1705143780092 (c)'
jobId: 491, job type: add index /
ingest cloud */, state: rollback done

4. What is your TiDB version? (Required)

./tidb-server -V
Release Version: v7.6.0
Edition: Community
Git Commit Hash: 2df8bd1
Git Branch: heads/refs/tags/v7.6.0
UTC Build Time: 2024-01-12 02:51:35
GoVersion: go1.21.5
Race Enabled: false
Check Table Before Drop: false
Store: unistore
2024-01-13T18:46:17.819+0800

tidb logs:
[2024/01/13 19:21:09.480 +08:00] [INFO] [handle.go:186] ["task not resumable"] [taskKey=ddl/backfill/491]
[2024/01/13 19:21:09.782 +08:00] [ERROR] [handle.go:92] ["task reverted"] [task-id=210029] [error="[tikv:9005]Region is unavailable"]
[2024/01/13 19:21:09.783 +08:00] [WARN] [index.go:2203] ["cannot get task"] [category=ddl] [task_key=ddl/backfill/491] [error="task not found"]
[2024/01/13 19:21:09.784 +08:00] [WARN] [reorg.go:232] ["run reorg job done"] [category=ddl] ["handled rows"=35232986] [error="[tikv:9005]Region is unavailable"]
[2024/01/13 19:21:09.790 +08:00] [WARN] [ddl_worker.go:1118] ["run DDL job error"] [worker="worker 2, tp add index"] [category=ddl] [jobID=491] [conn=710937990] [error="[tikv:9005]Region is unavailable"]
[2024/01/13 19:21:09.797 +08:00] [INFO] [ddl_worker.go:982] ["run DDL job failed, sleeps a while then retries it."] [worker="worker 2, tp add index"] [category=ddl] [jobID=491] [conn=710937990] [waitTime=1s] [error="[tikv:9005]Region is unavailable"]
[2024/01/13 19:21:10.797 +08:00] [INFO] [ddl_worker.go:1367] ["schema version doesn't change"] [category=ddl]
[2024/01/13 19:21:10.807 +08:00] [INFO] [ddl_worker.go:1156] ["run DDL job"] [worker="worker 3, tp add index"] [category=ddl] [jobID=491] [conn=710937990] [category=ddl] [job="ID:491, Type:add index, State:running, SchemaState:write reorganization, SchemaID:104, TableID:245, RowCount:35232986, ArgLen:0, start time: 2024-01-13 19:03:00.09 +0800 CST, Err:[tikv:9005]Region is unavailable, ErrCount:510, SnapshotVersion:446993211166556275, LocalMode: false, UniqueWarnings:0"]
[2024/01/13 19:21:10.808 +08:00] [INFO] [index.go:888] ["index backfill state running"] [category=ddl] ["job ID"=491] [table=sbtest1] ["ingest mode"=true] [index=index_test_1705143780092]
[2024/01/13 19:21:10.813 +08:00] [INFO] [handle.go:186] ["task not resumable"] [taskKey=ddl/backfill/491]
[2024/01/13 19:21:11.115 +08:00] [ERROR] [handle.go:92] ["task reverted"] [task-id=210029] [error="[tikv:9005]Region is unavailable"]
[2024/01/13 19:21:11.116 +08:00] [WARN] [index.go:2203] ["cannot get task"] [category=ddl] [task_key=ddl/backfill/491] [error="task not found"]
[2024/01/13 19:21:11.116 +08:00] [WARN] [reorg.go:232] ["run reorg job done"] [category=ddl] ["handled rows"=35232986] [error="[tikv:9005]Region is unavailable"]
[2024/01/13 19:21:11.122 +08:00] [WARN] [ddl_worker.go:1118] ["run DDL job error"] [worker="worker 3, tp add index"] [category=ddl] [jobID=491] [conn=710937990] [error="[tikv:9005]Region is unavailable"]
[2024/01/13 19:21:11.129 +08:00] [INFO] [ddl_worker.go:1367] ["schema version doesn't change"] [category=ddl]
[2024/01/13 19:21:11.139 +08:00] [INFO] [ddl_worker.go:1156] ["run DDL job"] [worker="worker 2, tp add index"] [category=ddl] [jobID=491] [conn=710937990] [category=ddl] [job="ID:491, Type:add index, State:running, SchemaState:write reorganization, SchemaID:104, TableID:245, RowCount:35232986, ArgLen:0, start time: 2024-01-13 19:03:00.09 +0800 CST, Err:[tikv:9005]Region is unavailable, ErrCount:511, SnapshotVersion:446993211166556275, LocalMode: false, UniqueWarnings:0"]
[2024/01/13 19:21:11.140 +08:00] [INFO] [index.go:888] ["index backfill state running"] [category=ddl] ["job ID"=491] [table=sbtest1] ["ingest mode"=true] [index=index_test_1705143780092]
[2024/01/13 19:21:11.146 +08:00] [INFO] [handle.go:186] ["task not resumable"] [taskKey=ddl/backfill/491]
[2024/01/13 19:21:11.448 +08:00] [ERROR] [handle.go:92] ["task reverted"] [task-id=210029] [error="[tikv:9005]Region is unavailable"]
[2024/01/13 19:21:11.450 +08:00] [WARN] [index.go:2203] ["cannot get task"] [category=ddl] [task_key=ddl/backfill/491] [error="task not found"]
[2024/01/13 19:21:11.450 +08:00] [WARN] [reorg.go:232] ["run reorg job done"] [category=ddl] ["handled rows"=35232986] [error="[tikv:9005]Region is unavailable"]
[2024/01/13 19:21:11.461 +08:00] [WARN] [index.go:1087] ["run add index job failed, convert job to rollback"] [category=ddl] [job="ID:491, Type:add index, State:running, SchemaState:write reorganization, SchemaID:104, TableID:245, RowCount:35232986, ArgLen:6, start time: 2024-01-13 19:03:00.09 +0800 CST, Err:[tikv:9005]Region is unavailable, ErrCount:511, SnapshotVersion:446993211166556275, LocalMode: false, UniqueWarnings:0"] [error="[tikv:9005]Region is unavailable"]
[2024/01/13 19:21:11.501 +08:00] [WARN] [ddl_worker.go:1118] ["run DDL job error"] [worker="worker 2, tp add index"] [category=ddl] [jobID=491] [conn=710937990] [error="[tikv:9005]Region is unavailable"]
[2024/01/13 19:21:11.511 +08:00] [INFO] [domain.go:272] ["diff load InfoSchema success"] [currentSchemaVersion=436] [neededSchemaVersion=437] ["start time"=1.220424ms] [gotSchemaVersion=437] [phyTblIDs="[245]"] [actionTypes="[7]"] [diffTypes="["add index"]"]
[2024/01/13 19:21:11.546 +08:00] [INFO] [domain.go:873] ["mdl gets lock, update to owner"] [jobID=491] [version=437]

tidb-0.tar.gz
tidb-1.tar.gz

@Lily2025 Lily2025 added the type/bug The issue is confirmed as a bug. label Jan 15, 2024
@Lily2025
Copy link
Author

/severity major
/assign ywqzzy

@ywqzzy
Copy link
Contributor

ywqzzy commented Jan 15, 2024

SubtaskExecutor error handling missed some corner case, need to add a backup logic to handle the error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-7.6 component/ddl This issue is related to DDL of TiDB. severity/major type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants