Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distask scheduler takes step success while it's failed actually #49950

Closed
D3Hunter opened this issue Jan 2, 2024 · 2 comments · Fixed by #49971
Closed

distask scheduler takes step success while it's failed actually #49950

D3Hunter opened this issue Jan 2, 2024 · 2 comments · Fixed by #49971
Labels
affects-7.5 component/ddl This issue is related to DDL of TiDB. severity/major type/bug This issue is a bug.

Comments

@D3Hunter
Copy link
Contributor

D3Hunter commented Jan 2, 2024

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

as we're checking whether task step fails or success using 2 sql, if the task fail in the middle, we will take it as success and switch task to next step. if it's already the last step, the task will be marked as success incorrectly

func (s *BaseScheduler) onRunning() error {
logutil.Logger(s.logCtx).Debug("on running state",
zap.Stringer("state", s.Task.State),
zap.Int64("step", int64(s.Task.Step)))
subTaskErrs, err := s.taskMgr.CollectSubTaskError(s.ctx, s.Task.ID)
if err != nil {
logutil.Logger(s.logCtx).Warn("collect subtask error failed", zap.Error(err))
return err
}
if len(subTaskErrs) > 0 {
logutil.Logger(s.logCtx).Warn("subtasks encounter errors")
return s.onErrHandlingStage(subTaskErrs)
}
// check current step finishes.
cnt, err := s.taskMgr.GetSubtaskInStatesCnt(s.ctx, s.Task.ID, proto.SubtaskStatePending, proto.SubtaskStateRunning)

it might cause some real-tikv case fail
https://do.pingcap.net/jenkins/blue/rest/organizations/jenkins/pipelines/pingcap/pipelines/tidb/pipelines/ghpr_check2/runs/381/nodes/107/steps/598/log/?start=0

[2023/12/29 11:06:09.434 +00:00] [INFO] [client.go:639] ["[pd] service mode changed"] [old-mode=UNKNOWN_SVC_MODE] [new-mode=PD_SVC_MODE]
    scheduler.go:387: Unexpected call to *mocklocal.MockTiKVModeSwitcher.ToNormalMode([context.Background.WithValue(type util.RequestSourceKeyType, val <not Stringer>).WithCancel.WithValue(type metric.ctxKeyType, val <not Stringer>)]) at pkg/disttask/importinto/scheduler.go:387 because: 
        expected call at tests/realtikvtest/importintotest/import_into_test.go:812 has already been called the max number of times
        expected call at tests/realtikvtest/importintotest/import_into_test.go:845 has already been called the max number of times
        expected call at tests/realtikvtest/importintotest/import_into_test.go:880 has already been called the max number of times

2. What did you expect to see? (Required)

3. What did you see instead (Required)

4. What is your TiDB version? (Required)

@hawkingrei
Copy link
Member

It is the root cause for this error.

Error Trace:	pkg/disttask/framework/planner/planner_test.go:60
        	Error:      	Received unexpected error:
        	            	no managed nodes
        	            	[github.com/pingcap/tidb/pkg/disttask/framework/storage.(*TaskManager).getCPUCountOfManagedNodes](http://github.com/pingcap/tidb/pkg/disttask/framework/storage.(*TaskManager).getCPUCountOfManagedNodes)
        	            		pkg/disttask/framework/storage/task_table.go:1324
        	            	[github.com/pingcap/tidb/pkg/disttask/framework/storage.(*TaskManager).CreateTaskWithSession](http://github.com/pingcap/tidb/pkg/disttask/framework/storage.(*TaskManager).CreateTaskWithSession)
        	            		pkg/disttask/framework/storage/task_table.go:225
        	            	[github.com/pingcap/tidb/pkg/disttask/framework/planner.(*Planner).Run](http://github.com/pingcap/tidb/pkg/disttask/framework/planner.(*Planner).Run)
        	            		pkg/disttask/framework/planner/planner.go:39
        	            	pkg/disttask/framework/planner/planner_test_test.TestPlanner
        	            		pkg/disttask/framework/planner/planner_test.go:59
        	            	testing.tRunner
        	            		GOROOT/src/testing/testing.go:1595
        	            	runtime.goexit
        	            		src/runtime/asm_amd64.s:1650
        	Test:       	TestPlanner

@D3Hunter
Copy link
Contributor Author

D3Hunter commented Jan 2, 2024

It is the root cause for this error.

Error Trace:	pkg/disttask/framework/planner/planner_test.go:60
        	Error:      	Received unexpected error:
        	            	no managed nodes
        	            	[github.com/pingcap/tidb/pkg/disttask/framework/storage.(*TaskManager).getCPUCountOfManagedNodes](http://github.com/pingcap/tidb/pkg/disttask/framework/storage.(*TaskManager).getCPUCountOfManagedNodes)
        	            		pkg/disttask/framework/storage/task_table.go:1324
        	            	[github.com/pingcap/tidb/pkg/disttask/framework/storage.(*TaskManager).CreateTaskWithSession](http://github.com/pingcap/tidb/pkg/disttask/framework/storage.(*TaskManager).CreateTaskWithSession)
        	            		pkg/disttask/framework/storage/task_table.go:225
        	            	[github.com/pingcap/tidb/pkg/disttask/framework/planner.(*Planner).Run](http://github.com/pingcap/tidb/pkg/disttask/framework/planner.(*Planner).Run)
        	            		pkg/disttask/framework/planner/planner.go:39
        	            	pkg/disttask/framework/planner/planner_test_test.TestPlanner
        	            		pkg/disttask/framework/planner/planner_test.go:59
        	            	testing.tRunner
        	            		GOROOT/src/testing/testing.go:1595
        	            	runtime.goexit
        	            		src/runtime/asm_amd64.s:1650
        	Test:       	TestPlanner

no managed nodes is caused by that we didn't wait node registered before submit task. they're fixed together though

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-7.5 component/ddl This issue is related to DDL of TiDB. severity/major type/bug This issue is a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants