Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

analyze worker didn't exit successfully when panic during analyze table #51993

Closed
seiya-annie opened this issue Mar 21, 2024 · 4 comments · Fixed by #52107
Closed

analyze worker didn't exit successfully when panic during analyze table #51993

seiya-annie opened this issue Mar 21, 2024 · 4 comments · Fixed by #52107
Assignees
Labels

Comments

@seiya-annie
Copy link

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

import a table with 500+ partition
execute "analyze table xxx"

2. What did you expect to see? (Required)

analyze table can be executed successfully

3. What did you see instead (Required)

the jobs are stuck

2024-03-21 16:32:36 (UTC+08:00)TiDB tc-tidb-0.tc-tidb-peer.sql-mvindex-40t-testing-tps-7500536-1-617.svc:4000[servermemorylimit.go:107] ["global memory controller failed to kill the top-consumer in 23648s"] [conn=2376119218] ["sql digest"=2a9574a60b4ac9685e9997742a93a6240d7219402333a804218f45a0de5be2ee] ["sql text"="analyze table db0.table0"] ["sql memory usage"=0]
2024-03-21 09:58:27 (UTC+08:00)TiDB tc-tidb-0.tc-tidb-peer.sql-mvindex-40t-testing-tps-7500536-1-617.svc:4000[analyze_col_v2.go:715] ["analyze worker panicked"] [recover="[executor:8176]Your query has been cancelled due to exceeding the allowed memory limit for the tidb-server instance and this query is currently using the most memory. Please try narrowing your query scope or increase the tidb_server_memory_limit and try again.[conn=2376119218]"] [stack="github.com/pingcap/tidb/pkg/executor.(*AnalyzeColumnsExecV2).subBuildWorker.func1\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/pkg/executor/analyze_col_v2.go:715\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:914\ngithub.com/pingcap/tidb/pkg/util/memory.(*Tracker).Consume\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/pkg/util/memory/tracker.go:459\ngithub.com/pingcap/tidb/pkg/statistics.BuildHistAndTopN.func1\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/pkg/statistics/builder.go:250\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:914\ngithub.com/pingcap/tidb/pkg/util/memory.(*Tracker).Consume\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/pkg/util/memory/tracker.go:459\ngithub.com/pingcap/tidb/pkg/statistics.buildHist.func1\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/pkg/statistics/builder.go:167\ngithub.com/pingcap/tidb/pkg/statistics.buildHist\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/pkg/statistics/builder.go:206\ngithub.com/pingcap/tidb/pkg/statistics.BuildHistAndTopN\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/pkg/statistics/builder.go:444\ngithub.com/pingcap/tidb/pkg/executor.(*AnalyzeColumnsExecV2).subBuildWorker\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/pkg/executor/analyze_col_v2.go:854\ngithub.com/pingcap/tidb/pkg/executor.(*AnalyzeColumnsExecV2).buildSamplingStats.func6\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/pkg/executor/analyze_col_v2.go:402\ngithub.com/pingcap/tidb/pkg/executor.(*notifyErrorWaitGroupWrapper).Run.func1\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/pkg/executor/analyze_utils.go:133"]

4. What is your TiDB version? (Required)

master

@seiya-annie seiya-annie added the type/bug The issue is confirmed as a bug. label Mar 21, 2024
@seiya-annie
Copy link
Author

20240321-180923
20240321-180934

@seiya-annie
Copy link
Author

oom_record.tar.gz

@Rustin170506
Copy link
Member

analyze statement failed to exit normally after encountering memory control and has been stuck for more than 10 hours. After analyzing the reason, I feel that it is caused by improper use of chan.

The phenomenon of being stuck is:

  1. Progress cannot be seen from analyze jobs and show analyze status, many partition tables have a processed_row of 0, and then I try to show processlist and cannot see this query.
  2. From the client executed by qa, it can be seen that the analyze has not stopped and has been waiting for the result.
  3. From the log, we can see:
[2024/03/22 16:00:05.895 +08:00] [WARN] [servermemorylimit.go:107] ["global memory controller failed to kill the top-consumer in 108098s"] [conn=2376119218] ["sql digest"=2a9574a60b4ac9685e9997742a93a6240d7219402333a804218f45a0de5be2ee] ["sql text"="analyze table db0.table0"] ["sql memory usage"=0]
  1. Analyzing the logs, it can be found that there is indeed a log of analyze worker exit, and then it gets stuck.
[2024/03/21 09:58:27.333 +08:00] [ERROR] [analyze.go:488] ["analyze failed"] [conn=2376119218] [session_alias=] [error="[executor:8176]Your query has been cancelled due to exceeding the allowed memory limit for the tidb-server instance and this query is currently using the most memory. Please try narrowing your query scope or increase the tidb_server_memory_limit and try again.[conn=2376119218]"]
  1. Checked Gorounite and found a chan waiting for a very long time.
goroutine 25362811 [chan send, 1751 minutes]:
[github.com/pingcap/tidb/pkg/executor.(*AnalyzeExec).analyzeWorker.func1()](http://github.com/pingcap/tidb/pkg/executor.(*AnalyzeExec).analyzeWorker.func1())
	/home/jenkins/agent/workspace/build-common/go/src/[github.com/pingcap/tidb/pkg/executor/analyze.go:522](http://github.com/pingcap/tidb/pkg/executor/analyze.go:522) +0x39a
panic({0x571a1e0?, 0xc00fca0090?})
  1. Checking the code, it was found to be stuck on result sending.
func (e *AnalyzeExec) analyzeWorker(taskCh <-chan *analyzeTask, resultsCh chan<- *statistics.AnalyzeResults) {
	var task *analyzeTask
	defer func() {
		if r := recover(); r != nil {
			logutil.BgLogger().Error("analyze worker panicked", zap.Any("recover", r), zap.Stack("stack"))
			metrics.PanicCounter.WithLabelValues(metrics.LabelAnalyze).Inc()
			resultsCh <- &statistics.AnalyzeResults{
				Err: getAnalyzePanicErr(r),
				Job: task.job,
			}
		}
	}()
  1. Then analyze the code and find that the only consumer of this chan is handleResultsError.
  2. In handleResultsError we do concurrent result processing based on the tidb_analyze_partition_concurrency parameter handleResultsErrorWithConcurrency, and there is a for loop that keeps trying to get the result from resultsCh.
  3. But in this handleResultsErrorWithConcurrency, if you receive a kill signal when handling an error, it will exit directly:
	var err error
	for panicCnt < statsConcurrency {
		if err := e.Ctx().GetSessionVars().SQLKiller.HandleSignal(); err != nil {
			close(saveResultsCh)
			return err
		}
		results, ok := <-resultsCh
		if !ok {
			break
		}
  1. So I think the current problem is that after receiving the kill signal, the only consumer exits, and when resultsCh is a channel with a buffer of 1, it ultimately causes the worker to be unable to send results normally in the Panic recovery function in step 4, resulting in the worker being unable to exit after Panic.
  2. Ultimately, it causes the wait group to be unable to wait for all goroutines to exit, resulting in the entire analyze statement being unable to terminate. It will always be stuck on this goroutine.

@aytrack
Copy link
Contributor

aytrack commented Mar 26, 2024

I encounter another panic error invalid memory address or nil pointer dereference after global memory kill the analyze job, issue #52071

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants