Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After executed the system call, exitsyscall hung is happened, high cpu #40114

Open
yanqingliu opened this issue Jul 8, 2020 · 19 comments
Open

After executed the system call, exitsyscall hung is happened, high cpu #40114

yanqingliu opened this issue Jul 8, 2020 · 19 comments

Comments

@yanqingliu
Copy link

@yanqingliu yanqingliu commented Jul 8, 2020

background:

  1. Surface phenomenon: process hung, CPU high

  2. Problematic Code: multiple system calls,os.Chmod

  3. go version: 1.11.6

  4. bt: os.Chmod--->syscall.Syscall---->runtime.exitsyscall---->runtime.casgstatus

See this below code: !atomic.Cas(&gp.atomicstatus, oldval, newval) exec result has always been true,cannot exit for loop, reason: gp.atomicstatus not equal to any fo these states:_Gsyscall, _Grunning, _Gwaiting, Why ??
the below code in go source code, the file is proc.go

for i := 0; !atomic.Cas(&gp.atomicstatus, oldval, newval); i++ {
......
......
}

@davecheney
Copy link
Contributor

@davecheney davecheney commented Jul 8, 2020

Thank you for raising this issue. Can you please provide some way for someone who does not have access to your codebase or environment to reproduce the problem. Ideally if you can provide a short, self contained, program that demonstrates the issue that will lead to the fastest diagnosis. Thank you

@davecheney
Copy link
Contributor

@davecheney davecheney commented Jul 8, 2020

Also, you said you are using go 1.11. This release is several years old and no longer supported. Please upgrade to Go 1.15 and confirm is the issue still exists.

@yanqingliu
Copy link
Author

@yanqingliu yanqingliu commented Jul 8, 2020

@davecheney thankyou for your anwser,but do you understand the problem i described? i'm not sure i make this question clear?
we plan to upgrade to 1.15 later, but we want to make clear this question.

@davecheney
Copy link
Contributor

@davecheney davecheney commented Jul 8, 2020

@yanqingliu i'm sorry I do not understand the problem as you have described it.

@yanqingliu
Copy link
Author

@yanqingliu yanqingliu commented Jul 8, 2020

@davecheney ok,now i repeat describe this problem again :
When we run some project ,the process is suspended and the CPU is high.

The business scenario: System commands are frequently invoked,eg: os.Chmod

go version: 1.11.6

The stack is: os.Chmod--->syscall.Syscall---->runtime.exitsyscall---->runtime.casgstatus

Please look at the below code of for loop:
for i := 0; !atomic.Cas(&gp.atomicstatus, oldval, newval); i++ {
......
......
}
!atomic.Cas(&gp.atomicstatus, oldval, newval) exec result has always return true,cannot exit for this loop, the reason: gp.atomicstatus not equal to any fo these states:_Gsyscall, _Grunning, _Gwaiting, Why ??

the above code in go source code, the file is proc.go

@davecheney
Copy link
Contributor

@davecheney davecheney commented Jul 8, 2020

Thank you for reiterating.

Can we step back a little bit. You said your program frequently makes syscalls. Can you show the code you are using? You suggested that there is a problem when the code appears to enter an infinite loop. How did you determine this? Can you show a complete stacktrace that you may have used to determine that the code you highlighted was the root cause.

Thank you for your patience. The more information you can provide to explain you’re issue the more beneficial it is to resolving the issue.

@yanqingliu
Copy link
Author

@yanqingliu yanqingliu commented Jul 8, 2020

thankyou so much.

  1. code we are using:
    When we decompressed a tar package. During the decompression process, it will be called recursively os.Chmod and other system commands, such as os.Stat

  2. determine this problem ?
    use dlv,break

  3. stacktrace:
    (dlv) bt

0 0xxxxxx in runtime.casgstatus
at .go/src/runtime/proc.go:858
1 0xxxxxx in runtime.exitsyscall
at .go/src/runtime/proc.go:3030
2 0xxxxxx in syscall.Syscall
at .go/src/syscall/asm_linux_amd64.s:39
3 0xxxxxx in syscall.fchmodat
at .go/src/syscall/zsyscall_linux_amd64.go:33
4 0xxxxxx in syscall.Fchmodat
at .go/src/syscall/syscall_linux.go:120
5 0xxxxxx in syscall.Chmod
at .go/src/syscall/syscall_linux.go:27
6 0xxxxxx in os.chmod
at .go/src/os/file_posix.go:49
7 0xxxxxx in os.Chmod
... ...

@davecheney
Copy link
Contributor

@davecheney davecheney commented Jul 8, 2020

Thank you for the explanation. You've stated that your program untars a file. This operation involves making several calls to os.Chmod. You've indicated that at some point in this process your program stops responding? Is that correct?

If so, can you please send SIGQUIT to the process, or hit cntl-\ if the program is running on the console. This will cause the complete stack trace to be recorded. Please copy and paste the entire stack trace into this issue. Thank you.

@yanqingliu
Copy link
Author

@yanqingliu yanqingliu commented Jul 8, 2020

Thank you for your patience.
The stack is: os.Chmod--->syscall.Syscall---->runtime.exitsyscall---->runtime.casgstatus

Please look at the below code of for loop:
for i := 0; !atomic.Cas(&gp.atomicstatus, oldval, newval); i++ {
......
......
}
!atomic.Cas(&gp.atomicstatus, oldval, newval) exec result has always return true,cannot exit for this loop, the reason: gp.atomicstatus not equal to any fo these states:_Gsyscall, _Grunning, _Gwaiting
Why ??

@davecheney
Copy link
Contributor

@davecheney davecheney commented Jul 8, 2020

Thank you for your reply. You've stated

The stack is: os.Chmod--->syscall.Syscall---->runtime.exitsyscall---->runtime.casgstatus

Please look at the below code of for loop:
for i := 0; !atomic.Cas(&gp.atomicstatus, oldval, newval); i++ {
......
......
}
!atomic.Cas(&gp.atomicstatus, oldval, newval) exec result has always return true,cannot exit for this loop, the reason:

Several times now. I am not able to understand the issue you are describing. Can you please try to answer the questions I asked earlier.

Alternatively, For asking questions, see:

@daive123-cloud
Copy link

@daive123-cloud daive123-cloud commented Jul 9, 2020

@davecheney Look at the description above:
Frequent system calls lead to abnormal goroutine scheduling, Causes the program to hang

Detailed description:
a system call is finished, when exec exitsystem function cannot exit because of the state of the current goroutine is illegal. Unable to recover, It's been recovering, so cause the program hang up, CPU high.

@yanqingliu
Copy link
Author

@yanqingliu yanqingliu commented Jul 9, 2020

Thank you very much for your help.

@davecheney
Copy link
Contributor

@davecheney davecheney commented Jul 9, 2020

@daive123-cloud thank you for clarifying. Can you please address the other questions I raised here: #40114 (comment)

@daive123-cloud
Copy link

@daive123-cloud daive123-cloud commented Jul 9, 2020

In response to your question:

  1. yes
  2. The problem scenario cannot be recurrence, Here's the stack that went wrong

0 0xxxxxx in runtime.casgstatus
at .go/src/runtime/proc.go:858
1 0xxxxxx in runtime.exitsyscall
at .go/src/runtime/proc.go:3030
2 0xxxxxx in syscall.Syscall
at .go/src/syscall/asm_linux_amd64.s:39
3 0xxxxxx in syscall.fchmodat
at .go/src/syscall/zsyscall_linux_amd64.go:33
4 0xxxxxx in syscall.Fchmodat
at .go/src/syscall/syscall_linux.go:120
5 0xxxxxx in syscall.Chmod
at .go/src/syscall/syscall_linux.go:27
6 0xxxxxx in os.chmod
at .go/src/os/file_posix.go:49
7 0xxxxxx in os.Chmod
... ...

@davecheney
Copy link
Contributor

@davecheney davecheney commented Jul 9, 2020

@daive123-cloud thank you for providing the delve stack trace. To investigate further it would be useful to have the complete stack dump from the Go process

Can you please send SIGQUIT to the process, or hit cntl-\ if the program is running on the console. This will cause the complete stack trace to be recorded. Please copy and paste the entire stack trace into this issue. Thank you.

@bobotu
Copy link

@bobotu bobotu commented Sep 21, 2020

@davecheney I encountered a similar situation in Go 1.13, a goroutine seems stuck in casgstatus.

Here is some information you may need
gstack.txt contains the thread stacks dump by gstack.
goroutines.txt contains the goroutines stacks dump by dlv
TiDB v3.0.5 is the GitHub repo at that commit, you may need this when exploring stack trace.

What happened

The whole process is hung, kill -SIGQUIT has no effect, and the built-in HTTP server cannot respond even a naive 404 not found.

From dlv's dump, it seems like the runtime is in mark termination, and from gstack we can know this STW operation may be blocked by something

Thread 49 (Thread 0x7f5310dfa700 (LWP 196058)):
#0  runtime.futex () at runtime/sys_linux_amd64.s:536
#1  0x0000000000c29ae4 in runtime.futexsleep (addr=0x3449608 <runtime.sched+296>, val=0, ns=100000) at runtime/os_linux.go:50
#2  0x0000000000c07ede in runtime.notetsleep_internal (n=0x3449608 <runtime.sched+296>, ns=100000, ~r2=<optimized out>) at runtime/lock_futex.go:193
#3  0x0000000000c07fb1 in runtime.notetsleep (n=0x3449608 <runtime.sched+296>, ns=100000, ~r2=<optimized out>) at runtime/lock_futex.go:216
#4  0x0000000000c31ffc in runtime.forEachP (fn={void (runtime.p *)} 0x7f5310df9e70) at runtime/proc.go:1391
#5  0x0000000000c5788e in runtime.gcMarkDone.func1 () at runtime/mgc.go:1429
#6  0x0000000000c5aee6 in runtime.systemstack () at runtime/asm_amd64.s:370
#7  0x0000000000c318a0 in ?? () at runtime/proc.go:1080
#8  0x0000000000000000 in ?? ()

All threads are blocked on #1 0x0000000000c29a66 in runtime.futexsleep at runtime/os_linux.go:44 except this one

Thread 23 (Thread 0x7f52b35fe700 (LWP 196087)):
#0  runtime.casgstatus (gp=0xc001a5d200, oldval=1, newval=2) at runtime/proc.go:813
#1  0x0000000000c338ac in runtime.execute (gp=0xc001a5d200, inheritTime=false) at runtime/proc.go:2140
#2  0x0000000000c34a79 in runtime.schedule () at runtime/proc.go:2565
#3  0x0000000000c34efd in runtime.park_m (gp=0xc00049d380) at runtime/proc.go:2610
#4  0x0000000000c5ae5b in runtime.mcall () at runtime/asm_amd64.s:318
#5  0x0000000000000000 in ?? ()

This goroutine is waiting for states to change to _Grunnable? I guess the scheduler's states is corrupted, so this loop will never end.

@davecheney
Copy link
Contributor

@davecheney davecheney commented Sep 21, 2020

@bobotu Go 1.13 is no longer supported. Would you please

  1. Test with Go 1.15.2
  2. If the issue is present there, please open a new issue and complete the issue template. Please do not comment here. If there is a duplicate, we'll take care of it.
@bobotu
Copy link

@bobotu bobotu commented Sep 21, 2020

This problem appears in our customer's production environment and cannot be tested for Go 1.15 in the short term.

In addition, this problem has not been reproduced so far, and we have never encountered such a problem with other customers before (Go 1.13 and the same source code).

We will upgrade the Go toolchain to the latest version for the next release, but we just want to know the root cause of this problem, and has it been resolved?

@davecheney So should I open a new issue with the information I have posted above?

@davecheney
Copy link
Contributor

@davecheney davecheney commented Sep 21, 2020

@davecheney So should I open a new issue with the information I have posted above?

Yes, but please understand that Go 1.13 is unsupported and you will be asked to confirm the problem continues to occur in Go 1.15.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.