Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: stuck at stopTheWorldWithSema during k8s operations #25207

Closed
guangyouyu opened this issue May 1, 2018 · 10 comments

Comments

Projects
None yet
4 participants
@guangyouyu
Copy link

commented May 1, 2018

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

go version go1.8.3 linux/amd64

Does this issue reproduce with the latest release?

I do not know how to reproduce it. But it occurs triple on different hosts.

What operating system and processor architecture are you using (go env)?

Linux VM-2-19-ubuntu 4.4.0-104-generic #127-Ubuntu SMP Mon Dec 11 12:16:42 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
KVM virtual machine , NUMA

GOARCH="amd64"
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/ubuntu/go"
GOROOT="/opt/go"
GOTOOLDIR="/opt/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build765569602=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"
PKG_CONFIG="pkg-config"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"

What did you do?

Run Kubelet(Kubernetes agent) ,Create and Delete Container Repeatly

If possible, provide a recipe for reproducing the error.
A complete runnable program is good.
A link on play.golang.org is best.

What did you expect to see?

The process runs normally.

What did you see instead?

The process is still there but does not respond anymore(and no more logging). When I use curl http://ip:port/debug/pprof/goroutine?debug=1 to check the stack, but it halts. There is nothing useful in stderr or dmesg.(I have core dump but it too big to upload)

i tried gdb it and got following result:

#0  runtime.futex () at /opt/go/src/runtime/sys_linux_amd64.s:426
#1  0x000000000042acd7 in runtime.futexsleep (addr=0x543e7e8 <runtime.sched+200>, val=0, ns=100000) at /opt/go/src/runtime/os_linux.go:62
#2  0x0000000000410930 in runtime.notetsleep_internal (n=0x543e7e8 <runtime.sched+200>, ns=100000, ~r2=false) at /opt/go/src/runtime/lock_futex.go:174
#3  0x00000000004109f6 in runtime.notetsleep (n=0x543e7e8 <runtime.sched+200>, ns=100000, ~r2=false) at /opt/go/src/runtime/lock_futex.go:194
#4  0x0000000000431704 in runtime.stopTheWorldWithSema () at /opt/go/src/runtime/proc.go:1029
#5  0x00000000004599d9 in runtime.systemstack () at /opt/go/src/runtime/asm_amd64.s:327
#6  0x00000000004319f0 in ?? () at /opt/go/src/runtime/proc.go:1067
#7  0x000000c4218e81e0 in ?? ()
#8  0x00007ff1d67fbb7f in ?? ()
#9  0x000000c4218e81a0 in ?? ()
#10 0x00007ff16dffaef8 in ?? ()
#11 0x0000000000431a54 in runtime.mstart () at /opt/go/src/runtime/proc.go:1149
#12 0x0000000002c13e43 in crosscall_amd64 () at /opt/go/src/runtime/cgo/gcc_amd64.S:35
#13 0x00007ff1d67fbc00 in ?? ()
#14 0x00007ff16dffb9c0 in ?? ()
#15 0x00007ff1d67fbb7f in ?? ()
#16 0x0000000000000000 in ?? ()

Thread 6 (LWP 4622):
---Type <return> to continue, or q <return> to quit---
#0  runtime.futex () at /opt/go/src/runtime/sys_linux_amd64.s:426
#1  0x000000000042ac62 in runtime.futexsleep (addr=0x545f640 <runtime.sig>, val=0, ns=-1) at /opt/go/src/runtime/os_linux.go:45
#2  0x00000000004108b6 in runtime.notetsleep_internal (n=0x545f640 <runtime.sig>, ns=-1, ~r2=false) at /opt/go/src/runtime/lock_futex.go:161
#3  0x0000000000410a8a in runtime.notetsleepg (n=0x545f640 <runtime.sig>, ns=-1, ~r2=224) at /opt/go/src/runtime/lock_futex.go:206
#4  0x0000000000443344 in os/signal.signal_recv (~r0=4410104) at /opt/go/src/runtime/sigqueue.go:116
#5  0x0000000002aaa802 in os/signal.loop () at /opt/go/src/os/signal/signal_unix.go:22
#6  0x000000000045c531 in runtime.goexit () at /opt/go/src/runtime/asm_amd64.s:2197
#7  0x72702d6e74622e7d in ?? ()
#8  0x6f683a7972616d69 in ?? ()
#9  0x6e74622e2c726576 in ?? ()
#10 0x7972616d6972702d in ?? ()
#11 0x000000c422728800 in ?? ()
#12 0x6d6972702d6e7462 in ?? ()
#13 0x697463613a797261 in ?? ()
#14 0x2d6e74622e2c6576 in ?? ()
#15 0x2e7972616d697270 in ?? ()
#16 0x2e2c657669746361 in ?? ()
#17 0x72642e206e65706f in ?? ()
#18 0x742d6e776f64706f in ?? ()
#19 0x74622e656c67676f in ?? ()
#20 0x72616d6972702d6e in ?? ()
#21 0x3a726f6c6f637b79 in ?? ()
#22 0x6361623b66666623 in ?? ()
#23 0x2d646e756f72676b in ?? ()
#24 0x33233a726f6c6f63 in ?? ()
#25 0x6f623b3162363732 in ?? ()
#26 0x6c6f632d72656472 in ?? ()
#27 0x65353832233a726f in ?? ()
#28 0x2d6e74622e7d6538 in ?? ()
#29 0x3a7972616d697270 in ?? ()
#30 0x2e2c657669746361 in ?? ()
#31 0x6d6972702d6e7462 in ?? ()
#32 0x697463612e797261 in ?? ()
#33 0x6e65706f2e2c6576 in ?? ()
#34 0x6f64706f72642e20 in ?? ()
#35 0x6c67676f742d6e77 in ?? ()
#36 0x72702d6e74622e65 in ?? ()
#37 0x61627b7972616d69 in ?? ()
#38 0x646e756f72676b63 in ?? ()
#39 0x6e3a6567616d692d in ?? ()
#40 0x6e74622e7d656e6f in ?? ()
#41 0x7972616d6972702d in ?? ()
#42 0x656c62617369642e in ?? ()
#43 0x702d6e74622e2c64 in ?? ()
#44 0x645b7972616d6972 in ?? ()
#45 0x5d64656c62617369 in ?? ()
#46 0x6573646c6569662c in ?? ()
#47 0x6c62617369645b74 in ?? ()
#48 0x6e74622e205d6465 in ?? ()
#49 0x7972616d6972702d in ?? ()
#50 0x72702d6e74622e2c in ?? ()
#51 0x69642e7972616d69 in ?? ()
#52 0x683a64656c626173 in ?? ()
#53 0x74622e2c7265766f in ?? ()
#54 0x72616d6972702d6e in ?? ()
#55 0x6c62617369645b79 in ?? ()
#56 0x65766f683a5d6465 in ?? ()
#57 0x73646c6569662c72 in ?? ()
#58 0x62617369645b7465 in ?? ()
#59 0x74622e205d64656c in ?? ()
#60 0x72616d6972702d6e in ?? ()
#61 0x2c7265766f683a79 in ?? ()
#62 0x6972702d6e74622e in ?? ()
#63 0x7369642e7972616d in ?? ()
#64 0x6f663a64656c6261 in ?? ()
#65 0x6e74622e2c737563 in ?? ()
#66 0x7972616d6972702d in ?? ()
#67 0x656c62617369645b in ?? ()
#68 0x7375636f663a5d64 in ?? ()
#69 0x6573646c6569662c in ?? ()
#70 0x6c62617369645b74 in ?? ()
#71 0x6e74622e205d6465 in ?? ()
#72 0x7972616d6972702d in ?? ()
#73 0x2e2c7375636f663a in ?? ()
#74 0x6d6972702d6e7462 in ?? ()
#75 0x617369642e797261 in ?? ()
#76 0x7463613a64656c62 in ?? ()
#77 0x6e74622e2c657669 in ?? ()
#78 0x7972616d6972702d in ?? ()
#79 0x656c62617369645b in ?? ()
#80 0x76697463613a5d64 in ?? ()
#81 0x73646c6569662c65 in ?? ()
#82 0x62617369645b7465 in ?? ()
#83 0x0000000000412d4f in runtime.writebarrierptr_prewrite1 (dst=0xc4209b9ac0, src=4271597) at /opt/go/src/runtime/mbarrier.go:189
#84 0x000000c420e7d068 in ?? ()
#85 0x000000c4209b9ac0 in ?? ()
#86 0x0000000000412ded in runtime.writebarrierptr (dst=0x5, src=3) at /opt/go/src/runtime/mbarrier.go:211
#87 0x0000000000412d4f in runtime.writebarrierptr_prewrite1 (dst=0xc4218c9650, src=48) at /opt/go/src/runtime/mbarrier.go:189
#88 0x0000000000411e24 in runtime.mallocgc (size=48, typ=0x3564540, needzero=true, ~r3=0xc420d81000) at /opt/go/src/runtime/malloc.go:723
#89 0x0000000000412418 in runtime.newobject (typ=0x3564540, ~r1=0xc4209b9f60) at /opt/go/src/runtime/malloc.go:820
#90 0x000000000040a90c in runtime.makemap (t=0x34a04e0, hint=0, h=0x0, bucket=0x0, ~r4=0x300000002) at /opt/go/src/runtime/hashmap.go:286
#91 0x00000000006b1bba in net/http.Header.clone (h=0xc4218c94a0, ~r0=0x8) at /opt/go/src/net/http/header.go:63
#92 0x000000000067bdd4 in net/http.(*Client).makeHeadersCopier (c#4=0xc4205ebc50, ireq=0xc420482700,
    ~r1={void (struct net/http.Request *)} 0xc4209b9df8) at /opt/go/src/net/http/client.go:622
#93 0x000000000067ad91 in net/http.(*Client).Do (c=0xc4205ebc50, req=0xc420482700, ~r1=0x0, ~r2=...) at /opt/go/src/net/http/client.go:500
#94 0x0000000001c0dbdf in k8s.io/kubernetes/vendor/github.com/docker/engine-api/client/transport/cancellable.Do.func1 (client=..., req=0xc420482700,
    result=0xc420d38660)
    at /Users/yu/work/docker/k8s-dev/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/github.com/docker/engine-api/client/transport/cancellable/cancellable.go:49
#95 0x000000000045c531 in runtime.goexit () at /opt/go/src/runtime/asm_amd64.s:2197
#96 0x00007ff1df4a8a38 in ?? ()
#97 0x000000c4205a4120 in ?? ()
#98 0x000000c420482700 in ?? ()
#99 0x000000c420d38660 in ?? ()
#100 0x0000000000000000 in ?? ()```



@agnivade

This comment has been minimized.

Copy link
Member

commented May 2, 2018

Please mention the complete repro steps, including how to download and install the package and the commands used to create, delete containers.

Also, can you try with latest Go 1.10.2 ?

Thanks.

@agnivade agnivade added this to the Go1.11 milestone May 2, 2018

@agnivade agnivade changed the title gc stuck on stopTheWorldWithSema with strange backtrace runtime: stuck at stopTheWorldWithSema during k8s operations May 2, 2018

@guangyouyu

This comment has been minimized.

Copy link
Author

commented May 2, 2018

@agnivade
Here is my script, create, delete Pods repeatly(30000 Pods each time) in a k8s cluster with 800 *(32U64G) Nodes.
I can't change go version this cluster, but I will create a 100 Nodes(1U2G) cluster for reproducing the error, and then try with latest Go 1.10.2.

I found the backtrace very strange, maybe any other thread broken the stack?
could you give me any suggestions or inspiration?

#77 0x6e74622e2c657669 in ?? ()
#78 0x7972616d6972702d in ?? ()
#79 0x656c62617369645b in ?? ()
#80 0x76697463613a5d64 in ?? ()
#81 0x73646c6569662c65 in ?? ()
#82 0x62617369645b7465 in ?? ()
#83 0x0000000000412d4f in runtime.writebarrierptr_prewrite1 (dst=0xc4209b9ac0, src=4271597) at /opt/go/src/runtime/mbarrier.go:189
#84 0x000000c420e7d068 in ?? ()
#85 0x000000c4209b9ac0 in ?? ()
#86 0x0000000000412ded in runtime.writebarrierptr (dst=0x5, src=3) at /opt/go/src/runtime/mbarrier.go:211
#87 0x0000000000412d4f in runtime.writebarrierptr_prewrite1 (dst=0xc4218c9650, src=48) at /opt/go/src/runtime/mbarrier.go:189

Thanks for your help!

@agnivade

This comment has been minimized.

Copy link
Member

commented May 2, 2018

Ok, so you are using a python client to interact with a kubernetes cluster ?

Please correct me if I am wrong, but this seems more like a kubernetes bug rather than a Go bug. You may get more help filing an issue on the kubernetes repo.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented May 2, 2018

The stack trace in #25207 (comment) is not a meaningful trace. Frames 77 to 82 are clearly corrupt, so we can trust anything that it reports. We can't draw any conclusions from the writebarrierptr report; there is no reason to think that function was called with those arguments.

@guangyouyu

This comment has been minimized.

Copy link
Author

commented May 3, 2018

@agnivade
yes,I am using a python client to interact with a kubernetes cluster.
Could you tell me why are you think this seems more like a kubernetes bug?
Do you tend to think of the stack memory modified by k8s?

@agnivade

This comment has been minimized.

Copy link
Member

commented May 3, 2018

Could you tell me why are you think this seems more like a kubernetes bug?

I don't have much experience with kubernetes, but since you are using a python client, I believe your stack trace is coming from the kube-agent binary ? Therefore, the issue does not lie in any code you have written.

Hence, following that logic, it is an issue with the software and not the programming language it was written in. (Something similar to if you see an issue with nginx, it is likely an issue with nginx and not with C compiler). That does not say it cannot be an issue with Go. It can, but this needs detailed debugging by the kubernetes team or by people having intimate knowledge about it. That's why I suggested to file an issue on the kubernetes repo.

@guangyouyu

This comment has been minimized.

Copy link
Author

commented May 3, 2018

@agnivade
OK, thanks for your suggestion, and I will open an issue on the kubernetes repo too.

Now I open issue on Go repo rather than kubernetes, because I thank kubernetes should not modify the process stack, and the corrupt stack does not cause the program to crash.

@guangyouyu

This comment has been minimized.

Copy link
Author

commented May 3, 2018

@ianlancetaylor
Thanks for your analysis.
I found the address in frames are illegal and not mapped, may frames be modify by any other(k8s bug) or go runtime bug?
Could you give me any suggestion for further debug?

@agnivade

This comment has been minimized.

Copy link
Member

commented Jun 8, 2018

HI @guangyouyu - Did you get any updates from the kubernetes team on this ? Have you tried with 1.10 ?

@gopherbot

This comment has been minimized.

Copy link

commented Jul 18, 2018

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

@gopherbot gopherbot closed this Jul 18, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.