Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: 1.5rc1 testing: stuck for long time if there are many TCP connections #12051

Closed
yangzhe1991 opened this issue Aug 6, 2015 · 17 comments

Comments

Projects
None yet
6 participants
@yangzhe1991
Copy link

commented Aug 6, 2015

Hi,
I use 1.5rc to test my program, which will hold millions of tcp connections and push messages to client.

I test the program by make a fake client(also in go) and build 1 million tcp to the server. Server bind 100 ports and client connect 10000 tcp per port so totally 1,000,000. Server and fake client are on same server and both have runtime.GOMAXPROCS(runtime.NumCPU()).

If I build them on go 1.4.2(cross-platform compile from my MacBook to linux/amd64 server), when I push message to all tcp connections almost concurrently, the CPU is high but it works as expected that all client tcp can receive messages within less than 20 seconds.

However if I build them with same code on go 1.5rc1(also cross-platform compile), when I push message, both server and client are stuck and system load is high for several minutes. Two process' status will sometimes change to 'D' which means "uninterruptible sleep" although CPU may be 0.0% at all. If I kill client, the server will become normal.

Why 1.4.2 works well but 1.5rc1 not? Does 1.5 has TCP performance issue or goroutine scheduling issue?

Thanks

@davecheney

This comment has been minimized.

Copy link
Contributor

commented Aug 6, 2015

Can you please try one or more of the following and report your results on this issue.

  • run your application with GODEBUG=gctrace=1 and capture the output, this will give an idea how much of your programs time is spent I garbage collection.
  • when the processes are in in D state, kill the programs with SIGQUIT and post the output
  • profile your application using debug/pprof and post the cpu svg data to this issue.

Thanks

Dave

On 6 Aug 2015, at 23:25, Phil Yang notifications@github.com wrote:

Hi,
I use 1.5rc to test my program, which will hold millions of tcp connections and push messages to client.

I test the program by make a fake client(also in go) and build 1 million tcp to the server. Server bind 100 ports and client connect 10000 tcp per port so totally 1,000,000. Server and fake client are on same server and both have runtime.GOMAXPROCS(runtime.NumCPU()).

If I build them on go 1.4.2(cross-platform compile from my MacBook to linux/amd64 server), when I push message to all tcp connections almost concurrently, the CPU is high but it works as expected that all client tcp can receive messages within less than 20 seconds.

However if I build them with same code on go 1.5rc1(also cross-platform compile), when I push message, both server and client are stuck and system load is high for several minutes. Two process' status will sometimes change to 'D' which means "uninterruptible sleep" although CPU may be 0.0% at all. If I kill client, the server will because normal.

Why 1.4.2 works well but 1.5rc1 not? Does 1.5 has TCP performance issue or goroutine scheduling issue?

Thanks


Reply to this email directly or view it on GitHub.

@ianlancetaylor ianlancetaylor added this to the Go1.5Maybe milestone Aug 6, 2015

@ianlancetaylor ianlancetaylor changed the title 1.5rc1 testing: stuck for long time if there are many TCP connections net: 1.5rc1 testing: stuck for long time if there are many TCP connections Aug 6, 2015

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Aug 6, 2015

Is your code available for us to try to recreate this problem?

@yangzhe1991

This comment has been minimized.

Copy link
Author

commented Aug 6, 2015

@davecheney I can do first and second now, there is the output:
The client build 1 million tcp one by one and block on Conn.Read. 22:55:28 is the time I push messages.

...
2015/08/06 22:55:20 [INFO] msgCount 0, hbCount 1699564
2015/08/06 22:55:21 [INFO] msgCount 0, hbCount 1700971
scvg7: 23 MB released
scvg7: inuse: 5236, idle: 425, sys: 5662, released: 418, consumed: 5244 (MB)
2015/08/06 22:55:22 [INFO] msgCount 0, hbCount 1702406
2015/08/06 22:55:23 [INFO] msgCount 0, hbCount 1703818
2015/08/06 22:55:24 [INFO] msgCount 0, hbCount 1705231
2015/08/06 22:55:25 [INFO] msgCount 0, hbCount 1706620
2015/08/06 22:55:26 [INFO] msgCount 0, hbCount 1708008
GC forced
2015/08/06 22:55:27 [INFO] msgCount 0, hbCount 1709443
2015/08/06 22:55:28 [INFO] msgCount 10766, hbCount 1710117
2015/08/06 22:55:29 [INFO] msgCount 26833, hbCount 1710170
2015/08/06 22:55:30 [INFO] msgCount 44748, hbCount 1710293
2015/08/06 22:55:31 [INFO] msgCount 78382, hbCount 1710385
2015/08/06 22:55:33 [INFO] msgCount 105793, hbCount 1710688
scvg8: inuse: 7467, idle: 0, sys: 7468, released: 0, consumed: 7468 (MB)

Then there is no output and process stuck. I kill -3 pid to it, but I can not make sure the time I kill is when the process is 'D'. It outputs(
I omit 1 million goroutines that is blocked on Conn.Read)

SIGQUIT: quit
PC=0x45d3a1 m=0

goroutine 0 [idle]:
runtime.futex(0xcb9c70, 0x0, 0x0, 0x0, 0x0, 0xcb9060, 0x0, 0x0, 0x40e514, 0xcb9c70, ...)
        /usr/local/go/src/runtime/sys_linux_amd64.s:288 +0x21
runtime.futexsleep(0xcb9c70, 0x0, 0xffffffffffffffff)
        /usr/local/go/src/runtime/os1_linux.go:39 +0x53
runtime.notesleep(0xcb9c70)
        /usr/local/go/src/runtime/lock_futex.go:142 +0xa4
runtime.stopm()
        /usr/local/go/src/runtime/proc1.go:1128 +0x112
runtime.exitsyscall0(0xc8a7da9980)
        /usr/local/go/src/runtime/proc1.go:2122 +0x16c
runtime.mcall(0x7fffa0fb0ec0)
        /usr/local/go/src/runtime/asm_amd64.s:204 +0x5b

goroutine 37 [syscall, 23 minutes]:
os/signal.loop()
        /usr/local/go/src/os/signal/signal_unix.go:22 +0x18
created by os/signal.init.1
        /usr/local/go/src/os/signal/signal_unix.go:28 +0x37


rax    0xca
rbx    0x0
rcx    0xffffffffffffffff
rdx    0x0
rdi    0xcb9c70
rsi    0x0
rbp    0x1
rsp    0x7fffa0fb0dd0
r8     0x0
r9     0x0
r10    0x0
r11    0x286
r12    0x1276b1
r13    0xecd556c9b
r14    0xecd556cd7
r15    0xcb8da0
rip    0x45d3a1
rflags 0x286
cs     0x33
fs     0x0
gs     0x0
@yangzhe1991

This comment has been minimized.

Copy link
Author

commented Aug 6, 2015

On server:

the last log before push is

scvg7: 440 MB released
scvg7: inuse: 5649, idle: 1625, sys: 7274, released: 440, consumed: 6834 (MB)

Then I push messages and stuck, I kill -3 pid. I still can not make sure the time I kill is when the process is 'D'.

fatal error: unexpected signal during runtime execution
[signal 0xb code=0x1 addr=0xf1 pc=0x45e0eb]

runtime stack:
runtime.throw(0xb2e3c0, 0x2a)
        /usr/local/go/src/runtime/panic.go:527 +0x90 fp=0x7fff8f22e438 sp=0x7fff8f22e420
runtime.sigpanic()
        /usr/local/go/src/runtime/sigpanic_unix.go:12 +0x5a fp=0x7fff8f22e488 sp=0x7fff8f22e438
runtime.munmap(0x7fb30e2ba000, 0x338f90, 0x41a90f, 0x7fb30e2ba000, 0x338f90, 0xd90560, 0x7fb30e2ba000, 0x671f2, 0x73548, 0x41a12c, ...)
        /usr/local/go/src/runtime/sys_linux_amd64.s:266 +0x1b fp=0x7fff8f22e490 sp=0x7fff8f22e488
runtime.sysFree(0x7fb30e2ba000, 0x338f90, 0xd90560)
        /usr/local/go/src/runtime/mem_linux.go:103 +0x34 fp=0x7fff8f22e4a8 sp=0x7fff8f22e490
runtime.gcCopySpans()
        /usr/local/go/src/runtime/mgc.go:1599 +0xdf fp=0x7fff8f22e4e0 sp=0x7fff8f22e4a8
runtime.gcMark(0x1f5d183bed958)
        /usr/local/go/src/runtime/mgc.go:1414 +0x5c fp=0x7fff8f22e548 sp=0x7fff8f22e4e0
runtime.gc.func3()
        /usr/local/go/src/runtime/mgc.go:1088 +0x23 fp=0x7fff8f22e558 sp=0x7fff8f22e548
runtime.systemstack(0xd67180)
        /usr/local/go/src/runtime/asm_amd64.s:262 +0x79 fp=0x7fff8f22e560 sp=0x7fff8f22e558
runtime.mstart()
        /usr/local/go/src/runtime/proc1.go:674 fp=0x7fff8f22e568 sp=0x7fff8f22e560

goroutine 213 [garbage collection]:
runtime.systemstack_switch()
        /usr/local/go/src/runtime/asm_amd64.s:216 fp=0xc820390c98 sp=0xc820390c90
runtime.gc(0x0)
        /usr/local/go/src/runtime/mgc.go:1096 +0x3d0 fp=0xc820390f90 sp=0xc820390c98
runtime.backgroundgc()
        /usr/local/go/src/runtime/mgc.go:896 +0x3d fp=0xc820390fc0 sp=0xc820390f90
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1696 +0x1 fp=0xc820390fc8 sp=0xc820390fc0
created by runtime.startGC
        /usr/local/go/src/runtime/mgc.go:869 +0x186

goroutine 2 [force gc (idle), 4 minutes]:
runtime.gopark(0xb77ff0, 0xd668a0, 0xa8bc80, 0xf, 0x14, 0x1)
        /usr/local/go/src/runtime/proc.go:185 +0x163 fp=0xc820038758 sp=0xc820038730
runtime.goparkunlock(0xd668a0, 0xa8bc80, 0xf, 0xc820000114, 0x1)
        /usr/local/go/src/runtime/proc.go:191 +0x54 fp=0xc820038790 sp=0xc820038758
runtime.forcegchelper()
        /usr/local/go/src/runtime/proc.go:152 +0xb8 fp=0xc8200387c0 sp=0xc820038790
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1696 +0x1 fp=0xc8200387c8 sp=0xc8200387c0
created by runtime.init.4
        /usr/local/go/src/runtime/proc.go:141 +0x2b

goroutine 3 [GC sweep wait, 2 minutes]:
runtime.gopark(0xb77ff0, 0xd66aa0, 0xa83b90, 0xd, 0x450014, 0x1)
        /usr/local/go/src/runtime/proc.go:185 +0x163 fp=0xc820038f48 sp=0xc820038f20
runtime.goparkunlock(0xd66aa0, 0xa83b90, 0xd, 0x14, 0x1)
        /usr/local/go/src/runtime/proc.go:191 +0x54 fp=0xc820038f80 sp=0xc820038f48
runtime.bgsweep(0xc820060000)
        /usr/local/go/src/runtime/mgcsweep.go:67 +0x14d fp=0xc820038fb8 sp=0xc820038f80
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1696 +0x1 fp=0xc820038fc0 sp=0xc820038fb8
created by runtime.gcenable
        /usr/local/go/src/runtime/mgc.go:205 +0x53

goroutine 4 [finalizer wait, 22 minutes]:
runtime.gopark(0xb77ff0, 0xd8e340, 0xa8b830, 0xe, 0x14, 0x1)
        /usr/local/go/src/runtime/proc.go:185 +0x163 fp=0xc820039718 sp=0xc8200396f0
runtime.goparkunlock(0xd8e340, 0xa8b830, 0xe, 0x14, 0x1)
        /usr/local/go/src/runtime/proc.go:191 +0x54 fp=0xc820039750 sp=0xc820039718
runtime.runfinq()
        /usr/local/go/src/runtime/mfinal.go:154 +0xaa fp=0xc8200397c0 sp=0xc820039750
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1696 +0x1 fp=0xc8200397c8 sp=0xc8200397c0
created by runtime.createfing
        /usr/local/go/src/runtime/mfinal.go:135 +0x60


goroutine 0 [idle]:
runtime.startpanic_m()
        /usr/local/go/src/runtime/panic1.go:67 +0x141 fp=0xc820009ac0 sp=0xc820009a98
runtime.systemstack(0xb78068)
        /usr/local/go/src/runtime/asm_amd64.s:278 +0xab fp=0xc820009ac8 sp=0xc820009ac0
runtime.startpanic()
        /usr/local/go/src/runtime/panic.go:505 +0x14 fp=0xc820009ad8 sp=0xc820009ac8
runtime.sighandler(0xc800000003, 0xc820009cf0, 0xc820009bc0, 0xd67140)
        /usr/local/go/src/runtime/signal_amd64x.go:152 +0x31d fp=0xc820009b70 sp=0xc820009ad8
runtime.sigtrampgo(0x3, 0xc820009cf0, 0xc820009bc0)
        /usr/local/go/src/runtime/signal_linux.go:94 +0x95 fp=0xc820009ba0 sp=0xc820009b70
runtime.sigtramp(0x0, 0x0, 0xc820002000, 0x0, 0x7fa0, 0xd42299, 0x7fff8f22e0f4, 0xc13333, 0x202, 0xd42299, ...)
        /usr/local/go/src/runtime/sys_linux_amd64.s:234 +0x1b fp=0xc820009bc0 sp=0xc820009ba0
runtime.sigreturn(0x0, 0xc820002000, 0x0, 0x7fa0, 0xd42299, 0x7fff8f22e0f4, 0xc13333, 0x202, 0xd42299, 0xb72e01, ...)
        /usr/local/go/src/runtime/sys_linux_amd64.s:238 fp=0xc820009bc8 sp=0xc820009bc0
@rsc

This comment has been minimized.

Copy link
Contributor

commented Aug 6, 2015

@yangzhe1991 If you make it get stuck many times, and you kill -3 it each time it is stuck, is that first stack always munmap called from sysFree called from gcCopySpans, or does it move around? Thanks.

@aclements

@rsc

This comment has been minimized.

Copy link
Contributor

commented Aug 6, 2015

Also, did you remove all your own goroutines from the stack trace? I see nothing but system goroutines. Maybe the problem is that your program is deadlocked. Even if you can't show us the stack traces, please look at what your goroutines are doing and make sure they're not just staring at each other. Thanks.

@yangzhe1991

This comment has been minimized.

Copy link
Author

commented Aug 7, 2015

The client code is very simple, each goroutine dial tcp to server and block at Read like this:

goroutine 147 [runnable]:
net.runtime_pollWait(0x7faba45ac600, 0x72, 0xc8200101d0)
        /usr/local/go/src/runtime/netpoll.go:157 +0x60
net.(*pollDesc).Wait(0xc820061090, 0x72, 0x0, 0x0)
        /usr/local/go/src/net/fd_poll_runtime.go:73 +0x3a
net.(*pollDesc).WaitRead(0xc820061090, 0x0, 0x0)
        /usr/local/go/src/net/fd_poll_runtime.go:78 +0x36
net.(*netFD).Read(0xc820061030, 0xc820227560, 0x4, 0x5, 0x0, 0x7faba45ec050, 0xc8200101d0)
        /usr/local/go/src/net/fd_unix.go:232 +0x23a
net.(*conn).Read(0xc82003a120, 0xc820227560, 0x4, 0x5, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/net.go:132 +0xe4
...(my own code)

or

goroutine 85 [IO wait, 11 minutes]:
net.runtime_pollWait(0x7faba45acfc0, 0x72, 0xc8200101d0)
        /usr/local/go/src/runtime/netpoll.go:157 +0x60
net.(*pollDesc).Wait(0xc82016a760, 0x72, 0x0, 0x0)
        /usr/local/go/src/net/fd_poll_runtime.go:73 +0x3a
net.(*pollDesc).WaitRead(0xc82016a760, 0x0, 0x0)
        /usr/local/go/src/net/fd_poll_runtime.go:78 +0x36
net.(*netFD).Read(0xc82016a700, 0xc8201d8e50, 0x4, 0x5, 0x0, 0x7faba45ec050, 0xc8200101d0)
        /usr/local/go/src/net/fd_unix.go:232 +0x23a
net.(*conn).Read(0xc82015c050, 0xc8201d8e50, 0x4, 0x5, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/net.go:132 +0xe4
...(my own code)

The only difference is IO wait and runnable.

And I have another 2 goroutines that I didn't post, one is

goroutine 1 [sleep, 11 minutes]:
time.Sleep(0x34630b8a000)
        /usr/local/go/src/runtime/time.go:59 +0xf9
main.main()
main.go:42 +0x599

to sleep forever in the end of main()

and another is to output log per second

@yangzhe1991

This comment has been minimized.

Copy link
Author

commented Aug 7, 2015

@rsc I think the goroutine "munmap called from sysFree called from gcCopySpans" is only if there is a fatal fatal error: unexpected signal during runtime execution. But it is not common. The common output is start by SIGQUIT: quit and there is no munmap.

@yangzhe1991

This comment has been minimized.

Copy link
Author

commented Aug 7, 2015

And if I only build 500,000 tcp connections, it will work well. So I doubt it is a performance issue.

@davecheney

This comment has been minimized.

Copy link
Contributor

commented Aug 7, 2015

Is your code available for us to try to recreate this problem?

Phil, to echo Ian's request, the simplest path to a solution is if you can
share code and instructions to reproduce your results.

Thanks

Dave

On Fri, Aug 7, 2015 at 3:20 PM, Phil Yang notifications@github.com wrote:

And if I only build 500,000 tcp connections, it will work well. So I doubt
it is a performance issue.


Reply to this email directly or view it on GitHub
#12051 (comment).

@yangzhe1991

This comment has been minimized.

Copy link
Author

commented Aug 7, 2015

I simplify the code on https://gist.github.com/yangzhe1991/c43f66aba140d651ffde
It has no encoding/decoding code so the CPU is lower than my previous code. The testing server can handle 1 million tcp connections but it can not handle 1.5 million.

Then I use go 1.4.2 to build them, the server can not handle 1.5 million, either :(

So I think 1.5's performance is a little lower than 1.4? And is it expected that it will stuck for many minutes?

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Aug 7, 2015

Thanks for the test case. Your code has a number of race conditions. The one on the global variable "list" in server.go looks particularly serious. One goroutine is appending values to list while a different goroutine is iterating over it. Your first step in analyzing this problem has to be to remove all these race conditions. You should see them if you build and run your programs with the -race option. See http://blog.golang.org/race-detector for background on the race detector.

@yangzhe1991

This comment has been minimized.

Copy link
Author

commented Aug 7, 2015

Hi
The code is only for a small test and the list will be read only after there is no append, and actually this test code's list has bug on concurrence while my previously test code has no bug.

And I find another issue that in the client code(which I can not post,so sorry) when GC triggered, there are a small stuck for two or three seconds that it will not build any connection at all. Log:

2015/08/07 20:46:07 [INFO] msgCount 0, hbCount 80265
2015/08/07 20:46:08 [INFO] msgCount 0, hbCount 87883
2015/08/07 20:46:09 [INFO] msgCount 0, hbCount 95491
2015/08/07 20:46:10 [INFO] msgCount 0, hbCount 103607
2015/08/07 20:46:11 [INFO] msgCount 0, hbCount 110262
2015/08/07 20:46:12 [INFO] msgCount 0, hbCount 116382
2015/08/07 20:46:13 [INFO] msgCount 0, hbCount 116382
2015/08/07 20:46:14 [INFO] msgCount 0, hbCount 116382
gc 12 @17.525s 7%: 6.0+3624+0.020+0.21+575 ms clock, 90+3624+0+253/0.014/2306+8630 ms cpu, 873->912->676 MB, 896 MB goal, 16 P
2015/08/07 20:46:15 [INFO] msgCount 0, hbCount 118007
2015/08/07 20:46:16 [INFO] msgCount 0, hbCount 142900
2015/08/07 20:46:17 [INFO] msgCount 0, hbCount 151018

and

2015/08/07 20:46:33 [INFO] msgCount 0, hbCount 275793
2015/08/07 20:46:34 [INFO] msgCount 0, hbCount 281952
2015/08/07 20:46:35 [INFO] msgCount 0, hbCount 286955
2015/08/07 20:46:36 [INFO] msgCount 0, hbCount 286955
gc 13 @40.110s 7%: 18+3859+0.059+0.38+1395 ms clock, 302+3859+0+260/0.012/2620+22327 ms cpu, 2015->1289->749 MB, 1275 MB goal, 16 P
2015/08/07 20:46:39 [INFO] msgCount 0, hbCount 307066
2015/08/07 20:46:40 [INFO] msgCount 0, hbCount 314777
2015/08/07 20:46:41 [INFO] msgCount 0, hbCount 322841
2015/08/07 20:46:42 [INFO] msgCount 0, hbCount 330693
2015/08/07 20:46:43 [INFO] msgCount 0, hbCount 338397

and

2015/08/07 20:46:57 [INFO] msgCount 0, hbCount 446845
2015/08/07 20:46:58 [INFO] msgCount 0, hbCount 454695
2015/08/07 20:46:59 [INFO] msgCount 0, hbCount 461162
2015/08/07 20:47:00 [INFO] msgCount 0, hbCount 468078
2015/08/07 20:47:01 [INFO] msgCount 0, hbCount 470947
2015/08/07 20:47:02 [INFO] msgCount 0, hbCount 470947
2015/08/07 20:47:03 [INFO] msgCount 0, hbCount 470947
gc 14 @65.701s 6%: 22+4324+0.11+2015+08/07 20:47:05 [INFO] msgCount 0, hbCount 470947
1354 ms clock, 359+4324+0+679/0.022/2736+21668 ms cpu, 1373->1431->835 MB, 1408 MB goal, 16 P
2015/08/07 20:47:05 [INFO] msgCount 0, hbCount 490606
2015/08/07 20:47:06 [INFO] msgCount 0, hbCount 498044
2015/08/07 20:47:07 [INFO] msgCount 0, hbCount 505883
2015/08/07 20:47:08 [INFO] msgCount 0, hbCount 513485
2015/08/07 20:47:09 [INFO] msgCount 0, hbCount 521286
2015/08/07 20:47:10 [INFO] msgCount 0, hbCount 528959

and more

You can see the hbCount is not increased 2-3 seconds before gc log output. I have no idea about the meaning of gc logs' timing like 6.0+3624+0.020+0.21+575 ms clock, but I think now that the program can output INFO log, it is not STW period but the program is not working as expected.

And moreover, if I turn off client's GC, the pushing can work well just like 1.4.2, won't stuck any more

@yangzhe1991

This comment has been minimized.

Copy link
Author

commented Aug 7, 2015

And when I kill -3 clientpid, it prints stacks but interrupts by a panic, then the process don't quit at all with about 100% CPU. And the log will output scvgs periodically. The last part of log is

found next stack barrier at 0xc85cdc7f50; expected [*0xc9d265bf60=0x4016ab]
fatal error: missed stack barrier
panic during panic

goroutine 0 [idle]:
runtime.startpanic_m()
        /usr/local/go/src/runtime/panic1.go:67 +0x141 fp=0xc820009870 sp=0xc820009848
runtime.systemstack(0xaed0a8)
        /usr/local/go/src/runtime/asm_amd64.s:278 +0xab fp=0xc820009878 sp=0xc820009870
runtime.startpanic()
        /usr/local/go/src/runtime/panic.go:505 +0x14 fp=0xc820009888 sp=0xc820009878
runtime.throw(0xa6c460, 0x14)
        /usr/local/go/src/runtime/panic.go:526 +0x83 fp=0xc8200098a0 sp=0xc820009888
runtime.gentraceback(0x42c653, 0xc85cdc7ac8, 0x0, 0xc8257c0480, 0x0, 0x0, 0x64, 0x0, 0x0, 0x0, ...)
        /usr/local/go/src/runtime/traceback.go:259 +0x1053 fp=0xc8200099d0 sp=0xc8200098a0
runtime.traceback1(0xffffffffffffffff, 0xffffffffffffffff, 0x0, 0xc8257c0480, 0x0)
        /usr/local/go/src/runtime/traceback.go:550 +0xc8 fp=0xc820009a30 sp=0xc8200099d0
runtime.traceback(0xffffffffffffffff, 0xffffffffffffffff, 0x0, 0xc8257c0480)
        /usr/local/go/src/runtime/traceback.go:527 +0x48 fp=0xc820009a60 sp=0xc820009a30
runtime.tracebackothers(0xcb93c0)
        /usr/local/go/src/runtime/traceback.go:682 +0x2b1 fp=0xc820009ad8 sp=0xc820009a60
runtime.sighandler(0xc800000003, 0xc820009cf0, 0xc820009bc0, 0xcb93c0)
        /usr/local/go/src/runtime/signal_amd64x.go:177 +0x6e0 fp=0xc820009b70 sp=0xc820009ad8
runtime.sigtrampgo(0x3, 0xc820009cf0, 0xc820009bc0)
        /usr/local/go/src/runtime/signal_linux.go:94 +0x95 fp=0xc820009ba0 sp=0xc820009b70
runtime.sigtramp(0x0, 0x0, 0xc820002000, 0x0, 0x7fa0, 0x0, 0x0, 0x0, 0x286, 0x13d0cd, ...)
        /usr/local/go/src/runtime/sys_linux_amd64.s:234 +0x1b fp=0xc820009bc0 sp=0xc820009ba0
runtime.sigreturn(0x0, 0xc820002000, 0x0, 0x7fa0, 0x0, 0x0, 0x0, 0x286, 0x13d0cd, 0xecd56a064, ...)
        /usr/local/go/src/runtime/sys_linux_amd64.s:238 fp=0xc820009bc8 sp=0xc820009bc0
scvg2: 0 MB released
scvg2: inuse: 8362, idle: 0, sys: 8362, released: 0, consumed: 8362 (MB)
scvg3: inuse: 8785, idle: 0, sys: 8785, released: 0, consumed: 8785 (MB)
scvg4: inuse: 8785, idle: 0, sys: 8785, released: 0, consumed: 8785 (MB)
scvg5: 0 MB released
scvg5: inuse: 8785, idle: 0, sys: 8785, released: 0, consumed: 8785 (MB)

There should be about 1 million goroutines, but the log stop when it has printed only 29309 goroutines

@yangzhe1991

This comment has been minimized.

Copy link
Author

commented Aug 7, 2015

Another try, output is not same for every time. After GC forced output, the process stuck.

2015/08/07 21:23:22 [INFO] msgCount 434032, hbCount 1000000
2015/08/07 21:23:25 [INFO] msgCount 435337, hbCount 1000000
2015/08/07 21:23:26 [INFO] msgCount 436639, hbCount 1000000
GC forced
SIGQUIT: quit
PC=0x42c627 m=0

goroutine 49551 [running]:
runtime.gopark(0xaecff0, 0x7f2fae828678, 0x9ffd98, 0x7, 0x59f51b, 0x5)
        /usr/local/go/src/runtime/proc.go:186 +0x167 fp=0xc82e6e4b00 sp=0xc82e6e4af8
runtime.netpollblock(0x7f2fae828650, 0xffffff0000000072, 0x0)
        /usr/local/go/src/runtime/netpoll.go:338 +0x15e fp=0xc82e6e4b48 sp=0xc82e6e4b00
net.runtime_pollWait(0x7f2fae828650, 0x72, 0xc8200800a0)
        /usr/local/go/src/runtime/netpoll.go:157 +0x60 fp=0xc82e6e4b68 sp=0xc82e6e4b48
net.(*pollDesc).Wait(0xc830d50920, 0x72, 0x0, 0x0)
        /usr/local/go/src/net/fd_poll_runtime.go:73 +0x3a fp=0xc82e6e4b88 sp=0xc82e6e4b68
net.(*pollDesc).WaitRead(0xc830d50920, 0x0, 0x0)
        /usr/local/go/src/net/fd_poll_runtime.go:78 +0x36 fp=0xc82e6e4bb0 sp=0xc82e6e4b88
net.(*netFD).Read(0xc830d508c0, 0xc8363e5b60, 0x4, 0x5, 0x0, 0x7f2faf4c4028, 0xc8200800a0)
        /usr/local/go/src/net/fd_unix.go:232 +0x23a fp=0xc82e6e4c80 sp=0xc82e6e4bb0

the last part:

panic: fatal error: malloc during signal
panic during panic

goroutine 0 [idle]:
runtime.startpanic_m()
        /usr/local/go/src/runtime/panic1.go:67 +0x141
runtime.systemstack(0xaed0a8)
        /usr/local/go/src/runtime/asm_amd64.s:278 +0xab
runtime.startpanic()
        /usr/local/go/src/runtime/panic.go:505 +0x14
runtime.throw(0xa66d80, 0x14)
        /usr/local/go/src/runtime/panic.go:526 +0x83
runtime.mallocgc(0x21, 0x0, 0x3, 0x0)
        /usr/local/go/src/runtime/malloc.go:516 +0x18e
runtime.rawstring(0x21, 0x0, 0x0, 0x0, 0x0, 0x0)
        /usr/local/go/src/runtime/string.go:264 +0x70
runtime.rawstringtmp(0x0, 0x21, 0x0, 0x0, 0x0, 0x0, 0x0)
        /usr/local/go/src/runtime/string.go:107 +0xb7
runtime.concatstrings(0x0, 0xc8200096d0, 0x2, 0x2, 0x0, 0x0)
        /usr/local/go/src/runtime/string.go:48 +0x1c2
runtime.concatstring2(0x0, 0xa18ae0, 0xf, 0xa663c0, 0x12, 0x0, 0x0)
        /usr/local/go/src/runtime/string.go:58 +0x67
runtime.(*errorString).Error(0xc8200100c0, 0x0, 0x0)
        <autogenerated>:2 +0xde
runtime.printany(0x925480, 0xc8200100c0)
        /usr/local/go/src/runtime/error.go:75 +0x1b1
runtime.gopanic(0x925480, 0xc8200100c0)
        /usr/local/go/src/runtime/panic.go:352 +0x7a
runtime.panicindex()
        /usr/local/go/src/runtime/panic.go:12 +0x49
runtime.gentraceback(0x42c623, 0xc8219e1ad8, 0x0, 0xc8205ec600, 0x0, 0x0, 0x64, 0x0, 0x0, 0x0, ...)
        /usr/local/go/src/runtime/traceback.go:255 +0x1206
runtime.traceback1(0xffffffffffffffff, 0xffffffffffffffff, 0x0, 0xc8205ec600, 0x0)
        /usr/local/go/src/runtime/traceback.go:550 +0xc8
runtime.traceback(0xffffffffffffffff, 0xffffffffffffffff, 0x0, 0xc8205ec600)
        /usr/local/go/src/runtime/traceback.go:527 +0x48
runtime.tracebackothers(0xc830b4be00)
        /usr/local/go/src/runtime/traceback.go:682 +0x2b1
runtime.sighandler(0xc800000003, 0xc820009cf0, 0xc820009bc0, 0xc830b4be00)
        /usr/local/go/src/runtime/signal_amd64x.go:177 +0x6e0
runtime.sigtrampgo(0x3, 0xc820009cf0, 0xc820009bc0)
        /usr/local/go/src/runtime/signal_linux.go:94 +0x95
runtime.sigtramp(0x0, 0x0, 0xc820002000, 0x0, 0x7fa0, 0xcb9b80, 0x0, 0x0, 0x246, 0xfcdb4, ...)
        /usr/local/go/src/runtime/sys_linux_amd64.s:234 +0x1b
runtime.sigreturn(0x0, 0xc820002000, 0x0, 0x7fa0, 0xcb9b80, 0x0, 0x0, 0x246, 0xfcdb4, 0xecd56a867, ...)
        /usr/local/go/src/runtime/sys_linux_amd64.s:238

goroutine 49551 [running]:
@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Aug 7, 2015

To make progress on this we really need working, non-racy, code that demonstrates this problem. Thanks.

@rsc rsc modified the milestones: Unplanned, Go1.5Maybe Aug 11, 2015

@gopherbot

This comment has been minimized.

Copy link

commented Jan 1, 2018

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

@gopherbot gopherbot closed this Jan 1, 2018

@golang golang locked and limited conversation to collaborators Jan 1, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.