New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: goroutine permanently stuck in waitWrite() and waitRead() on tcp conn #27752

Open
absolute8511 opened this Issue Sep 19, 2018 · 13 comments

Comments

Projects
None yet
4 participants
@absolute8511

absolute8511 commented Sep 19, 2018

I am using the go 1.9.2 and OS is 2.6.32-696.23.1.el6.x86_64 #1 SMP Tue Mar 13 22:44:18 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux. While stopping a TCP server we see a connection is stuck in waitWrite for too long as below:

goroutine 156550879 [IO wait, 17408 minutes]:
internal/poll.runtime_pollWait(0x7f70428535a0, 0x77, 0x0)
        /Users/han/.gvm/gos/go1.9.2/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc420e12018, 0x77, 0xffffffffffffff00, 0xdf3ce0, 0xdef460)
        /Users/han/.gvm/gos/go1.9.2/src/internal/poll/fd_poll_runtime.go:85 +0xae
internal/poll.(*pollDesc).waitWrite(0xc420e12018, 0xc422c15b00, 0x498, 0x498)
        /Users/han/.gvm/gos/go1.9.2/src/internal/poll/fd_poll_runtime.go:94 +0x3d
internal/poll.(*FD).Write(0xc420e12000, 0xc422c12000, 0x4000, 0x4000, 0x0, 0x0, 0x0)
        /Users/han/.gvm/gos/go1.9.2/src/internal/poll/fd_unix.go:227 +0x244
net.(*netFD).Write(0xc420e12000, 0xc422c12000, 0x4000, 0x4000, 0x2e0f, 0x2e0f, 0x53b543)
        /Users/han/.gvm/gos/go1.9.2/src/net/fd_unix.go:220 +0x52
net.(*conn).Write(0xc420250040, 0xc422c12000, 0x4000, 0x4000, 0x0, 0x0, 0x0)
        /Users/han/.gvm/gos/go1.9.2/src/net/net.go:188 +0x6d
bufio.(*Writer).Flush(0xc4208322c0, 0x0, 0x0)
        /Users/han/.gvm/gos/go1.9.2/src/bufio/bufio.go:567 +0x7e
bufio.(*Writer).WriteByte(0xc4208322c0, 0x3a, 0x0, 0x0)
        /Users/han/.gvm/gos/go1.9.2/src/bufio/bufio.go:622 +0x8c

We use lsof to see no connection on this port already. Is this an issue related with Go to handle closed connection on write?

I saw a similar issue #23604, but it is unixgram. But in my problem, I am using l, err = net.Listen("tcp", laddr).

@bcmills

This comment has been minimized.

Member

bcmills commented Sep 19, 2018

While stopping a TCP server

Stopping it how? More detail would be helpful.

Close on a net.Listener unblocks “[a]ny blocked Accept operations”,¹ but doesn't unblock operations on the individual connections.

¹https://tip.golang.org/pkg/net/#Listener

@bcmills

This comment has been minimized.

Member

bcmills commented Sep 19, 2018

@absolute8511

This comment has been minimized.

absolute8511 commented Sep 19, 2018

Yeah, I know Close Listener will not unblock operation on connections, but this connection showing waitwrite wait for goroutine 156550879 [IO wait, 17408 minutes]: so long time seems something wrong. And the connection is already closed because of no any related connections by the output of lsof -i TCP .

We counted all the accepted connections, and while stop server by close Listener we wait for all the connections to finish the current request and quit at next request until the connection counter became 0.

All works all right, until recently we saw a block too long so I get the stack trace to see what is blocking and found the waitWrite problem.

@bcmills bcmills added this to the Go1.12 milestone Sep 19, 2018

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Sep 25, 2018

This code has been improved in more recent releases. Is it possible for you to try Go 1.11?

If it still fails we will need some instructions on how to reproduce the problem.

@absolute8511

This comment has been minimized.

absolute8511 commented Sep 26, 2018

I can give it a try, however, It rarely happens. Not sure it can be reproduced.

@absolute8511

This comment has been minimized.

absolute8511 commented Nov 2, 2018

I used the go1.10.4 version, and for days, another stuck happened as below:

SIGQUIT: quit

goroutine 1401441 [IO wait, 12921 minutes]:
internal/poll.runtime_pollWait(0x7f248bd21ca0, 0x72, 0xc437f91448)
        /Users/han/.gvm/gos/go1.10.4/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc423a84118, 0x72, 0xffffffffffffff00, 0xb02940, 0xde24b0)
        /Users/han/.gvm/gos/go1.10.4/src/internal/poll/fd_poll_runtime.go:85 +0x9b
internal/poll.(*pollDesc).waitRead(0xc423a84118, 0xc42445f400, 0x1e, 0x4a1)
        /Users/han/.gvm/gos/go1.10.4/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Read(0xc423a84100, 0xc42445f400, 0x1e, 0x4a1, 0x0, 0x0, 0x0)
        /Users/han/.gvm/gos/go1.10.4/src/internal/poll/fd_unix.go:157 +0x17d
net.(*netFD).Read(0xc423a84100, 0xc42445f400, 0x1e, 0x4a1, 0x0, 0xe28a20, 0x0)
        /Users/han/.gvm/gos/go1.10.4/src/net/fd_unix.go:202 +0x4f
net.(*conn).Read(0xc4308752f8, 0xc42445f400, 0x1e, 0x4a1, 0x0, 0x0, 0x0)
        /Users/han/.gvm/gos/go1.10.4/src/net/net.go:176 +0x6a

and this connection has been closed actually. The OS is Linux 2.6.32-696.13.2.el6.x86_64 #1 SMP Thu Oct 5 21:22:16 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

@ianlancetaylor

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Nov 3, 2018

Thanks, but I'm going to make the same comment:

This code has been improved in more recent releases. Is it possible for you to try Go 1.11?

If it still fails we will need some instructions on how to reproduce the problem.

@absolute8511

This comment has been minimized.

absolute8511 commented Nov 4, 2018

Currently, we can not use Go 1.11.

I thought currently Go 1.10 release is still supported until 1.12 release. So will it be fixed if it happened in 1.10 but not in 1.11?

@absolute8511 absolute8511 changed the title from net: goroutine permanently stuck in waitWrite() on tcp conn to net: goroutine permanently stuck in waitWrite() and waitRead() on tcp conn Nov 4, 2018

@agnivade

This comment has been minimized.

Member

agnivade commented Nov 4, 2018

Hi @absolute8511 - please have a look at our backport policy https://github.com/golang/go/wiki/MinorReleases.

That said, if you can try out 1.11, we will know that it has been fixed in 1.11, which will give us a better visibility into the bug and a possible decision whether to backport it to 1.10 or not. Thank you.

In any case, you haven't provided any sample code along with instructions that reproduces the issue. Without that, it is very hard to debug this issue from our side. Could you help with that ?

@absolute8511

This comment has been minimized.

absolute8511 commented Nov 5, 2018

Basically, my code is a proxy which will accept connection from a client, and receive data from client then send data to the backend server and wait for the reply from the server.

                         connCh := make(chan net.Conn, connChannelLength)
			for {
				if conn, err := l.Accept(); err != nil {	
                                          return err				
				} else {
					connCh <- conn
				}
			}

another goroutine handle accepted connections:

	for {
		select {
		case conn, ok := <-connCh:
			if ok {
				self.wg.Add(1)
				go func() {
					defer func() {
						self.wg.Done()
					}()
			               // loop to read data from the client connection
			               //  process data and send to the backend server, send the server reply back to client.
                                        //  if any error, close conneciton, break loop and return.
				}()
			} else {
				return
			}
		}
	}

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Nov 5, 2018

Unfortunately we are unlikely to be able to solve this without a complete reproduction case.

If you want to try to solve this yourself, try to recreate the problem while running the program under strace -f, and see when the connection was closed.

@absolute8511

This comment has been minimized.

absolute8511 commented Nov 6, 2018

I saw no connections on the port while the hang up happened. Can you give any suggestion what I can do next to identify the problem if I see strace -f close connection but it still hangs up in Go runtime?

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Nov 6, 2018

Looking at lsof doesn't tell us the ordering of the calls to epoll and close, so I think that strace -f is still the first step. I don't have any suggestions for the next step without seeing the first step. I don't see how epoll could fail to report a closed descriptor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment