Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: goroutine permanently stuck in waitWrite() and waitRead() on tcp conn #27752

Closed
absolute8511 opened this issue Sep 19, 2018 · 25 comments

Comments

@absolute8511
Copy link

@absolute8511 absolute8511 commented Sep 19, 2018

I am using the go 1.9.2 and OS is 2.6.32-696.23.1.el6.x86_64 #1 SMP Tue Mar 13 22:44:18 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux. While stopping a TCP server we see a connection is stuck in waitWrite for too long as below:

goroutine 156550879 [IO wait, 17408 minutes]:
internal/poll.runtime_pollWait(0x7f70428535a0, 0x77, 0x0)
        /Users/han/.gvm/gos/go1.9.2/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc420e12018, 0x77, 0xffffffffffffff00, 0xdf3ce0, 0xdef460)
        /Users/han/.gvm/gos/go1.9.2/src/internal/poll/fd_poll_runtime.go:85 +0xae
internal/poll.(*pollDesc).waitWrite(0xc420e12018, 0xc422c15b00, 0x498, 0x498)
        /Users/han/.gvm/gos/go1.9.2/src/internal/poll/fd_poll_runtime.go:94 +0x3d
internal/poll.(*FD).Write(0xc420e12000, 0xc422c12000, 0x4000, 0x4000, 0x0, 0x0, 0x0)
        /Users/han/.gvm/gos/go1.9.2/src/internal/poll/fd_unix.go:227 +0x244
net.(*netFD).Write(0xc420e12000, 0xc422c12000, 0x4000, 0x4000, 0x2e0f, 0x2e0f, 0x53b543)
        /Users/han/.gvm/gos/go1.9.2/src/net/fd_unix.go:220 +0x52
net.(*conn).Write(0xc420250040, 0xc422c12000, 0x4000, 0x4000, 0x0, 0x0, 0x0)
        /Users/han/.gvm/gos/go1.9.2/src/net/net.go:188 +0x6d
bufio.(*Writer).Flush(0xc4208322c0, 0x0, 0x0)
        /Users/han/.gvm/gos/go1.9.2/src/bufio/bufio.go:567 +0x7e
bufio.(*Writer).WriteByte(0xc4208322c0, 0x3a, 0x0, 0x0)
        /Users/han/.gvm/gos/go1.9.2/src/bufio/bufio.go:622 +0x8c

We use lsof to see no connection on this port already. Is this an issue related with Go to handle closed connection on write?

I saw a similar issue #23604, but it is unixgram. But in my problem, I am using l, err = net.Listen("tcp", laddr).

@bcmills

This comment has been minimized.

Copy link
Member

@bcmills bcmills commented Sep 19, 2018

While stopping a TCP server

Stopping it how? More detail would be helpful.

Close on a net.Listener unblocks “[a]ny blocked Accept operations”,¹ but doesn't unblock operations on the individual connections.

¹https://tip.golang.org/pkg/net/#Listener

@bcmills

This comment has been minimized.

Copy link
Member

@bcmills bcmills commented Sep 19, 2018

@absolute8511

This comment has been minimized.

Copy link
Author

@absolute8511 absolute8511 commented Sep 19, 2018

Yeah, I know Close Listener will not unblock operation on connections, but this connection showing waitwrite wait for goroutine 156550879 [IO wait, 17408 minutes]: so long time seems something wrong. And the connection is already closed because of no any related connections by the output of lsof -i TCP .

We counted all the accepted connections, and while stop server by close Listener we wait for all the connections to finish the current request and quit at next request until the connection counter became 0.

All works all right, until recently we saw a block too long so I get the stack trace to see what is blocking and found the waitWrite problem.

@bcmills bcmills added this to the Go1.12 milestone Sep 19, 2018
@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Sep 25, 2018

This code has been improved in more recent releases. Is it possible for you to try Go 1.11?

If it still fails we will need some instructions on how to reproduce the problem.

@absolute8511

This comment has been minimized.

Copy link
Author

@absolute8511 absolute8511 commented Sep 26, 2018

I can give it a try, however, It rarely happens. Not sure it can be reproduced.

@absolute8511

This comment has been minimized.

Copy link
Author

@absolute8511 absolute8511 commented Nov 2, 2018

I used the go1.10.4 version, and for days, another stuck happened as below:

SIGQUIT: quit

goroutine 1401441 [IO wait, 12921 minutes]:
internal/poll.runtime_pollWait(0x7f248bd21ca0, 0x72, 0xc437f91448)
        /Users/han/.gvm/gos/go1.10.4/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc423a84118, 0x72, 0xffffffffffffff00, 0xb02940, 0xde24b0)
        /Users/han/.gvm/gos/go1.10.4/src/internal/poll/fd_poll_runtime.go:85 +0x9b
internal/poll.(*pollDesc).waitRead(0xc423a84118, 0xc42445f400, 0x1e, 0x4a1)
        /Users/han/.gvm/gos/go1.10.4/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Read(0xc423a84100, 0xc42445f400, 0x1e, 0x4a1, 0x0, 0x0, 0x0)
        /Users/han/.gvm/gos/go1.10.4/src/internal/poll/fd_unix.go:157 +0x17d
net.(*netFD).Read(0xc423a84100, 0xc42445f400, 0x1e, 0x4a1, 0x0, 0xe28a20, 0x0)
        /Users/han/.gvm/gos/go1.10.4/src/net/fd_unix.go:202 +0x4f
net.(*conn).Read(0xc4308752f8, 0xc42445f400, 0x1e, 0x4a1, 0x0, 0x0, 0x0)
        /Users/han/.gvm/gos/go1.10.4/src/net/net.go:176 +0x6a

and this connection has been closed actually. The OS is Linux 2.6.32-696.13.2.el6.x86_64 #1 SMP Thu Oct 5 21:22:16 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

@ianlancetaylor

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Nov 3, 2018

Thanks, but I'm going to make the same comment:

This code has been improved in more recent releases. Is it possible for you to try Go 1.11?

If it still fails we will need some instructions on how to reproduce the problem.

@absolute8511

This comment has been minimized.

Copy link
Author

@absolute8511 absolute8511 commented Nov 4, 2018

Currently, we can not use Go 1.11.

I thought currently Go 1.10 release is still supported until 1.12 release. So will it be fixed if it happened in 1.10 but not in 1.11?

@absolute8511 absolute8511 changed the title net: goroutine permanently stuck in waitWrite() on tcp conn net: goroutine permanently stuck in waitWrite() and waitRead() on tcp conn Nov 4, 2018
@agnivade

This comment has been minimized.

Copy link
Contributor

@agnivade agnivade commented Nov 4, 2018

Hi @absolute8511 - please have a look at our backport policy https://github.com/golang/go/wiki/MinorReleases.

That said, if you can try out 1.11, we will know that it has been fixed in 1.11, which will give us a better visibility into the bug and a possible decision whether to backport it to 1.10 or not. Thank you.

In any case, you haven't provided any sample code along with instructions that reproduces the issue. Without that, it is very hard to debug this issue from our side. Could you help with that ?

@absolute8511

This comment has been minimized.

Copy link
Author

@absolute8511 absolute8511 commented Nov 5, 2018

Basically, my code is a proxy which will accept connection from a client, and receive data from client then send data to the backend server and wait for the reply from the server.

                         connCh := make(chan net.Conn, connChannelLength)
			for {
				if conn, err := l.Accept(); err != nil {	
                                          return err				
				} else {
					connCh <- conn
				}
			}

another goroutine handle accepted connections:

	for {
		select {
		case conn, ok := <-connCh:
			if ok {
				self.wg.Add(1)
				go func() {
					defer func() {
						self.wg.Done()
					}()
			               // loop to read data from the client connection
			               //  process data and send to the backend server, send the server reply back to client.
                                        //  if any error, close conneciton, break loop and return.
				}()
			} else {
				return
			}
		}
	}

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Nov 5, 2018

Unfortunately we are unlikely to be able to solve this without a complete reproduction case.

If you want to try to solve this yourself, try to recreate the problem while running the program under strace -f, and see when the connection was closed.

@absolute8511

This comment has been minimized.

Copy link
Author

@absolute8511 absolute8511 commented Nov 6, 2018

I saw no connections on the port while the hang up happened. Can you give any suggestion what I can do next to identify the problem if I see strace -f close connection but it still hangs up in Go runtime?

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Nov 6, 2018

Looking at lsof doesn't tell us the ordering of the calls to epoll and close, so I think that strace -f is still the first step. I don't have any suggestions for the next step without seeing the first step. I don't see how epoll could fail to report a closed descriptor.

@ianlancetaylor ianlancetaylor modified the milestones: Go1.12, Go1.13, Unplanned Dec 18, 2018
@gzcaimufu

This comment has been minimized.

Copy link

@gzcaimufu gzcaimufu commented May 10, 2019

I saw no connections on the port while the hang up happened. Can you give any suggestion what I can do next to identify the problem if I see strace -f close connection but it still hangs up in Go runtime?

hi, @ianlancetaylor , recently I have encountered the same problem. Have you found the reason?

internal/poll.runtime_pollWait(0x7f24b8167900, 0x77, 0x0)
	/usr/local/go/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc42186ca98, 0x77, 0xffffffffffffff00, 0x1e95e20, 0x1e8e8a0)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0xae
internal/poll.(*pollDesc).waitWrite(0xc42186ca98, 0xc42450f100, 0xd, 0x2ad)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:94 +0x3d
internal/poll.(*FD).Write(0xc42186ca80, 0xc42450f000, 0x160, 0x400, 0x0, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_unix.go:227 +0x244
net.(*netFD).Write(0xc42186ca80, 0xc42450f000, 0x160, 0x400, 0xc42250adf8, 0x3, 0xa023cd6d80)
	/usr/local/go/src/net/fd_unix.go:220 +0x52
net.(*conn).Write(0xc4225d0158, 0xc42450f000, 0x160, 0x400, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:188 +0x6d
crypto/tls.(*Conn).write(0xc42051ea80, 0xc42450f000, 0x160, 0x400, 0xc421e421e0, 0xc42250ae01, 0x61ee64)
	/usr/local/go/src/crypto/tls/conn.go:832 +0x1a7
crypto/tls.(*Conn).writeRecordLocked(0xc42051ea80, 0x169fe17, 0xc420a79000, 0x143, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/crypto/tls/conn.go:909 +0x360
crypto/tls.(*Conn).Write(0xc42051ea80, 0xc420a79000, 0x143, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/crypto/tls/conn.go:1068 +0x1bb
net/http.http2stickyErrWriter.Write(0x7f24bda32be8, 0xc42051ea80, 0xc4200a1368, 0xc420a79000, 0x143, 0x1000, 0x1532d60, 0x0, 0x80000000000)
	/usr/local/go/src/net/http/h2_bundle.go:6803 +0x7a
net/http.(*http2stickyErrWriter).Write(0xc4225bc880, 0xc420a79000, 0x143, 0x1000, 0x143, 0x412447, 0xc423818a40)
	<autogenerated>:1 +0x74
bufio.(*Writer).Flush(0xc422991340, 0x16b, 0xc420374c80)
	/usr/local/go/src/bufio/bufio.go:567 +0x7e
net/http.(*http2ClientConn).writeHeaders(0xc4200a11e0, 0x10000016b, 0xc420374dba, 0x0, 0x133, 0x0, 0x0)
	/usr/local/go/src/net/http/h2_bundle.go:7436 +0x1ec
net/http.(*http2ClientConn).RoundTrip(0xc4200a11e0, 0xc421f3f800, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/http/h2_bundle.go:7289 +0x46b
@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented May 10, 2019

@gzcaimufu Unfortunately I have not found the reason. Nothing has changed since I said "Unfortunately we are unlikely to be able to solve this without a complete reproduction case."

@gzcaimufu

This comment has been minimized.

Copy link

@gzcaimufu gzcaimufu commented May 11, 2019

@gzcaimufu Unfortunately I have not found the reason. Nothing has changed since I said "Unfortunately we are unlikely to be able to solve this without a complete reproduction case."

OK. Thanks for your reply.

@gzcaimufu

This comment has been minimized.

Copy link

@gzcaimufu gzcaimufu commented May 15, 2019

@gzcaimufu Unfortunately I have not found the reason. Nothing has changed since I said "Unfortunately we are unlikely to be able to solve this without a complete reproduction case."

@ianlancetaylor @bradfitz hi, I see a comment in h2_bundle.go as below. I am not sure if this comment is related to above goroutine stuck error. Could you please have a look?

func (cc *http2ClientConn) writeHeaders(streamID uint32, endStream bool, hdrs []byte) error {
          ....
         // TODO(bradfitz): this Flush could potentially block (as
	// could the WriteHeaders call(s) above), which means they
	// wouldn't respond to Request.Cancel being readable. That's
	// rare, but this should probably be in a goroutine.
	cc.bw.Flush()
        return cc.werr
}
@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented May 15, 2019

@gzcaimufu Thanks, but that comment is not related to this problem. That is a different level, and has to do with how a blocking TCP write will respond to request cancelation.

@gzcaimufu

This comment has been minimized.

Copy link

@gzcaimufu gzcaimufu commented May 15, 2019

@gzcaimufu Thanks, but that comment is not related to this problem. That is a different level, and has to do with how a blocking TCP write will respond to request cancelation.

@ianlancetaylor Thanks a lot! For goroutine stuck in internal/poll.runtime_pollWait(/usr/local/go/src/runtime/netpoll.go:173), can I set timeout for this function? I just don't want this to stuck the goroutine. I find that even I set timeout of http.Client, poll.runtime_pollWait still block for a long time. Please let me know if you have suggestions. Many thanks!

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented May 15, 2019

@gzcaimufu I'm sorry, I would like to help, but it's very hard to give advice without knowing what the problem is.

Yes, you can set a timeout for network connections in general; see the SetDeadline method in https://golang.org/pkg/net/#Conn. But setting a timeout on http.Client should be sufficient and I don't know why it wouldn't be.

@gzcaimufu

This comment has been minimized.

Copy link

@gzcaimufu gzcaimufu commented May 15, 2019

@gzcaimufu I'm sorry, I would like to help, but it's very hard to give advice without knowing what the problem is.

Yes, you can set a timeout for network connections in general; see the SetDeadline method in https://golang.org/pkg/net/#Conn. But setting a timeout on http.Client should be sufficient and I don't know why it wouldn't be.

@ianlancetaylor OK, Thanks a lot!

@soyum2222

This comment has been minimized.

Copy link

@soyum2222 soyum2222 commented Aug 18, 2019

@absolute8511 I have the same problem. Have you solved it ? And I found that you are from China.
so I doubt if this is related to the NAT device ?
I found something interesting . If the TCP idle for a long time , server will block to read or write function.And after a few minutes , server will close the TCP conn , whether this phenomenon is consistent with NAT characteristics ?
If this true ,I think use heartbeat ,keep TCP active can solve this problem.

@absolute8511

This comment has been minimized.

Copy link
Author

@absolute8511 absolute8511 commented Aug 19, 2019

I work around this by setting the timeout on Conn. What I am curious is why the block in go runtime after the OS closed the underlying TCP conn not how to keep the TCP active.

@soyum2222

This comment has been minimized.

Copy link

@soyum2222 soyum2222 commented Aug 22, 2019

I think I solved this problem.The reason is as I said above. Is not golang bug .

@agnivade

This comment has been minimized.

Copy link
Contributor

@agnivade agnivade commented Oct 14, 2019

@absolute8511 - It seems like you have worked around this issue. I will go ahead and close this because there is not enough information / repro steps for us to go ahead with this. Please feel free to report back with complete repro steps using Go 1.13 and output from strace -f. Thanks.

@agnivade agnivade closed this Oct 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.