Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: Conn.Read() stuck in runtime_pollWait after internet connection loss and recovery #41549

Closed
leventov opened this issue Sep 22, 2020 · 9 comments
Milestone

Comments

@leventov
Copy link

@leventov leventov commented Sep 22, 2020

What version of Go are you using (go version)?

go1.15

What operating system and processor architecture are you using (go env)?

RaspberriPi Compute Module 3+, 4.19.88 #1 SMP Fri Jul 17 09:42:11 UTC 2020 armv7l GNU/Linux.

Additionally, the process runs within a Docker container.

What did you do?

Internet connection broke and then recovered.

tls.Conn.Read() stuck in runtime_pollWait.

1 @ 0x48608 0x412e8 0x75ffc 0xde6a8 0xdf670 0xdf655 0x1d1ab0 0x1e24e8 0x21a5ec 0x10f724 0x21a834 0x2180a4 0x21d074 0x21d07d 0xda238 0x4cd528 0x4cd4fd 0x4e54e8 0x7ad2c
--
  | #	0x75ffb		internal/poll.runtime_pollWait+0x43				runtime/netpoll.go:220 |   | #	0x75ffb		internal/poll.runtime_pollWait+0x43				runtime/netpoll.go:220
  | #	0xde6a7		internal/poll.(*pollDesc).wait+0x2f				internal/poll/fd_poll_runtime.go:87 |   | #	0xde6a7		internal/poll.(*pollDesc).wait+0x2f				internal/poll/fd_poll_runtime.go:87
  | #	0xdf66f		internal/poll.(*pollDesc).waitRead+0x17b			internal/poll/fd_poll_runtime.go:92 |   | #	0xdf66f		internal/poll.(*pollDesc).waitRead+0x17b			internal/poll/fd_poll_runtime.go:92
  | #	0xdf654		internal/poll.(*FD).Read+0x160					internal/poll/fd_unix.go:159 |   | #	0xdf654		internal/poll.(*FD).Read+0x160					internal/poll/fd_unix.go:159
  | #	0x1d1aaf	net.(*netFD).Read+0x37						net/fd_posix.go:55 |   | #	0x1d1aaf	net.(*netFD).Read+0x37						net/fd_posix.go:55
  | #	0x1e24e7	net.(*conn).Read+0x63						net/net.go:182 |   | #	0x1e24e7	net.(*conn).Read+0x63						net/net.go:182
  | #	0x21a5eb	crypto/tls.(*atLeastReader).Read+0x77				crypto/tls/conn.go:779 |   | #	0x21a5eb	crypto/tls.(*atLeastReader).Read+0x77				crypto/tls/conn.go:779
  | #	0x10f723	bytes.(*Buffer).ReadFrom+0xa3					bytes/buffer.go:204 |   | #	0x10f723	bytes.(*Buffer).ReadFrom+0xa3					bytes/buffer.go:204
  | #	0x21a833	crypto/tls.(*Conn).readFromUntil+0xc3				crypto/tls/conn.go:801 |   | #	0x21a833	crypto/tls.(*Conn).readFromUntil+0xc3				crypto/tls/conn.go:801
  | #	0x2180a3	crypto/tls.(*Conn).readRecordOrCCS+0xfb				crypto/tls/conn.go:608 |   | #	0x2180a3	crypto/tls.(*Conn).readRecordOrCCS+0xfb				crypto/tls/conn.go:608
  | #	0x21d073	crypto/tls.(*Conn).readRecord+0x14f				crypto/tls/conn.go:576 |   | #	0x21d073	crypto/tls.(*Conn).readRecord+0x14f				crypto/tls/conn.go:576
  | #	0x21d07c	crypto/tls.(*Conn).Read+0x158					crypto/tls/conn.go:1252 |   | #	0x21d07c	crypto/tls.(*Conn).Read+0x158					crypto/tls/conn.go:1252
  | #	0xda237		io.ReadAtLeast+0x6b						io/io.go:314 |   | #	0xda237		io.ReadAtLeast+0x6b						io/io.go:314
  | #	0x4cd527	io.ReadFull+0x67						io/io.go:333 |   | #	0x4cd527	io.ReadFull+0x67						io/io.go:333
  | #	0x4cd4fc	github.com/eclipse/paho.mqtt.golang/packets.ReadPacket+0x3c	github.com/eclipse/paho.mqtt.golang@v1.2.0/packets/packets.go:105 |   | #	0x4cd4fc	github.com/eclipse/paho.mqtt.golang/packets.ReadPacket+0x3c	github.com/eclipse/paho.mqtt.golang@v1.2.0/packets/packets.go:105
  | #	0x4e54e7	github.com/eclipse/paho%2emqtt%2egolang.incoming+0xe7		github.com/eclipse/paho.mqtt.golang@v1.2.0/net.go:132 |   | #	0x4e54e7	github.com/eclipse/paho%2emqtt%2egolang.incoming+0xe7		github.com/eclipse/paho.mqtt.golang@v1.2.0/net.go:132

Might be related to #27752

@davecheney
Copy link
Contributor

@davecheney davecheney commented Sep 22, 2020

This is expected if a timeout has not been set on the connection. Has a timeout been set before calling Read?

@leventov
Copy link
Author

@leventov leventov commented Sep 22, 2020

So there should probably be a SetReadDeadline() call before this line?
https://github.com/eclipse/paho.mqtt.golang/blob/ba85050a1f239f4e954dc95920213db51f937df1/net.go#L119

Still, I would expect that a read call (even untimed) would error with "internet disconnected" on internet disconnection, or would unstuck again when the internet connection has recovered, but not just stuck.

@davecheney
Copy link
Contributor

@davecheney davecheney commented Sep 22, 2020

Yup, if it’s important, it needs a timeout.

Still, I would expect that a read call (even untimed) would error with "internet disconnected" on internet disconnection, or would unstuck again when the internet connection has recovered, but not just stuck

If the operating system has not signalled that the tcp connection has been closed or reset, there’s not much the runtime can do from user space.

@leventov
Copy link
Author

@leventov leventov commented Sep 22, 2020

So you think this is a kernel/Docker problem that it doesn't close the socket on internet disconnection, or no one's problem at all?

The runtime could probably detect the internet disconnection event and fail all outstanding Reads.

@davecheney
Copy link
Contributor

@davecheney davecheney commented Sep 22, 2020

The runtime could probably detect the internet disconnection event and fail all outstanding Reads.

the network fd is handled by epoll (on linux) and if there is no event received from the kernel, there's nothing the runtime can do.

@networkimprov
Copy link

@networkimprov networkimprov commented Sep 23, 2020

See also #31490 re TCP keepalive problems.

TCP keepalive is on by default for both client and server net.Conn's

@cagedmantis cagedmantis changed the title tls.Conn.Read() stuck in runtime_pollWait after internet connection loss and recovery tls: Conn.Read() stuck in runtime_pollWait after internet connection loss and recovery Sep 28, 2020
@cagedmantis cagedmantis changed the title tls: Conn.Read() stuck in runtime_pollWait after internet connection loss and recovery crypto/tls: Conn.Read() stuck in runtime_pollWait after internet connection loss and recovery Sep 28, 2020
@cagedmantis cagedmantis added this to the Backlog milestone Sep 28, 2020
@cagedmantis
Copy link
Contributor

@cagedmantis cagedmantis commented Sep 28, 2020

@FiloSottile FiloSottile changed the title crypto/tls: Conn.Read() stuck in runtime_pollWait after internet connection loss and recovery net: Conn.Read() stuck in runtime_pollWait after internet connection loss and recovery Oct 5, 2020
@FiloSottile
Copy link
Member

@FiloSottile FiloSottile commented Oct 5, 2020

Doesn't look like a crypto/tls specific issue, please tag me back in if I'm wrong.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Oct 5, 2020

I don't think there is anything we can change in the Go standard library here, so I'm going to close the issue.

Please comment if you disagree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants
You can’t perform that action at this time.