persistent "connection refused" errors after upgrade #1026

tamird · 2016-12-19T19:47:56Z

After merging cockroachdb/cockroach#9697 which bumped our version of grpc from 79b7c34 to 777daa1, we began to see the following error on one of our servers:

E161219 19:33:29.500130 457 storage/queue.go:597  [n2,replicaGC,s4,r6243/3:/Table/51/1/76{29630…-30792…},@c420d70900] failed to send RPC: sending to all 3 replicas failed; last error: rpc error: code = 13 desc = connection error: desc = "transport: write tcp 10.142.0.38:37617->10.142.0.44:26257: write: connection refused"

Note that most of the error is produced by our code, but the final fragment comes from grpc:

rpc error: code = 13 desc = connection error: desc = "transport: write tcp 10.142.0.38:37617->10.142.0.44:26257: write: connection refused"

This error first appeared after one of our other servers was restarted - during this time, some connection refused errors are expected, but the grpc internal reconnection logic should have cleared them after the other server became available.

The very last fragment of this error comes from net, not grpc:

write tcp 10.142.0.38:37617->10.142.0.44:26257: write: connection refused

and this is the strangest part - this appears to be a net.OpError wrapping a os.SyscallError, but note that the supposed syscall was write and the supported error was ECONNREFUSED, which is not an error that write should ever return, as far as I can tell.

cc @bradfitz @petermattis

The text was updated successfully, but these errors were encountered:

bradfitz · 2016-12-19T20:58:59Z

I don't know what version of Go you're talking about here. If you find a problem with the net package in Go 1.8 compared to Go 1.7, please file a bug at https://github.com/golang/go/issues/new . Or if you find something underdocumented, likewise file a bug.

I don't know what's changed in gRPC lately to know what's biting you. I don't regularly work on gRPC.

tamird · 2016-12-19T21:05:42Z

Sorry - this was observed with Go 1.7.3 and not associated with a change in the Go version used.

…

On Mon, Dec 19, 2016 at 3:59 PM, Brad Fitzpatrick ***@***.***> wrote: I don't know what version of Go you're talking about here. If you find a problem with the net package in Go 1.8 compared to Go 1.7, please file a bug at https://github.com/golang/go/issues/new . Or if you find something underdocumented, likewise file a bug. I don't know what's changed in gRPC lately to know what's biting you. I don't regularly work on gRPC. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1026 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABdsPN9WbSeya6Bo_7fnZZlpZ2R2oxHYks5rJvAVgaJpZM4LRHVF> .

bdarnell · 2016-12-20T03:43:23Z

note that the supposed syscall was write and the supported error was ECONNREFUSED, which is not an error that write should ever return, as far as I can tell.

Writing to a socket can return ECONNREFUSED (on linux, but not bsd) if the connect was in non-blocking mode and you haven't consumed the error with getsockopt (or maybe this is a quirk of the python socket wrappers? I haven't tried to reproduce this in go)

>>> import socket
>>> s=socket.socket()
>>> s.setblocking(False)
>>> s.connect(('localhost', 54321))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/socket.py", line 224, in meth
    return getattr(self._sock,name)(*args)
socket.error: [Errno 115] Operation now in progress
>>> s.send(b'asdf')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
socket.error: [Errno 111] Connection refused

We've seen issues caused by write() returning unexpected errors before, in golang/go#14548.

F21 · 2017-01-12T09:23:37Z

I am also seeing something similiar, but in my case, I have only started using grpc since 708a7f9. I am currently using Go 1.8Beta2.

In my case, I have a grpc server running as a http handler for an http server using ServeHTTP(). I also have mutual TLS authentication enabled and am using a self-signed certificate for the http server.

The set up is being used in an end-to-end test.

This is what is being logged:

2017/01/12 09:18:13 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/01/12 09:18:13 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp [::1]:443: getsockopt: connection refused"; Reconnecting to {localhost:443 <nil>}
2017/01/12 09:18:14 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp [::1]:443: getsockopt: connection refused"; Reconnecting to {localhost:443 <nil>}
2017/01/12 09:18:16 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp [::1]:443: getsockopt: connection refused"; Reconnecting to {localhost:443 <nil>}
2017/01/12 09:18:19 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp [::1]:443: getsockopt: connection refused"; Reconnecting to {localhost:443 <nil>}
2017/01/12 09:18:22 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp [::1]:443: getsockopt: connection refused"; Reconnecting to {localhost:443 <nil>}
2017/01/12 09:18:29 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp [::1]:443: getsockopt: connection refused"; Reconnecting to {localhost:443 <nil>}
2017/01/12 09:18:41 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp [::1]:443: getsockopt: connection refused"; Reconnecting to {localhost:443 <nil>}

If I start the server and use time.Sleep() to wait before dialling to the server, I still get these errors, which is really strange. According to the timestamps, these happen upon the first connection. Some how, the connection does succeed and the test passes.

bradfitz · 2017-01-12T23:52:37Z

Yeah, @F21, I confirm that TestServerCredsDispatch loops forever with that failure log message when using Go 1.8rc1.

bradfitz · 2017-01-12T23:56:41Z

Wait, I also see this on Go 1.7.

bradfitz · 2017-01-19T18:50:59Z

@MakMukhi, are you the owner of this?

bradfitz · 2017-01-26T17:30:26Z

This is still happening (I'm at 5095579):

$ go test -v -run=TestServerCredsDispatch google.golang.org/grpc/test
=== RUN   TestServerCredsDispatch
2017/01/26 17:28:34 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp [::]:60765: connect: network is unreachable"; Reconnecting to {[::]:60765 <nil>}
2017/01/26 17:28:35 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp [::]:60765: connect: network is unreachable"; Reconnecting to {[::]:60765 <nil>}
2017/01/26 17:28:37 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp [::]:60765: connect: network is unreachable"; Reconnecting to {[::]:60765 <nil>}
2017/01/26 17:28:40 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp [::]:60765: connect: network is unreachable"; Reconnecting to {[::]:60765 <nil>}
2017/01/26 17:28:44 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp [::]:60765: connect: network is unreachable"; Reconnecting to {[::]:60765 <nil>}
2017/01/26 17:28:51 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp [::]:60765: connect: network is unreachable"; Reconnecting to {[::]:60765 <nil>}
2017/01/26 17:29:01 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp [::]:60765: connect: network is unreachable"; Reconnecting to {[::]:60765 <nil>}
....

tamird · 2017-01-26T17:34:50Z

@bradfitz note that your issue is not the one I reported here; perhaps it deserves its own issue.

bradfitz · 2017-01-26T17:37:07Z

If anybody knew what the issue was, it'd probably be fixed. :)

tamird · 2017-01-26T17:49:40Z

To clarify, the issue I reported here produces returned errors of the form: "transport: write tcp 10.142.0.38:37617->10.142.0.44:26257: write: connection refused" (emphasis mine).

The issue you're reporting here is different, and results in "ambient" error logging of the form: "transport: dial tcp [::]:60765: connect: network is unreachable" (emphasis again mine).

bradfitz · 2017-01-26T17:58:41Z

I'll trust you if you think they're different. I haven't looked. I forked my bug report off into #1058.

MakMukhi · 2017-01-27T00:51:26Z

@tamird Can you send some more logs around the error? For instance, if any other errors were seen (http2client.notifyError etc.)
Also, some context about the connecting client would be useful too;
For instance, what backoff strategy was used since I see here cockroachdb/cockroach#12713 that the reconnect happened every second which seems odd.
Also, if I understand it right: You have a client that is connected to a server but then the server goes away(abruptly ?), and the client here keeps trying to reconnect to it. The server later comes up but the client still won't connect?
An arrow in the dark: when the said server comes up, does the ip address change and if so does the balancer's name resolver get updated to map to the right address?

bdarnell · 2017-01-27T02:54:57Z

This is happening right now on one of our test clusters. The IPs have not changed; one server crashed, was down for a time, and was restarted. The logs are full of this:

W170127 02:50:04.374599 96914562 storage/intent_resolver.go:352  [n1,s1,r64/7:/Table/51/1/14{12136…-29906…}]: failed to resolve intents: failed to send RPC: sending to all
 3 replicas failed; last error: rpc error: code = 13 desc = connection error: desc = "transport: write tcp 192.168.1.4:33844->192.168.1.10:26257: write: connection refused
"
W170127 02:50:04.798731 96914733 storage/intent_resolver.go:352  [n1,s1,r422/1:/Table/51/1/40{39115…-57292…}]: failed to resolve intents: failed to send RPC: sending to al
l 3 replicas failed; last error: rpc error: code = 13 desc = connection error: desc = "transport: write tcp 192.168.1.4:33844->192.168.1.10:26257: write: connection refuse
d"
W170127 02:50:04.848520 96914767 storage/intent_resolver.go:352  [n1,s1,r423/3:/Table/51/1/86{26570…-44684…}]: failed to resolve intents: failed to send RPC: sending to al
l 3 replicas failed; last error: rpc error: code = 13 desc = connection error: desc = "transport: write tcp 192.168.1.4:33844->192.168.1.10:26257: write: connection refuse
d"
W170127 02:50:04.908616 96914878 storage/intent_resolver.go:352  [n1,s1,r428/3:/Table/51/1/83{72888…-90988…}]: failed to resolve intents: failed to send RPC: sending to al
l 3 replicas failed; last error: rpc error: code = 13 desc = connection error: desc = "transport: write tcp 192.168.1.4:33844->192.168.1.10:26257: write: connection refuse
d"
W170127 02:50:04.996322 96915001 storage/intent_resolver.go:352  [n1,s1,r428/3:/Table/51/1/83{72888…-90988…}]: failed to resolve intents: failed to send RPC: sending to al
l 3 replicas failed; last error: rpc error: code = 4 desc = context deadline exceeded
W170127 02:50:05.004075 96914928 storage/intent_resolver.go:352  [n1,s1,r423/3:/Table/51/1/86{26570…-44684…}]: failed to resolve intents: failed to send RPC: sending to al
l 3 replicas failed; last error: rpc error: code = 13 desc = connection error: desc = "transport: write tcp 192.168.1.4:33844->192.168.1.10:26257: write: connection refuse
d"
W170127 02:50:05.042729 96915124 storage/intent_resolver.go:352  [n1,s1,r45/5:/Table/51/1/13{41272…-58971…}]: failed to resolve intents: failed to send RPC: sending to all
 3 replicas failed; last error: rpc error: code = 13 desc = connection error: desc = "transport: write tcp 192.168.1.4:33844->192.168.1.10:26257: write: connection refused
"

Note that the error is write: connection refused, so we're getting ECONNREFUSED from a Write call (which isn't really supposed to happen, but I can't tell whether it's explicitly forbidden). And the local port number (33844) is constant, so it's not trying to reconnect over and over. Instead, what appears to be happening is that because the "connection refused" error is happening in an unexpected context, grpc continues to retry on the failed connection instead of closing it so it can reopen a new one.

MakMukhi · 2017-01-28T00:23:10Z

I haven't had the chance to look at this today but I'm going to statically analyze the code more to see what code path might be causing this issue Monday. Sorry about the delay. It will be quite helpful though, if you can reproduce this error and provide the scenario. Best, Mak

…

On Thu, Jan 26, 2017 at 6:55 PM, Ben Darnell ***@***.***> wrote: This is happening right now on one of our test clusters. The IPs have not changed; one server crashed, was down for a time, and was restarted. The logs are full of this: W170127 02:50:04.374599 96914562 storage/intent_resolver.go:352 [n1,s1,r64/7:/Table/51/1/14{12136…-29906…}]: failed to resolve intents: failed to send RPC: sending to all 3 replicas failed; last error: rpc error: code = 13 desc = connection error: desc = "transport: write tcp 192.168.1.4:33844->192.168.1.10:26257: write: connection refused " W170127 02:50:04.798731 96914733 storage/intent_resolver.go:352 [n1,s1,r422/1:/Table/51/1/40{39115…-57292…}]: failed to resolve intents: failed to send RPC: sending to al l 3 replicas failed; last error: rpc error: code = 13 desc = connection error: desc = "transport: write tcp 192.168.1.4:33844->192.168.1.10:26257: write: connection refuse d" W170127 02:50:04.848520 96914767 storage/intent_resolver.go:352 [n1,s1,r423/3:/Table/51/1/86{26570…-44684…}]: failed to resolve intents: failed to send RPC: sending to al l 3 replicas failed; last error: rpc error: code = 13 desc = connection error: desc = "transport: write tcp 192.168.1.4:33844->192.168.1.10:26257: write: connection refuse d" W170127 02:50:04.908616 96914878 storage/intent_resolver.go:352 [n1,s1,r428/3:/Table/51/1/83{72888…-90988…}]: failed to resolve intents: failed to send RPC: sending to al l 3 replicas failed; last error: rpc error: code = 13 desc = connection error: desc = "transport: write tcp 192.168.1.4:33844->192.168.1.10:26257: write: connection refuse d" W170127 02:50:04.996322 96915001 storage/intent_resolver.go:352 [n1,s1,r428/3:/Table/51/1/83{72888…-90988…}]: failed to resolve intents: failed to send RPC: sending to al l 3 replicas failed; last error: rpc error: code = 4 desc = context deadline exceeded W170127 02:50:05.004075 96914928 storage/intent_resolver.go:352 [n1,s1,r423/3:/Table/51/1/86{26570…-44684…}]: failed to resolve intents: failed to send RPC: sending to al l 3 replicas failed; last error: rpc error: code = 13 desc = connection error: desc = "transport: write tcp 192.168.1.4:33844->192.168.1.10:26257: write: connection refuse d" W170127 02:50:05.042729 96915124 storage/intent_resolver.go:352 [n1,s1,r45/5:/Table/51/1/13{41272…-58971…}]: failed to resolve intents: failed to send RPC: sending to all 3 replicas failed; last error: rpc error: code = 13 desc = connection error: desc = "transport: write tcp 192.168.1.4:33844->192.168.1.10:26257: write: connection refused " Note that the error is write: connection refused, so we're getting ECONNREFUSED from a Write call (which isn't really supposed to happen, but I can't tell whether it's explicitly forbidden). And the local port number (33844) is constant, so it's not trying to reconnect over and over. Instead, what appears to be happening is that because the "connection refused" error is happening in an unexpected context, grpc continues to retry on the failed connection instead of closing it so it can reopen a new one. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1026 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ATtnR2nHpeEjWo7JaxWpJk8h880OCfs-ks5rWVyMgaJpZM4LRHVF> .

sailat · 2017-02-16T00:50:46Z

@tamird @bdarnell are you still seeing this issue?

bdarnell · 2017-02-20T00:55:18Z

Yes, it still happens. Here's a recent log snippet:

cockroach@cockroach-blue-0007.crdb.io: I170219 02:03:25.681933 693 vendor/google.golang.org/grpc/clientconn.go:866  grpc: addrConn.transportMonitor exits due to: connection error: desc = "transport: write tcp 192.168.1.10:39162->192.168.1.9:26257: write: connection refused"
cockroach@cockroach-blue-0007.crdb.io: E170219 02:07:48.257687 12830805 storage/replica_command.go:1851  [replica consistency checker,n7,s7,r130/59:/Table/51/1/756{11808…-23002…},@c4216af500] could not CollectChecksum from replica {6 6 61}: rpc error: code = 13 desc = connection error: desc = "transport: write tcp 192.168.1.10:39162->192.168.1.9:26257: write: connection refused"
cockroach@cockroach-blue-0007.crdb.io: E170219 02:11:12.643325 13043929 storage/replica_command.go:1851  [replica consistency checker,n7,s7,r2104/25:/Table/51/1/775{32005…-43245…},@c4212ed180] could not CollectChecksum from replica {6 6 26}: rpc error: code = 13 desc = connection error: desc = "transport: write tcp 192.168.1.10:39162->192.168.1.9:26257: write: connection refused"
cockroach@cockroach-blue-0007.crdb.io: E170219 02:20:16.612271 13671648 storage/replica_command.go:1851  [replica consistency checker,n7,s7,r558/46:/Table/51/1/474{18512…-29719…},@c421240000] could not CollectChecksum from replica {6 6 38}: rpc error: code = 13 desc = connection error: desc = "transport: write tcp 192.168.1.10:39162->192.168.1.9:26257: write: connection refused"

We haven't managed to reduce this to a simple reproduction, but this is on a test cluster where we periodically restart random nodes from cron (when this happens, the process is killed with -9, left down for 5 minutes, then restarted).

bradfitz · 2017-02-20T01:22:47Z

@bdarnell, sorry for the threadjack, but does Cockroach use streaming RPCs? I ask because I also ran into some gRPC problems but while debugging I just decided to delete gRPC-go's http2 stack and retrofit gRPC-go atop Go's native http2 support. The retrofit worked for me, but I didn't need streaming RPCs so I didn't implement them yet. Similarly, h2c support is easy but not yet done, because I didn't need it. (I just needed to do simple RPCs out to Google services over https)

bdarnell · 2017-02-20T01:26:27Z

Yes, we use streaming RPCs extensively (and we need h2c support as well)

bradfitz · 2017-02-20T01:30:56Z

Okay. If you're interested, you can follow along at bradfitz#3 and bradfitz#4

bdarnell · 2017-02-25T19:59:15Z

I think there are two sides to this problem, one in Go's net package and one in GRPC.

On the Go side, I hypothesized in golang/go#14548 that the net poller was subject to spurious wakeups, which could result in Dial returning with nil error before the connection was completed. On Darwin, premature writes to a socket that is not yet connected would fail, so a Darwin-specific fix was put in place to ensure that the connection was complete before Dial returns. On Linux, the same spurious wakeup scenario is not a problem for connections that succeed, but it does mean that connection-related errors like ECONNREFUSED can be observed from system calls like write, where they are not otherwise possible. I'll file a Go issue for further investigation about how this should be resolved, but for now let's look at how GRPC handles this error.

In GRPC, this shifts the handling of this error from dial, where every error is a temporary connectionError (unless FailOnNonTempDialError is set, which is not the case for us), to creds.ClientHandshake, which may return a permanent error according to the result of isTemporary. (ECONNREFUSED is not temporary). Elsewhere in grpc/transport, errors from Write always turn into temporary connectionErrors, regardless of what isTemporary says.

I propose that if creds.ClientHandshake returns a net.Error and FailOnNonTempDialError is false, the connectionError should be considered temporary (if there are other more precise ways to distinguish the kinds of certificate errors that indicate a fatal misconfiguration, that would be nice, but it's not clear to me where exactly to draw the line). We could even go so far as to say that the TLS handshake is logically a part of the dial process and only return permanent errors for certificate problems if FailOnNonTempDialError is true, although that may be taking things too far.

Finally in Cockroach, we should probably be detecting connections in the "shutdown" state and A) make loud noises about them and B) retry them with an appropriate backoff.

Works around grpc/grpc-go#1026, which results in connections getting permanently wedged by "connection refused" errors being returned at the wrong time.

tamird · 2017-10-22T17:57:54Z

I believe golang/go@bf0f692 fixed this for Go1.9.

tamird mentioned this issue Jan 11, 2017

stability: delta experiencing regular hiccups cockroachdb/cockroach#12713

Closed

menghanl assigned MakMukhi Feb 6, 2017

bdarnell mentioned this issue Feb 25, 2017

net: Dial returns before connection complete golang/go#19289

Closed

bdarnell mentioned this issue Feb 27, 2017

rpc: Hacky workaround for GRPC connection refused errors cockroachdb/cockroach#13822

Merged

dfawley added P3 Type: Bug labels Oct 12, 2017

dfawley unassigned MakMukhi Oct 19, 2017

menghanl closed this as completed Dec 14, 2017

lock bot locked as resolved and limited conversation to collaborators Sep 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

persistent "connection refused" errors after upgrade #1026

persistent "connection refused" errors after upgrade #1026

tamird commented Dec 19, 2016

bradfitz commented Dec 19, 2016

tamird commented Dec 19, 2016 via email

bdarnell commented Dec 20, 2016

F21 commented Jan 12, 2017

bradfitz commented Jan 12, 2017

bradfitz commented Jan 12, 2017

bradfitz commented Jan 19, 2017

bradfitz commented Jan 26, 2017

tamird commented Jan 26, 2017

bradfitz commented Jan 26, 2017

tamird commented Jan 26, 2017

bradfitz commented Jan 26, 2017 •

edited

Loading

MakMukhi commented Jan 27, 2017

bdarnell commented Jan 27, 2017

MakMukhi commented Jan 28, 2017 via email

sailat commented Feb 16, 2017

bdarnell commented Feb 20, 2017

bradfitz commented Feb 20, 2017

bdarnell commented Feb 20, 2017

bradfitz commented Feb 20, 2017

bdarnell commented Feb 25, 2017

tamird commented Oct 22, 2017

persistent "connection refused" errors after upgrade #1026

persistent "connection refused" errors after upgrade #1026

Comments

tamird commented Dec 19, 2016

bradfitz commented Dec 19, 2016

tamird commented Dec 19, 2016 via email

bdarnell commented Dec 20, 2016

F21 commented Jan 12, 2017

bradfitz commented Jan 12, 2017

bradfitz commented Jan 12, 2017

bradfitz commented Jan 19, 2017

bradfitz commented Jan 26, 2017

tamird commented Jan 26, 2017

bradfitz commented Jan 26, 2017

tamird commented Jan 26, 2017

bradfitz commented Jan 26, 2017 • edited Loading

MakMukhi commented Jan 27, 2017

bdarnell commented Jan 27, 2017

MakMukhi commented Jan 28, 2017 via email

sailat commented Feb 16, 2017

bdarnell commented Feb 20, 2017

bradfitz commented Feb 20, 2017

bdarnell commented Feb 20, 2017

bradfitz commented Feb 20, 2017

bdarnell commented Feb 25, 2017

tamird commented Oct 22, 2017

bradfitz commented Jan 26, 2017 •

edited

Loading