-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
persistent "connection refused" errors after upgrade #1026
Comments
I don't know what version of Go you're talking about here. If you find a problem with the net package in Go 1.8 compared to Go 1.7, please file a bug at https://github.com/golang/go/issues/new . Or if you find something underdocumented, likewise file a bug. I don't know what's changed in gRPC lately to know what's biting you. I don't regularly work on gRPC. |
Sorry - this was observed with Go 1.7.3 and not associated with a change in
the Go version used.
…On Mon, Dec 19, 2016 at 3:59 PM, Brad Fitzpatrick ***@***.***> wrote:
I don't know what version of Go you're talking about here. If you find a
problem with the net package in Go 1.8 compared to Go 1.7, please file a
bug at https://github.com/golang/go/issues/new . Or if you find something
underdocumented, likewise file a bug.
I don't know what's changed in gRPC lately to know what's biting you. I
don't regularly work on gRPC.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1026 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABdsPN9WbSeya6Bo_7fnZZlpZ2R2oxHYks5rJvAVgaJpZM4LRHVF>
.
|
Writing to a socket can return
We've seen issues caused by |
I am also seeing something similiar, but in my case, I have only started using grpc since 708a7f9. I am currently using Go 1.8Beta2. In my case, I have a grpc server running as a http handler for an http server using The set up is being used in an end-to-end test. This is what is being logged:
If I start the server and use |
Yeah, @F21, I confirm that TestServerCredsDispatch loops forever with that failure log message when using Go 1.8rc1. |
Wait, I also see this on Go 1.7. |
@MakMukhi, are you the owner of this? |
This is still happening (I'm at 5095579):
|
@bradfitz note that your issue is not the one I reported here; perhaps it deserves its own issue. |
If anybody knew what the issue was, it'd probably be fixed. :) |
To clarify, the issue I reported here produces returned errors of the form: "transport: write tcp 10.142.0.38:37617->10.142.0.44:26257: write: connection refused" (emphasis mine). The issue you're reporting here is different, and results in "ambient" error logging of the form: "transport: dial tcp [::]:60765: connect: network is unreachable" (emphasis again mine). |
I'll trust you if you think they're different. I haven't looked. I forked my bug report off into #1058. |
@tamird Can you send some more logs around the error? For instance, if any other errors were seen (http2client.notifyError etc.) |
This is happening right now on one of our test clusters. The IPs have not changed; one server crashed, was down for a time, and was restarted. The logs are full of this:
Note that the error is |
I haven't had the chance to look at this today but I'm going to statically
analyze the code more to see what code path might be causing this issue
Monday. Sorry about the delay.
It will be quite helpful though, if you can reproduce this error and
provide the scenario.
Best,
Mak
…On Thu, Jan 26, 2017 at 6:55 PM, Ben Darnell ***@***.***> wrote:
This is happening right now on one of our test clusters. The IPs have not
changed; one server crashed, was down for a time, and was restarted. The
logs are full of this:
W170127 02:50:04.374599 96914562 storage/intent_resolver.go:352 [n1,s1,r64/7:/Table/51/1/14{12136…-29906…}]: failed to resolve intents: failed to send RPC: sending to all
3 replicas failed; last error: rpc error: code = 13 desc = connection error: desc = "transport: write tcp 192.168.1.4:33844->192.168.1.10:26257: write: connection refused
"
W170127 02:50:04.798731 96914733 storage/intent_resolver.go:352 [n1,s1,r422/1:/Table/51/1/40{39115…-57292…}]: failed to resolve intents: failed to send RPC: sending to al
l 3 replicas failed; last error: rpc error: code = 13 desc = connection error: desc = "transport: write tcp 192.168.1.4:33844->192.168.1.10:26257: write: connection refuse
d"
W170127 02:50:04.848520 96914767 storage/intent_resolver.go:352 [n1,s1,r423/3:/Table/51/1/86{26570…-44684…}]: failed to resolve intents: failed to send RPC: sending to al
l 3 replicas failed; last error: rpc error: code = 13 desc = connection error: desc = "transport: write tcp 192.168.1.4:33844->192.168.1.10:26257: write: connection refuse
d"
W170127 02:50:04.908616 96914878 storage/intent_resolver.go:352 [n1,s1,r428/3:/Table/51/1/83{72888…-90988…}]: failed to resolve intents: failed to send RPC: sending to al
l 3 replicas failed; last error: rpc error: code = 13 desc = connection error: desc = "transport: write tcp 192.168.1.4:33844->192.168.1.10:26257: write: connection refuse
d"
W170127 02:50:04.996322 96915001 storage/intent_resolver.go:352 [n1,s1,r428/3:/Table/51/1/83{72888…-90988…}]: failed to resolve intents: failed to send RPC: sending to al
l 3 replicas failed; last error: rpc error: code = 4 desc = context deadline exceeded
W170127 02:50:05.004075 96914928 storage/intent_resolver.go:352 [n1,s1,r423/3:/Table/51/1/86{26570…-44684…}]: failed to resolve intents: failed to send RPC: sending to al
l 3 replicas failed; last error: rpc error: code = 13 desc = connection error: desc = "transport: write tcp 192.168.1.4:33844->192.168.1.10:26257: write: connection refuse
d"
W170127 02:50:05.042729 96915124 storage/intent_resolver.go:352 [n1,s1,r45/5:/Table/51/1/13{41272…-58971…}]: failed to resolve intents: failed to send RPC: sending to all
3 replicas failed; last error: rpc error: code = 13 desc = connection error: desc = "transport: write tcp 192.168.1.4:33844->192.168.1.10:26257: write: connection refused
"
Note that the error is write: connection refused, so we're getting
ECONNREFUSED from a Write call (which isn't really supposed to happen,
but I can't tell whether it's explicitly forbidden). And the local port
number (33844) is constant, so it's not trying to reconnect over and over.
Instead, what appears to be happening is that because the "connection
refused" error is happening in an unexpected context, grpc continues to
retry on the failed connection instead of closing it so it can reopen a new
one.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1026 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ATtnR2nHpeEjWo7JaxWpJk8h880OCfs-ks5rWVyMgaJpZM4LRHVF>
.
|
Yes, it still happens. Here's a recent log snippet:
We haven't managed to reduce this to a simple reproduction, but this is on a test cluster where we periodically restart random nodes from cron (when this happens, the process is killed with |
@bdarnell, sorry for the threadjack, but does Cockroach use streaming RPCs? I ask because I also ran into some gRPC problems but while debugging I just decided to delete gRPC-go's http2 stack and retrofit gRPC-go atop Go's native http2 support. The retrofit worked for me, but I didn't need streaming RPCs so I didn't implement them yet. Similarly, h2c support is easy but not yet done, because I didn't need it. (I just needed to do simple RPCs out to Google services over https) |
Yes, we use streaming RPCs extensively (and we need h2c support as well) |
Okay. If you're interested, you can follow along at bradfitz#3 and bradfitz#4 |
I think there are two sides to this problem, one in Go's On the Go side, I hypothesized in golang/go#14548 that the net poller was subject to spurious wakeups, which could result in In GRPC, this shifts the handling of this error from I propose that if Finally in Cockroach, we should probably be detecting connections in the "shutdown" state and A) make loud noises about them and B) retry them with an appropriate backoff. |
Works around grpc/grpc-go#1026, which results in connections getting permanently wedged by "connection refused" errors being returned at the wrong time.
I believe golang/go@bf0f692 fixed this for Go1.9. |
After merging cockroachdb/cockroach#9697 which bumped our version of grpc from 79b7c34 to 777daa1, we began to see the following error on one of our servers:
Note that most of the error is produced by our code, but the final fragment comes from grpc:
This error first appeared after one of our other servers was restarted - during this time, some connection refused errors are expected, but the grpc internal reconnection logic should have cleared them after the other server became available.
The very last fragment of this error comes from
net
, not grpc:and this is the strangest part - this appears to be a
net.OpError
wrapping aos.SyscallError
, but note that the supposed syscall waswrite
and the supported error wasECONNREFUSED
, which is not an error thatwrite
should ever return, as far as I can tell.cc @bradfitz @petermattis
The text was updated successfully, but these errors were encountered: