Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"io.grpc.StatusRuntimeException: UNKNOWN: channel closed" on network disconnect #10120

Open
cocreature opened this issue May 2, 2023 · 8 comments
Milestone

Comments

@cocreature
Copy link

What version of gRPC-Java are you using?

1.44.0

What is your environment?

Ubuntu 22.04

What did you expect to see?

An UNAVAILABLE status code or something similar

What did you see instead?

We saw things fail with this exception and stacktrace:

io.grpc.StatusRuntimeException: UNKNOWN: channel closed
	at io.grpc.Status.asRuntimeException(Status.java:535)
	at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:479)
	at io.opentelemetry.instrumentation.grpc.v1_6.TracingClientInterceptor$TracingClientCall$TracingClientCallListener.onClose(TracingClientInterceptor.java:161)
	at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:562)
	at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70)
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:743)
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:722)
	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
	at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
	at com.daml.executors.QueueAwareExecutorService$TrackingRunnable.run(QueueAwareExecutorService.scala:98)
	at com.daml.metrics.InstrumentedExecutorServiceMetrics$InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorServiceMetrics.scala:202)
	at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1426)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
	at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
Caused by: java.nio.channels.ClosedChannelException: null
	at io.grpc.netty.Utils.statusFromThrowable(Utils.java:271)
	at io.grpc.netty.NettyClientHandler.onConnectionError(NettyClientHandler.java:500)
	at io.netty.handler.codec.http2.Http2ConnectionHandler.onError(Http2ConnectionHandler.java:641)
	at io.netty.handler.codec.http2.DefaultHttp2ConnectionEncoder.writeHeaders0(DefaultHttp2ConnectionEncoder.java:251)
	at io.netty.handler.codec.http2.DefaultHttp2ConnectionEncoder.writeHeaders(DefaultHttp2ConnectionEncoder.java:167)
	at io.netty.handler.codec.http2.DecoratingHttp2FrameWriter.writeHeaders(DecoratingHttp2FrameWriter.java:53)
	at io.netty.handler.codec.http2.StreamBufferingEncoder.writeHeaders(StreamBufferingEncoder.java:170)
	at io.netty.handler.codec.http2.StreamBufferingEncoder.writeHeaders(StreamBufferingEncoder.java:158)
	at io.grpc.netty.NettyClientHandler.createStreamTraced(NettyClientHandler.java:609)
	at io.grpc.netty.NettyClientHandler.createStream(NettyClientHandler.java:592)
	at io.grpc.netty.NettyClientHandler.write(NettyClientHandler.java:326)
	at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:881)
	at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:863)
	at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:968)
	at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:856)
	at io.netty.channel.DefaultChannelPipeline.write(DefaultChannelPipeline.java:1015)
	at io.netty.channel.AbstractChannel.write(AbstractChannel.java:301)
	at io.grpc.netty.WriteQueue$AbstractQueuedCommand.run(WriteQueue.java:213)
	at io.grpc.netty.WriteQueue.flush(WriteQueue.java:128)
	at io.grpc.netty.WriteQueue.access$000(WriteQueue.java:34)
	at io.grpc.netty.WriteQueue$1.run(WriteQueue.java:46)
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:566)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: io.netty.channel.StacklessClosedChannelException: null
	at io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, ChannelPromise)(Unknown Source)

Interestingly, it does look like the channel recovered from this after the connection established again.

Steps to reproduce the bug

In our test setup, we kill the connection with toxiproxy and then see this failure but only relatively rarely. I don't have a reliable reproduce unfortunately (nor one that I can make public).

Is that expected? Given that it recovers should we just retry on UNKNOWN: channel closed like we do on an UNAVAILABLE?

@YifeiZhuang
Copy link
Contributor

YifeiZhuang commented May 2, 2023

The exception looks expected behaviour. The error needs to be handled on your streamObserver/callListener, a retry sounds fair. A retry policy in grpc may help to do that automatically: see retry example.

@ejona86
Copy link
Member

ejona86 commented May 2, 2023

I agree that disconnects are generally UNAVAILABLE. The problem in this case is "you shouldn't have seen that error." It could have been normal, or it could have been an internal error.

9bdb8f0 is working, and we have some stacktrace information to go on. But it is rather strange that we learn that the connection is closed when trying to write headers.

You are using plain-text (no TLS)?

@cocreature
Copy link
Author

cocreature commented May 3, 2023

Yes this is plaintext.

@ejona86 ejona86 added the bug label May 3, 2023
@ejona86 ejona86 added this to the Next milestone May 3, 2023
@ejona86 ejona86 added the netty label May 3, 2023
@ejona86
Copy link
Member

ejona86 commented May 3, 2023

I think there's enough here to try and make educated guesses on ways that might improve it. But it will be troublesome without a reproduction. It is timing-based, so a reproduction will be probabilistic.

@cocreature, could you share a toxiproxy configuration that kills the connection? I'm not already familiar with it, but it seems great for trying to trigger this. Are you using reset_peer?

@cocreature
Copy link
Author

Let me describe our toxiproxy config as best as I can:
We're using toxiproxy-server version 2.5.0 and the toxiproxy java client.

We're creating the proxy within our tests through client.createProxy(name, listenAddress, upstreamAddress) where client is created through new ToxiproxyClient().

Our client (which is throwing the error) connects to the listen address of the proxy. The actual server is behind the upstream address.

To break the connection we use proxy.disable().

That calls https://github.com/shopify/toxiproxy#down.

We don't use reset_peer.

@ejona86
Copy link
Member

ejona86 commented May 5, 2023

Looks like down just closes the connection, without setting linger to 0 (as used by reset).

@marx-freedom
Copy link

Getting same exception.
Unfortunately, we have only trimmed stacktrace in our logs.

io.grpc.StatusRuntimeException: UNKNOWN: channel closed
        at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:275)
Caused by:
java.nio.channels.ClosedChannelException
        at io.grpc.netty.Utils.statusFromThrowable(Utils.java:275)
Caused by:
io.netty.channel.StacklessClosedChannelException
        at io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, ChannelPromise)(Unknown Source)
Caused by:
io.netty.channel.unix.Errors$NativeIoException: sendAddress(..) failed: Connection reset by peer

@ejona86
Copy link
Member

ejona86 commented Aug 10, 2023

@marx-freedom, your issue seems unrelated to this. File a separate issue if you need to discuss it. In your case gRPC did know why the connection was closed: "Connection reset by peer."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants