Ocasional NullPointerException in io.grpc.internal.RetriableStream.drain(RetriableStream.java:279) when using hedgingPolicy #9185

srmrz · 2022-05-18T19:27:34Z

What version of gRPC-Java are you using?

Version 1.42.2

What is your environment?

Linux

What did you expect to see?

I'm using the hedging retry policy via configuration and ocasionally see a NullPointerException pop up.

Here's a snippet of the Kotlin code that configures the hedging policy with an 85ms hedging delay:

.defaultServiceConfig(
    mapOf(
        "loadBalancingPolicy" to "round_robin",
        "methodConfig" to listOf(
            mapOf(
                "name" to listOf(
                    mapOf(
                        "service" to "my.org.Service",
                        "method" to "MyMethod"
                    )
                ),
                "waitForReady" to true,
                "hedgingPolicy" to mapOf(
                    "maxAttempts" to 2.
                    "hedgingDelay" to "0.085s",
                    "nonFatalStatusCodes" to listOf(Status.UNAVAILABLE.code.name)
                )
            )
        )
    )
)

What did you see instead?

java.lang.NullPointerException: null
    at io.grpc.internal.RetriableStream.drain(RetriableStream.java:279)
    at io.grpc.internal.RetriableStream.access$1100(RetriableStream.java:55)
    at io.grpc.internal.RetriableStream$HedgingRunnable$1.run(RetriableStream.java:476)
    at io.micrometer.core.instrument.composite.CompositeTimer.record(CompositeTimer.java:79)
    at io.micrometer.core.instrument.Timer.lambda$wrap$0(Timer.java:160)
    at io.micronaut.scheduling.instrument.InvocationInstrumenterWrappedRunnable.run(InvocationInstrumenterWrappedRunnable.java:47)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.lang.Thread.run(Thread.java:834)

Steps to reproduce the bug

Enable hedging policy above

The text was updated successfully, but these errors were encountered:

ejona86 · 2022-05-18T20:53:43Z

Interesting. I think this is a regression from cbda32a . It seems state.buffer == null which means passthrough == true which means some other stream was committed and this stream should be cancelled. But it hadn't processed the enough of the buffer yet to start the stream which means it has trouble cancelling it.

ejona86 · 2022-05-18T20:54:41Z

This is a race and seems could only be avoided by avoiding hedging. Retries would still be safe.

srmrz · 2022-05-18T20:56:24Z

I see. Thanks @ejona86 for triaging this so quickly.

Hedging is working really well for us to reduce tail latency. I'm hoping we can continue using it and help resolve the race condition.

ejona86 · 2022-05-18T21:02:53Z

Actually... what was the impact to the RPC in this case? It looks like this might just be logged but the RPC could still complete successfully. Which means things may mostly work if you just ignore the log.

There is a risk of a memory leak developing, but since it didn't end up starting the stream, I think gRPC wouldn't leak any memory when using the Netty transport and a connection is available. If there is no connection available, it might leak.

ejona86 · 2022-05-18T21:15:52Z

This is a race and seems could only be avoided by avoiding hedging

This was in the context of a workaround for the problem.

srmrz · 2022-05-18T21:17:31Z

Actually... what was the impact to the RPC in this case? It looks like this might just be logged but the RPC could still complete successfully. Which means things may mostly work if you just ignore the log.

From what I can tell, all requests are succeeding. This is just logged on the side.

That said, I did see 1 single instance of this message:

LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information.
Recent access records: 
Created at:
	io.grpc.netty.shaded.io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:402)
	io.grpc.netty.shaded.io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:187)
	io.grpc.netty.shaded.io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:123)
	io.grpc.netty.shaded.io.grpc.netty.NettyWritableBufferAllocator.allocate(NettyWritableBufferAllocator.java:51)
	io.grpc.internal.MessageFramer.writeKnownLengthUncompressed(MessageFramer.java:226)
	io.grpc.internal.MessageFramer.writeUncompressed(MessageFramer.java:168)
	io.grpc.internal.MessageFramer.writePayload(MessageFramer.java:141)
	io.grpc.internal.AbstractStream.writeMessage(AbstractStream.java:65)
	io.grpc.internal.ForwardingClientStream.writeMessage(ForwardingClientStream.java:37)
	io.grpc.internal.RetriableStream$1SendMessageEntry.runWith(RetriableStream.java:549)
	io.grpc.internal.RetriableStream.drain(RetriableStream.java:310)
	io.grpc.internal.RetriableStream.access$1100(RetriableStream.java:55)
	io.grpc.internal.RetriableStream$HedgingRunnable$1.run(RetriableStream.java:476)
	io.micrometer.core.instrument.composite.CompositeTimer.record(CompositeTimer.java:79)
	io.micrometer.core.instrument.Timer.lambda$wrap$0(Timer.java:160)
	io.micronaut.scheduling.instrument.InvocationInstrumenterWrappedRunnable.run(InvocationInstrumenterWrappedRunnable.java:47)
	java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	java.base/java.lang.Thread.run(Thread.java:834)

ejona86 · 2022-05-18T21:35:32Z

Since writing a message has to be done after start(), it feel like that buffer leak should be unrelated. But that stack trace is in the same hedging draining code, which does make it suspiciously related.

srmrz · 2022-05-19T17:52:25Z

which does make it suspiciously related.

Yeah!

I do see several TODOs in the class ("TODO(b/145386688)" - d3c77f2 ) that may be related to this race condition 🤷?

Example:

grpc-java/core/src/main/java/io/grpc/internal/RetriableStream.java

Lines 451 to 453 in 5bb721e

    
           // TODO(b/145386688): This access should be guarded by 
        
           // 'HedgingRunnable.this.scheduledHedgingRef.lock'; instead found: 
        
           // 'RetriableStream.this.lock'

grpc-java/core/src/main/java/io/grpc/internal/RetriableStream.java

Lines 168 to 169 in 5bb721e

    
           // TODO(b/145386688): This access should be guarded by 'this.scheduledRetry.lock'; instead 
        
           // found: 'this.lock'

grpc-java/core/src/main/java/io/grpc/internal/RetriableStream.java

Lines 178 to 179 in 5bb721e

    
           // TODO(b/145386688): This access should be guarded by 'this.scheduledHedging.lock'; instead 
        
           // found: 'this.lock'

grpc-java/core/src/main/java/io/grpc/internal/RetriableStream.java

Lines 412 to 413 in 5bb721e

    
           // TODO(b/145386688): This access should be guarded by 'this.scheduledHedging.lock'; instead 
        
           // found: 'this.lock'

ejona86 · 2022-05-19T18:13:13Z

I don't think those guarded by things are that important. All those locks are the same instance, just aliases. So it is more just that the code is written in a way that isn't able to be verified by tooling.

The problem was one hedge was committed before another had drained start(). This was not testable because HedgingRunnable checks whether scheduledHedgingRef is cancelled, which is racy, but there's no way to deterministically trigger either race. The same problem couldn't be triggered with retries because only one attempt will be draining at a time. Retries with cancellation also couldn't trigger it, for the surprising reason that the noop stream used in cancel() wasn't considered drained. This commit marks the noop stream as drained with cancel(), which allows memory to be garbage collected sooner and exposes the race for tests. That then showed the stream as hanging, because inFlightSubStreams wasn't being decremented. Fixes grpc#9185

ejona86 · 2023-04-03T22:41:04Z

I believe this issue became more serious in 1.53. There was a change in 1.53 that fixed some lifecycle issues, but when this is triggered I think it would cause the RPC to hang and never receive its final status. So if you are seeing this issue, you may need to avoid 1.53 and 1.54 until a patch is out.

The problem was one hedge was committed before another had drained start(). This was not testable because HedgingRunnable checks whether scheduledHedgingRef is cancelled, which is racy, but there's no way to deterministically trigger either race. The same problem couldn't be triggered with retries because only one attempt will be draining at a time. Retries with cancellation also couldn't trigger it, for the surprising reason that the noop stream used in cancel() wasn't considered drained. This commit marks the noop stream as drained with cancel(), which allows memory to be garbage collected sooner and exposes the race for tests. That then showed the stream as hanging, because inFlightSubStreams wasn't being decremented. Fixes #9185

The problem was one hedge was committed before another had drained start(). This was not testable because HedgingRunnable checks whether scheduledHedgingRef is cancelled, which is racy, but there's no way to deterministically trigger either race. The same problem couldn't be triggered with retries because only one attempt will be draining at a time. Retries with cancellation also couldn't trigger it, for the surprising reason that the noop stream used in cancel() wasn't considered drained. This commit marks the noop stream as drained with cancel(), which allows memory to be garbage collected sooner and exposes the race for tests. That then showed the stream as hanging, because inFlightSubStreams wasn't being decremented. Fixes grpc#9185

The problem was one hedge was committed before another had drained start(). This was not testable because HedgingRunnable checks whether scheduledHedgingRef is cancelled, which is racy, but there's no way to deterministically trigger either race. The same problem couldn't be triggered with retries because only one attempt will be draining at a time. Retries with cancellation also couldn't trigger it, for the surprising reason that the noop stream used in cancel() wasn't considered drained. This commit marks the noop stream as drained with cancel(), which allows memory to be garbage collected sooner and exposes the race for tests. That then showed the stream as hanging, because inFlightSubStreams wasn't being decremented. Fixes #9185

ejona86 added the bug label May 18, 2022

ejona86 added this to the Next milestone May 18, 2022

ejona86 mentioned this issue Apr 3, 2023

core: Fix NPE race during hedging #10007

Merged

ejona86 self-assigned this Apr 3, 2023

ejona86 closed this as completed in #10007 Apr 11, 2023

ejona86 modified the milestones: Next, 1.55 Apr 13, 2023

This was referenced Apr 13, 2023

core: Fix NPE race during hedging (1.54.x backport) #10046

Merged

core: Fix NPE race during hedging (1.53.x backport) #10047

Merged

github-actions bot locked as resolved and limited conversation to collaborators Jul 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ocasional NullPointerException in io.grpc.internal.RetriableStream.drain(RetriableStream.java:279) when using hedgingPolicy #9185

Ocasional NullPointerException in io.grpc.internal.RetriableStream.drain(RetriableStream.java:279) when using hedgingPolicy #9185

srmrz commented May 18, 2022 •

edited by ejona86

ejona86 commented May 18, 2022

ejona86 commented May 18, 2022

srmrz commented May 18, 2022

ejona86 commented May 18, 2022

ejona86 commented May 18, 2022

srmrz commented May 18, 2022

ejona86 commented May 18, 2022

srmrz commented May 19, 2022

ejona86 commented May 19, 2022

ejona86 commented Apr 3, 2023

Ocasional NullPointerException in io.grpc.internal.RetriableStream.drain(RetriableStream.java:279) when using hedgingPolicy #9185

Ocasional NullPointerException in io.grpc.internal.RetriableStream.drain(RetriableStream.java:279) when using hedgingPolicy #9185

Comments

srmrz commented May 18, 2022 • edited by ejona86

What version of gRPC-Java are you using?

What is your environment?

What did you expect to see?

What did you see instead?

Steps to reproduce the bug

ejona86 commented May 18, 2022

ejona86 commented May 18, 2022

srmrz commented May 18, 2022

ejona86 commented May 18, 2022

ejona86 commented May 18, 2022

srmrz commented May 18, 2022

ejona86 commented May 18, 2022

srmrz commented May 19, 2022

ejona86 commented May 19, 2022

ejona86 commented Apr 3, 2023

srmrz commented May 18, 2022 •

edited by ejona86