Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ocasional NullPointerException in io.grpc.internal.RetriableStream.drain(RetriableStream.java:279) when using hedgingPolicy #9185

Closed
srmrz opened this issue May 18, 2022 · 10 comments · Fixed by #10007
Assignees
Labels
Milestone

Comments

@srmrz
Copy link

srmrz commented May 18, 2022

What version of gRPC-Java are you using?

Version 1.42.2

What is your environment?

Linux

What did you expect to see?

I'm using the hedging retry policy via configuration and ocasionally see a NullPointerException pop up.

Here's a snippet of the Kotlin code that configures the hedging policy with an 85ms hedging delay:

.defaultServiceConfig(
    mapOf(
        "loadBalancingPolicy" to "round_robin",
        "methodConfig" to listOf(
            mapOf(
                "name" to listOf(
                    mapOf(
                        "service" to "my.org.Service",
                        "method" to "MyMethod"
                    )
                ),
                "waitForReady" to true,
                "hedgingPolicy" to mapOf(
                    "maxAttempts" to 2.
                    "hedgingDelay" to "0.085s",
                    "nonFatalStatusCodes" to listOf(Status.UNAVAILABLE.code.name)
                )
            )
        )
    )
)

What did you see instead?

java.lang.NullPointerException: null
    at io.grpc.internal.RetriableStream.drain(RetriableStream.java:279)
    at io.grpc.internal.RetriableStream.access$1100(RetriableStream.java:55)
    at io.grpc.internal.RetriableStream$HedgingRunnable$1.run(RetriableStream.java:476)
    at io.micrometer.core.instrument.composite.CompositeTimer.record(CompositeTimer.java:79)
    at io.micrometer.core.instrument.Timer.lambda$wrap$0(Timer.java:160)
    at io.micronaut.scheduling.instrument.InvocationInstrumenterWrappedRunnable.run(InvocationInstrumenterWrappedRunnable.java:47)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.lang.Thread.run(Thread.java:834)

Steps to reproduce the bug

Enable hedging policy above

@ejona86 ejona86 added the bug label May 18, 2022
@ejona86 ejona86 added this to the Next milestone May 18, 2022
@ejona86
Copy link
Member

ejona86 commented May 18, 2022

Interesting. I think this is a regression from cbda32a . It seems state.buffer == null which means passthrough == true which means some other stream was committed and this stream should be cancelled. But it hadn't processed the enough of the buffer yet to start the stream which means it has trouble cancelling it.

@ejona86
Copy link
Member

ejona86 commented May 18, 2022

This is a race and seems could only be avoided by avoiding hedging. Retries would still be safe.

@srmrz
Copy link
Author

srmrz commented May 18, 2022

I see. Thanks @ejona86 for triaging this so quickly.

Hedging is working really well for us to reduce tail latency. I'm hoping we can continue using it and help resolve the race condition.

@ejona86
Copy link
Member

ejona86 commented May 18, 2022

Actually... what was the impact to the RPC in this case? It looks like this might just be logged but the RPC could still complete successfully. Which means things may mostly work if you just ignore the log.

There is a risk of a memory leak developing, but since it didn't end up starting the stream, I think gRPC wouldn't leak any memory when using the Netty transport and a connection is available. If there is no connection available, it might leak.

@ejona86
Copy link
Member

ejona86 commented May 18, 2022

This is a race and seems could only be avoided by avoiding hedging

This was in the context of a workaround for the problem.

@srmrz
Copy link
Author

srmrz commented May 18, 2022

Actually... what was the impact to the RPC in this case? It looks like this might just be logged but the RPC could still complete successfully. Which means things may mostly work if you just ignore the log.

From what I can tell, all requests are succeeding. This is just logged on the side.

That said, I did see 1 single instance of this message:

LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information.
Recent access records: 
Created at:
	io.grpc.netty.shaded.io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:402)
	io.grpc.netty.shaded.io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:187)
	io.grpc.netty.shaded.io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:123)
	io.grpc.netty.shaded.io.grpc.netty.NettyWritableBufferAllocator.allocate(NettyWritableBufferAllocator.java:51)
	io.grpc.internal.MessageFramer.writeKnownLengthUncompressed(MessageFramer.java:226)
	io.grpc.internal.MessageFramer.writeUncompressed(MessageFramer.java:168)
	io.grpc.internal.MessageFramer.writePayload(MessageFramer.java:141)
	io.grpc.internal.AbstractStream.writeMessage(AbstractStream.java:65)
	io.grpc.internal.ForwardingClientStream.writeMessage(ForwardingClientStream.java:37)
	io.grpc.internal.RetriableStream$1SendMessageEntry.runWith(RetriableStream.java:549)
	io.grpc.internal.RetriableStream.drain(RetriableStream.java:310)
	io.grpc.internal.RetriableStream.access$1100(RetriableStream.java:55)
	io.grpc.internal.RetriableStream$HedgingRunnable$1.run(RetriableStream.java:476)
	io.micrometer.core.instrument.composite.CompositeTimer.record(CompositeTimer.java:79)
	io.micrometer.core.instrument.Timer.lambda$wrap$0(Timer.java:160)
	io.micronaut.scheduling.instrument.InvocationInstrumenterWrappedRunnable.run(InvocationInstrumenterWrappedRunnable.java:47)
	java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	java.base/java.lang.Thread.run(Thread.java:834)

@ejona86
Copy link
Member

ejona86 commented May 18, 2022

Since writing a message has to be done after start(), it feel like that buffer leak should be unrelated. But that stack trace is in the same hedging draining code, which does make it suspiciously related.

@srmrz
Copy link
Author

srmrz commented May 19, 2022

which does make it suspiciously related.

Yeah!

I do see several TODOs in the class ("TODO(b/145386688)" - d3c77f2 ) that may be related to this race condition 🤷?

Example:

// TODO(b/145386688): This access should be guarded by
// 'HedgingRunnable.this.scheduledHedgingRef.lock'; instead found:
// 'RetriableStream.this.lock'

// TODO(b/145386688): This access should be guarded by 'this.scheduledRetry.lock'; instead
// found: 'this.lock'

// TODO(b/145386688): This access should be guarded by 'this.scheduledHedging.lock'; instead
// found: 'this.lock'

// TODO(b/145386688): This access should be guarded by 'this.scheduledHedging.lock'; instead
// found: 'this.lock'

@ejona86
Copy link
Member

ejona86 commented May 19, 2022

I don't think those guarded by things are that important. All those locks are the same instance, just aliases. So it is more just that the code is written in a way that isn't able to be verified by tooling.

ejona86 added a commit to ejona86/grpc-java that referenced this issue Apr 3, 2023
The problem was one hedge was committed before another had drained
start(). This was not testable because HedgingRunnable checks whether
scheduledHedgingRef is cancelled, which is racy, but there's no way to
deterministically trigger either race.

The same problem couldn't be triggered with retries because only one
attempt will be draining at a time. Retries with cancellation also
couldn't trigger it, for the surprising reason that the noop stream used
in cancel() wasn't considered drained.

This commit marks the noop stream as drained with cancel(), which allows
memory to be garbage collected sooner and exposes the race for tests.
That then showed the stream as hanging, because inFlightSubStreams
wasn't being decremented.

Fixes grpc#9185
@ejona86 ejona86 self-assigned this Apr 3, 2023
@ejona86
Copy link
Member

ejona86 commented Apr 3, 2023

I believe this issue became more serious in 1.53. There was a change in 1.53 that fixed some lifecycle issues, but when this is triggered I think it would cause the RPC to hang and never receive its final status. So if you are seeing this issue, you may need to avoid 1.53 and 1.54 until a patch is out.

ejona86 added a commit that referenced this issue Apr 11, 2023
The problem was one hedge was committed before another had drained
start(). This was not testable because HedgingRunnable checks whether
scheduledHedgingRef is cancelled, which is racy, but there's no way to
deterministically trigger either race.

The same problem couldn't be triggered with retries because only one
attempt will be draining at a time. Retries with cancellation also
couldn't trigger it, for the surprising reason that the noop stream used
in cancel() wasn't considered drained.

This commit marks the noop stream as drained with cancel(), which allows
memory to be garbage collected sooner and exposes the race for tests.
That then showed the stream as hanging, because inFlightSubStreams
wasn't being decremented.

Fixes #9185
@ejona86 ejona86 modified the milestones: Next, 1.55 Apr 13, 2023
ejona86 added a commit to ejona86/grpc-java that referenced this issue Apr 13, 2023
The problem was one hedge was committed before another had drained
start(). This was not testable because HedgingRunnable checks whether
scheduledHedgingRef is cancelled, which is racy, but there's no way to
deterministically trigger either race.

The same problem couldn't be triggered with retries because only one
attempt will be draining at a time. Retries with cancellation also
couldn't trigger it, for the surprising reason that the noop stream used
in cancel() wasn't considered drained.

This commit marks the noop stream as drained with cancel(), which allows
memory to be garbage collected sooner and exposes the race for tests.
That then showed the stream as hanging, because inFlightSubStreams
wasn't being decremented.

Fixes grpc#9185
ejona86 added a commit to ejona86/grpc-java that referenced this issue Apr 13, 2023
The problem was one hedge was committed before another had drained
start(). This was not testable because HedgingRunnable checks whether
scheduledHedgingRef is cancelled, which is racy, but there's no way to
deterministically trigger either race.

The same problem couldn't be triggered with retries because only one
attempt will be draining at a time. Retries with cancellation also
couldn't trigger it, for the surprising reason that the noop stream used
in cancel() wasn't considered drained.

This commit marks the noop stream as drained with cancel(), which allows
memory to be garbage collected sooner and exposes the race for tests.
That then showed the stream as hanging, because inFlightSubStreams
wasn't being decremented.

Fixes grpc#9185
ejona86 added a commit that referenced this issue Apr 13, 2023
The problem was one hedge was committed before another had drained
start(). This was not testable because HedgingRunnable checks whether
scheduledHedgingRef is cancelled, which is racy, but there's no way to
deterministically trigger either race.

The same problem couldn't be triggered with retries because only one
attempt will be draining at a time. Retries with cancellation also
couldn't trigger it, for the surprising reason that the noop stream used
in cancel() wasn't considered drained.

This commit marks the noop stream as drained with cancel(), which allows
memory to be garbage collected sooner and exposes the race for tests.
That then showed the stream as hanging, because inFlightSubStreams
wasn't being decremented.

Fixes #9185
ejona86 added a commit that referenced this issue Apr 13, 2023
The problem was one hedge was committed before another had drained
start(). This was not testable because HedgingRunnable checks whether
scheduledHedgingRef is cancelled, which is racy, but there's no way to
deterministically trigger either race.

The same problem couldn't be triggered with retries because only one
attempt will be draining at a time. Retries with cancellation also
couldn't trigger it, for the surprising reason that the noop stream used
in cancel() wasn't considered drained.

This commit marks the noop stream as drained with cancel(), which allows
memory to be garbage collected sooner and exposes the race for tests.
That then showed the stream as hanging, because inFlightSubStreams
wasn't being decremented.

Fixes #9185
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 13, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants