client: fix potential panic during RPC retries #5323

dfawley · 2022-04-28T21:07:41Z

Before this change, there was a very narrow window where a stream could fail to be created on a retry attempt, and subsequent stream operations could panic when this occurred. This change is a fairly substantial restructuring of the retry code in order to guarantee that for all clientStreams returned to the user, cs.attempt.s is always non-nil. New attempts that don't yet have a stream are created in cs.newAttempt until they pick a transport and create a stream, at which time they are promoted to cs.attempt.

This also allows us to run balancer pick results through the retry code, meaning UNAVAILABLE errors can be retried by the retry policy, as intended. It also handles drop results that should explicitly avoid the retry code.

RELEASE NOTES:

client: fix potential panic during RPC retries

internal/transport/http2_client.go

menghanl · 2022-04-29T17:37:29Z

stream.go

 		cs.finish(err)
 		return nil, err
 	}

-	op := func(a *csAttempt) error { return a.newStream() }
+	// Because this operation is always called either here (while creating the


Comment what this op does? That it create transport and stream, and only replaces the attempt if both are successful.

menghanl · 2022-04-29T17:40:03Z

stream.go

-	op := func(a *csAttempt) error { return a.newStream() }
+	// Because this operation is always called either here (while creating the
+	// clientStream) or by the retry code while locked when replaying the
+	// operation, it is safe to access cs.attempt directly.


Move this comment right above cs.attempt = a?

menghanl · 2022-04-29T18:09:17Z

stream.go

 			return err
 		}
-		if lastErr = cs.replayBufferLocked(); lastErr == nil {
+		if lastErr = cs.replayBufferLocked(attempt); lastErr == nil {


Comment that the first op will set the attempt if it gets a transport and a stream.

Before this patch, our client-side tracing interceptor for streaming rpc calls was exposed to gRPC bugs being currently fixed in github.com/grpc/grpc-go/pull/5323. This had to do with calls to clientStream.Context() panicking with an NPE under certain races with RPCs failing. We've recently gotten two crashes seemingly because of this. It's unclear why this hasn't shown up before, as nothing seems new (either on our side or on the grpc side). In 22.2 we do use more streaming RPCs than before (for example for span configs), so maybe that does it. This patch eliminates the problem by eliminating the problematic call into ClientStream.Context(). The background is that our interceptors needs to watch for ctx cancelation and consider the RPC done at that point. But, there was no reason for that call; we can more simply use the RPC caller's ctx for the purposes of figuring out if the caller cancels the RPC. In fact, calling ClientStream.Context() is bad for other reasons, beyond exposing us to the bug: 1) ClientStream.Context() pins the RPC attempt to a lower-level connection, and inhibits gRPC's ability to sometimes transparently retry failed calls. In fact, there's a comment on ClientStream.Context() that tells you not to call it before using the stream through other methods like Recv(), which imply that the RPC is already "pinned" and transparent retries are no longer possible anyway. We were breaking this. 2) One of the grpc-go maintainers suggested that, due to the bugs reference above, this call could actually sometimes give us "the wrong context", although how wrong exactly is not clear to me (i.e. could we have gotten a ctx that doesn't inherit from the caller's ctx? Or a ctx that's canceled independently from the caller?) This patch also removes a paranoid catch-all finalizer in the interceptor that assured that the RPC client span is always eventually closed (at a random GC time), regardless of what the caller does - i.e. even if the caller forgets about interacting with the response stream or canceling the ctx. This kind of paranoia is not needed. The code in question was copied from grpc-opentracing[1], which quoted a StackOverflow answer[2] about whether or not a client is allowed to simply walk away from a streaming call. As a result of conversations triggered by this patch [3], that SO answer was updated to reflect the fact that it is not, in fact, OK for a client to do so, as it will leak gRPC resources. The client's contract is specified in [4] (although arguably that place is not the easiest to find by a casual gRPC user). In any case, this patch gets rid of the finalizer. This could in theory result in leaked spans if our own code is buggy in the way that the paranoia prevented, but all our TestServers check that spans don't leak like that so we are pretty protected. FWIW, a newer re-implementation of the OpenTracing interceptor[4] doesn't have the finalizer (although it also doesn't listen for ctx cancellation, so I think it's buggy), and neither does the equivalent OpenTelemetry interceptor[6]. Fixes cockroachdb#80689 [1] https://github.com/grpc-ecosystem/grpc-opentracing/blob/8e809c8a86450a29b90dcc9efbf062d0fe6d9746/go/otgrpc/client.go#L174 [2] https://stackoverflow.com/questions/42915337/are-you-required-to-call-recv-until-you-get-io-eof-when-interacting-with-grpc-cl [3] grpc/grpc-go#5324 [4] https://pkg.go.dev/google.golang.org/grpc#ClientConn.NewStream [5] https://github.com/grpc-ecosystem/go-grpc-middleware/blob/master/tracing/opentracing/client_interceptors.go#L37-L52 [6] https://github.com/open-telemetry/opentelemetry-go-contrib/blame/main/instrumentation/google.golang.org/grpc/otelgrpc/interceptor.go#L193 Release note: A rare crash indicating a nil-pointer deference in google.golang.org/grpc/internal/transport.(*Stream).Context(...) was fixed.

Before this patch, our client-side tracing interceptor for streaming rpc calls was exposed to gRPC bugs being currently fixed in github.com/grpc/grpc-go/pull/5323. This had to do with calls to clientStream.Context() panicking with an NPE under certain races with RPCs failing. We've recently gotten two crashes seemingly because of this. It's unclear why this hasn't shown up before, as nothing seems new (either on our side or on the grpc side). In 22.2 we do use more streaming RPCs than before (for example for span configs), so maybe that does it. This patch eliminates the problem by eliminating the problematic call into ClientStream.Context(). The background is that our interceptors needs to watch for ctx cancelation and consider the RPC done at that point. But, there was no reason for that call; we can more simply use the RPC caller's ctx for the purposes of figuring out if the caller cancels the RPC. In fact, calling ClientStream.Context() is bad for other reasons, beyond exposing us to the bug: 1) ClientStream.Context() pins the RPC attempt to a lower-level connection, and inhibits gRPC's ability to sometimes transparently retry failed calls. In fact, there's a comment on ClientStream.Context() that tells you not to call it before using the stream through other methods like Recv(), which imply that the RPC is already "pinned" and transparent retries are no longer possible anyway. We were breaking this. 2) One of the grpc-go maintainers suggested that, due to the bugs reference above, this call could actually sometimes give us "the wrong context", although how wrong exactly is not clear to me (i.e. could we have gotten a ctx that doesn't inherit from the caller's ctx? Or a ctx that's canceled independently from the caller?) Release note: A rare crash indicating a nil-pointer deference in google.golang.org/grpc/internal/transport.(*Stream).Context(...) was fixed.

dfawley

Thanks for the review. All comments addressed.

dfawley · 2022-05-03T16:16:09Z

stream.go

 			return err
 		}
-		if lastErr = cs.replayBufferLocked(); lastErr == nil {
+		if lastErr = cs.replayBufferLocked(attempt); lastErr == nil {


dfawley · 2022-05-03T16:16:27Z

stream.go

 		cs.finish(err)
 		return nil, err
 	}

-	op := func(a *csAttempt) error { return a.newStream() }
+	// Because this operation is always called either here (while creating the


dfawley · 2022-05-03T16:16:31Z

stream.go

-	op := func(a *csAttempt) error { return a.newStream() }
+	// Because this operation is always called either here (while creating the
+	// clientStream) or by the retry code while locked when replaying the
+	// operation, it is safe to access cs.attempt directly.


hexfusion · 2022-05-03T16:45:22Z

Just curious if this could get backported to 1.46.x or is the change too large.

80476: security: update the TLS cipher suite list r=bdarnell a=knz Fixes #80483 This does not really change the list, it merely explains more clearly how it was built. Release note: None 80635: release: use preflight tool for RedHat images r=celiala a=rail RedHat Connect has introduced a new certification workflow, which requires running some tests locally and submit the results back to RedHat. This patch runs the `preflight` tool to address the new workflow changes. Fixes #80633 Release note: None 80878: util/tracing: fix crash in StreamClientInterceptor r=andreimatei a=andreimatei Before this patch, our client-side tracing interceptor for streaming rpc calls was exposed to gRPC bugs being currently fixed in github.com/grpc/grpc-go#5323. This had to do with calls to clientStream.Context() panicking with an NPE under certain races with RPCs failing. We've recently gotten two crashes seemingly because of this. It's unclear why this hasn't shown up before, as nothing seems new (either on our side or on the grpc side). In 22.2 we do use more streaming RPCs than before (for example for span configs), so maybe that does it. This patch eliminates the problem by eliminating the problematic call into ClientStream.Context(). The background is that our interceptors needs to watch for ctx cancelation and consider the RPC done at that point. But, there was no reason for that call; we can more simply use the RPC caller's ctx for the purposes of figuring out if the caller cancels the RPC. In fact, calling ClientStream.Context() is bad for other reasons, beyond exposing us to the bug: 1) ClientStream.Context() pins the RPC attempt to a lower-level connection, and inhibits gRPC's ability to sometimes transparently retry failed calls. In fact, there's a comment on ClientStream.Context() that tells you not to call it before using the stream through other methods like Recv(), which imply that the RPC is already "pinned" and transparent retries are no longer possible anyway. We were breaking this. 2) One of the grpc-go maintainers suggested that, due to the bugs reference above, this call could actually sometimes give us "the wrong context", although how wrong exactly is not clear to me (i.e. could we have gotten a ctx that doesn't inherit from the caller's ctx? Or a ctx that's canceled independently from the caller?) This patch also removes a paranoid catch-all finalizer in the interceptor that assured that the RPC client span is always eventually closed (at a random GC time), regardless of what the caller does - i.e. even if the caller forgets about interacting with the response stream or canceling the ctx. This kind of paranoia is not needed. The code in question was copied from grpc-opentracing[1], which quoted a StackOverflow answer[2] about whether or not a client is allowed to simply walk away from a streaming call. As a result of conversations triggered by this patch [3], that SO answer was updated to reflect the fact that it is not, in fact, OK for a client to do so, as it will leak gRPC resources. The client's contract is specified in [4] (although arguably that place is not the easiest to find by a casual gRPC user). In any case, this patch gets rid of the finalizer. This could in theory result in leaked spans if our own code is buggy in the way that the paranoia prevented, but all our TestServers check that spans don't leak like that so we are pretty protected. FWIW, a newer re-implementation of the OpenTracing interceptor[4] doesn't have the finalizer (although it also doesn't listen for ctx cancellation, so I think it's buggy), and neither does the equivalent OpenTelemetry interceptor[6]. Fixes #80689 [1] https://github.com/grpc-ecosystem/grpc-opentracing/blob/8e809c8a86450a29b90dcc9efbf062d0fe6d9746/go/otgrpc/client.go#L174 [2] https://stackoverflow.com/questions/42915337/are-you-required-to-call-recv-until-you-get-io-eof-when-interacting-with-grpc-cl [3] grpc/grpc-go#5324 [4] https://pkg.go.dev/google.golang.org/grpc#ClientConn.NewStream [5] https://github.com/grpc-ecosystem/go-grpc-middleware/blob/master/tracing/opentracing/client_interceptors.go#L37-L52 [6] https://github.com/open-telemetry/opentelemetry-go-contrib/blame/main/instrumentation/google.golang.org/grpc/otelgrpc/interceptor.go#L193 Release note: A rare crash indicating a nil-pointer deference in google.golang.org/grpc/internal/transport.(*Stream).Context(...) was fixed. Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net> Co-authored-by: Rail Aliiev <rail@iqchoice.com> Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>

dfawley · 2022-05-03T16:51:32Z

Just curious if this could get backported to 1.46.x or is the change too large.

Yes, I believe this should be backported.

Before this patch, our client-side tracing interceptor for streaming rpc calls was exposed to gRPC bugs being currently fixed in github.com/grpc/grpc-go/pull/5323. This had to do with calls to clientStream.Context() panicking with an NPE under certain races with RPCs failing. We've recently gotten two crashes seemingly because of this. It's unclear why this hasn't shown up before, as nothing seems new (either on our side or on the grpc side). In 22.2 we do use more streaming RPCs than before (for example for span configs), so maybe that does it. This patch eliminates the problem by eliminating the problematic call into ClientStream.Context(). The background is that our interceptors needs to watch for ctx cancelation and consider the RPC done at that point. But, there was no reason for that call; we can more simply use the RPC caller's ctx for the purposes of figuring out if the caller cancels the RPC. In fact, calling ClientStream.Context() is bad for other reasons, beyond exposing us to the bug: 1) ClientStream.Context() pins the RPC attempt to a lower-level connection, and inhibits gRPC's ability to sometimes transparently retry failed calls. In fact, there's a comment on ClientStream.Context() that tells you not to call it before using the stream through other methods like Recv(), which imply that the RPC is already "pinned" and transparent retries are no longer possible anyway. We were breaking this. 2) One of the grpc-go maintainers suggested that, due to the bugs reference above, this call could actually sometimes give us "the wrong context", although how wrong exactly is not clear to me (i.e. could we have gotten a ctx that doesn't inherit from the caller's ctx? Or a ctx that's canceled independently from the caller?) Release note: A rare crash indicating a nil-pointer deference in google.golang.org/grpc/internal/transport.(*Stream).Context(...) was fixed.

hexfusion · 2022-05-03T21:37:52Z

Quick question: in the case of server returns RST_STREAM could that trigger this panic? I'm having a hard time isolating the retry logic in this scenario (> maxStreams).

dfawley · 2022-05-03T21:44:10Z

RST_STREAM(REFUSED_STREAM) at the start of the RPC will trigger a transparent retry. If the RPC is canceled before another transport stream can be created, or if the ClientConn is closed in that window, then you could hit this panic.

dfawley added 2 commits April 26, 2022 14:45

client: do not retry transparently if RPC committed or finished

ed294b3

prevent cs.attempt.s from ever being nil

0a4d344

dfawley requested a review from menghanl April 28, 2022 21:07

dfawley assigned menghanl Apr 28, 2022

dfawley added the Type: Bug label Apr 28, 2022

dfawley added this to the 1.47 Release milestone Apr 28, 2022

dfawley added 2 commits April 28, 2022 14:47

remove newAttempt field

e50b202

remove unnecessary local var

9c0a3d0

dfawley force-pushed the retrycommit branch from 59f7c01 to 9c0a3d0 Compare April 28, 2022 22:11

hexfusion mentioned this pull request Apr 28, 2022

need guidance for detecting client stream close in interceptor #5324

Closed

hexfusion reviewed Apr 29, 2022

View reviewed changes

internal/transport/http2_client.go Show resolved Hide resolved

menghanl reviewed Apr 29, 2022

View reviewed changes

andreimatei mentioned this pull request May 2, 2022

util/tracing: fix crash in StreamClientInterceptor cockroachdb/cockroach#80878

Merged

andreimatei mentioned this pull request May 3, 2022

release-22.1: util/tracing: fix crash in StreamClientInterceptor cockroachdb/cockroach#80911

Merged

comments

79068d9

dfawley commented May 3, 2022

View reviewed changes

dfawley added the Cherry Pick label May 3, 2022

menghanl approved these changes May 3, 2022

View reviewed changes

menghanl assigned dfawley and unassigned menghanl May 3, 2022

andreimatei mentioned this pull request May 3, 2022

release-22.1.0: util/tracing: fix crash in StreamClientInterceptor cockroachdb/cockroach#80936

Merged

remove deadlock condition; finish is called through other means

f8a1b1e

dfawley merged commit 799605c into grpc:master May 4, 2022

dfawley deleted the retrycommit branch May 4, 2022 17:06

dfawley added a commit to dfawley/grpc-go that referenced this pull request May 13, 2022

client: fix potential panic during RPC retries (grpc#5323)

ffeff9e

dfawley added a commit that referenced this pull request May 13, 2022

client: fix potential panic during RPC retries (#5323)

1303098

erikgrinaker mentioned this pull request Jun 9, 2022

kvserver: v21.2.7: nil pointer in gRPC transport when sending snapshots cockroachdb/cockroach#81227

Closed

github-actions bot locked as resolved and limited conversation to collaborators Nov 1, 2022

dfawley removed the Cherry Pick label Jan 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

client: fix potential panic during RPC retries #5323

client: fix potential panic during RPC retries #5323

dfawley commented Apr 28, 2022 •

edited

menghanl Apr 29, 2022

dfawley May 3, 2022

menghanl Apr 29, 2022

dfawley May 3, 2022

menghanl Apr 29, 2022

dfawley May 3, 2022

dfawley left a comment

dfawley May 3, 2022

dfawley May 3, 2022

dfawley May 3, 2022

hexfusion commented May 3, 2022

dfawley commented May 3, 2022

hexfusion commented May 3, 2022

dfawley commented May 3, 2022

client: fix potential panic during RPC retries #5323

client: fix potential panic during RPC retries #5323

Conversation

dfawley commented Apr 28, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dfawley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hexfusion commented May 3, 2022

dfawley commented May 3, 2022

hexfusion commented May 3, 2022

dfawley commented May 3, 2022

dfawley commented Apr 28, 2022 •

edited