Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement transparent retries for gRFC A6 #1597

Merged
merged 4 commits into from
Nov 6, 2017

Conversation

dfawley
Copy link
Member

@dfawley dfawley commented Oct 19, 2017

Also fixes #1532

@dfawley dfawley requested a review from menghanl October 19, 2017 23:44
@dfawley dfawley assigned menghanl and dfawley and unassigned dfawley Oct 19, 2017
// Check to make sure the context has expired. This will prevent us from
// looping forever if an error occurs for wait-for-ready RPCs where no data
// is sent on the wire.
select {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's no reason to use a select here, and doing so can result in worse performance than the simpler alternative:

if err := ctx.Err(); err != nil {
  return toRPCErr(err)
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out this is not true. I made some benchmarks, and found the following results:

BenchmarkCancelContextErrNoErr-12            	100000000	        52.2 ns/op
BenchmarkCancelContextErrGotErr-12           	100000000	        51.9 ns/op
BenchmarkCancelContextChannelNoErr-12        	200000000	        19.9 ns/op
BenchmarkCancelContextChannelGotErr-12       	100000000	        36.1 ns/op

Using the channel is faster as of today. I believe the problem is the defer in context.Err():

https://github.com/golang/net/blob/cd69bc3fc700721b709c3a59e16e24c67b58f6ff/context/pre_go17.go#L162

These numbers line up well with our benchmarks for mutexes vs. channels and defer vs no-defer:

BenchmarkMutexWithDefer-12                	20000000	        51.2 ns/op
BenchmarkMutex-12                         	100000000	        14.5 ns/op
BenchmarkSelectClosed-12                  	50000000	        20.3 ns/op
BenchmarkSelectOpen-12                    	300000000	         5.74 ns/op

Note how the select is faster than the mutex with a defer, but slower than the mutex with an inline Unlock().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite surprising, thanks for the details. Your comment points to pre_go17.go but I assume these benchmarks were run on 1.9?

Also interesting that selecting on the open channel is faster than the mutex with inline Unlock.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, there's something fishy here - how can BenchmarkCancelContextChannelGotErr ever be faster than BenchmarkCancelContextErrGotErr? The former does strictly more work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite surprising, thanks for the details. Your comment points to pre_go17.go but I
assume these benchmarks were run on 1.9?

Yes -- presumably the std context package is implemented identically (based on the results that makes sense).

The former does strictly more work.

Good catch. Actually, it does not do strictly more work, because I forgot to call Err() in this case. Adding it results in:

BenchmarkCancelContextErrGotErr-12        	100000000	        52.4 ns/op
BenchmarkCancelContextChannelGotErr-12    	50000000	        94.8 ns/op
BenchmarkTimerContextErrGotErr-12         	100000000	        52.5 ns/op
BenchmarkTimerContextChannelGotErr-12     	50000000	        86.7 ns/op

I would still lean toward the existing approach, as it optimizes the non-error case. But maybe the Go team should think about removing that defer...

Copy link
Contributor

@tamird tamird Oct 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @Sajmani -- looks like that defer has been there since inception; perhaps it's worth replacing with the an inline Unlock as @dfawley has shown.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I filed golang/go#22401 for this.

// Check to make sure the context has expired. This will prevent us from
// looping forever if an error occurs for wait-for-ready RPCs where no data
// is sent on the wire.
select {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto:

if err := ctx.Err(); err != nil {
  return toRPCErr(err)
}

Copy link
Contributor

@menghanl menghanl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes LGTM.
Some possible goaway cleanup in transport.

@@ -689,6 +672,12 @@ var (
// ErrStreamDrain indicates that the stream is rejected by the server because
// the server stops accepting new RPCs.
ErrStreamDrain = streamErrorf(codes.Unavailable, "the server stops accepting new RPCs")
// StatusStreamRefused indicates that the server sent a RST_STREAM with the
// REFUSED_STREAM code, meaning the stream was entirely unprocessed.
StatusStreamRefused = status.New(codes.Unavailable, "the server refused the stream")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be exported?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted entirely (it was unused).

// The stream was unprocessed by the server.
stream.mu.Lock()
stream.unprocessed = true
stream.finish(StatusGoAway)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this finish()?
finish closes s.done, which may result in race in wait().

See following comment on ErrStreamDrain.
If we do a cleanup on stream goaway, this finish() may become necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave both the finish and the closing of s.done for now; we will clean things up in the future.

Unexported statusGoAway.

@@ -689,6 +672,12 @@ var (
// ErrStreamDrain indicates that the stream is rejected by the server because
// the server stops accepting new RPCs.
ErrStreamDrain = streamErrorf(codes.Unavailable, "the server stops accepting new RPCs")
Copy link
Contributor

@menghanl menghanl Oct 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this anymore? Seems it's only returned by transport but never check.

IIRC, the purpose of this error is exactly what Unprocessed() is for.

And if we no longer need this error, we can also (maybe) remove s.goaway channel.
But should probably make this a TODO for another cleanup PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unexported, added TODO. We have a lot more cleaning up we can do.

@dfawley
Copy link
Member Author

dfawley commented Nov 3, 2017

Ping: rebased & resolved conflicts.

for ; state == connectivity.Ready && cc.WaitForStateChange(ctx, state); state = cc.GetState() {
}
if state == connectivity.Ready {
t.Fatalf("Want connection state to be Ready, got %v", state)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

want state to be not Ready?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

if state == connectivity.Ready {
t.Fatalf("Want connection state to be Ready, got %v", state)
}
cancel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

context will be leaked if t.Fafalf before this.
Move this before the if check?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

t.Fatalf("TestService/EmptyCall(_, _) = _, %v, want _, <nil>", err)
}
cancel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not cancel on t.Fatalf?
defer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferred.

Copy link
Member Author

@dfawley dfawley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, PTAL.

for ; state == connectivity.Ready && cc.WaitForStateChange(ctx, state); state = cc.GetState() {
}
if state == connectivity.Ready {
t.Fatalf("Want connection state to be Ready, got %v", state)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

if state == connectivity.Ready {
t.Fatalf("Want connection state to be Ready, got %v", state)
}
cancel()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

t.Fatalf("TestService/EmptyCall(_, _) = _, %v, want _, <nil>", err)
}
cancel()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferred.

Copy link
Contributor

@menghanl menghanl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@dfawley dfawley merged commit 8ff8683 into grpc:master Nov 6, 2017
@menghanl menghanl added this to the 1.8 Release milestone Nov 7, 2017
@menghanl menghanl added the Type: Feature New features or improvements in behavior label Nov 7, 2017
@dfawley dfawley deleted the transparent_retries2 branch April 30, 2018 16:12
@lock lock bot locked as resolved and limited conversation to collaborators Oct 27, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Type: Feature New features or improvements in behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"Wait for Ready" RPCs retry even if data was written to the wire
3 participants