-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stats: Add RPC event for blocking for a picker update #6422
Conversation
stats/stats.go
Outdated
@@ -59,6 +59,26 @@ func (s *Begin) IsClient() bool { return s.Client } | |||
|
|||
func (s *Begin) isRPCStats() {} | |||
|
|||
// ResolverResolved represents an event that the resolved finished resolving if | |||
// the ClientConn blocks on resolver resolution while performing a RPC. | |||
type ResolverResolved struct{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InitialResolverResult
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
stats/stats.go
Outdated
|
||
// PickerPicked represents an event that the picker finished picking if the ClientConn | ||
// blocks on picker picking while performing a RPC. | ||
type PickerPicked struct{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PickerUpdated
? (This one seems tricky to find something succinct for.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah needs to encapsulate all: https://github.com/grpc/grpc-go/blob/master/picker_wrapper.go#L87. PickedUpdated sounds fine to me.
stats/stats.go
Outdated
@@ -59,6 +59,27 @@ func (s *Begin) IsClient() bool { return s.Client } | |||
|
|||
func (s *Begin) isRPCStats() {} | |||
|
|||
// InitialResolverResult represents an event that the resolved finished |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved->resolver
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched.
stats/stats.go
Outdated
@@ -59,6 +59,27 @@ func (s *Begin) IsClient() bool { return s.Client } | |||
|
|||
func (s *Begin) isRPCStats() {} | |||
|
|||
// InitialResolverResult represents an event that the resolved finished | |||
// resolving if the ClientConn blocks on resolver resolution while performing a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"...if the RPC occurred before the initial resolver result."?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My string is still technically correct, but yours falls under the subset of my language and is more specific :). Switched.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an external docstring, so we should find something that is as clear and concise as possible.
"blocks on resolver resolution" is confusing to read and technically the ClientConn
doesn't block -- the RPC does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I agree. Will push once I finish tests.
stats/stats.go
Outdated
// PickerUpdated represents an event that the picker finished picking if the | ||
// ClientConn blocks on picker picking while performing a RPC. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"PickerUpdated indicates that the LB Policy's picker was updated after the previous pick attempt returned ErrNoSubConnAvailable." Or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is the only case that this gets triggered (see list of 4 in https://github.com/grpc/grpc-go/blob/master/picker_wrapper.go#L87). Any other ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's true. I was trying to keep it simple. The alternative would be something more vague like "PickerUpdated indicates that the LB policy provided a new picker while the RPC was waiting for one."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do the more vague one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.. "picker finished picking if the CC blocks on picker picking" makes no sense to me 😛
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update to something like PickerUpdated indicates that the LB policy provided a new picker while the RPC was waiting for one.
picker_wrapper.go
Outdated
for _, sh := range pw.statsHandlers { | ||
sh.HandleRPC(ctx, &stats.PickerUpdated{}) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: nix newline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah whoops. Deleted newline.
picker_wrapper.go
Outdated
@@ -125,6 +130,10 @@ func (pw *pickerWrapper) pick(ctx context.Context, failfast bool, info balancer. | |||
return nil, balancer.PickResult{}, status.Error(codes.Canceled, errStr) | |||
} | |||
case <-ch: | |||
for _, sh := range pw.statsHandlers { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a comment here or where statsHandler
field is defined about what this is for and how this is used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment on field: statsHandlers []stats.Handler // to record blocking picker calls
stats/stats.go
Outdated
|
||
// IsClient indicates if the stats information is from client side. Only Client | ||
// Side interfaces with a resolver, thus always returns true. | ||
func (irr *InitialResolverResult) IsClient() bool { return true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit for all these methods. Since the receiver is not used, we don't even have to name it. So, this can simply be:
func (*InitialResolverResult) IsClient() bool { return true }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmmm ok I was actually wondering that too. Deleted, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM modulo testing
test/retry_test.go
Outdated
@@ -579,7 +581,7 @@ func (s) TestRetryStats(t *testing.T) { | |||
// There is a race between receiving the payload (triggered by the | |||
// application / gRPC library) and receiving the trailer (triggered at the | |||
// transport layer). Adjust the received stats accordingly if necessary. | |||
const tIdx, pIdx = 13, 14 | |||
const tIdx, pIdx = 15, 16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Existing tests should probably filter out PickerUpdated
events entirely, unless we know they are deterministic in those tests. And we should have a new test with a custom LB policy that confirms those events work as intended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
stats/stats.go
Outdated
// PickerUpdated represents an event that the picker finished picking if the | ||
// ClientConn blocks on picker picking while performing a RPC. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.. "picker finished picking if the CC blocks on picker picking" makes no sense to me 😛
test/end2end_test.go
Outdated
func (h *statsHandlerRecordEvents) HandleRPC(_ context.Context, s stats.RPCStats) { | ||
h.mu.Lock() | ||
h.s = append(h.s, s) | ||
h.mu.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: defer
as a matter of course is probably a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
test/end2end_test.go
Outdated
} | ||
|
||
func (bp *blockingPicker) Pick(pi balancer.PickInfo) (balancer.PickResult, error) { | ||
defer bp.pickDone() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW defer
doesn't actually do anything different than just calling it directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that guarantee the first pick hits err sub conn available before the client conn processes the new picker? And thus deterministically induce this event?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched anyway. But this worries me. I guess the client conn has already called into the picker so the new one won't overwrite that call into component.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not saying you must change it, but noting that the functionally it is identical. defer x(); return <no function call>
is the same as x(); return <no function call>
. If you did:
defer x()
return y()
...it would be the same as:
blah := y()
x()
return blah
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I had a feeling. So does the fact that another picker comes before the ErrNoSubConnAvailable logically cause any problems? Does that make it deterministic or is it guaranteed to finish the processing from the picker callout?
test/end2end_test.go
Outdated
} | ||
] | ||
}` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix formatting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks solid on local. Idk why it's showing up like this on Github diff. Let me mess with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I messed with it and I think it looks better now. Let me know what you think.
test/end2end_test.go
Outdated
if pickerUpdatedCount == 0 { | ||
t.Fatalf("sh.pickerUpdated count: %v, want: >0", pickerUpdatedCount) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be able to be deterministic and more specific... == 1
or == 2
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it was determinisitc at 2 but switched to this to make sure. Switched back to 2.
test/end2end_test.go
Outdated
// Shouldn't happen, defensive programming. Registered from import of | ||
// roundrobin package. | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
panic here instead of returning nil
and maybe invoking some other panic later. And below.
Actually this one isn't needed as builder.Build
will just panic if builder
is nil.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Why does this matter philosophically though? Test crashes closer to root cause of the crash?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because if it fails then you want it to fail as close to where the error is as possible. If you cause another failure somewhere else it's harder to debug.
test/end2end_test.go
Outdated
} | ||
func (*statsHandlerRecordEvents) HandleConn(context.Context, stats.ConnStats) {} | ||
|
||
type blockingPicker struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this called blocking
? Is that because of it queues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. Induces a block. I'll call it triggerRPCBlockPicker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also switched the balancer type names accordingly.
test/end2end_test.go
Outdated
} | ||
|
||
// TestPickerBlockingStatsCall tests the emission of a stats handler call that | ||
// represents the picker had to block due to ErrNoSubConnAvailable being |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Terminology nit: the picker doesn't block. I guess the blockingPicker
blocks. But more accurately, the RPC blocks (or in C++/Java: is queued) waiting for a new picker from the LB policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched language accordingly :).
stats/stats.go
Outdated
// PickerUpdated represents an event that the picker finished picking if the | ||
// ClientConn blocks on picker picking while performing a RPC. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update to something like PickerUpdated indicates that the LB policy provided a new picker while the RPC was waiting for one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the test is still flaky...
test/end2end_test.go
Outdated
} | ||
|
||
func (bpb *triggerRPCBlockBalancer) UpdateClientConnState(s balancer.ClientConnState) error { | ||
bpb.blockingPickerDone = grpcsync.NewEvent() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this per-updateCCS instead of created in Build?
test/end2end_test.go
Outdated
bpb.blockingPickerDone.Fire() | ||
bpb.stateMu.Lock() | ||
cs := bpb.childState | ||
bpb.stateMu.Unlock() | ||
bpb.ClientConn.UpdateState(cs) | ||
}, | ||
}, | ||
}) | ||
return err | ||
} | ||
|
||
func (bpb *triggerRPCBlockBalancer) UpdateState(state balancer.State) { | ||
bpb.stateMu.Lock() | ||
bpb.childState = state | ||
bpb.stateMu.Unlock() | ||
if bpb.blockingPickerDone.HasFired() { // guard first one to get a picker sending ErrNoSubConnAvailable first | ||
bpb.ClientConn.UpdateState(state) // after the first rr picker update, cc will trigger more, so actually forward these |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This dance seems racy.
Maybe it should be:
func UpdateState() {
Lock(); defer Unlock()
if HasFired() { UpdateState() } else { childState = state }
}
pickDone: func() {
Lock(); defer Unlock()
Fire(); UpdateState(childState)
}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, maybe we should not update the picker at all (except the initial one at Build or the first UpdateCCS call) until the state is READY. Otherwise we might get an unpredictable number of updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I messed around with it to what it is now.
picker_wrapper.go
Outdated
// a new one, or a picker errored with ErrNoSubConnAvailable or errored | ||
// with failfast set to false, which will trigger a continue to the next | ||
// iteration. In that second case the only way it gets to this codeblock | ||
// is to receive a new picker either in UpdateState or the select above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is confusing because a new picker is always received by UpdateState
, not by the select
. Maybe this sentence is: "In the latter case, we only reach this when there is a new picker."
And isn't this true in either of the two cases? Basically, we only reach this point when: 1. there is a picker immediately on the first iteration, or 2. we have a new picker available. That's why we grab the picker on the following lines. The if ch != nil
filters out the first case where we never needed to block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I guess the language I had is technically incorrect, but I was going for it either UpdateState() before the next iteration or it UpdateState() while blocking on the <-ch if the new one didn't come before the for. Yeah, the first case (either it's there, or it's not and we block) gets there, with it being there already not triggering conditional, and not being there triggering conditional. However, I think that it is more easily implied by conditional, but let me reword the comment and see if you like it more technically.
test/end2end_test.go
Outdated
builder := balancer.Get(roundrobin.Name) | ||
rr := builder.Build(b, bOpts) | ||
if rr == nil { | ||
// Shouldn't happen, defensive programming. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: generally, commenting a panic with "this shouldn't happen" isn't helpful. Obviously it shouldn't happen or else we wouldn't have made it panic. You can say why it shouldn't happen if it's interesting, or just leave the comment out if it's obvious or if the panic explains it (which is better than having a comment explaining it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, whoops, this was left over from when this wasn't an explicit panic. Deleted comment.
test/end2end_test.go
Outdated
err := bpb.Balancer.UpdateClientConnState(balancer.ClientConnState{ | ||
ResolverState: s.ResolverState, | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not pass s
directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to eat the config and send down nil, but I guess this gets nil as the config. Switched.
test/end2end_test.go
Outdated
bpb.blockingPickerDone.Fire() | ||
bpb.stateMu.Lock() | ||
defer bpb.stateMu.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the lock be taken before firing the event here? Otherwise you might race between this and a call from the balancer, and do the UpdateState
twice. It doesn't lead to a correctness issue, but it does seem wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right. Luckily it didn't lead to non determinism of the assertion of the test, because it triggers another for iteration from the continue from ErrNoSubConnAvailable and an extra UpdateState doesn't affect picker count. However, for the intention of this testing struct, it is incorrect in that I desired to send one UpdateState. Good catch, switched.
serviceconfig.LoadBalancingConfig | ||
} | ||
|
||
type triggerRPCBlockBalancer struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment: // triggerRPCBlockBalancer uses a child RR balancer, but blocks all UpdateState calls from it until the first Pick call. That first pick returns ErrNoSubConnAvailable to make the RPC block and trigger the appropriate stats handler callout.
or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment, alongside a little more. Thanks for suggestion.
test/end2end_test.go
Outdated
sc := internal.ParseServiceConfig.(func(string) *serviceconfig.ParseResult)(lbCfgJSON) | ||
mr := manual.NewBuilderWithScheme("pickerupdatedbalancer") | ||
defer mr.Close() | ||
print("ss.Address when setting: ", ss.Address) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove or change to t.Log
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, sorry, deleted.
This PR adds stats events representing blocking resolver resolve and blocking picker pick. The OpenCensus stats handler uses these callouts to add trace annotations, which provide a timestamp for debugging purposes.
Fixes #6418.
RELEASE NOTES: