Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stats: Add RPC event for blocking for a picker update #6422

Merged
merged 17 commits into from
Jul 18, 2023

Conversation

zasweq
Copy link
Contributor

@zasweq zasweq commented Jun 29, 2023

This PR adds stats events representing blocking resolver resolve and blocking picker pick. The OpenCensus stats handler uses these callouts to add trace annotations, which provide a timestamp for debugging purposes.

Fixes #6418.

RELEASE NOTES:

  • stats: Add an RPC event for blocking caused by the LB policy's picker

@zasweq zasweq requested a review from dfawley June 29, 2023 19:47
@zasweq zasweq added Type: Feature New features or improvements in behavior Type: Internal Cleanup Refactors, etc and removed Type: Internal Cleanup Refactors, etc labels Jun 29, 2023
@zasweq zasweq added this to the 1.57 Release milestone Jun 29, 2023
@zasweq zasweq changed the title stats: Add events for blocking resolver resolve and blocking picker pick stats: Add RPC events for blocking caused by the name resolver and the LB policy's picker Jun 29, 2023
stats/stats.go Outdated
@@ -59,6 +59,26 @@ func (s *Begin) IsClient() bool { return s.Client }

func (s *Begin) isRPCStats() {}

// ResolverResolved represents an event that the resolved finished resolving if
// the ClientConn blocks on resolver resolution while performing a RPC.
type ResolverResolved struct{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InitialResolverResult?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

stats/stats.go Outdated

// PickerPicked represents an event that the picker finished picking if the ClientConn
// blocks on picker picking while performing a RPC.
type PickerPicked struct{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PickerUpdated? (This one seems tricky to find something succinct for.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah needs to encapsulate all: https://github.com/grpc/grpc-go/blob/master/picker_wrapper.go#L87. PickedUpdated sounds fine to me.

stats/stats.go Outdated
@@ -59,6 +59,27 @@ func (s *Begin) IsClient() bool { return s.Client }

func (s *Begin) isRPCStats() {}

// InitialResolverResult represents an event that the resolved finished
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved->resolver

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched.

stats/stats.go Outdated
@@ -59,6 +59,27 @@ func (s *Begin) IsClient() bool { return s.Client }

func (s *Begin) isRPCStats() {}

// InitialResolverResult represents an event that the resolved finished
// resolving if the ClientConn blocks on resolver resolution while performing a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"...if the RPC occurred before the initial resolver result."?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My string is still technically correct, but yours falls under the subset of my language and is more specific :). Switched.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an external docstring, so we should find something that is as clear and concise as possible.

"blocks on resolver resolution" is confusing to read and technically the ClientConn doesn't block -- the RPC does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I agree. Will push once I finish tests.

stats/stats.go Outdated
Comment on lines 73 to 74
// PickerUpdated represents an event that the picker finished picking if the
// ClientConn blocks on picker picking while performing a RPC.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"PickerUpdated indicates that the LB Policy's picker was updated after the previous pick attempt returned ErrNoSubConnAvailable." Or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the only case that this gets triggered (see list of 4 in https://github.com/grpc/grpc-go/blob/master/picker_wrapper.go#L87). Any other ideas?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's true. I was trying to keep it simple. The alternative would be something more vague like "PickerUpdated indicates that the LB policy provided a new picker while the RPC was waiting for one."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do the more vague one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.. "picker finished picking if the CC blocks on picker picking" makes no sense to me 😛

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update to something like PickerUpdated indicates that the LB policy provided a new picker while the RPC was waiting for one.

for _, sh := range pw.statsHandlers {
sh.HandleRPC(ctx, &stats.PickerUpdated{})
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: nix newline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah whoops. Deleted newline.

@@ -125,6 +130,10 @@ func (pw *pickerWrapper) pick(ctx context.Context, failfast bool, info balancer.
return nil, balancer.PickResult{}, status.Error(codes.Canceled, errStr)
}
case <-ch:
for _, sh := range pw.statsHandlers {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a comment here or where statsHandler field is defined about what this is for and how this is used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment on field: statsHandlers []stats.Handler // to record blocking picker calls

stats/stats.go Outdated

// IsClient indicates if the stats information is from client side. Only Client
// Side interfaces with a resolver, thus always returns true.
func (irr *InitialResolverResult) IsClient() bool { return true }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit for all these methods. Since the receiver is not used, we don't even have to name it. So, this can simply be:

func (*InitialResolverResult) IsClient() bool { return true }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmmm ok I was actually wondering that too. Deleted, thanks.

@dfawley dfawley assigned zasweq and unassigned dfawley Jul 5, 2023
Copy link
Member

@dfawley dfawley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo testing

@@ -579,7 +581,7 @@ func (s) TestRetryStats(t *testing.T) {
// There is a race between receiving the payload (triggered by the
// application / gRPC library) and receiving the trailer (triggered at the
// transport layer). Adjust the received stats accordingly if necessary.
const tIdx, pIdx = 13, 14
const tIdx, pIdx = 15, 16
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing tests should probably filter out PickerUpdated events entirely, unless we know they are deterministic in those tests. And we should have a new test with a custom LB policy that confirms those events work as intended.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

stats/stats.go Outdated
Comment on lines 73 to 74
// PickerUpdated represents an event that the picker finished picking if the
// ClientConn blocks on picker picking while performing a RPC.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.. "picker finished picking if the CC blocks on picker picking" makes no sense to me 😛

@zasweq zasweq changed the title stats: Add RPC events for blocking caused by the name resolver and the LB policy's picker stats: Add RPC events for blocking caused by the LB policy's picker Jul 6, 2023
@dfawley dfawley changed the title stats: Add RPC events for blocking caused by the LB policy's picker stats: Add RPC event for blocking caused by the LB policy's picker Jul 6, 2023
func (h *statsHandlerRecordEvents) HandleRPC(_ context.Context, s stats.RPCStats) {
h.mu.Lock()
h.s = append(h.s, s)
h.mu.Unlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: defer as a matter of course is probably a good idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

func (bp *blockingPicker) Pick(pi balancer.PickInfo) (balancer.PickResult, error) {
defer bp.pickDone()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW defer doesn't actually do anything different than just calling it directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that guarantee the first pick hits err sub conn available before the client conn processes the new picker? And thus deterministically induce this event?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched anyway. But this worries me. I guess the client conn has already called into the picker so the new one won't overwrite that call into component.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not saying you must change it, but noting that the functionally it is identical. defer x(); return <no function call> is the same as x(); return <no function call>. If you did:

defer x()
return y()

...it would be the same as:

blah := y()
x()
return blah

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I had a feeling. So does the fact that another picker comes before the ErrNoSubConnAvailable logically cause any problems? Does that make it deterministic or is it guaranteed to finish the processing from the picker callout?

Comment on lines 6495 to 6497
}
]
}`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix formatting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks solid on local. Idk why it's showing up like this on Github diff. Let me mess with it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I messed with it and I think it looks better now. Let me know what you think.

Comment on lines 6528 to 6530
if pickerUpdatedCount == 0 {
t.Fatalf("sh.pickerUpdated count: %v, want: >0", pickerUpdatedCount)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be able to be deterministic and more specific... == 1 or == 2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it was determinisitc at 2 but switched to this to make sure. Switched back to 2.

Comment on lines 6410 to 6412
// Shouldn't happen, defensive programming. Registered from import of
// roundrobin package.
return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

panic here instead of returning nil and maybe invoking some other panic later. And below.

Actually this one isn't needed as builder.Build will just panic if builder is nil.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Why does this matter philosophically though? Test crashes closer to root cause of the crash?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because if it fails then you want it to fail as close to where the error is as possible. If you cause another failure somewhere else it's harder to debug.

}
func (*statsHandlerRecordEvents) HandleConn(context.Context, stats.ConnStats) {}

type blockingPicker struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this called blocking? Is that because of it queues?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Induces a block. I'll call it triggerRPCBlockPicker.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also switched the balancer type names accordingly.

}

// TestPickerBlockingStatsCall tests the emission of a stats handler call that
// represents the picker had to block due to ErrNoSubConnAvailable being
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Terminology nit: the picker doesn't block. I guess the blockingPicker blocks. But more accurately, the RPC blocks (or in C++/Java: is queued) waiting for a new picker from the LB policy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched language accordingly :).

@zasweq zasweq assigned dfawley and unassigned zasweq Jul 6, 2023
stats/stats.go Outdated
Comment on lines 73 to 74
// PickerUpdated represents an event that the picker finished picking if the
// ClientConn blocks on picker picking while performing a RPC.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update to something like PickerUpdated indicates that the LB policy provided a new picker while the RPC was waiting for one.

Copy link
Member

@dfawley dfawley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the test is still flaky...

}

func (bpb *triggerRPCBlockBalancer) UpdateClientConnState(s balancer.ClientConnState) error {
bpb.blockingPickerDone = grpcsync.NewEvent()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this per-updateCCS instead of created in Build?

Comment on lines 6450 to 6466
bpb.blockingPickerDone.Fire()
bpb.stateMu.Lock()
cs := bpb.childState
bpb.stateMu.Unlock()
bpb.ClientConn.UpdateState(cs)
},
},
})
return err
}

func (bpb *triggerRPCBlockBalancer) UpdateState(state balancer.State) {
bpb.stateMu.Lock()
bpb.childState = state
bpb.stateMu.Unlock()
if bpb.blockingPickerDone.HasFired() { // guard first one to get a picker sending ErrNoSubConnAvailable first
bpb.ClientConn.UpdateState(state) // after the first rr picker update, cc will trigger more, so actually forward these
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dance seems racy.

Maybe it should be:

func UpdateState() {
  Lock(); defer Unlock()
  if HasFired() { UpdateState() } else { childState = state }
}

pickDone: func() {
  Lock(); defer Unlock()
  Fire(); UpdateState(childState)
}

?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, maybe we should not update the picker at all (except the initial one at Build or the first UpdateCCS call) until the state is READY. Otherwise we might get an unpredictable number of updates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I messed around with it to what it is now.

@zasweq zasweq assigned dfawley and unassigned dfawley Jul 7, 2023
@easwars easwars assigned zasweq and unassigned dfawley Jul 11, 2023
@easwars easwars modified the milestones: 1.57 Release, 1.58 Release Jul 14, 2023
@zasweq zasweq assigned dfawley and unassigned zasweq Jul 18, 2023
// a new one, or a picker errored with ErrNoSubConnAvailable or errored
// with failfast set to false, which will trigger a continue to the next
// iteration. In that second case the only way it gets to this codeblock
// is to receive a new picker either in UpdateState or the select above.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing because a new picker is always received by UpdateState, not by the select. Maybe this sentence is: "In the latter case, we only reach this when there is a new picker."

And isn't this true in either of the two cases? Basically, we only reach this point when: 1. there is a picker immediately on the first iteration, or 2. we have a new picker available. That's why we grab the picker on the following lines. The if ch != nil filters out the first case where we never needed to block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I guess the language I had is technically incorrect, but I was going for it either UpdateState() before the next iteration or it UpdateState() while blocking on the <-ch if the new one didn't come before the for. Yeah, the first case (either it's there, or it's not and we block) gets there, with it being there already not triggering conditional, and not being there triggering conditional. However, I think that it is more easily implied by conditional, but let me reword the comment and see if you like it more technically.

builder := balancer.Get(roundrobin.Name)
rr := builder.Build(b, bOpts)
if rr == nil {
// Shouldn't happen, defensive programming.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: generally, commenting a panic with "this shouldn't happen" isn't helpful. Obviously it shouldn't happen or else we wouldn't have made it panic. You can say why it shouldn't happen if it's interesting, or just leave the comment out if it's obvious or if the panic explains it (which is better than having a comment explaining it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, whoops, this was left over from when this wasn't an explicit panic. Deleted comment.

Comment on lines 6443 to 6445
err := bpb.Balancer.UpdateClientConnState(balancer.ClientConnState{
ResolverState: s.ResolverState,
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not pass s directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to eat the config and send down nil, but I guess this gets nil as the config. Switched.

Comment on lines 6450 to 6452
bpb.blockingPickerDone.Fire()
bpb.stateMu.Lock()
defer bpb.stateMu.Unlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the lock be taken before firing the event here? Otherwise you might race between this and a call from the balancer, and do the UpdateState twice. It doesn't lead to a correctness issue, but it does seem wrong.

Copy link
Contributor Author

@zasweq zasweq Jul 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. Luckily it didn't lead to non determinism of the assertion of the test, because it triggers another for iteration from the continue from ErrNoSubConnAvailable and an extra UpdateState doesn't affect picker count. However, for the intention of this testing struct, it is incorrect in that I desired to send one UpdateState. Good catch, switched.

serviceconfig.LoadBalancingConfig
}

type triggerRPCBlockBalancer struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment: // triggerRPCBlockBalancer uses a child RR balancer, but blocks all UpdateState calls from it until the first Pick call. That first pick returns ErrNoSubConnAvailable to make the RPC block and trigger the appropriate stats handler callout. or something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment, alongside a little more. Thanks for suggestion.

sc := internal.ParseServiceConfig.(func(string) *serviceconfig.ParseResult)(lbCfgJSON)
mr := manual.NewBuilderWithScheme("pickerupdatedbalancer")
defer mr.Close()
print("ss.Address when setting: ", ss.Address)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove or change to t.Log

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, sorry, deleted.

@dfawley dfawley assigned zasweq and unassigned dfawley Jul 18, 2023
@arvindbr8 arvindbr8 assigned dfawley and unassigned zasweq Jul 18, 2023
@dfawley dfawley assigned zasweq and unassigned dfawley Jul 18, 2023
@zasweq zasweq changed the title stats: Add RPC event for blocking caused by the LB policy's picker stats: Add RPC event for blocking for a picker update Jul 18, 2023
@zasweq zasweq merged commit 7aab9c0 into grpc:master Jul 18, 2023
1 check passed
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 15, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Type: Feature New features or improvements in behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Trace events for DirectPath debugging
3 participants