Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

orca: server side custom metrics implementation #5531

Merged
merged 27 commits into from
Sep 1, 2022

Conversation

easwars
Copy link
Contributor

@easwars easwars commented Jul 21, 2022

Implementation of the server API as defined in https://github.com/grpc/proposal/blob/master/A51-custom-backend-metrics.md.

Fixes #5474.

RELEASE NOTES: n/a

@easwars easwars requested a review from dfawley July 21, 2022 22:05
@easwars easwars added the Type: Feature New features or improvements in behavior label Jul 21, 2022
@easwars easwars added this to the 1.49 Release milestone Jul 21, 2022
orca/orca.go Outdated Show resolved Hide resolved
orca/orca.go Outdated Show resolved Hide resolved
orca/orca.go Outdated Show resolved Hide resolved
orca/recorder.go Outdated Show resolved Hide resolved
@dfawley dfawley assigned easwars and unassigned dfawley Jul 26, 2022
@easwars easwars assigned dfawley and unassigned easwars Jul 28, 2022
orca/service_impl.go Outdated Show resolved Hide resolved
return &serviceImpl{minReportingInterval: opts.MinReportingInterval}
// Register creates a new ORCA service implementation configured using the
// provided options and registers the same on the provided service registrar.
func Register(s grpc.ServiceRegistrar, opts ServerOptions) (*Server, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete and instead document how to use NewServer / have an example in a _test.go file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Got rid of the NewService() function. In the process, also figured out that I was not actually checking if the interval received in the StreamCoreMetrics RPC was less than the configured minimum. Added that as well.

Comment on lines 91 to 92
interval := req.GetReportInterval()
if err := interval.CheckValid(); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do this extra check?

From what I can tell, there's some definition of "normal" in a time.Duration, but the library deals with it in a pretty sane manner. Why not just use what it figures out?

The AsDuration method performs the conversion on a best-effort basis. Durations with denormal values (e.g., nanoseconds beyond -99999999 and +99999999, inclusive; or seconds and nanoseconds with opposite signs) are normalized during the conversion to a time.Duration. To manually check for invalid Duration per the documented limitations in duration.proto, additionally call the CheckValid method:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We receive this value on the wire, and I believe this value could have invalid duration value because durationpb.New() does not do any kind of validation. https://github.com/protocolbuffers/protobuf-go/blob/v1.28.1/types/known/durationpb/duration.pb.go#L167

The duration.AsDuration() implementation https://github.com/protocolbuffers/protobuf-go/blob/v1.28.1/types/known/durationpb/duration.pb.go#L176 returns minInt or maxInt in the case of an overflow.

If we use, duration.CheckValid(), we will be able to return a meaningful error to the user and still continue to use the configured minimum.

Why do you see this extra check as problematic? Do you just see it as a wasted effort?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's problematic in my mind because it enforces an unnecessary constraint on the system. Namely, if you have >99999999 in the ns field, it's converted to a pretty reasonable number by the library, but then this check considers it to be invalid anyway.

We already reject it if it's too small. How about:

dur := interval.AsDuration()
if err := interval.CheckValid(); err != nil && dur > 30*time.Minute {
	// Probably not what the user intended.
	logger.Warning(...)
	return s.minReportingInterval
}

Also, do we really want the minimum interval in this case? Why is the default the same as the minimum?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GitHub is so messed up. None of the review comments show up along with the source code and I'm not able to reply to them all at once. I have taken care of all comments except this one.

I don't quite understand what you are suggesting here though. Why should we check for dur > 30*time.Minute?

There are multiple minimums:

  • There is an absolute minimum, which is 30s, and this is what we call the default.
  • There is a minimum enforced by the user when they create the service. This is ServiceOptions.MinReportingInterval, and this needs to be greater than or eq to the absolute minimum of 30s. If it is less than the absolute minimum, we end up using the absolute minimum for this.
  • Then, the incoming RPC can specify an interval, and this needs to be greater than the configured service minimum. If not, we use the service minimum.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified it to match Java's implementation, where the only things that are checked are:

  • whether it is specified or not
  • whether it is negative
  • whether it is less than the configured minimum.

orca/service_impl.go Outdated Show resolved Hide resolved
orca/service_impl.go Outdated Show resolved Hide resolved
orca/service_impl.go Outdated Show resolved Hide resolved
orca/service_impl.go Outdated Show resolved Hide resolved
orca/service_impl.go Outdated Show resolved Hide resolved
orca/recorder.go Outdated Show resolved Hide resolved
@dfawley dfawley assigned easwars and unassigned dfawley Jul 28, 2022
@easwars easwars assigned dfawley and unassigned easwars Aug 2, 2022
@easwars easwars added P2 and removed P2 labels Aug 3, 2022
@dfawley dfawley removed this from the 1.49 Release milestone Aug 12, 2022
@easwars easwars assigned dfawley and unassigned easwars Aug 24, 2022
orca/call_metric_recorder.go Outdated Show resolved Hide resolved
orca/call_metric_recorder.go Outdated Show resolved Hide resolved
orca/service.go Outdated Show resolved Hide resolved
orca/service.go Show resolved Hide resolved
orca/service.go Outdated Show resolved Hide resolved
orca/service.go Outdated
// utilization metric.
func (s *Service) DeleteCPUUtilization() {
s.mu.Lock()
s.cpu = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there no difference between zero CPU utilization and deleted (presumably "not reporting")? If not, do we need this API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cpu, and memory are special because they have dedicated fields in the orca load report proto. I initially had the delete API for them, just to be in sync with Java. But I think it does not make much sense to have these delete APIs for two reasons:

  1. they can always explicitly set it to 0.
  2. actual applications which are setting cpu and memory will hardly have a reason to delete them.

We can always add it if there is demand for it.

The other issue of not reporting when they are not set is not something that I'm doing right now. Maybe it will save a few bytes if done. Do you think it is worth doing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other issue of not reporting when they are not set is not something that I'm doing right now. Maybe it will save a few bytes if done. Do you think it is worth doing?

I'm unsure of the semantic differences between "unset" and "set to zero". If the protocol supports both (optional field vs. always present) then do we need to model that difference in our API as well? Did Java?

@dfawley dfawley assigned easwars and unassigned dfawley Aug 25, 2022
@easwars easwars assigned dfawley and unassigned easwars Aug 26, 2022
orca/orca.go Outdated
// It is safe to access the underlying metric recorder inside the wrapper at
// this point, as the user's RPC handler is done executing, and therefore
// there will be no more calls to CallMetricRecorderFromContext().
rw, ok := ctxWithRecorder.Value(callMetricRecorderCtxKey{}).(*recorderWrapper)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little awkward that we have to find this in the context considering we just stored it there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm ... it is. But is there another option?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to prescribe one, and I also didn't want to suggest it since you just changed the function signature the other way, but....:

rw := newRecorderWrapper()
ctxWithRecorder := contextWithRecorder(ctx, rw)
...
// use rw directly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks.

orca/service.go Outdated
@@ -90,16 +108,9 @@ func Register(s grpc.ServiceRegistrar, opts ServiceOptions) (*Service, error) {
return nil, fmt.Errorf("concrete type of provided grpc.ServiceRegistrar is %T, only supported type is %T", s, &grpc.Server{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we start with the API taking a *grpc.Server and change it in the future to a grpc.ServiceRegistrar when we can support it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@dfawley dfawley assigned easwars and unassigned dfawley Aug 30, 2022
@easwars easwars assigned dfawley and unassigned easwars Aug 31, 2022
@easwars
Copy link
Contributor Author

easwars commented Aug 31, 2022

Based on our offline discussion, did not add the delete API back. So, things stay as-is for now. PTAL. Thanks.

@easwars
Copy link
Contributor Author

easwars commented Aug 31, 2022

Also changed the interceptor implementation to not pull the recorder out of the context, but use it directly.

@dfawley dfawley assigned easwars and unassigned dfawley Sep 1, 2022
@easwars easwars merged commit aee9f0e into grpc:master Sep 1, 2022
@easwars easwars deleted the orca_master branch September 1, 2022 22:58
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 1, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Type: Feature New features or improvements in behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support server API for injecting per-RPC custom backend metrics
2 participants