Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow reporting gRPC codes in "status_code" label of request duration metrics #424

Merged
merged 9 commits into from
Nov 4, 2023

Conversation

duricanikolic
Copy link
Contributor

@duricanikolic duricanikolic commented Nov 1, 2023

What this PR does:
Before this PR grpc_instrumentation.go could recognize only the errors built by the httpgrpc package and use their error codes in the status_code label of the request duration metrics. All other gRPC errors were treated as unknown, and used to be labeled as error. This PR fixes that erroneous behaviour, and allows all gRPC status codes to be used as the status_code label.

Types middleware.InstrumentationLabel and middleware.InstrumentationOption, as well as a special value of the latter, middleware.ReportGRPCStatusOption have been introduced. This special value can be used to enable reporting of gRPC status codes in status_code label (instead of simplified "error", "cancel" or "success" values). middleware.ReportGRPCStatusOption can be passed as an optional argument to the following methods:

  • middleware.UnaryServerInstrumentInterceptor
  • middleware.StreamServerInstrumentInterceptor
  • middleware.UnaryClientInstrumentInterceptor
  • middleware.StreamClientInstrumentInterceptor
  • middleware.Instrument.

In order to guarantee backwards compatibility, reporting of gRPC codes as labels is disabled by default and could be enabled as follows:

  • on server side by setting the new CLI experimental flag -server.report-grpc-codes-in-instrumentation-label to true, or by setting server.Config.ReportGRPCCodesInInstrumentationLabel to true.
  • on client side by passing middleware.ReportGRPCStatusOption to middleware.Instrument.
    Which issue(s) this PR fixes:

Part of grafana/mimir#6008.

Checklist

  • Tests updated
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
@duricanikolic duricanikolic self-assigned this Nov 1, 2023
middleware/grpc_instrumentation_test.go Outdated Show resolved Hide resolved
middleware/grpc_instrumentation.go Outdated Show resolved Hide resolved
// This implementation differs from status.FromError() because the
// latter checks only if the given error can be cast to status.Status,
// and doesn't check other errors in the given error's tree.
func ErrorToStatus(err error) (*status.Status, bool) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This current httpgrpc.statusFromError() function has been renamed to ErrorToStatus(), exported and moved here.

grpcstatus "google.golang.org/grpc/status"
)

func TestErrorToStatus(t *testing.T) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current httpgrpc.TestStatusFromError() has been moved here.

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
)

func TestAppendMessageSizeToOutgoingContext(t *testing.T) {
ctx := context.Background()

req := &httpgrpc.HTTPRequest{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been replaced with fakeSizer to avoid a circular dependency. The parameter of AppendMessageSizeToOutgoingContext should be any implementation of the Sizer interface, and httpgrpc.HTTPRequest was just one of them.

)

// IsCanceled checks whether an error comes from an operation being canceled
func IsCanceled(err error) bool {
if errors.Is(err, context.Canceled) {
return true
}
s, ok := status.FromError(err)
s, ok := ErrorToStatus(err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing full conversion to gogo's status.Status seems unnecessary here, if all we care about is code from grpc.Status. Perhaps we can have ErrorToStatusCode function too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is true that we need only the status code here, but in order to get it, we anyway need to call a gogostatus.FromError() or grpcstatus.FromError() (the current implementation).
My idea here was to unify the way we extract statuses. Because in most of the places it is done by gogostatus, and in a couple of places it is done by grpcstatus.

The idea is, anyway, to get rid of the gogo-related things, but once we find a valid alternative. If we now start mixing gogo and grpc usages, it will be more difficult to do the change one day.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we now start mixing gogo and grpc usages, it will be more difficult to do the change one day.

I don't see how my proposal would do that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how my proposal would do that?

I was referring to the current implementation: it added a grpcstatus usage to dskit. And my motivation with the proposed change was to get rid of it. I meant that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to the current implementation: it added a grpcstatus usage to dskit. And my motivation with the proposed change was to get rid of it. I meant that.

We need to check for GRPCStatus() *grpcstatus.Status method, is this what you're referring to? I don't see a way around it.

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
grpcutil/status.go Outdated Show resolved Hide resolved
grpcutil/status.go Outdated Show resolved Hide resolved
grpcutil/status_test.go Outdated Show resolved Hide resolved
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
…LabelOption structs

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Copy link
Member

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is getting bigger and bigger :/

server/server.go Outdated Show resolved Hide resolved
middleware/grpc_instrumentation.go Outdated Show resolved Hide resolved
middleware/grpc_instrumentation.go Outdated Show resolved Hide resolved
middleware/grpc_instrumentation.go Outdated Show resolved Hide resolved
middleware/grpc_instrumentation.go Outdated Show resolved Hide resolved
middleware/grpc_instrumentation.go Outdated Show resolved Hide resolved
middleware/grpc_instrumentation.go Outdated Show resolved Hide resolved
middleware/grpc_instrumentation.go Outdated Show resolved Hide resolved
middleware/grpc_instrumentation.go Outdated Show resolved Hide resolved
Comment on lines 265 to 277
statusCode := grpcutil.ErrorToStatusCode(err)

if statusCode == codes.Canceled {
return statusCode
}

if isHTTPStatusCode(statusCode) {
return statusCode
}
if i.acceptGRPCStatuses {
return statusCode
}
return codes.Unknown
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function is doing too much -- it should just return grpcutil.ErrorToStatusCode(err) for non-nil and not-canceled errors. (It also doesn't need to be a method of InstrumentationLabel.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is the only place where we actually take into account whether i.acceptGRPCStatues is set to true or not. If we move it out from here, we need to put it in statusCodeToString() function. I can try to see how it would look like, but it will for sure complicate the test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can try to see how it would look like, but it will for sure complicate the test.

I'm not sure if test is too complicated or not, but I think this function looks good now. Thank you for giving it a try.

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Copy link
Member

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work 👏

instrumentationOptions []InstrumentationOption
expectedAcceptGRPCStatuses bool
}{
"Applying no InstrimentationOption sets acceptGRPCStatus to false": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo in InstrimentationOption

func TestApplyInstrumentationLabelOptions(t *testing.T) {
testCases := map[string]struct {
instrumentationOptions []InstrumentationOption
expectedAcceptGRPCStatuses bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's now called "reportGRPCStatus".

CHANGELOG.md Outdated
@@ -155,6 +155,13 @@
* [ENHANCEMENT] Memcached: allow to configure write and read buffer size (in bytes). #414
* [ENHANCEMENT] Server: Add `-server.http-read-header-timeout` option to specify timeout for reading HTTP request header. It defaults to 0, in which case reading of headers can take up to `-server.http-read-timeout`, leaving no time for reading body, if there's any. #423
* [ENHANCEMENT] Make httpgrpc.Server produce non-loggable errors when a header with key `httpgrpc.DoNotLogErrorHeaderKey` and any value is present in the HTTP response. #421
* [ENHANCEMENT] Server: Add `-server.report-grpc-codes-in-instrumentation-label` CLI flag to specify whether gRPC status codes should be used in instrumentation labels. It defaults to false, meaning that gRPC status codes are represented with `error` value. #424
* [ENHANCEMENT] Instrumentation: `middleware.InstrumentationOption` struct, and a special value `middleware.ReportGRPCStatusOption` have been added to allow both server and clients to configure gRPC status code usages in instrumentation labels. This can be optionally used in the following functions: #424
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* [ENHANCEMENT] Instrumentation: `middleware.InstrumentationOption` struct, and a special value `middleware.ReportGRPCStatusOption` have been added to allow both server and clients to configure gRPC status code usages in instrumentation labels. This can be optionally used in the following functions: #424
* [ENHANCEMENT] Instrumentation: Added `middleware.ReportGRPCStatusOption` that can be passed to the following functions to enable reporting of gRPC status codes in "status_code" label (instead of simplified "error", "cancel" or "success" values): #424

CHANGELOG.md Outdated
@@ -155,6 +155,13 @@
* [ENHANCEMENT] Memcached: allow to configure write and read buffer size (in bytes). #414
* [ENHANCEMENT] Server: Add `-server.http-read-header-timeout` option to specify timeout for reading HTTP request header. It defaults to 0, in which case reading of headers can take up to `-server.http-read-timeout`, leaving no time for reading body, if there's any. #423
* [ENHANCEMENT] Make httpgrpc.Server produce non-loggable errors when a header with key `httpgrpc.DoNotLogErrorHeaderKey` and any value is present in the HTTP response. #421
* [ENHANCEMENT] Server: Add `-server.report-grpc-codes-in-instrumentation-label` CLI flag to specify whether gRPC status codes should be used in instrumentation labels. It defaults to false, meaning that gRPC status codes are represented with `error` value. #424
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* [ENHANCEMENT] Server: Add `-server.report-grpc-codes-in-instrumentation-label` CLI flag to specify whether gRPC status codes should be used in instrumentation labels. It defaults to false, meaning that gRPC status codes are represented with `error` value. #424
* [ENHANCEMENT] Server: Add `-server.report-grpc-codes-in-instrumentation-label` CLI flag to specify whether gRPC status codes should be used in `status_code` label of request duration metric. It defaults to false, meaning that gRPC status codes are represented with `error` value. #424

Comment on lines 232 to 237
// errorToStatusCode extracts a status code from the given error, and does the following:
//
// - If the error is nil, codes.OK is returned.
// - If the error corresponds to context.Canceled, codes.Canceled is returned.
// - Otherwise, the actual status code of the error is returned.
func errorToStatusCode(err error) codes.Code {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This description of implementation seems unnecessary. Same for next function. (I'd only comment public functions, of functions with tricky behaviour. But this is pretty straightforward)

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
@duricanikolic duricanikolic changed the title grpc_instrumentation: recognize both HTTP and gRPC status codes Allow reporting of gRPC status codes in the "status_code" label of the request duration metrics Nov 4, 2023
@duricanikolic duricanikolic changed the title Allow reporting of gRPC status codes in the "status_code" label of the request duration metrics Allow reporting gRPC codes in the "status_code" label of the request duration metrics Nov 4, 2023
@duricanikolic duricanikolic changed the title Allow reporting gRPC codes in the "status_code" label of the request duration metrics Allow reporting gRPC codes in "status_code" label of request duration metrics Nov 4, 2023
@duricanikolic duricanikolic merged commit b3823cb into main Nov 4, 2023
3 checks passed
@duricanikolic duricanikolic deleted the yuri/status-code branch November 4, 2023 11:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants