Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GRPC trace API #199

Merged
merged 11 commits into from
Mar 3, 2022
Merged

GRPC trace API #199

merged 11 commits into from
Mar 3, 2022

Conversation

esnible
Copy link
Contributor

@esnible esnible commented Dec 23, 2021

Signed-off-by: Ed Snible snible@us.ibm.com

This PR creates an Open Telemetry gRPC endpoint for the method /opentelemetry.proto.collector.trace.v1.TraceService/Export. This proof-of-concept does not (yet) contain RBAC security or tenancy. This version uses https://github.com/mwitkow/grpc-proxy to proxy the gRPC without deserializing it.

Previously I created pull 196 which exposed the OTLP Trace HTTP endpoint. That was straightforward using the Chi-router.

The final implementation needs RBAC and tenancy. I haven't yet found anything that is the equivalent to the chi.Router for gRPC. Maybe https://github.com/grpc-ecosystem/go-grpc-middleware ?

It is too early to review this code, beyond the architectural ideas and the choice of libraries, parameter names, etc.

@esnible esnible changed the title proof-of-concept GRPC trace API proof-of-concept Dec 23, 2021
@esnible esnible changed the title GRPC trace API proof-of-concept [WORK IN PROGRESS] GRPC trace API proof-of-concept Dec 23, 2021
go.mod Show resolved Hide resolved
@esnible esnible changed the title [WORK IN PROGRESS] GRPC trace API proof-of-concept GRPC trace API proof-of-concept Jan 12, 2022
@esnible esnible marked this pull request as draft January 12, 2022 14:07
@matej-g matej-g requested a review from a team January 12, 2022 14:08
@matej-g
Copy link
Contributor

matej-g commented Jan 12, 2022

Thanks for this @esnible 👍 I've pinged people to take a look at this, I'll be back with my review as well.

@esnible
Copy link
Contributor Author

esnible commented Jan 24, 2022

@matej-g Any feedback?

@matej-g
Copy link
Contributor

matej-g commented Jan 25, 2022

Hey @esnible,
Although there seem to be still a lot of comments and to-do's, I believe this is in essence how we can proceed, i.e. I imagine we want to have a proxy to forward requests and before that we'll need interceptors for every authn / authz (OIDC, OPA etc.). I'd personally say that we can develop this PR further and continue iterate / give feedback? I'm happy to also chat about this on the tomorrow's community call.

I believe @squat could also give us qualified feedback here, as he was initially having thoughts about how to implement the gRPC interceptors (#158).

@bwplotka
Copy link
Member

Thanks for this and make sure to ping us on the #observatorium CNCF channel or through private means next - that got blocked too long 🙈

I would love if we can add gRPC API if Otel gRPC is more efficient and popular. That should fit perfectly in our space.

In terms of https://github.com/mwitkow/grpc-proxy feel free to use it. I have somehow written permissions and looks like many wants to jump in and help: mwitkow/grpc-proxy#52. Definitely something running in production in many places. (:

In terms of interceptors please use v2 branch and if anything is missing, feel free to put issue on https://github.com/grpc-ecosystem/go-grpc-middleware/

For authorization - this interceptor might be a good start: https://github.com/grpc-ecosystem/go-grpc-middleware/blob/v2/interceptors/auth/auth.go We might want to convert some middleware to interceptors: https://github.com/observatorium/api/blob/main/main.go#L552

Overall all decisions here makes sense to me! 💪🏽

@esnible esnible marked this pull request as ready for review February 3, 2022 20:55
@esnible esnible changed the title GRPC trace API proof-of-concept GRPC trace API Feb 3, 2022
@esnible
Copy link
Contributor Author

esnible commented Feb 4, 2022

Note: This code requires the 2018 version of mwitkow/grpc-proxy. The repo was untouched; I thought it was orphaned, but just a week ago it got an update that breaks this. I have opened an issue mwitkow/grpc-proxy#54

@esnible esnible force-pushed the grpc-traces-api branch 2 times, most recently from 7f9346b to 3d5880f Compare February 4, 2022 20:04
@matej-g
Copy link
Contributor

matej-g commented Feb 7, 2022

Note: This code requires the 2018 version of mwitkow/grpc-proxy. The repo was untouched; I thought it was orphaned, but just a week ago it got an update that breaks this. I have opened an issue mwitkow/grpc-proxy#54

What seems to be the issue exactly with using latest version? On a quick look, the changes seem to be more on the surface to modernize the codebase, rather than changing how the library works.

My thinking is, if we're already introducing this dependency and it has been just updated, I would not make these changes to require to use the old version.

Copy link
Contributor

@matej-g matej-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @esnible! I think the direction is looking good.

It would also be great to have E2E tests for tracing API, but I'm happy to have them after the fact (as long as we don't forget about it 🙂).

api/traces/v1/api.go Outdated Show resolved Hide resolved
api/traces/v1/api.go Outdated Show resolved Hide resolved
if fullMethodName == traceRoute {
var err error

if conn == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know what happens if DialContext below errors out? Is conn guaranteed to be nil in such case? I'm thinking about a case when we dial returns error, but because conn is not nil, we would not try to connect again on the second method call.

Also, is there any downside to dialing and obtaining conn outside of the director method? I would find that to be a bit simpler and if there's problem with dialing to collector, we can return error before we even enter director logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matej-g I am now explicitly setting conn to nil on failure.

I haven't tested this code yet for performance or reliability. It passes my simple test: Observatorium-api forwards a trace. Even if the OTel collector is restarted, and the Observatorium-api forwards the next trace. More testing is needed for correct keep-alives and for the OTel collector restarting while Observatorium-api is streaming traces.

main.go Outdated

//nolint:lll
func gRPCServer(cfg *config, tenantHeader string, tenantIDs map[string]string, pm *authentication.ProviderManager, authorizers map[string]rbac.Authorizer, logger log.Logger) (*grpc.Server, error) {
director := tracesv1.NewHandler(cfg.traces.writeEndpoint,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering whether in our case a simple NewProxy would not suffice instead of introducing director. The way I see it, we don't need to necessarily intercept the method, as this is something that collector would refuse if e.g. a user tries to use unimplemented method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR only handles incoming traces. However, Jaeger supports both HTTP and gRPC query APIs. There will be different RBAC security for forwarding traces and queries.

main.go Outdated Show resolved Hide resolved
main.go Show resolved Hide resolved
authentication/oidc.go Show resolved Hide resolved
token = authorization[1]
}

idToken, err := a.verifier.Verify(oidc.ClientContext(ctx, a.client), token)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like from this line below, the code is very similar with how HTTP middleware handles this. Do you think we could extract those parts and reuse a single method for both HTTP and gRPC middleware?

authentication/mtls.go Show resolved Hide resolved
main.go Outdated Show resolved Hide resolved
@esnible
Copy link
Contributor Author

esnible commented Feb 9, 2022

@matej-g I have spent two days trying to get the latest version to proxy OpenTelemetry protobufs through gRPC. Sending data either with grpcurl or with the Open Telemetry gRPC trace exporter gives similar problems. I discuss the problem at mwitkow/grpc-proxy#55

I am extremely concerned about requiring a back-level mwitkow/grpc-proxy. I am looking for a work-around.

What seems to be the issue exactly with using latest version? On a quick look, the changes seem to be more on the surface to modernize the codebase, rather than changing how the library works.

My thinking is, if we're already introducing this dependency and it has been just updated, I would not make these changes to require to use the old version.

@matej-g
Copy link
Contributor

matej-g commented Feb 9, 2022

@matej-g I have spent two days trying to get the latest version to proxy OpenTelemetry protobufs through gRPC. Sending data either with grpcurl or with the Open Telemetry gRPC trace exporter gives similar problems. I discuss the problem at mwitkow/grpc-proxy#55

Understood, thanks for putting in the effort @esnible 👍, then let's see if get an answer from upstream and in the meantime we can perhaps settle the remaining questions.

// because the codec we need to register is also deprecated. A better fix, if Google removes
// the deprecated type, is https://github.com/mwitkow/grpc-proxy/pull/48
grpc.WithCodec(grpcproxy.Codec()), // nolint: staticcheck
grpc.WithInsecure(), // nolint: staticcheck
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
grpc.WithInsecure(), // nolint: staticcheck
grpc.WithTransportCredentials(insecure.NewCredentials())

secondary, but so the linter would give us a green light.

@esnible
Copy link
Contributor Author

esnible commented Feb 9, 2022

@matej-g I would also like to have tests. I see tests in the test/e2e directory. I did not understand them. Can you point me to some docs on test writing?

I have some ideas on how to test the GRPC piece end-to-end with just 2 Open Telemetry agents -- no actual Jaeger or span storage. Would this be desirable? Or should the E2E tests install the full Jaeger, possibly including ElasticSearch, and be as close as possible to a valid deployment?

It would also be great to have E2E tests for tracing API, but I'm happy to have them after the fact (as long as we don't forget about it 🙂).

@esnible
Copy link
Contributor Author

esnible commented Feb 17, 2022

@matej-g I discovered a bug where the Observatorium-api gets an interrupt and restarts every few minutes. I am looking into it; please review but don't merge yet. I also notice that it is checking the tenancy header before checking the authorization header, letting unauthenticated users guess tenant names.

@matej-g
Copy link
Contributor

matej-g commented Feb 28, 2022

@matej-g I would also like to have tests. I see tests in the test/e2e directory. I did not understand them. Can you point me to some docs on test writing?

We are using the https://github.com/efficientgo/e2e framework, you'll find everything documented there.

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

Overall, functionally makes sense. Let us know if you need help in setting e2e tests. Code looks fine too, just make sure to not handle error twice. If you see something like

foo, err := bar()
if err != nil {
  // produce error/warning/debug log line
  return err
}

That is generally quite bad. The reason is that the passed error is most likely handled in some way - this some way is usually to log it on server side, put metric, return error to user. If every function would add log line when returning error we have lots of wasted information and effort.

HTTP middleware are tricky as usually indeed the returned error is handled by returning HTTP/gRPC response to client/user. But in order to log this I would rather implement logging middleware/interceptor than spending 100 code lines to add log line to every error handling. WDYT?

api/traces/v1/api.go Show resolved Hide resolved
}
}

// NewHandler creates the new traces v1 handler.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// NewHandler creates the new traces v1 handler.
// NewOTelConnection creates new GRPC connection to Otel handler.

}

// HandlerOption modifies the handler's configuration.
type HandlerOption func(h *handlerConfiguration)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really handler? This looks like NewOtelConnection looks like client rather?


const TraceRoute = "/opentelemetry.proto.collector.trace.v1.TraceService/Export"

type handlerConfiguration struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
type handlerConfiguration struct {
type connOptions struct {

o(c)
}

level.Info(c.logger).Log("msg", "gRPC dialing OTel collector", "endpoint", write)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this collector client? I would not presume what's the end target here. End target for this connection might be anything that supports Otel gRPC, no?

main.go Outdated
Comment on lines 1060 to 1072
connOtel, err := tracesv1.NewOTelConnection(cfg.traces.writeEndpoint,
tracesv1.WithLogger(logger),
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
connOtel, err := tracesv1.NewOTelConnection(cfg.traces.writeEndpoint,
tracesv1.WithLogger(logger),
)
connOtel, err := tracesv1.NewOTelConnection(
cfg.traces.writeEndpoint,
tracesv1.WithLogger(logger),
)

main.go Outdated
}

//nolint:lll
func gRPCServer(cfg *config, tenantHeader string, tenantIDs map[string]string, pmis authentication.GRPCMiddlewareFunc, authorizers map[string]rbac.Authorizer, logger log.Logger) (*grpc.Server, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func gRPCServer(cfg *config, tenantHeader string, tenantIDs map[string]string, pmis authentication.GRPCMiddlewareFunc, authorizers map[string]rbac.Authorizer, logger log.Logger) (*grpc.Server, error) {
func newOtelGRPCServer(cfg *config, tenantHeader string, tenantIDs map[string]string, pmis authentication.GRPCMiddlewareFunc, authorizers map[string]rbac.Authorizer, logger log.Logger) (*grpc.Server, error) {

main.go Outdated

// Currently we only proxy TraceService/Export to OTel collectors.
// In the future we will pass queries to Jaeger, and possibly other
// gRPC methods for logs and metrics to different connections
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// gRPC methods for logs and metrics to different connections
// gRPC methods for logs and metrics to different connections.

main.go Outdated
// the deprecated type, is https://github.com/mwitkow/grpc-proxy/pull/48
grpc.CustomCodec(grpcproxy.Codec()), // nolint: staticcheck
grpc.UnknownServiceHandler(grpcproxy.TransparentHandler(director)),
// Authorization (tenant)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's visible what interceptors are, we could remove this comment.

main.go Outdated
authentication.WithGRPCTenantInterceptors(logger, pmis),
auth.StreamServerInterceptor(
authorization.WithGRPCAuthorizers(authorizers, gRPCRBAC, logger)),
// Log when a trace cannot be forwarded
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.. why?

Signed-off-by: Ed Snible <snible@us.ibm.com>
Signed-off-by: Ed Snible <snible@us.ibm.com>
Signed-off-by: Ed Snible <snible@us.ibm.com>
Signed-off-by: Ed Snible <snible@us.ibm.com>
Signed-off-by: Ed Snible <snible@us.ibm.com>
Signed-off-by: Ed Snible <snible@us.ibm.com>
Signed-off-by: Ed Snible <snible@us.ibm.com>
Signed-off-by: Ed Snible <snible@us.ibm.com>
Signed-off-by: Ed Snible <snible@us.ibm.com>
Signed-off-by: Ed Snible <snible@us.ibm.com>
Copy link
Contributor

@matej-g matej-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my comments @esnible, this is looking nicely 👍, I have few more comments but I would not like to stall this PR any further. I'll be happy to approve once you get a chance to respond to the latest review.

// service supporting opentelemetry.proto.collector.trace.v1.TraceService
level.Info(c.logger).Log("msg", "gRPC dialing OTel", "endpoint", write)

ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for the 1 minute time out here? Could we do with shorter time out?

authentication/grpc.go Outdated Show resolved Hide resolved
authentication/grpc.go Show resolved Hide resolved
authentication/grpc.go Outdated Show resolved Hide resolved
authentication/grpc.go Show resolved Hide resolved
// it lets the caller see messages such as
// "failed to authenticate: oidc: token is expired (Token Expiry: 2022-02-03 14:05:36 -0500 EST)"
// which are super-helpful in troubleshooting.
return ctx, fmt.Sprintf("%s: %v", "failed to authenticate", err),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the HTTP behavior might be by design, see discussion here #235 (comment), so worth considering if the same should be apply here?

main.go Show resolved Hide resolved
Signed-off-by: Ed Snible <snible@us.ibm.com>
Copy link
Contributor

@matej-g matej-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks you @esnible! This is look good to me now!

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! It looks definitely good enough 👍🏽

api/traces/v1/api.go Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants