Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pubsub: native OpenTelemetry support #4665

Open
ianonavy opened this issue Aug 24, 2021 · 27 comments
Open

pubsub: native OpenTelemetry support #4665

ianonavy opened this issue Aug 24, 2021 · 27 comments
Assignees
Labels
api: pubsub Issues related to the Pub/Sub API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@ianonavy
Copy link

cc @aabmass

Is your feature request related to a problem? Please describe.
As developer using PubSub via the Golang client lib, I would like to propagate OpenTelemetry span contexts via messages so that I can better understand the entire lifecycle of a message.

Describe the solution you'd like
I would expect a working solution to look similar to the NodeJS implementation.

Maybe something like this?

client, _ := pubsub.NewClientWithConfig(ctx, PROJECT_ID, pubsub.ClientConfig{
    OpenTelemetryEnabled: true,
})

Then I would expect the result to look something like this when configured with a Cloud Trace exporter:

Screenshot of trace waterfall showing spans across publisher and consumer

Image is from this article by the former intern who implemented the NodeJS and Python versions of this request.

Describe alternatives you've considered
OpenTelemetry-OpenCensus Bridge would be great if PubSub were already instrumented to propagate OpenCensus spans through messages like this, but unfortunately the behavior I describe is orthogonal to the existing issue to port the existing tracing from OpenCensus to OpenTelemetry.

@ianonavy ianonavy added the triage me I really want to be triaged. label Aug 24, 2021
@product-auto-label product-auto-label bot added the api: pubsub Issues related to the Pub/Sub API. label Aug 24, 2021
@hongalex hongalex added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed triage me I really want to be triaged. labels Aug 24, 2021
@hongalex
Copy link
Member

Hiya,

Thanks for filing this issue. OT tracing support is definitely something that's on our roadmap and we're actively planning to support this. At the moment, it's mostly blocked on standardization across Pub/Sub client library languages and waiting on the Go implementation for 1.0. I'll post here for updates.

Also appreciate the detail info. If you have specific asks or features you want, I'd be happy to relay them to the rest of the client library maintainers while we work on this.

@ianonavy
Copy link
Author

Hi again. otel-go released 1.0 yesterday! 🎉

Could you please shed some insight on what

standardization across Pub/Sub client library languages

means?

I talked to @aabmass and @weyert on the CNCF Slack about the googclient_OpenTelemetrySpanContext format from the NodeJS (and WIP Python) implementations, and I understand there was some internal discussion about the merits of that format over say something based on traceparent. Is that what you mean by "standardization across [...] languages"? If so, I am happy to opine from a user perspective as well. 😄

@hongalex
Copy link
Member

Hey! Yeah we (myself and the Node library maintainer) are in contact with @aabmass. Yeah, that's one of the few things that we're trying to standardize though we're not 100% sure how we're going to fix that for Node since it's already released. We're discussing the merits of making a breaking change/major version bump or just supporting both attributes.

Among the other things is what parts of the publishing/subscribing we want to trace (and what the end result should look like). We kind of have an idea about this from the Node implementation, but it seems like it might have issues. I'll start hacking away at draft PR in Go and ask for feedback there if you're willing to help with that!

@weyert
Copy link

weyert commented Oct 7, 2021

Did you had any luck with this?

@hongalex
Copy link
Member

Sorry for the lack of an update. Yeah, I've had success with a basic span that encompasses send/receive/consume. I'm trying to make sure the spans look like what you get out of other messaging systems, following semconv, and should have a draft ready by end of this week.

@weyert
Copy link

weyert commented Oct 19, 2021

@hongalex That's great! If that's the case I wil try to match semconv usage it in the node-library. Still not sure how to deal with message handling in node

@hongalex
Copy link
Member

Could you clarify what the issue with message handling is? I did notice that there is some issue with context propagation in Node (contexts are marshalled with json instead of a propagator that follows W3C tracestate/traceparent conventions).

One thing that I'm still missing right now in Go is linking spans when doing batch receiving](https://github.com/open-telemetry/opentelemetry-specification/blob/v1.1.0/specification/trace/semantic_conventions/messaging.md#batch-receiving). After this, the PR should be ready for feedback.

@weyert
Copy link

weyert commented Oct 20, 2021

Could you clarify what the issue with message handling is? I did notice that there is some issue with context propagation in Node (contexts are marshalled with json instead of a propagator that follows W3C tracestate/traceparent conventions).

Yeah, I meant that there is a ticket in the node project were is said the context propagation is not working as expected so that's something I would like to solve. See: googleapis/nodejs-pubsub#1389

Alongside that I want to move towards the same approach you are using with the W3C tracestate/traceparent conventions. For the node-library I am to going to worry about batch receiving yet.

@hongalex
Copy link
Member

I finally opened #5034, please take a look/try it out for yourself. It's very much a work in progress still (need to figure out how to pass messaging.destination to the subscribe side, benchmarking message size calculations).

@weyert
Copy link

weyert commented Oct 28, 2021

Thanks, I will check it out :)

@Mistic92
Copy link

Mistic92 commented Sep 1, 2022

Hi, what is the current status of that feature?

@jjmengze
Copy link

jjmengze commented Oct 13, 2022

Ho folks , is there any news? Very excited to google support otel

@quzhi1
Copy link

quzhi1 commented Dec 10, 2022

We really need this feature. Right now we don't have a very elegant way of passing trace info from publisher to subscriber.

@billinghamj
Copy link

What is the situ here, given #5034 (comment)? If we want to keep a consistent trace across all messages, is GCP going to prevent that?

@jjmengze
Copy link

jjmengze commented Apr 4, 2023

is there any updated?

@quzhi1
Copy link

quzhi1 commented Apr 4, 2023

I wrote a blog about how to enable tracing in Google PubSub. Basically you need to wrap tracing information inside message attribute. This is the link to my blog: https://www.nylas.com/blog/distributed-tracing-with-opentelemetry/. Search for "Tracing PubSub" and you will find my helper functions.

@zwolsman
Copy link

zwolsman commented Sep 1, 2023

Is there any update as of when this will be picked up again/available? 😄

@ghost
Copy link

ghost commented Oct 20, 2023

Any update here? Seems this requirement is strong.

@mhv666
Copy link

mhv666 commented Feb 13, 2024

Same question here, any plans to implement it? 🙇🏽 Thanks

@hongalex
Copy link
Member

hongalex commented Mar 13, 2024

Thanks for the patience on this. We're still working out a few kinks, but if you would be interested in trying out this, it is available to try out as a beta release. This can be done by running go get cloud.google.com/go/pubsub@v1.37.0-beta.otel.trace.

Official documentation is still pending, but you should be able to see traces once you do the following

  1. Instantiate an exporter + tracer provider (Cloud Trace exporter example)
  2. On Pub/Sub client instantiation, set the client config EnableOpenTelemetryTracing to true.
  3. Publish and subscribe to messages as normal using this new version of the client library.

Each message is given its own trace, from publish to subscribe. Batch RPC spans (publish, ack, modack) will live in their own traces, but durations can be viewed on the message trace via span events.

image

While we are close to launching this feature, we welcome any feedback.

Edit: here are basic publish and subscribe examples.

@billinghamj
Copy link

@hongalex Could you clarify how the new support connects with this? #5034 (comment)

Has your thinking changed? Are there best practices or suggested options you have in mind?

@hongalex
Copy link
Member

The main change is that we are now relying more heavily on links. The design at the time of the previous comment duplicated batch RPC spans to every single message trace. So if you make a publish RPC call with 10 messages in it, then each message trace will contain a copy of the publish RPC span. Now, that span exists in its own trace, and is linked to each message trace it is affiliated with. In addition, in each message trace, we added span events to say how long the RPC took.

There were some span name changes and attribute changes that were made with updated messaging semantic conventions in mind. This is still an evolving standard so we reserve the right to make breaking changes to things like spans, span names, attribute names, and links.

@jameshartig
Copy link

One issue I noticed is that the parent's context is not being used in the ack spans so you send up missing some of the ack spans if you are sampling. We use trace.ParentBased as a sampler so if the parent is sampled all of the children are sampled but the problem is that when startSpan is called for the ack spans it uses a new context rather than using the context from the parent span.

@jameshartig
Copy link

I left a few comments on the PR, we ran this in production for a while but quickly noticed memory leaks, I believe from the activeSpans map.

@hongalex
Copy link
Member

One issue I noticed is that the parent's context is not being used in the ack spans so you send up missing some of the ack spans if you are sampling

This is unfortunately intentional. The short story is that batched spans (like ack) have multiple parents, so it cannot solely rely on parent/child relationships and why I gave it a new context. Instead, batched spans live in their own trace and are linked to the main message trace. I plan to add a new option to link back from message trace -> batched ops traces.

I left a few comments on the PR, we ran this in production for a while but quickly noticed memory leaks, I believe from the activeSpans map.

Thanks for the comments. Fixed the memory leak issues (and others) and cut another release under cloud.google.com/go/pubsub@v1.37.0-beta.otel.trace2

@jameshartig
Copy link

@hongalex thanks for the quick turnaround. I can confirm we aren't seeing the memory leak in the latest release.

imjasonh pushed a commit to chainguard-dev/terraform-infra-common that referenced this issue Jun 1, 2024
Two changes:
- use a pubsub client library version that enables trace support - this
allows a header `googclient-traceparent` to be progapated to push
subscribers. We can also populate this ourselves (by setting a message
attribute), but the library also adds a couple nice spans inside to the
pubsub client library
- on handlers, overwrite `traceparent` by `googclient-traceparent` if
available

This PubSub client library is at 1.37, slightly older than our current
version (1.38). According to
googleapis/google-cloud-go#4665 (comment)
this will be launching ~soonish so we won't have to use an older version
much longer. If it takes very long, we can DIY.

<img width="1482" alt="image"
src="https://github.com/chainguard-dev/terraform-infra-common/assets/138266/179f91bb-20f1-4a95-b64a-ae8157300ebe">

---------

Signed-off-by: Nghia Tran <tcnghia@gmail.com>
@tcnghia
Copy link

tcnghia commented Jun 4, 2024

@hongalex do you have an estimated timeline for the launch of this feature?

Also, do you plan to cut regular releases following the main versions (1.38.0 now)?

Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the Pub/Sub API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests