Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add W3C Trace Context propagation #949

Merged
merged 6 commits into from
Mar 29, 2021
Merged

Conversation

duxet
Copy link
Contributor

@duxet duxet commented Feb 23, 2021

New propagation implementation, based on W3C Trace Context specification. This PR solves #802.

@SimunKaracic
Copy link
Contributor

Great job as far as I can tell, but you're gonna have to give me some time to study the specification.
Your tests don't compile on github though, can you please remove the trailing , on line 31 and try again?

@pnerg
Copy link
Contributor

pnerg commented Mar 23, 2021

Any plans on moving this forward?
I'm really in need for w3c context propagation.
Something missing, something I can help with?

@ivantopo
Copy link
Contributor

I want to get this thing merged this week, it seems pretty solid but I wanted to double check before merging. I had a chat with @jatcwang last week precisely about moving OTel-related support forward and I'm really grateful you are pushing for it @pnerg 🙏. (jumped out of the work bubble just to answer this message, will be back at it later today)

@pnerg
Copy link
Contributor

pnerg commented Mar 23, 2021

Sounds super if we could get it into the baseline this week.
I've got a beta-version of an OpenTelemetry trace exporter which I've tested with success against the OpenTelemetry Collector

The true test however will be to run a kamon instrumented app alongside with non-kamon apps, propagating the same w3c context AND reporting to the same OTLP endpoint hoping the total trace would be stitched properly
I have an environment coming up with such setup and I'd need the w3c propagation.

@SimunKaracic
Copy link
Contributor

SimunKaracic commented Mar 23, 2021

I've been reading the spec https://www.w3.org/TR/trace-context/
And it looks like we propagate traceparent correctly, but tracestate is just forwarded along, not mutated (or parsed) according to the spec.
If you think we can go along without that, I guess this can be merged, but we'll have to do some more work later.

P.S. rebasing onto master should fix the failing tests

@pnerg
Copy link
Contributor

pnerg commented Mar 23, 2021

To be honest I've not really understood the use of the tracestate header.
This vendor specific in any arbitrary/opaque format is somewhat beyond me how to use it.

I've noticed that the OTLP spec contains a field trace_state which according to the docs relates to the w3c tracestate header. But the spec just defines it as a string so I assume one just crams the entire header as-is.
ATM I've left out the tracestate from my Otel exporter (as there's no such info in the context)

The most important fields imo are the identifiers (trace, span, parent) as is using them one can stitch the trace.
I'd be fine with having an update later on which deals with tracestate in a more "proper" way.

@SimunKaracic
Copy link
Contributor

If I'm reading the spec correctly, we're not breaking anything by not touching tracestate.
So, to me this looks fine and well tested, but I don't have any actual services using w3trace propagation to test this with.
@pnerg, please, when we release this, try it out and raise any issues that come up, we'll get on them asap.

I've raised an issue for handling tracestate #963.
If @ivantopo (or anyone else) doesn't raise any issues by the end of day tomorrow, we'll merge and do a small release so you can try it out.

How does that sound @pnerg ?

@pnerg
Copy link
Contributor

pnerg commented Mar 23, 2021

excellent, I'll take it into use along with the OTel exporter and we'll see where it goes from there.

@duxet
Copy link
Contributor Author

duxet commented Mar 23, 2021

And it looks like we propagate traceparent correctly, but tracestate is just forwarded along, not mutated (or parsed) according to the spec.

Yeah, that's done on purpose. I think this behavior should be compliant with the specification as we don't really need to store any Kamon specific value in tracestate header:

The vendor MAY validate the tracestate header
The vendor MAY modify the tracestate header
The vendor sets the traceparent and tracestate header for the outgoing request

If we would like to pass the Context, then probably it would be better to use baggage header for that: https://www.w3.org/TR/baggage/

@ivantopo
Copy link
Contributor

I just took a second look at the spec and this PR, and I think it will work just fine 🎉

There is only one tiny thing that we should consider: Kamon uses 8-byte identifiers by default, but W3C uses 16-byte identifiers by default. I saw that this PR will handle it gracefully:

  • Always use 16 bytes when decoding a traceparent
  • Always fill up to 16 bytes when encoding a traceparent

But it will be a bit odd that all traces starting on a Kamon application will have an ID like 00000000000000000123456789ABCDEF by default. Do you think we should log a warning about it when this propagation is used but the identifier scheme is set to single?

parentId = Identifier("2222", Array[Byte](2, 2, 2, 2)),
trace = Trace(
id = Identifier("1234", Array[Byte](1, 2, 3, 4)),
samplingDecision = SamplingDecision.Sample
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a test for the Unknown sampling decision?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't mind if it's added, but it can be merged without it

@pnerg
Copy link
Contributor

pnerg commented Mar 24, 2021

@ivantopo when we implement support for OTel trace exporter the trace_id HAS to be 16 bytes.

  // A unique identifier for a trace. All spans from the same trace share
  // the same `trace_id`. The ID is a 16-byte array. An ID with all zeroes
  // is considered invalid.
  //
  // This field is semantically required. Receiver should generate new
  // random trace_id if empty or invalid trace_id was received.
  //
  // This field is required.
  bytes trace_id = 1;

Failing to comply will result in failure when exporting the trace. At least the reference collector I've tested with returns an error code.

I'm more of a brutal failure person than the fix gracefully...:)
IMO we should log error and fail to start if single/8-byte identifier is used AND using either of

  • w3c context propagation
  • OTel exporter

I've seen too many cases where logs are ignored and/or possibly just drown in an avalanche of other logs.
So under the hood start to pad the traceid is IMO not the correct way. If you want 8-byte identifiers then you can't use w3c context nor OTel exporters....period
Better to be clear in the documentation and also fail to start in case you don't read docs.

@ivantopo
Copy link
Contributor

I'm trying to balance the two sides we have seen over the years:

  • Folks saying that is better to make it break and force users to fix the problems right away (like @pnerg here)
  • Folks saying that the monitoring tool should never break the app, no matter what

I find myself in the middle of these two groups 😅

In this case, this is what I think we should do:

  • Log a warning when the w3c trace context reader is created if Kamon is configured to use single length trace identifiers. It should be only a warning because everything would work just fine, besides the "oddness" of having leading zeroes if a trace starts in Kamon.
  • The OTLP reporter can either warn as well, or crash, whatever @pnerg finds more appropriate.

I'm treating these two cases differently because using w3c tracecontext doesn't necessarily mean using the OTLP reporter, and the application will work fine in those cases anyways. I think we should be lining up a bigger release soon and change the default propagation to w3c tracecontext, along with the default identifier scheme.

@duxet are you ok with adding the warning?

@pnerg
Copy link
Contributor

pnerg commented Mar 26, 2021

I'm fine with any decision you make, I'll adapt the OTel exporter to behave in a similar fashion.
I'd rather see the exporter has the same behaviour as the core/propagation. Don't need two variants.

As for releasing this feature, I'd be happy if we could come up with a minor release including the w3c context propagation.
Then you can line up a major release where you change the default propagation and identifier scheme.

@duxet
Copy link
Contributor Author

duxet commented Mar 29, 2021

@duxet are you ok with adding the warning?

Sure, I've updated the PR.

@duxet
Copy link
Contributor Author

duxet commented Mar 29, 2021

So under the hood start to pad the traceid is IMO not the correct way. If you want 8-byte identifiers then you can't use w3c context nor OTel exporters....period

Well, it's not a hard requirement but still a recommendation to use the padding in such case:

https://w3c.github.io/trace-context/#interoperating-with-existing-systems-which-use-shorter-identifiers

When a system creates an outbound message and needs to generate a fully compliant 16 bytes trace-id from a shorter identifier, it SHOULD left pad the original identifier with zeroes. For example, the identifier 53ce929d0e0e4736, SHOULD be converted to trace-id value 000000000000000053ce929d0e0e4736.

@pnerg
Copy link
Contributor

pnerg commented Mar 29, 2021

@duxet , I've already surrendered on the id padding...:)
Added padding in the OTel exporter in case the TraceId is 8-bytes

@ivantopo ivantopo merged commit e4244bd into kamon-io:master Mar 29, 2021
@ivantopo
Copy link
Contributor

Thanks a lot @duxet and @pnerg!

We will be pushing out a release later today with this 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants