-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal to separate context propagation from observability #42
Changes from 3 commits
dff8df9
5ad7d1c
1dc3c7b
58248e6
68cb0ba
3dc6a76
459435e
c9c64f4
c3c7c24
4588096
2d80dae
7a73210
0d8e41b
aad5605
4a930eb
e1ef61f
f949435
1cb155e
7b9e861
07eb397
0ebeb6c
147d6b0
72d4651
1472197
18a37d4
7ea1834
7317747
43ba8fd
d7d6f1c
3a817a2
3381e0f
c15a107
310e8d5
f59fc27
153b9aa
f70855a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,195 @@ | ||||||
# Proposal: Separate Layer for Context Propagation | ||||||
|
||||||
Status: `proposed` | ||||||
|
||||||
Design OpenTelemetry as a set of separate applications which operate on a shared context propagation mechanism. | ||||||
|
||||||
|
||||||
## Motivation | ||||||
|
||||||
Based on prior art, we know that fusing the observability system and the context propagation system together creates issues. Observability systems have special rules for propagating information, such as sampling, and may have different requirements from other systems which require non-local information to be sent downstream. | ||||||
* Separation of concerns | ||||||
* Remove the Tracer dependency from context propagation mechanisms. | ||||||
* Separate distributed context into Baggage and Correlations | ||||||
* Extensibility | ||||||
* Allow users to create new applications for context propagation. | ||||||
* For example: A/B testing, encrypted or authenticated data, and new, experimental forms of observability. | ||||||
|
||||||
## Explanation | ||||||
|
||||||
# OpenTelemetry Layered Architecture | ||||||
|
||||||
![drawing](img/context_propagation_explanation.png) | ||||||
|
||||||
OpenTelemetry is a distributed program, which requires non-local, transaction-level context in order to execute correctly. Transaction-level context can also be used to build other distributed programs, such as security, versioning, and network switching programs. | ||||||
|
||||||
To allow for this extensibility, OpenTelemetry is separated into **application layer** and a **context propagation layer**. In this architecture, multiple distributed applications - such as the observability and baggage systems provided by OpenTelemetry - simultaneously share the same underlying context propagation system in order to execute their programs. | ||||||
|
||||||
|
||||||
# Application Layer | ||||||
|
||||||
## Observability API | ||||||
|
||||||
OpenTelemetry currently contains two observability systems - Tracing and Metrics – and may be extended over time. These separate systems are bound into a unified Observability API through sharing labels – a mechanism for correlating independent observations – and through sharing propagators. | ||||||
|
||||||
**Observe( context, labels…, observations...) context** | ||||||
tedsuo marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
The general form for all observability APIs is a function which takes a Context, label keys, and observations as input, and returns an updated Context. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what is an observation in this context? A single metric measurement comes to mind. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This describes
In my understanding, Observe is a stand-in for e.g. starting, ending Spans, but also any metric recording. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess it's still not something that immediately clicks for me. Some clarification on that, if it's not around in some other document, would be great. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the feedback, @toumorokoshi. @Oberon00 is correct, this definition means to say the details of every function in the observability system are not relevant, only that in addition to "doing what they do," they:
These are the only two ways that the various observability APIs are tied together. I will try to make this more clear in the proposal; please let me know if you agree in principle. |
||||||
|
||||||
**Correlate( context, label, value, hoplimit) context** | ||||||
To set the label values used by all observations in the current transaction, the Observability API provides a function which takes a context, a label key, a value, and a hoplimit, and returns an updated context. If the hoplimit is set to NO_PROPAGATION, the label will only be available to observability functions in the same process. If the hoplimit is set to UNLIMITED_PROPAGATION, it will be available to all downstream services. | ||||||
|
||||||
**GetPropagator( type) inject, extract** | ||||||
tedsuo marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
To register with the propagation system, the Observability API provides a set of propagation functions for every propagation type. | ||||||
|
||||||
|
||||||
## Baggage API | ||||||
|
||||||
In addition to observability, OpenTelemetry provides a simple mechanism for propagating arbitrary data, called Baggage. This allows new distributed applications to be implemented without having to create new propagators. | ||||||
|
||||||
To manage the state of a distributed application, the Baggage API provides a set of functions which read, write, and remove data. | ||||||
|
||||||
**SetBaggage(context, key, value) context** | ||||||
To record the distributed state of an application, the Baggage API provides a function which takes a context, a key, and a value as input, and returns an updated context which contains the new value. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would it also be beneficial to have a Although I see the explicit methods below for achieving the same - having the hoplimit also available when setting the baggage value enables an implicit decision to be defined by the baggage creator. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've been in general skeptical about the TTL notion, especially "single-hop" whose semantics are completely vague given that requests may pass through proxies that the caller would not know about. There is always the underlying in-memory Context object used as a storage for all kinds of contexts. I can always add a value to that context explicitly, and retrieve it explicitly if my application needs it in-process. I don't think there's a need to provide any additional functionality here (although it's worth calling out that such capability must exist in the Context, whereas it was missing from OpenTracing Java, for example). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I agree that single hop TTL is questionable... @objectiser I agree that if we come up with a TTL scheme beyond just |
||||||
|
||||||
**GetBaggage( context, key) value** | ||||||
tedsuo marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
To access the distributed state of an application, the Baggage API provides a function which takes a context and a key as input, and returns a value. | ||||||
|
||||||
**RemoveBaggage( context, key) context** | ||||||
To delete distributed state from an application, the Baggage API provides a function which takes a context, a key, and a value as input, and returns an updated context which contains the new value. | ||||||
|
||||||
**CleanBaggage( context) context** | ||||||
tedsuo marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
To avoid sending baggage to an untrusted downstream process, the Baggage API provides a function remove all baggage from a context. | ||||||
|
||||||
**GetPropagator( type) inject, extract** | ||||||
tedsuo marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
To register with the propagation system, the Baggage API provides a set of propagation functions for every propagation type. | ||||||
|
||||||
|
||||||
## Additional APIs | ||||||
|
||||||
Because the application and context propagation layers are separated, it is possible to create new distributed applications which do not depend on either the Observability or Baggage APIs. | ||||||
|
||||||
**GetPropagator(type) inject, extract** | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With the text that follows, I expected a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So, the propagation layer provides the RegisterPropagator function. Applications provide, and applications provide the Propagators to be registered. Is there a better way to explain this in the GetPropagator description? Right now it says "To register with the propagation system, the [BLANK] API provides a set of propagation functions for every propagation type." There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've removed the Registry concept, hopefully this is clearer. |
||||||
To register with the propagation system, additional APIs provide a set of propagation functions for every propagation type. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Was this meant?
Suggested change
|
||||||
|
||||||
|
||||||
# Context Propagation Layer | ||||||
|
||||||
## Context API | ||||||
|
||||||
Distributed applications access data in-process using a shared context object. Each distributed application sets a single key in the context, containing all of the data for that system. | ||||||
|
||||||
**SetValue( context, key, value) context** | ||||||
To record the local state of an application, the Context API provides a function which takes a context, a key, and a value as input, and returns an updated context which contains the new value. | ||||||
|
||||||
**GetValue( context, key) value** | ||||||
To access the local state of an application, the Context API provides a function which takes a context and a key as input, and returns a value. | ||||||
|
||||||
**Optional: Automated Context Management** | ||||||
tedsuo marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
When possible, context should automatically be associated with program execution . Note that some languages do not provide any facility for setting and getting a current context. In these cases, the user is responsible for managing the current context. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. might be worth expanded on what Automated Context Management is like in practice. I would imagine A ThreadLocal is an example since context is not available outside of the thread it was set or modified in? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👍 In general, this OTEP is very abstract, some (non-normative) examples would be helpful. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I struggle with this part. IMHO, explaining what OpenTelemetry looks like without automation is the goal of the spec, as it describes what instructions actually need to be executed in order to implement this system. Some implementations may be able to leverage the runtime to execute these instructions without the end user having to write some or all of the code... but I'm not sure what the best way to express that would be. We can list some example runtimes, like java and thread locals, to give readers a hint, but it would be nice if there was a useful way to describe what we mean while still using the same syntax and concepts described in the spec. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This section is confusing two different layers from the Context Propagation Layers design doc: the Context and In-Process Propagation, hence the problem. In languages like Go the In-Process Propagation layer is not relevant, because the context is passed explicitly. This could apply to other frameworks when people don't want to rely on thread-locals. That has nothing to do with the Context type itself that simply stores values. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think the section is confusing them. It just defines a point where they interact. The context type used here is not necessarily the same the in-process layer uses, it could translate between them. |
||||||
|
||||||
**Optional: SetCurrentContext( context)** | ||||||
To associate a context with program execution, the Context API provides a function which takes a Context. | ||||||
|
||||||
**Optional: GetCurrentContext() context** | ||||||
To access the context associated with program execution, the Context API provides a function which takes no arguments and returns a Context. | ||||||
|
||||||
|
||||||
## Propagation API | ||||||
|
||||||
Distributed applications send data to downstream processes via propagators, functions which read and write application context into RPC requests. Each distributed application creates a set of propagators for every type of supported medium - currently HTTP and Binary. | ||||||
|
||||||
**Inject( context, request)** | ||||||
To send the data for all distributed applications downstream to the next process, the Propagation API provides a function which takes a context and a request, and mutates the request to include the encoded context. The canonical representation of a request is as a map. | ||||||
|
||||||
**Extract( context, request) context** | ||||||
To receive data injected by prior upstream processes, the Propagation API provides a function which takes a context and a request, and returns an updated context. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What does Extract update the context with? Is it SpanContext or Span? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You are right, I also think this is an important detail that needs to be resolved. I have another suggestion: The context always contains just a spancontext. Any SpanContext has an optional reference to the span it refers to, which is null in case of a remote span (cf open-telemetry/opentelemetry-specification#216 (comment)) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I made a similar suggestion when we were working on the initial Java API in the spring. I think it has the benefits not just in the context of this discussion, but also as a alternative mechanism for implementing the "span". For example, in a streaming implementation the span does not even need to exist if all write operations simply work off the SpanContext. In other cases, spans may be kept by the tracer in a private dictionary, which makes memory management easier than if user code has an explicit reference to the span. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't understand how this would reduce references to Spans. I thought the user would be able to get this reference anytime from the SpanContext. But even withot such an API, startSpan etc would of course still return references to spans. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. StartSpan would return a new span context. Span is just an accumulator of data, it is not actually needed in the API. We could move all Span methods into Tracer directly and make them accept span context as the first argument, that would be functionally equivalent. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, I understand! There is one functional difference though: if you allowed something like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I remember that discussion on last spring, and I think the desire was that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Oberon00 indeed, you can extract a parent span context and set attributes on it. I think it's actually a feature of this approach, not the downside :-) . Although it would only work in the streaming-like implementations, not the current approach where "finished span is not writeable". But then, even today you can do @carlosalberto there are many ways of implementing SDK when the API only exposes Span Context. As @Oberon00 said, the simple implementation is still to have the Span at the SDK level which is linked from Span Context, this way you can share clock offsets. |
||||||
|
||||||
**RegisterPropagator( type, inject, extract)** | ||||||
In order for the application layer to function correctly, Propagation choices must be syncronized between all processes in the distributed system, and multiple applications must be able to inject and extract their context into the same request. To meet these requirements, the Propagation API provides a function which registers a set of propagators, which will all be executed in order when the future calls to inject and extract are made. A canonical propagator consists of an inject and an extract function. | ||||||
|
||||||
OpenTelemetry currently contains two types of Propagators: | ||||||
|
||||||
* **HTTP** - context is written into and read from a map of HTTP headers. | ||||||
* **Binary** - context is serialized into and deserialized from a stream of bytes. | ||||||
|
||||||
# Internal details | ||||||
|
||||||
![drawing](img/context_propagation_details.png) | ||||||
|
||||||
## Context details | ||||||
OpenTelemetry currently implements three context types of context propagation. | ||||||
|
||||||
**Span Context -** Observability data used by the tracing system. The readable attributes are defined to match those found in the W3C **traceparent** header. Span Context is used as labels for metrics and traces. This can quickly add overhead when propagated in-band. But, because this data is write-only, how this information is transmitted remains undefined. | ||||||
|
||||||
**Correlation Context -** Transaction-level observability data, which can be applied as labels to spans and metrics. | ||||||
tedsuo marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
**Baggage Context -** Transaction-level application data, meant to be shared with downstream components. | ||||||
|
||||||
Note that when possible, OpenTelemetry APIs calls are given access to the entire context object, and not a specific context type. | ||||||
tedsuo marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
|
||||||
## Context Management and in-process propagation | ||||||
|
||||||
In order for Context to function, it must always remain bound to the execution of code it represents. By default, this means that the programmer must pass a Context down the call stack as a function parameter. However, many languages provide automated context management facilities, such as thread locals. OpenTelemetry should leverage these facilities when available, in order to provide automatic context management. | ||||||
tedsuo marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
## Pre-existing Context implementations | ||||||
|
||||||
In some languages, a single, widely used Context implementation exists. In other languages, there many be too many implementations, or none at all. In the cases where there is not an extremely clear pre-existing option available, OpenTelemetry should provide its own Context implementation. | ||||||
|
||||||
While the above explanation represents the default OpenTelemetry approach to context propagation, it is important to note that some languages may already contain a form of context propagation. For example, Go has a the context.Context object, and widespread conventions for how to pass. | ||||||
|
||||||
Span data is used as labels for metrics and traces. This can quickly add overhead when propagated in-band. But, because this data is write-only, how this information is transmitted remains undefined. | ||||||
tedsuo marked this conversation as resolved.
Show resolved
Hide resolved
tedsuo marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
## Default Propagators | ||||||
|
||||||
When available, OpenTelemetry defaults to propagating via HTTP header definitions which have been standardized by the W3C. | ||||||
|
||||||
|
||||||
# Trade-offs and mitigations | ||||||
|
||||||
## Why separate Baggage from Correlations? | ||||||
|
||||||
Since Baggage Context and Correlation Context appear very similar, why have two? | ||||||
|
||||||
First and foremost, the intended uses for Baggage and Correlations are completely different. Secondly, the propagation requirements diverge significantly. | ||||||
|
||||||
Correlations values are solely to be used as labels for metrics and traces. By making Correlation Context data write-only, how and when it is transmitted remains undefined. This leaves the door open to optimizations, such as propagating some data out-of-band, and situations where sampling decisions may cease the need to propagate correlation context any further. | ||||||
|
||||||
Baggage values, on the other hand, are explicitly added in order to be accessed by downstream by other application code. Therefore, Baggage Context must be readable, and reliably propagated in-band in order to accomplish this goal. | ||||||
|
||||||
There may be cases where a key-value pair is propagated as TagMap for observability and as a Baggage for application specific use. AB testing is one example of such use case. There is potential duplication here at call site where a pair is created and also at propagation. | ||||||
|
||||||
Solving this issue is not worth having semantic confusion with dual purpose. However, because all observability functions take the complete context as input, it may still be possible to use baggage values as labels. | ||||||
|
||||||
|
||||||
## What about complex propagation behavior? | ||||||
|
||||||
Some OpenTelemetry proposals have called for more complex propagation behavior. For example, having a fallback to extracting B3 headersif Trace-Context headers are not found. Chained propagators and other complex behavior can be modeled as implementation details behind the Propagator interface. Therefore, the propagation system itself does not need to provide chained propagators or other additional facilities. | ||||||
tedsuo marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
So what happens when multiple propagators are registered for the same type? I think chained propagators may be cumbersome to provide on top of an API that allows only one propagator, since it requires cooperation between everyone who wants to add a propagator to the chain. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have seen code that does this, by using a stack-alike There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So, here is the nuance: there is basic chaining provided at the API level by RegisterPropagator. All propagators are run for every type, in the order in which they are registered. This is where the cooperation happens between independent applications. More complex propagation behavior usually ends up being specific to each application. So, the OTel SDK can provide the kind of fallback W3C -> B3 chaining for observability, described here. That sort of behavior is presented as a single, complex propagator, from the point of view of the Propagation API, since the fallback behavior is internal to a single application. Does that make sense? |
||||||
|
||||||
|
||||||
## Did you add a context parameter to every API call because Go has infected your brain? | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Curious, is this just for fun? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Haha, well I wanted to address what I perceive to be a common question, along the lines of "why is this context parameter everywhere? Is it because this is a golang project? Is it required that every language must pass context this way?" But if the humor gets in the way of learning, I can change it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I like it. You can also stress that it's not just a Go thing - in the old versions of Node there was no CLS (or it was extremely inefficient) and explicitly passing the context was how the propagation was achieved. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good call, I will emphasize that this issue exists in multiple languages. |
||||||
|
||||||
No. The concept of an explicit context is fundamental to a model where independent distributed applications share the same context propagation layer. How this context appears or is expressed is language specific, but it must be present in some form. | ||||||
|
||||||
|
||||||
# Prior art and alternatives | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ahem... Context Propagation Layers https://docs.google.com/document/d/1UxrEYOaQlF_E4gtiPoFmcZ4YKKe1GxohvCvQDuwvD1I/edit |
||||||
|
||||||
Prior art: | ||||||
* OpenTelemetry distributed context | ||||||
* OpenCensus propagators | ||||||
* OpenTracing spans | ||||||
* gRPC context | ||||||
|
||||||
# Open questions | ||||||
|
||||||
Related work on HTTP propagators has not been completed yet. | ||||||
|
||||||
* [W3C Trace-Context](https://www.w3.org/TR/trace-context/) candidate is not yet accepted | ||||||
* Work on [W3C Correlation-Context](https://w3c.github.io/correlation-context/) has begun, but was halted to focus on Trace-Context. | ||||||
* No work has begun on a theoretical W3C Baggage-Context. | ||||||
|
||||||
Given that we must ship with working propagators, and the W3C specifications are not yet complete, how should we move forwards with implementing context propagation? | ||||||
|
||||||
# Future possibilities | ||||||
|
||||||
Cleanly splitting OpenTelemetry into an Application and Context Propagation layer may allow us to move the Context Propagation layer into its own, stand-alone project. This may facilitate adoption, by allowing us to share Context Propagation with gRPC and other projects. | ||||||
tedsuo marked this conversation as resolved.
Show resolved
Hide resolved
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this pic very confusing, sorry. There was an ascii art in one of the earlier tickets.