Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal to separate context propagation from observability #42

Closed
wants to merge 36 commits into from
Closed
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
dff8df9
Proposal to separate context propagation from observability
tedsuo Sep 8, 2019
5ad7d1c
cleanup description for Extract
tedsuo Sep 10, 2019
1dc3c7b
commas
tedsuo Sep 10, 2019
58248e6
Update text/0000-separate-context-propagation.md
tedsuo Sep 10, 2019
68cb0ba
RFC proposal: A layered approach to data formats
tedsuo Aug 13, 2019
3dc6a76
whitespace
tedsuo Aug 22, 2019
459435e
Capitalization
tedsuo Aug 22, 2019
c9c64f4
whitespace
tedsuo Aug 22, 2019
c3c7c24
CleanBaggage -> ClearBaggage
tedsuo Sep 10, 2019
4588096
move function descriptions to new line
tedsuo Sep 10, 2019
2d80dae
Add Optional subheader
tedsuo Sep 10, 2019
7a73210
cleanup rough edits
tedsuo Sep 10, 2019
0d8e41b
clean up advice on pre-existing context implementations
tedsuo Sep 10, 2019
aad5605
Better context descriptions
tedsuo Sep 10, 2019
4a930eb
remove data format file
tedsuo Sep 11, 2019
e1ef61f
remove git diff message
tedsuo Sep 11, 2019
f949435
improved code sytnax
tedsuo Sep 11, 2019
1cb155e
stop stuttering
tedsuo Sep 11, 2019
7b9e861
Update text/0000-separate-context-propagation.md
tedsuo Sep 11, 2019
07eb397
spacing
tedsuo Sep 11, 2019
0ebeb6c
Refine propagation
tedsuo Sep 25, 2019
147d6b0
Add RFC ID number from PR
tedsuo Oct 1, 2019
72d4651
remove RFC status line
tedsuo Oct 1, 2019
1472197
slight calrification for GetHTTPExtractor
tedsuo Oct 1, 2019
18a37d4
add global propagators
tedsuo Oct 1, 2019
7ea1834
Clean up motivation
tedsuo Oct 15, 2019
7317747
Clean up explanbation intro
tedsuo Oct 15, 2019
43ba8fd
Clarify context types
tedsuo Oct 15, 2019
d7d6f1c
Fix ChainHTTPInjector and ChainHTTPExtractor
tedsuo Oct 15, 2019
3a817a2
typo
tedsuo Oct 15, 2019
3381e0f
Reference Trace-Context, not just traceparent
tedsuo Oct 15, 2019
c15a107
Bagge context cleanup
tedsuo Oct 15, 2019
310e8d5
stronger language around context access
tedsuo Oct 15, 2019
f59fc27
Update text/0042-separate-context-propagation.md
tedsuo Oct 15, 2019
153b9aa
clean up tradeoffs
tedsuo Oct 15, 2019
f70855a
Update text/0042-separate-context-propagation.md
tedsuo Oct 15, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
195 changes: 195 additions & 0 deletions text/0000-separate-context-propagation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,195 @@
# Proposal: Separate Layer for Context Propagation

Status: `proposed`

Design OpenTelemetry as a set of separate applications which operate on a shared context propagation mechanism.


## Motivation

Based on prior art, we know that fusing the observability system and the context propagation system together creates issues. Observability systems have special rules for propagating information, such as sampling, and may have different requirements from other systems which require non-local information to be sent downstream.
* Separation of concerns
* Remove the Tracer dependency from context propagation mechanisms.
* Separate distributed context into Baggage and Correlations
* Extensibility
* Allow users to create new applications for context propagation.
* For example: A/B testing, encrypted or authenticated data, and new, experimental forms of observability.

## Explanation

# OpenTelemetry Layered Architecture

![drawing](img/context_propagation_explanation.png)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this pic very confusing, sorry. There was an ascii art in one of the earlier tickets.


OpenTelemetry is a distributed program, which requires non-local, transaction-level context in order to execute correctly. Transaction-level context can also be used to build other distributed programs, such as security, versioning, and network switching programs.

To allow for this extensibility, OpenTelemetry is separated into **application layer** and a **context propagation layer**. In this architecture, multiple distributed applications - such as the observability and baggage systems provided by OpenTelemetry - simultaneously share the same underlying context propagation system in order to execute their programs.


# Application Layer

## Observability API

OpenTelemetry currently contains two observability systems - Tracing and Metrics – and may be extended over time. These separate systems are bound into a unified Observability API through sharing labels – a mechanism for correlating independent observations – and through sharing propagators.

**Observe( context, labels…, observations...) context**
tedsuo marked this conversation as resolved.
Show resolved Hide resolved
The general form for all observability APIs is a function which takes a Context, label keys, and observations as input, and returns an updated Context.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is an observation in this context? A single metric measurement comes to mind.

Copy link
Member

@Oberon00 Oberon00 Sep 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This describes

The general form for all observability APIs

In my understanding, Observe is a stand-in for e.g. starting, ending Spans, but also any metric recording.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's still not something that immediately clicks for me. Some clarification on that, if it's not around in some other document, would be great.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback, @toumorokoshi. @Oberon00 is correct, this definition means to say the details of every function in the observability system are not relevant, only that in addition to "doing what they do," they:

  • always have access to the entire context
  • accept labels

These are the only two ways that the various observability APIs are tied together. I will try to make this more clear in the proposal; please let me know if you agree in principle.


**Correlate( context, label, value, hoplimit) context**
To set the label values used by all observations in the current transaction, the Observability API provides a function which takes a context, a label key, a value, and a hoplimit, and returns an updated context. If the hoplimit is set to NO_PROPAGATION, the label will only be available to observability functions in the same process. If the hoplimit is set to UNLIMITED_PROPAGATION, it will be available to all downstream services.

**GetPropagator( type) inject, extract**
tedsuo marked this conversation as resolved.
Show resolved Hide resolved
To register with the propagation system, the Observability API provides a set of propagation functions for every propagation type.


## Baggage API

In addition to observability, OpenTelemetry provides a simple mechanism for propagating arbitrary data, called Baggage. This allows new distributed applications to be implemented without having to create new propagators.

To manage the state of a distributed application, the Baggage API provides a set of functions which read, write, and remove data.

**SetBaggage(context, key, value) context**
To record the distributed state of an application, the Baggage API provides a function which takes a context, a key, and a value as input, and returns an updated context which contains the new value.
Copy link

@objectiser objectiser Sep 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it also be beneficial to have a hoplimit on baggage items? For example, once supported, we may want a certain baggage value only to be propagated to the downstream service and no further - or only available within the service (NO_PROPAGATION) - but it is not to be treated as a label.

Although I see the explicit methods below for achieving the same - having the hoplimit also available when setting the baggage value enables an implicit decision to be defined by the baggage creator.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been in general skeptical about the TTL notion, especially "single-hop" whose semantics are completely vague given that requests may pass through proxies that the caller would not know about.

There is always the underlying in-memory Context object used as a storage for all kinds of contexts. I can always add a value to that context explicitly, and retrieve it explicitly if my application needs it in-process. I don't think there's a need to provide any additional functionality here (although it's worth calling out that such capability must exist in the Context, whereas it was missing from OpenTracing Java, for example).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree that single hop TTL is questionable...

@objectiser I agree that if we come up with a TTL scheme beyond just NO_PROPAGATION and UNLIMITED_PROPAGATION we should add it to baggage. I only left it off because NO_PROPAGATION is equivalent to just using the context API directly, and that did not leave any options other than UNLIMITED_PROPAGATION. So it wasn't clear to me why anyone would need baggage for that purpose.


**GetBaggage( context, key) value**
tedsuo marked this conversation as resolved.
Show resolved Hide resolved
To access the distributed state of an application, the Baggage API provides a function which takes a context and a key as input, and returns a value.

**RemoveBaggage( context, key) context**
To delete distributed state from an application, the Baggage API provides a function which takes a context, a key, and a value as input, and returns an updated context which contains the new value.

**CleanBaggage( context) context**
tedsuo marked this conversation as resolved.
Show resolved Hide resolved
To avoid sending baggage to an untrusted downstream process, the Baggage API provides a function remove all baggage from a context.

**GetPropagator( type) inject, extract**
tedsuo marked this conversation as resolved.
Show resolved Hide resolved
To register with the propagation system, the Baggage API provides a set of propagation functions for every propagation type.


## Additional APIs

Because the application and context propagation layers are separated, it is possible to create new distributed applications which do not depend on either the Observability or Baggage APIs.

**GetPropagator(type) inject, extract**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the text that follows, I expected a RegisterPropagatator API here. Somehow this should reference the actual registration function described later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the propagation layer provides the RegisterPropagator function. Applications provide, and applications provide the Propagators to be registered.

Is there a better way to explain this in the GetPropagator description? Right now it says "To register with the propagation system, the [BLANK] API provides a set of propagation functions for every propagation type."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the Registry concept, hopefully this is clearer.

To register with the propagation system, additional APIs provide a set of propagation functions for every propagation type.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this meant?

Suggested change
To register with the propagation system, additional APIs provide a set of propagation functions for every propagation type.
To register with the propagation system, additional APIs allow to retrieve and register a set of propagation functions for every propagation type.



# Context Propagation Layer

## Context API

Distributed applications access data in-process using a shared context object. Each distributed application sets a single key in the context, containing all of the data for that system.

**SetValue( context, key, value) context**
To record the local state of an application, the Context API provides a function which takes a context, a key, and a value as input, and returns an updated context which contains the new value.

**GetValue( context, key) value**
To access the local state of an application, the Context API provides a function which takes a context and a key as input, and returns a value.

**Optional: Automated Context Management**
tedsuo marked this conversation as resolved.
Show resolved Hide resolved
When possible, context should automatically be associated with program execution . Note that some languages do not provide any facility for setting and getting a current context. In these cases, the user is responsible for managing the current context.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be worth expanded on what Automated Context Management is like in practice. I would imagine A ThreadLocal is an example since context is not available outside of the thread it was set or modified in?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 In general, this OTEP is very abstract, some (non-normative) examples would be helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I struggle with this part. IMHO, explaining what OpenTelemetry looks like without automation is the goal of the spec, as it describes what instructions actually need to be executed in order to implement this system.

Some implementations may be able to leverage the runtime to execute these instructions without the end user having to write some or all of the code... but I'm not sure what the best way to express that would be. We can list some example runtimes, like java and thread locals, to give readers a hint, but it would be nice if there was a useful way to describe what we mean while still using the same syntax and concepts described in the spec.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is confusing two different layers from the Context Propagation Layers design doc: the Context and In-Process Propagation, hence the problem. In languages like Go the In-Process Propagation layer is not relevant, because the context is passed explicitly. This could apply to other frameworks when people don't want to rely on thread-locals. That has nothing to do with the Context type itself that simply stores values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the section is confusing them. It just defines a point where they interact. The context type used here is not necessarily the same the in-process layer uses, it could translate between them.


**Optional: SetCurrentContext( context)**
To associate a context with program execution, the Context API provides a function which takes a Context.

**Optional: GetCurrentContext() context**
To access the context associated with program execution, the Context API provides a function which takes no arguments and returns a Context.


## Propagation API

Distributed applications send data to downstream processes via propagators, functions which read and write application context into RPC requests. Each distributed application creates a set of propagators for every type of supported medium - currently HTTP and Binary.

**Inject( context, request)**
To send the data for all distributed applications downstream to the next process, the Propagation API provides a function which takes a context and a request, and mutates the request to include the encoded context. The canonical representation of a request is as a map.

**Extract( context, request) context**
To receive data injected by prior upstream processes, the Propagation API provides a function which takes a context and a request, and returns an updated context.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does Extract update the context with? Is it SpanContext or Span?
If it is SpanContext then it would require another api to set the current SpanContext in addition to Span.
If it is Span then Extract also has to start a span which I don't think it is appropriate.
So, it would be better for this API to return SpanContext. Unless we introduce a concept of Inactive Span which simply contains remote SpanContext.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I also think this is an important detail that needs to be resolved.

I have another suggestion: The context always contains just a spancontext. Any SpanContext has an optional reference to the span it refers to, which is null in case of a remote span (cf open-telemetry/opentelemetry-specification#216 (comment))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a similar suggestion when we were working on the initial Java API in the spring. I think it has the benefits not just in the context of this discussion, but also as a alternative mechanism for implementing the "span". For example, in a streaming implementation the span does not even need to exist if all write operations simply work off the SpanContext. In other cases, spans may be kept by the tracer in a private dictionary, which makes memory management easier than if user code has an explicit reference to the span.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how this would reduce references to Spans. I thought the user would be able to get this reference anytime from the SpanContext. But even withot such an API, startSpan etc would of course still return references to spans.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StartSpan would return a new span context. Span is just an accumulator of data, it is not actually needed in the API. We could move all Span methods into Tracer directly and make them accept span context as the first argument, that would be functionally equivalent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I understand! There is one functional difference though: if you allowed something like Tracer.setSpanAttribute(context, "x", "y"), you implicitly allow setting attributes on spans that were not created in this process. This is prevented by the current Span/SpanContext separation and I think that's a good thing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember that discussion on last spring, and I think the desire was that Spans created in the same process could share the monotonic clock (which is an optimization, more than a requirement).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Oberon00 indeed, you can extract a parent span context and set attributes on it. I think it's actually a feature of this approach, not the downside :-) . Although it would only work in the streaming-like implementations, not the current approach where "finished span is not writeable". But then, even today you can do span.end(); span.setTag(), which is meaningless.

@carlosalberto there are many ways of implementing SDK when the API only exposes Span Context. As @Oberon00 said, the simple implementation is still to have the Span at the SDK level which is linked from Span Context, this way you can share clock offsets.


**RegisterPropagator( type, inject, extract)**
In order for the application layer to function correctly, Propagation choices must be syncronized between all processes in the distributed system, and multiple applications must be able to inject and extract their context into the same request. To meet these requirements, the Propagation API provides a function which registers a set of propagators, which will all be executed in order when the future calls to inject and extract are made. A canonical propagator consists of an inject and an extract function.

OpenTelemetry currently contains two types of Propagators:

* **HTTP** - context is written into and read from a map of HTTP headers.
* **Binary** - context is serialized into and deserialized from a stream of bytes.

# Internal details

![drawing](img/context_propagation_details.png)

## Context details
OpenTelemetry currently implements three context types of context propagation.

**Span Context -** Observability data used by the tracing system. The readable attributes are defined to match those found in the W3C **traceparent** header. Span Context is used as labels for metrics and traces. This can quickly add overhead when propagated in-band. But, because this data is write-only, how this information is transmitted remains undefined.

**Correlation Context -** Transaction-level observability data, which can be applied as labels to spans and metrics.
tedsuo marked this conversation as resolved.
Show resolved Hide resolved

**Baggage Context -** Transaction-level application data, meant to be shared with downstream components.

Note that when possible, OpenTelemetry APIs calls are given access to the entire context object, and not a specific context type.
tedsuo marked this conversation as resolved.
Show resolved Hide resolved


## Context Management and in-process propagation

In order for Context to function, it must always remain bound to the execution of code it represents. By default, this means that the programmer must pass a Context down the call stack as a function parameter. However, many languages provide automated context management facilities, such as thread locals. OpenTelemetry should leverage these facilities when available, in order to provide automatic context management.
tedsuo marked this conversation as resolved.
Show resolved Hide resolved

## Pre-existing Context implementations

In some languages, a single, widely used Context implementation exists. In other languages, there many be too many implementations, or none at all. In the cases where there is not an extremely clear pre-existing option available, OpenTelemetry should provide its own Context implementation.

While the above explanation represents the default OpenTelemetry approach to context propagation, it is important to note that some languages may already contain a form of context propagation. For example, Go has a the context.Context object, and widespread conventions for how to pass.

Span data is used as labels for metrics and traces. This can quickly add overhead when propagated in-band. But, because this data is write-only, how this information is transmitted remains undefined.
tedsuo marked this conversation as resolved.
Show resolved Hide resolved
tedsuo marked this conversation as resolved.
Show resolved Hide resolved

## Default Propagators

When available, OpenTelemetry defaults to propagating via HTTP header definitions which have been standardized by the W3C.


# Trade-offs and mitigations

## Why separate Baggage from Correlations?

Since Baggage Context and Correlation Context appear very similar, why have two?

First and foremost, the intended uses for Baggage and Correlations are completely different. Secondly, the propagation requirements diverge significantly.

Correlations values are solely to be used as labels for metrics and traces. By making Correlation Context data write-only, how and when it is transmitted remains undefined. This leaves the door open to optimizations, such as propagating some data out-of-band, and situations where sampling decisions may cease the need to propagate correlation context any further.

Baggage values, on the other hand, are explicitly added in order to be accessed by downstream by other application code. Therefore, Baggage Context must be readable, and reliably propagated in-band in order to accomplish this goal.

There may be cases where a key-value pair is propagated as TagMap for observability and as a Baggage for application specific use. AB testing is one example of such use case. There is potential duplication here at call site where a pair is created and also at propagation.

Solving this issue is not worth having semantic confusion with dual purpose. However, because all observability functions take the complete context as input, it may still be possible to use baggage values as labels.


## What about complex propagation behavior?

Some OpenTelemetry proposals have called for more complex propagation behavior. For example, having a fallback to extracting B3 headersif Trace-Context headers are not found. Chained propagators and other complex behavior can be modeled as implementation details behind the Propagator interface. Therefore, the propagation system itself does not need to provide chained propagators or other additional facilities.
tedsuo marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the propagation system itself does not need to provide chained propagators

So what happens when multiple propagators are registered for the same type? I think chained propagators may be cumbersome to provide on top of an API that allows only one propagator, since it requires cooperation between everyone who wants to add a propagator to the chain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have seen code that does this, by using a stack-alike Propagator that tries to inject/extract from a list of propagators (order is important) - for extraction, it returns with the first successful attempt, and for injection it simply tries to inject all formats. And yes, we should have a test case for this scenario, so we know where work fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, here is the nuance: there is basic chaining provided at the API level by RegisterPropagator. All propagators are run for every type, in the order in which they are registered. This is where the cooperation happens between independent applications.

More complex propagation behavior usually ends up being specific to each application. So, the OTel SDK can provide the kind of fallback W3C -> B3 chaining for observability, described here. That sort of behavior is presented as a single, complex propagator, from the point of view of the Propagation API, since the fallback behavior is internal to a single application. Does that make sense?



## Did you add a context parameter to every API call because Go has infected your brain?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, is this just for fun?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, well I wanted to address what I perceive to be a common question, along the lines of "why is this context parameter everywhere? Is it because this is a golang project? Is it required that every language must pass context this way?"

But if the humor gets in the way of learning, I can change it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it. You can also stress that it's not just a Go thing - in the old versions of Node there was no CLS (or it was extremely inefficient) and explicitly passing the context was how the propagation was achieved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, I will emphasize that this issue exists in multiple languages.


No. The concept of an explicit context is fundamental to a model where independent distributed applications share the same context propagation layer. How this context appears or is expressed is language specific, but it must be present in some form.


# Prior art and alternatives
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Prior art:
* OpenTelemetry distributed context
* OpenCensus propagators
* OpenTracing spans
* gRPC context

# Open questions

Related work on HTTP propagators has not been completed yet.

* [W3C Trace-Context](https://www.w3.org/TR/trace-context/) candidate is not yet accepted
* Work on [W3C Correlation-Context](https://w3c.github.io/correlation-context/) has begun, but was halted to focus on Trace-Context.
* No work has begun on a theoretical W3C Baggage-Context.

Given that we must ship with working propagators, and the W3C specifications are not yet complete, how should we move forwards with implementing context propagation?

# Future possibilities

Cleanly splitting OpenTelemetry into an Application and Context Propagation layer may allow us to move the Context Propagation layer into its own, stand-alone project. This may facilitate adoption, by allowing us to share Context Propagation with gRPC and other projects.
tedsuo marked this conversation as resolved.
Show resolved Hide resolved
Binary file added text/img/context_propagation_details.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added text/img/context_propagation_explanation.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.