Add Getter's Keys operation. #825

carlosalberto · 2020-08-18T12:58:42Z

Fixes #713
Addresses part of #433

Changes

Getter now has an additional Keys()operation in order to get all the keys in a given carrier. It will help support variable key names formats, such as the Jaeger one (as it has variable uberctx- entries).

The change is relatively trivial, as it's a simple addition. Most existing Propagators won't need to update its behavior, and only a few ones (such as the OT or the Jaeger one) can use the additional Keys() operation to do additional work.

One thing to pay attention to is the existing HttpTextPropagator.fields() operation, which is kept but a note is added, mentioning it only includes predefined fields.

Wondering if we should use either fields or keys for both mentioned operations, and have a more homogeneous experience.

Related PR shown as prototype: open-telemetry/opentelemetry-java#1549

cc @tylerbenson @yurishkuro @bogdandrutu

richardstartin · 2020-08-18T13:49:55Z

Context extraction isn't exactly compute intensive but a lot of context carriers don't support random access. So if you want to implement this specification for e.g. Kafka client you will need to pass over the headers once to materialise the keys, then you have a quadratic pass extracting the values. The headers are almost always going to be quite small, but it would just be a lot lighter-weight to provide an API based on iterators over key-value pairs e.g. Iterable<Map.Entry<String, String>> so you can do it all in one pass. Happily, this supports most context carriers which do support random access too.

specification/context/api-propagators.md

SergeyKanzhelev · 2020-08-18T18:02:18Z

Context extraction isn't exactly compute intensive but a lot of context carriers don't support random access. So if you want to implement this specification for e.g. Kafka client you will need to pass over the headers once to materialise the keys, then you have a quadratic pass extracting the values. The headers are almost always going to be quite small, but it would just be a lot lighter-weight to provide an API based on iterators over key-value pairs e.g. Iterable<Map.Entry<String, String>> so you can do it all in one pass. Happily, this supports most context carriers which do support random access too.

Good point. Couple points here - context extraction is on the hot path of any telemetry system and context is extracted even when telemetry is sampled out. So even small optimization here are important. On some systems well known headers are stored in a dedicated variables for faster access. So completely switching to iterator approach may hurt performance.

I want to understand the jaeger- propagator case better, @carlosalberto please add a link to the spec. Understanding the use case will help to make sure we actually need this new iterator.

SergeyKanzhelev

comments and please add more information about the use case to spec why this thing is needed

richardstartin · 2020-08-18T18:12:52Z

Couple points here - context extraction is on the hot path of any telemetry system and context is extracted even when telemetry is sampled out. So even small optimization here are important.

Couldn't agree more.

On some systems well known headers are stored in a dedicated variables for faster access

That's fine if the carrier supports random access; many don't even if they present an API like get(key) - it often turns out it's actually a linear search. Moreover, how can case insensitivity (e.g. W3C trace context) be supported without a worst case linear time access, or preprocessing (e.g. building a case-insensitive trie) that no context carriers do?

SergeyKanzhelev · 2020-08-18T18:18:04Z

Couple points here - context extraction is on the hot path of any telemetry system and context is extracted even when telemetry is sampled out. So even small optimization here are important.

Couldn't agree more.

On some systems well known headers are stored in a dedicated variables for faster access

That's fine if the carrier supports random access; many don't even if they present an API like get(key) - it often turns out it's actually a linear search. Moreover, how can case insensitivity (e.g. W3C trace context) be supported without a worst case linear time access, or preprocessing (e.g. building a case-insensitive trie) that no context carriers do?

@richardstartin with the introduction of a new accessor for the fields, unless we will have technology-specific propagators, propagators would use one accessor or another. In order to address the issue you are bringing up, we actually need to have separate propagators, depending on the underlying technology to access fields in the most efficient way. Are you advocating for this? If so, this PR may not be enough to solve this issue.

richardstartin · 2020-08-18T18:51:36Z

In order to address the issue you are bringing up, we actually need to have separate propagators, depending on the underlying technology to access fields in the most efficient way.

You have to do this anyway. When you get in to the gritty details, hardly any context carriers are nice generic types and they need to be handled on a case by case basis. I presume the challenge here is to find an abstraction which models the most context carriers in the least bad way.

Concretely, what I'm advocating for is that the context extractor should have an interface like:

   Iterable<Pair<String, String>> contextEntries()

as opposed to

   Iterable<String> keys()

   String get(String key);

Just to reiterate, since some codecs mandate case-insensitivity, the extractor implementation can't be better than linear time, but what's being proposed here is quadratic time unless the underlying context carrier supports random access, and many implementations don't.

SergeyKanzhelev · 2020-08-18T19:02:49Z

In order to address the issue you are bringing up, we actually need to have separate propagators, depending on the underlying technology to access fields in the most efficient way.

You have to do this anyway. When you get in to the gritty details, hardly any context carriers are nice generic types and they need to be handled on a case by case basis. I presume the challenge here is to find an abstraction which models the most context carriers in the least bad way.

Concretely, what I'm advocating for is that the context extractor should have an interface like:
   Iterable<Pair<String, String>> contextEntries()
as opposed to
   Iterable<String> keys()

   String get(String key);
Just to reiterate, since some codecs mandate case-insensitivity, the extractor implementation can't be better than linear time, but what's being proposed here is quadratic time unless the underlying context carrier supports random access, and many implementations don't.

I don't quite agree that we are looking for an "abstraction which models the most context carriers in the least bad way". Telemetry may not be the only think that reads header. And if platform is optimized to read headers once and store well-known values in a predefined variables, getter approach doesn't induce any additional iteration. As we are advocating to integrate telemetry into software, this pattern may become dominant over time. System that only support iterations may implement a proxy layer that iterates once and stores variables for getter to get them. Which almost as good as a single iteration inside the propagator.

If we switch everything to iterators, we are loosing this optimization.

So I'm looking to understand the scenario when propagators have to iterate over key/values.

richardstartin · 2020-08-19T05:40:50Z

@SergeyKanzhelev sorry I think I may have jumped ahead because I am already familiar with the problem this proposal aims to solve and also the implementation challenges it presents.

Motivation

Why might Getter<C>.keys(C carrier) be required? Why isn't Getter.get(key) enough? Two reasons:

case insensitive codecs For example in W3C trace context it is not enough to get traceparent, TRACEPARENT, TraceParent and TrAcEpArEnT must all be supported by specification. "Vendors MUST expect the header name in any case (upper, lower, mixed), and SHOULD send the header name in lowercase". Most context carriers require the use to iterate to extract these keys, and some support case insensitive key comparisons efficiently.
prefix based context keys e.g. existing open tracing ot-baggage-, haystack baggage prefix, uberctx- (mentioned above).

Get by key doesn't support either of these extraction use cases, hence the need for Getter.keys(C carrier)

How would context extraction look with proposal?

A specific implementation would iterate over the keys, match them against known values, and get the values the implementation is interested in. With prefix based baggage, this might look like what's below.

MyContext extract(C carrier, Getter<C> getter) {
     MyContext extracted = new MyContext(); 
     for (String key : getter.keys(carrier) {
        if ("trace-id".equalsIgnoreCase(key)) {
           extracted.traceId = getter.get(carrier, key);
        } else if ("parent-id".equalsIgnoreCase(key) {
           extracted.parentId = getter.get(carrier, key);         
        } else if (key.toLowerCase().startsWith("baggage-") {
          extracted.baggage.put(key.substring("baggage-".length(), getter.get(carrier, key);
        } 
    }
}

Shortcomings of this proposal

Unless Getter.get is constant time, the loop above is quadratic in the number of headers in the context carrier. I hope a code example will help to understand the consequences of this proposal.

Let's look at the API of one possible context carrier, Kafka client Headers, and implement the Getter for it.

public interface Headers extends Iterable<Header> {
   // irrelevant methods removed
    
    /**
     * Returns just one (the very last) header for the given key, if present.
     * 
     * @param key to get the last header for.
     * @return this last header matching the given key, returns none if not present.
     */
    Header lastHeader(String key);

}

For reference, here's the actual implementation of RecordHeaders.lastHeader(key), which is a linear search:

    @Override
    public Header lastHeader(String key) {
        checkKey(key);
        for (int i = headers.size() - 1; i >= 0; i--) {
            Header header = headers.get(i);
            if (header.key().equals(key)) {
                return header;
            }
        }
        return null;
    }

The only plausible way to implement Getter as proposed here is as follows:

public class KafkaHeadersGetter implements Getter<Headers> {

  @Override
  public Iterable<String> keys(final Headers headers) {
    final List<String> keys = new ArrayList<>();
    for (final Header header : headers) {
      keys.add(header.key());
    }
    return keys;
  }

  @Override
  public String get(final Headers headers, final String key) {
    final Header header = headers.lastHeader(key);
    return new String(header.value(), StandardCharsets.UTF_8);
  }

This makes the loop above quadratic in the number of headers. For some carriers, e.g. RabbitMQ just uses a Map, it's not such a big deal.

Alternative for supporting case insensitivity/prefix matches

Given that there are prefix based and case insensitive codecs out there, I would suggest one of the least bad options, assuming a limited complexity budget, is actually iterating over key value pairs. I will illustrate with the Kafka example

Getter<C> becomes:

interface Getter<C> {

    Iterable<Pair<String, String>> contextValues(C carrier);

}

The loop above for extracting the values above becomes:

MyContext extract(C carrier, Getter<C> getter) {
     MyContext extracted = new MyContext(); 
     for (Pair<String, String pair : getter.contextValues(carrier) {
        if ("trace-id".equalsIgnoreCase(pair.getKey())) {
           extracted.traceId = pair.getValue();
        } else if ("parent-id".equalsIgnoreCase(pair.getKey()) {
           extracted.parentId = pair.getValue();
        } else if (pair.getKey().toLowerCase().startsWith("baggage-") {
          extracted.baggage.put(pair.getKey().substring("baggage-".length(), pair.getValue());
        } 
    }
}

Which is now guaranteed to be linear in the headers in the carrier.

The Kafka Headers example becomes:

public class KafkaHeadersGetter implements Getter<Headers> {

  @Override
  public Iterable<Pair<String, String>> contextValues(final Headers headers) {
    final List<Pair<String, String>> contextValues = new ArrayList<>();
    for (final Header header : headers) {
      contextValues.add(Pair.of(header.key(), new String(header.value(), StandardCharsets.UTF_8));
    }
    return contextValues;
  }

This is a straw-man example, and could be implemented more efficiently with an iterator, and more efficient still would be to push key matching down to where the byte[] is converted to the String:

public class KafkaHeadersGetter implements Getter<Headers> {

   private final Predicate<String> keyPredicate;

  @Override
  public Iterable<Pair<String, String>> contextValues(final Headers headers) {
      return () -> new HeadersIterator(keyPredicate, headers.iterator());
  }

  private static final class HeadersIterator implements Iterator<Pair<String, String>> {
        private final Predicate<String> keyPredicate;
        private final Iterator<Header> headers;

        private Header next;

       public (Predicate<String> keyPredicate, Iterator<Header> headers) {
          this.keyPredicate = keyPredicate;
          this.headers = headers;
       }

       @Override
       public Pair<String, String> next() {
            if (null == next) {
                throw new NoSuchElementException();
            }
            return Pair.of(next.key(), new String(next.value(), StandardCharsets.UTF_8);
       }

       @Override
       public boolean hasNext() {
          this.next = null;
          while (headers.hasNext() {
            Header next = headers.next();
            if (keyPredicate.test(next.key()) {
                this.next = next;
                return true;
            }
          }
          return false;
      }
  }

In summary, I think it's important to consider what constraints a specification might impose on the implementations, and the edge case consequences of this proposal are fairly limiting.

anuraaga · 2020-08-19T11:17:58Z

@richardstartin Thanks for this detailed example! A couple things came to mind

The same point probably applies to general multi propagators. If a server, for interop, supports w3c, b3, etc, then each .get is linear. b3-based headers actually extract four headers by themselves.
Maybe this can be considered an instrumentation issue instead of SDK. If a library is known to have slow header lookup, instrumentation could copy into a map. This has the advantage over an iterator API that conversion only happens once and all registered propagators would benefit, IIUC the proposed to use an iterator would still require iterating over all keys for each propagator. This is especially unfortunate since multi propagator is usually used knowing that only one key will actually be present, just not sure which one.

I think the vast majority of libraries support random access to headers so lean towards dealing with it in instrumentation but interested in hearing any thoughts!

carlosalberto · 2020-08-19T12:42:48Z

If a library is known to have slow header lookup, instrumentation could copy into a map. This has the advantage over an iterator API that conversion only happens once and all registered propagators would benefit, IIUC the proposed to use an iterator would still require iterating over all keys for each propagator.

Agreed.

Besides, even though a full Iterator pattern would help Kafka, it would not exactly improve many other libraries that expose the headers as a collection that can be accessed through a Get operation (servlets, Armeria, OkHttp in Java; Django, Flask or Pyramid under Python, for example).

Alternatively, we could offer instead of a Keys() iterator an Items() iterator. But observe that such iterator will be mostly used by propagators with variable key names (Jaeger, OT), as the rest will always use Get instead.

carlosalberto · 2020-08-19T12:43:06Z

Ping @yurishkuro @tylerbenson ;)

richardstartin · 2020-08-19T12:43:16Z

@anuraaga

I think the vast majority of libraries support random access to headers so lean towards dealing with it in instrumentation but interested in hearing any thoughts!

Sadly not - Kafka's just one example.

Here's Akka HTTP, a linear search:

  def getHeader(headerName: String): Optional[jm.HttpHeader] = {
    val lowerCased = headerName.toRootLowerCase
    Util.convertOption(headers.find(_.is(lowerCased))) // Upcast because of invariance
  }

Supports single pass iteration over all the pairs though.

Grizzly Request, another linear search:

    public DataChunk getValue(String name) {
        for (int i = 0; i < count; i++) {
            if (headers[i].getName().equalsIgnoreCase(name)) {
                return headers[i].getValue();
            }
        }
        return null;
    }

Supports single pass iteration via MimeHeaders though.

But I think I can let this go at this point :)

carlosalberto · 2020-08-19T19:26:29Z

@SergeyKanzhelev feedback applied. Please re-review.

SergeyKanzhelev · 2020-08-20T17:51:54Z

But I think I can let this go at this point :)

@richardstartin in these technologies, do we need an iterator over key/value pairs or Keys is enough?

specification/context/api-propagators.md

SergeyKanzhelev

LGTM, but I'm seeking for additional feedback on whether it is the right abstraction and whether we need iterator over Keys or key-value pairs.

richardstartin · 2020-08-20T18:23:14Z

@richardstartin in these technologies, do we need an iterator over key/value pairs or Keys is enough?

I've stated my case - this abstraction leads to accidentally quadratic loops on the critical path, albeit over small inputs.

Co-authored-by: Sergey Kanzhelev <S.Kanzhelev@live.com>

carlosalberto · 2020-08-21T14:34:30Z

@tylerbenson @bogdandrutu @yurishkuro Any preference regarding Keys() vs Keys/Values()?

yurishkuro · 2020-08-21T15:51:50Z

Any preference regarding Keys() vs Keys/Values()?

I don't have a strong preference, but there may be performance considerations where one is more efficient than the other depending on the language. I wouldn't mind leaving it up to the language to define which form is more efficient and idiomatic. If we don't want that, then I would go with Keys(), because given the actual use case (e.g. with Jaeger baggage), only keys need to be inspected, and values will be needed for potentially smaller number of keys.

carlosalberto · 2020-08-21T15:58:42Z

@yurishkuro Sounds great. So will do a follow-up to mention that the language implementations are free to return the keys or keys/values in the iterator.

carlosalberto · 2020-08-22T15:31:07Z

@open-telemetry/specs-approvers Please review/approve this PR ;)

SergeyKanzhelev · 2020-08-24T19:56:02Z

This PR is coming from the need to support the "uber-" keys. If we cannot find enough approvers for the PR - it may be indicative of the fact that nobody needs this functionality in OpenTelemetry 1.0.

@yurishkuro what would be blocked without this support? Maybe making the downside of not having this merged would convince more people to review and approve.

yurishkuro · 2020-08-24T20:11:32Z

@yurishkuro what would be blocked without this support? Maybe making the downside of not having this merged would convince more people to review and approve.

Without iterator on keys the existing Jaeger context propagation format cannot be implemented.

Add of Getter's Keys operation.

c3fecad

carlosalberto added area:api Cross language API specification issue spec:context Related to the specification/context directory spec:trace Related to the specification/trace directory labels Aug 18, 2020

carlosalberto requested review from a team as code owners August 18, 2020 12:58

carlosalberto requested a review from SergeyKanzhelev August 18, 2020 12:58

carlosalberto mentioned this pull request Aug 18, 2020

Add Getter.Keys() with Jaeger Baggage support. open-telemetry/opentelemetry-java#1549

Merged

bogdandrutu assigned SergeyKanzhelev Aug 18, 2020

SergeyKanzhelev reviewed Aug 18, 2020

View reviewed changes