Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Representation of multi-valued attributes #341

Closed
fbogsany opened this issue Nov 2, 2019 · 21 comments · Fixed by #368
Closed

Representation of multi-valued attributes #341

fbogsany opened this issue Nov 2, 2019 · 21 comments · Fixed by #368
Assignees

Comments

@fbogsany
Copy link
Contributor

fbogsany commented Nov 2, 2019

Memcached operations such as get can specify multiple keys. If we want to add an attribute keys with a list of the keys included in the operation, we can do so with an arbitrary separator (e.g. span.set_attribute('keys', keys.join('|'))), however this requires coordination between instrumentation and analysis backends. Can we either define semantic conventions for representation of multi-valued attributes, or allow for multiple attributes on a span with the same key?

cc @tigrannajaryan

@Oberon00
Copy link
Member

Oberon00 commented Nov 2, 2019

Other use case for multi-valued attributes: HTTP headers (would require a multimap), list of HTTP proxies used.

@jmacd
Copy link
Contributor

jmacd commented Nov 3, 2019

See also #76, a call for structured attribute values.

@tigrannajaryan
Copy link
Member

This has been discussed a while ago and we didn't have a specific use-case to demonstrate the need. I think this shows the need clearly.

The relevant part of the spec [1] that needs to be changed is:

Setting an attribute with the same key as an existing attribute SHOULD overwrite the existing attribute's value.

Note that this also requires a support in exporting protocols. Proposed OTLP [2] can support this, so can Jaeger Protobuf [3]. OpenCensus proto cannot support this directly because attributes are a map there [4], so we will need to come up with a reasonable way to translate to formats that require attribute keys to be unique.

[1] - https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/api-tracing.md#set-attributes
[2] - open-telemetry/oteps#59
[3] - https://github.com/jaegertracing/jaeger/blob/4c7be1a10438590738779b54a151a0055b387132/model/proto/model.proto#L126
[4] - https://github.com/census-instrumentation/opencensus-proto/blob/5cec5ea58c3efa81fa808f2bd38ce182da9ee731/src/opencensus/proto/trace/v1/trace.proto#L158

@jmacd
Copy link
Contributor

jmacd commented Nov 4, 2019

Why not support list-valued attributes? I would prefer not to change the specification that states:

Setting an attribute with the same key as an existing attribute SHOULD overwrite the existing attribute's value.

@tigrannajaryan
Copy link
Member

Why not support list-valued attributes?

List of what thought? Is it a list of primitives only? Can it be a list of lists (and recursively so)? That makes it equivalent to #76

This is also difficult to support in exporters. How do we translate a list-valued attribute to Jaeger tags for example? Do we flatten the list? If most of exporters are going to flatten the list then why don't we start with a flat list in the first place?

@jmacd
Copy link
Contributor

jmacd commented Nov 4, 2019

I suggest this because it would allow future exporters to enrich their data structure and support any JSON-type value. In a structured logging system, it will be very natural to have structured key-values, recursively so, and I'd like to maintain consistency between span event key-values and span attributes.

@jmacd
Copy link
Contributor

jmacd commented Nov 4, 2019

Also to answer your question, I would format these values as JSON and set them as strings.

@tigrannajaryan
Copy link
Member

I suggest this because it would allow future exporters to enrich their data structure and support any JSON-type value. In a structured logging system, it will be very natural to have structured key-values, recursively so, and I'd like to maintain consistency between span event key-values and span attributes.

I am not against structured values but I think that's a separate topic. Even with structured values I would still suggest to support multi-values (essentially creating a tree of multimaps).

Let me keep this debate going though since I am not convinced yet one way or another. I want to see more arguments in favor of the approaches.

One thing I don't like about list-valued attributes proposal is that it results in multi- and single-value attributes to have a different data structure, e.g. given the usecase in this issue I can have one of the following 2 structures depending on whether I am getting a single or multiple keys from Memchached:

{
  "keys": "my-first-key"
}
{
  "keys": ["my-first-key", "my-second-key"]
}

If I now want to search for all Spans representing Memcached get of "my-first-key" I need to have a complicated (possibly recursive) tree search.

If it was a multi-valued list my search approach would most likely be the same regardless of how many values of the same key I have. Assuming I am storing the data in a RDBMS the flat version is more natural and likely faster unless my RDBMS supports JSON indexing (and even if it does my search is likely still more complicated than the flat version).

@jmacd
Copy link
Contributor

jmacd commented Nov 5, 2019

For the record, the LightStep tracer supports JSON-valued inputs. The OpenTracing APIs support this to some extent, so for example the Golang Span.SetTag API takes an interface, and if you pass a non-scalar value the tracer will use json.Marshal to encode a string value.

To me, multimap semantics raise all sorts of new complicated questions about the API. Should we support a separate API to overwrite? Should we support deletion by key:value?

I see your concerns about RDBMS representation as out of place at the API level. If you received a JSON-encoded list, you could certainly choose to index it by flattening the list. If you receive a JSON-encoded dictionary, you could recursively expand it and index flat key:values. These are questions about the schema of the data.

That leaves only one real question, to me. It's convenient to say there are multi-map semantics, because it lets a user call "set" multiple times without remembering any state on their own. If we only support unique map semantics, then the user is required to remember all their values and assemble a list on their own. This would be my preference because it simplifies the API considerably. Set has overwrite semantics, and if you want to set multiple times with a list, you can. But it doesn't require any new API surface to replace an already-set value.

@tigrannajaryan
Copy link
Member

I see your concerns about RDBMS representation as out of place at the API level.

This is true.

I think I agree with what you are saying. As soon as we are able to support multiple values unambiguously and there is a clear convention on how it must be done I am OK with using array values for that. If we have a convention the backend is free to flatten or store however it wants.

@fbogsany thoughts on this? Does it work for you?

@fbogsany
Copy link
Contributor Author

fbogsany commented Nov 7, 2019

👍 I'm fine with a clear convention.

@tigrannajaryan
Copy link
Member

tigrannajaryan commented Nov 28, 2019

Here is a proposal to support multi-value attributes. Please comment.

API Changes

In Tracing API specification, Set Attributes section replace this:

(Required) The attribute value, which must be either a string, a boolean value, or a numeric type.

By this:

- (Required) The attribute value, which is either:
  - A primitive type: string, boolean or numeric.
  - An array of primitive type values. The array MUST be homogeneous, i.e. it MUST NOT contain values of different types.

Protocol Support

OTLP

Add support for array types to AttributeKeyValue message:

  1. Add ARRAY=4 to ValueType enum.
  2. Add new field repeated AttributeKeyValue array_value=7 to contain child elements of the array.
  3. Require that if ValueType==ARRAY then key field of all child elements must be empty.

Protocols with no built-in support for arrays

To store an array of values encode it as a JSON and store as a string value.

For example to store an array of 3 elements 10, 15 and 25 for key "redis" create the following key/value pair:

redis = [10, 15, 25]

Semantic Conventions

Add the following paragraph to Semantic Conventions document:

Resource and Span attributes may be one of the following types:
  - A primitive type: string, boolean or numeric.
  - An array of primitive type values. The array MUST be homogeneous, i.e. it MUST NOT contain values of different types.

The type of the attribute SHOULD be specified in the semantic convention for that attribute. For protocols that do not natively support array values such values MUST be represented as JSON strings.

@Oberon00
Copy link
Member

I would not require support for non-homogenous arrays and I would even go as far as forbidding semantic conventions to use them. From the top of my head, I can't think of any use cases for them and I think that they will lead to complicated or ugly APIs in some languages.

@fbogsany
Copy link
Contributor Author

I would not require support for non-homogenous arrays and I would even go as far as forbidding semantic conventions to use them. From the top of my head, I can't think of any use cases for them and I think that they will lead to complicated or ugly APIs in some languages.

Agreed. My use case only requires homogenous arrays. I'm looking for an array of strings, but I could also see a use for arrays of booleans (which keys in a memcached multi get were hits) and arrays of numbers (size of each value returned).

@tigrannajaryan
Copy link
Member

Updated the proposal to require homogeneous arrays.

@tigrannajaryan
Copy link
Member

Note that the requirement to have homogeneous arrays also places limits in arbitrary nested structures that we may want to support in the future. If such support is added most likely there will be a desire to allow non-homogeneous array elements (so that you can e.g. directly represent JSON data).

@bogdandrutu
Copy link
Member

I don't think the json representation is useful for Metrics. Hanse one more reason to not have the same representation for attributes in tracing and labels in metrics

tigrannajaryan pushed a commit to tigrannajaryan/opentelemetry-specification that referenced this issue Dec 2, 2019
Resolves open-telemetry#341

Array values are useful for representing attributes that can have
multiple values (e.g. representing a Memcached `get` that can be
over multiple keys and making the keys a span attribute).

This change allows homogeneous array values for span attributes
and specifies that array values should be JSON encoded stirng when
exporting using protocols that do not natively support array values.
tigrannajaryan pushed a commit to tigrannajaryan/opentelemetry-specification that referenced this issue Dec 2, 2019
Resolves open-telemetry#341

Array values are useful for representing attributes that can have
multiple values (e.g. representing a Memcached `get` that can be
over multiple keys and making the keys a span attribute).

This change allows homogeneous array values for span attributes
and specifies that array values should be JSON encoded stirng when
exporting using protocols that do not natively support array values.
@jmacd
Copy link
Contributor

jmacd commented Dec 3, 2019

The homogeneous requirement is fine, thanks!

yurishkuro pushed a commit that referenced this issue Dec 6, 2019
Resolves #341

Array values are useful for representing attributes that can have
multiple values (e.g. representing a Memcached `get` that can be
over multiple keys and making the keys a span attribute).

This change allows homogeneous array values for span attributes
and specifies that array values should be JSON encoded stirng when
exporting using protocols that do not natively support array values.
@codefromthecrypt
Copy link

simple joining on comma is cheaper to index split and join and more supportable vs json arrays. I know folks are already doing this and it requires no more coordination on backend than making a json array. the only thing you gain with json is a an escaping syntax. I know ship has sailed, but this one needs an alternative opinion mentioned for posterity

@codefromthecrypt
Copy link

also the use case of tagging the input to multi-get is generalizable as request tagging. it is very common to have json strings when serializing requests into tags. A concrete non-request tagging use case for array values are things like experiment IDs. Usually these are bounded to much lower cardinality, yet need to be indexed for lookup. It is possible that someone wants to index all redis keys, also, but yeah I think this is more a request logging thing vs anything else.

hope the context helps for the next similar discussions

SergeyKanzhelev pushed a commit to SergeyKanzhelev/opentelemetry-specification that referenced this issue Feb 18, 2020
Resolves open-telemetry#341

Array values are useful for representing attributes that can have
multiple values (e.g. representing a Memcached `get` that can be
over multiple keys and making the keys a span attribute).

This change allows homogeneous array values for span attributes
and specifies that array values should be JSON encoded stirng when
exporting using protocols that do not natively support array values.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants
@codefromthecrypt @Oberon00 @bogdandrutu @fbogsany @jmacd @tigrannajaryan and others