Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Mandatory Unique Identifier For Telemetry Sources #194

Closed
wants to merge 2 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# Mandatory unique identifier for sdk-based telemetry sources

Provide an explicit mandatory unique identifier for sdk-based telemetry sources.

## Motivation

Having a way to uniquely identify a telemetry source is helpful in many ways, like in processing and storing data from that source, visualizing them in a backend UI or debugging issues with that source and it's data.

For sdk-based telemetry sources, as of now `service.name` (and related attributes `service.namespace` and `service.instance_id`) are the implicit standard for that due to `service.name` being enforced as mandatory by the [Resource SDK specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/resource/sdk.md#sdk-provided-resource-attributes) and [Resource Semantic Conventions](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/resource/semantic_conventions/README.md#semantic-attributes-with-sdk-provided-default-value).

But, because those attributes are not **explicitly** available to uniquely identify a sdk-based telemetry source, multiple issues are calling out problems with the current state:

* [opentelemetry-specification/issues#1034](https://github.com/open-telemetry/opentelemetry-specification/issues/1034) calls out that `service.instance.id` is poorly defined right now and should be replaced by something more meaningful, that can help to uniquely identify a SDK-based telemetry source.
* [open-telemetry/opentelemetry-specification#2111](https://github.com/open-telemetry/opentelemetry-specification/pull/2111) calls out that there is no proper definition of what a `Service` is and that a proper definition is important since `service.name` is such an important attribute
* [open-telemetry/opentelemetry-specification#2115](https://github.com/open-telemetry/opentelemetry-specification/pull/2115) asks for introducing `app.name` and others alongside `service.name` since client-side applications (browser, mobile) are **not** services and end-users might be confused by calling them a _Service_.
* [open-telemetry/opentelemetry-specification#2192](https://github.com/open-telemetry/opentelemetry-specification/pull/2192)) is providing a middle ground between `app.name` and `service.name` by suggesting `telemetry.source.name` as broader term.

To address all requirements outlined in those approaches, we are proposing the following combined approach for uniquely identifying a SDK-based telemetry source:

* Introduce an `telemetry.sdk.source.id` attribute, which MUST either be autogenerated by the SDK at application start or be supplied via an environment variable to the SDK. This will be the unique identifier for an SDK-based telemetry.source.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why telemetry.sdk.source.id and not telemetry.source.id? How is "sdk" relevant to the description of the workload that is producing telemetry?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's one of my open questions:

Is the namespace telemetry.sdk.source.* suitable? Alternative names could be used: [...] telemetry.source.* as suggested by open-telemetry/opentelemetry-specification#2192. The difference is that[telemetry.sdk.source] does state explicitly that only SDK-based telemetry sources are covered. This is not necessarily bad, since other telemetry sources could decide to use it as well. [...]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to drop the word "source" and have telemetry.sdk.id (or telemetry.sdk.instance.id) to identify the SDK. This will match other telemetry.sdk.* attributes. This also fits the requirements of possible future remote SDK management capabilities, where the SDK is actually the entity that is being managed.

To identify the Service we will (continue to) use service.* attributes, to identify other types of sources which are not Services we can introduce new sets of attributes (e.g. for client-side apps as was discussed earlier).

If we call this attribute telemetry.source.id it is not clear what does it identify. Does it identify the Service? If yes then do we remove service.instance.id? If we remove it then do we use a mix of telemetry.source.id+service.* attributes to identify the Service? This does not look very consistent to me.

Generally, for any particular kind of entity that we want to identify I think we should have a set of attributes <kind>.* that are defined in semantic conventions specifically for that particular kind. This has been the practice so far and telemetry.source.id would be a conceptual deviation from the practice which I don't see why we need.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tigrannajaryan personally I am not completely sold on telemetry.source.id and I think it should be a different OTEP (i.e. assigning an opaque unique ID to a source vs. identifying the source via a collection of domain-specific attributes, like k8s.pod etc.) from the question of how to generalize service.name to different types of workloads.

I also find it more intuitive to have service.name and app.name vs. a single (and ambiguous) telemetry.source.name. The issue is how that affects existing backends like Jaeger where service.name is a core concept and app.name means nothing at all. Should the exporters to such backends translate app.name (OTEL) -> service.name (Jaeger)? This translation becomes easier / more consistent if we use a single telemetry.source.name instead of domain-specific attributes, but I agree that it diverges from most of the existing conventions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is how that affects existing backends like Jaeger where service.name is a core concept and app.name means nothing at all. Should the exporters to such backends translate app.name (OTEL) -> service.name (Jaeger)?

I think this is the right approach. Each exporter should be responsible for specialized needs of the particular backend they target.

I don't think we should try to reduce Otel conceptually to the lowest common denominator of features supported by all backends we want to support. We should probably accept the fact that some backends may have different requirements and if a particular Otel capability is not exactly representable in a particular backend that's fine as long as there is a reasonable mapping. So the logic in Jaeger exporter can be for example:

if otel service.name is defined:
  set jaeger service.name to otel service.name 
else if otel app.name is defined:
  set jaeger service.name to otel app.name 
else
  set jaeger service.name to "unknown-service:"+ otel process.executable.name // mimic current behavior

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine if this was only Jaeger, but from the comments in different threads it seems a lot of backends would prefer a single attribute. Maybe a fair question is: are there (how many) backends that would be completely fine with not having a single attribute identifying the source by low cardinality name?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question, we probably want vendors to chime in.

Just one data point: the AWS Xray exporter in Collector for example probes a series of different attributes (including but not limited to service.name) to arrive at a "name", see: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/af3d9b1dc9f1da2e4f71944a32e388368fb0f5e6/exporter/awsxrayexporter/internal/translator/segment.go#L131

I would expect each vendor to either have an equivalent logic that makes sense for their backend or perhaps we extract this as a common logic somewhere for all vendors to use.

Note that vendors likely need to be ready for this anyway since Collector doesn't necessarily set service.name for all data it collects/produces on its own.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one data point: the AWS Xray exporter in Collector for example probes a series of different attributes (including but not limited to service.name) to arrive at a "name"

I think this is a very good example, why there should be clarity on how to identify a telemetry source and why I opened my issues initially and now this OTEP:

As of now it is really hard to say "this telemetry source is called "?

I would expect each vendor to either have an equivalent logic that makes sense for their backend or perhaps we extract this as a common logic somewhere for all vendors to use.

That common logic is what I am looking for. My initial assumption was that service.namespace, service.name and service.instance.id are doing the job, but as we know now this has the following issues:

  1. not everybody thinks of their workload as a "service"
  2. it's not clear how service.instance.id is filled (see Guidance for filling (or consider removing?) service.instance.id opentelemetry-specification#1034)

The intend of telemetry.(sdk?).source.(id|name|namespace) was to avoid that kind of lengthy checks we see in the AWS Xray exporter (@martinkuba brought up this argument here as well: open-telemetry/opentelemetry-specification#2192)

An alternative, that was suggested at some point, is a <whatever>.type field which is then app or service or ... and can help the implementation to jump to the correct attributes immediately.

* Remove `service.instance.id` as attribute, since it is superseded by the `telemetry.sdk.source.id`
* Replace `service.name` and `service.namespace` with attributes `telemetry.sdk.source.name` and `telemetry.sdk.source.namespace`, to have a more broad term for identification.
* Make `telemetry.sdk.source.name` the attribute which MUST be provided by the SDK.
* Provide backward compatibility with service.name by adopting [open-telemetry/oteps#161](https://github.com/open-telemetry/oteps/pull/161)
* Backend specific exporters who rely on `service.name` should set a default value themselves if the attribute is missing
* Add a term definition for `Service` and `App` to the specification glossary, which are non-overlapping.
* Introduce further attributes to describe the telemetry source where needed, e.g. `telemetry.sdk.source.version`, `app.bundle`, `app.short_version`, ...

## Explanation

With those changes in place, the following use cases will be covered:

* If one of many instances of a SDK-based telemetry source is in an erroneous state, the user can quickly identify that instance using the `telemetry.sdk.source.id` and fix the issue. This will improve observability of OTel SDKs themselves.
* With replacing `service.name` with `telemetry.sdk.source.name` frontend applications and other sources, which are not seen as _Services_ by their application owners can be named in a more user-expected way. They also can use different scopes like `app` for additional attributes which might not be reasonable for a backend `service`
* Collectors & backends can use `telemetry.sdk.source.id` (or the combination of `id`, `name` and `namespace`) as unique identifier for storing data, processing data & displaying data.

## Internal details

Replacing `service.instance.(id|name|namespace)` with `telemetry.sdk.source.(id|name|namespace)` will require a mechanism to provide backward-compatibility. For this we are suggesting to adopt [open-telemetry/oteps#161](https://github.com/open-telemetry/oteps/pull/161).

Language specific implementations of the SDK used for instrumenting backend services will need to update their code to expect `telemetry.sdk.*` where `service.*` was used so far. This requires significant effort, although we believe that going down this route earlier is better than going on with a less-invasive change which has different drawbacks (see alternatives below).

Language specific implementations of the SDK for other kinds of telemetry sources, like client side applications, gain the flexibility to use a different scope like `app` for additional attributes of their telemetry source.

Implementations of the SDK need to add a mechanism to either load the `telemetry.sdk.source.id` from an environment variable or to autogenerate a value at application start. For the auto-generated ID the existing recommendation for `service.instance.id`, to use a random Version 1 or Version 4 RFC 4122 UUID, can be used.

Different modules in the collector and implementations of the backend will need to adopt this change. The solution for those backend-specific exporters would be to set some default value for `service.name`, to satisfy their particular backends.

## Alternatives

We think that the proposed approach is the best among many. The following list provides existing alternatives and reasons why they have been rejected:

1. Provide a broad definition for the term `Service`, which then would also cover client-side applications. With that `service.(instance_id|name|namespace)` could be used as unique identifier. It is possible to extend the definition of `Service` to cover that ([open-telemetry/opentelemetry-specification#2111](https://github.com/open-telemetry/opentelemetry-specification/pull/2111)), but frontend application developers & owners do not think about their applications as services and might be confused by this broad definition. Additionally it is not forseeable if other future SDK-based telemetry sources might need a different name which could not be covered by this definition.

2. Introduce `app.(instance_id|name|namespace)` alongside `service.(instance_id|name|namespace)` and require that either `app.name` or `service.name` MUST be provided by the SDK. While this approach addresses the issues of (1), it comes with the disadvantage that a processor like the collector or backend needs to check multiple attributes to identify the type of the telemetry source. This creates additional unnecessary overhead. Also, this _may_ lead to attribute explosion if further SDK-based telemetry sources are introduced and are looking into providing attributes for an id, name, namespace, version or other similar attributes.

## Open questions

* Is the namespace `telemetry.sdk.source.*` suitable? Alternative names could be used
* `telemetry.source.*` as suggested by [open-telemetry/opentelemetry-specification#2192](https://github.com/open-telemetry/opentelemetry-specification/pull/2192). The difference is that it does state explicitly that only SDK-based telemetry sources are covered. This is not necessarily bad, since other telemetry sources _could_ decide to use it as well.
* `telemetry.instance.*`
* `source.*`
* `telemetry.sdk.*` can not be used since `telemetry.sdk.name` is already used
* Should duplication of attributes be allowed, e.g. that `telemetry.sdk.source.name` and `service.name`and `app.name` are specified and possible to be set, or should an attribute that exists in `telemetry.sdk.source.*` not be allowed in `service.*` and `app.*`?
* How should additional attributes like `version`, `bundle`, `firmware_version`, `short_name`, `short_version` be treated? Does it make sense to provide a rule, that attributes common to all sources (like `version`) should also be part of `telemetry.sdk.source.*` and only specific attributes like `bundle` or `firmware_version` should live in a different namespace?

## Future possibilities

While the discussion right now is between backend and frontend services, in the future additional SDK-based telemetry sources like different kinds of devices could be introduced without the need to re-use `service.name` as a mandatory attribute and with the possibility to simply introduce their own scope of additional specific attributes.