Re-open: Clarify that service.* conventions apply to all telemetry sources #671

jack-berg · 2024-01-25T16:33:08Z

This is the second attempt at #630, which was reverted over concerns that it was merged too quickly.

Including the original PR description below because it still stands:

The question of whether the service.* conventions are applicable only to web services is an important one that has come up several times. If the answer is yes, then everything that isn't a web service needs its own version of service.name, service.instance.id, service.namespace, service.version to uniquely define the thing producing telemetry. We've discussed this at length several times and while we don't have anything written down yet in the spec / semantic-conventions, the actions we've taken (i.e. rejecting alternatives like telemetry.source, app.name, etc) confirm that the service.* attributes are applicable to all telemetry services, not some narrower subset that some people consider a web service.

This PR aims to clarify this to avoid repeating the same discussion.

Some PRs, issues that are related:

The conversation on #630 that took place after merging is relevant to all reviewers.

…urces" (open-telemetry#638) This reverts commit e37eac7.

pyohannes · 2024-01-26T09:59:58Z

docs/resource/README.md

@@ -77,7 +77,7 @@ as specified in the [Resource SDK specification](https://github.com/open-telemet

 **type:** `service`

-**Description:** A service instance.
+**Description:** A telemetry source. OpenTelemetry has adopted a broad interpretation such that every telemetry source is a service. Examples include, but are not limited to: web services, hosts, mobile applications, browser application, edge computing devices, functions as a service, databases, message brokers, etc. Specific types of telemetry sources may have additional conventions defining domain specific information, but the `service` conventions are applicable to all telemetry sources.


What would qualify as telemetry source when scraping metrics from a Kubernetes cluster? Is it the cluster component responsible for creating and providing the metrics (the "source" in a technical sense)? Or is it a deployment, node, or container (the entity a piece of telemetry data primarily refers to, in a logical sense)?

Its a good question and probably a key aspect of the design of the scraping component. The prometheus model (commonly used to scrape metrics from k8s clusters) every endpoint scraped is an "instance", and multiple replicated endpoints of an instance are a "job". These ideas map well to the idea of service.name (i.e. prometheus job) and service.instance.id (i.e. prometheus instance). The next question is can the scraper ascribe any other useful bits of information describing / identifying the telemetry source? I.e. does it know its scraping information about a node, deployment, or pod, or does everything role up into just a cluster?

The idea of a common set of identifying resource attributes applicable across all telemetry sources works nicely with the #575 idea of adding a service.type attribute. With these in place, a OTLP receiver can reliably know the identity and type of telemetry producer.

The next question to answer is how correlation works: If all telemetry sources have the same identifying attributes, how do we indicate that a service is running on a host in a cluster? We have attributes for a host and cluster, but when everything uses service.* for identity, we need conventions about how a host and a cluster populate the service.* attributes.

For example, consider a setup with a service running on a host. Both the service and host are telemetry producers:

// Host resource { service.name: "jberg", service.instance.id: "abc123", service.type: "host" } // Service resource { service.name: "my-service", "service.instance.id: "def456", service.type: "web_service", host.name: "jberg", host.id: "abc123" }

If we have a convention that states that hosts set service.name=${host.name}, service.instance.id=${host.id}, service.type=host, then we can reliably known that "my-service" is running on the host because it has matching host.name, host.id resource attributes.

I see, so service.type would identify the kind of entity that service attributes refer to.

However, it seems to me that with its current proposed definition it doesn't fully cover this use case:

The service.type identifies the product that is deployed as the service.

A value like "host" doesn't match that description, the proposed definition of service.type would need to be generalized to account for that.

Correct. This PR doesn't solve all the problems associated with identifying and correlating telemetry sources. It aims to codify just one concept (which I assert we already imply) - that all telemetry sources use the same attributes for identity - which we can build off of in future steps.

joaopgrassi · 2024-01-29T12:45:21Z

model/resource/service.yaml

-      A service instance.
+      A telemetry source. OpenTelemetry has adopted a broad interpretation such that every
+      telemetry source is a service. Examples include, but are not limited to: web services,
+      hosts, mobile applications, browser application, edge computing devices, functions as


Suggested change

hosts, mobile applications, browser application, edge computing devices, functions as

hosts, mobile applications, browser applications, edge computing devices, functions as

joaopgrassi · 2024-01-29T12:50:05Z

CC @tigrannajaryan @yurishkuro a new PR is here to continue the discussions.

yurishkuro

How is this PR different from #630? Is seems to propose the exact same changes and does not address any of the concerns raised in #630 discussion.

Blocking so it doesn't get merged by accident again.

jack-berg · 2024-01-29T16:29:35Z

Its the same as #630 because while there was disagreement, there were also 7 approvals.

I like the solution you proposed:

Here's a possible solution:

resource.type: service | host | ...

resource.name: my-service | my.host.com | ...

Basically just replacing s/service/resource/ seems to address the concerns. And more resource-specific attributes like app.app-store (making it up) are both possible and recognizable because of resource.type. service.name can be treated as a fallback to resource.name (OTEL schema transforms should support such upgrade).

But that doesn't seem possible given this langauge:

Exception: Some resource attributes are embedded in various locations of the
Specification, e.g. the service.* attributes which are required by SDKs to be
produced and have corresponding environment variables defined in general SDK configuration. These resource
attributes MUST NOT be ever changed. They are considered a hard-coded part of
this specification.

yurishkuro · 2024-01-29T16:38:50Z

But that doesn't seem possible given this language:

Heh, the wording is certainly sure of itself. But it doesn't bother me that much by itself, the question is what those "various locations" are and what the impact would be to evolve them in backwards-compatible manner.

tigrannajaryan

I do not think this should be merged because I don't think all telemetry sources are services. My position is that by default the telemetry emitted by Otel SDK is coming from a Service. However, I think there can be other telemetry sources that are not services.

As an example a Host has system metrics that are telemetry and can be reported and analysed. There is no Service in this picture unless we inject it just for the purpose of having one (e.g. the Collector that collects the metrics is the Service).

jack-berg · 2024-01-29T18:05:41Z

If all SDKs MUST include service.name, and there are telemetry sources which service.name does not apply to, then there are telemetry sources which can not use an SDK.

This is wrong - SDKs are not meant to be limited to use to the colloquial definition of a service (i.e. web service). We must find a way to not artificially limit the types of telemetry producers SDKs can represent.

Heh, the wording is certainly sure of itself. But it doesn't bother me that much by itself, the question is what those "various locations" are and what the impact would be to evolve them in backwards-compatible manner.

We could:

Introduce a new more generic prefix (say telemetry.* for the sake of example), which includes identifying attributes applicable to all telemetry sources
Define a mapping between service.* and telemetry.*
Include both service.* and telemetry.* by default, and let users configure their pipeline to drop the redundant attributes

tigrannajaryan · 2024-01-29T18:12:58Z

This is wrong - SDKs are not meant to be limited to use to the colloquial definition of a service (i.e. web service). We must find a way to not artificially limit the types of telemetry producers SDKs can represent.

I agree.

But that doesn't seem possible given this langauge:

I am not sure I read it that way. Do you see it as a prohibition to have non-Service Resources? It seems to say "don't change attributes which are defined as part of env variables".

pyohannes · 2024-01-30T09:29:32Z

I like the solution you proposed:

Here's a possible solution:

resource.type: service | host | ...
resource.name: my-service | my.host.com | ...

With a slight modification, such a solution could be fully backward compatible:

resource.type: service | host | ...
<resource.type>.name: my-service | my.host.com | ...

Which would give:

resource.type=host, host.name=my.host.com
resource.type=service, service.name=my-service

This is a pattern that's already used in semantic conventions, for example if messaging.system=kafka, then you can expect messaging.kafka.* attributes.

As a downside, you don't have a single attribute that you can look at. To achieve that, you'd need to process data.

jack-berg · 2024-01-30T20:14:23Z

Earlier I said:

If all SDKs MUST include service.name, and there are telemetry sources which service.name does not apply to, then there are telemetry sources which can not use an SDK.

It appears I misunderstood the spec with respect to all SDKs needing to produce a resource with a service.name. To clarify, the spec says the following:

The resource spec says:

The SDK MUST provide access to a Resource with at least the attributes listed at Semantic Attributes with SDK-provided Default Value.
Note: This means that it is possible to create and associate a resource that does not have all or any of the SDK-provided attributes present. However, that does not happen by default.

The note directly contradicts my conclusion. Its not only possible to produce a SDK without a service.name, but explicitly allowed.

However, as I also noted, the versioning and stability doc mentions:

Exception: Some resource attributes are embedded in various locations of the
Specification, e.g. the service.* attributes which are required by SDKs to be
produced and have corresponding environment variables defined in general SDK configuration. These resource
attributes MUST NOT be ever changed. They are considered a hard-coded part of
this specification.

This appears to reflect a misunderstanding of the resource spec. It should be considered a bug as it came well after the stabilization of the resource spec.

I still think that this PR represents a valid way to proceed, and is consistent with our rejection proposals for alternative identifying attributes (e.g. rejection of app.name). For us to say that alternative sets of identifying attributes are possible, we need a litmus test for when a class of telemetry producers warrants its own identity.

tigrannajaryan · 2024-01-31T05:29:21Z

Its not only possible to produce a SDK without a service.name, but explicitly allowed.

That's my understanding as well.

joaopgrassi · 2024-02-01T13:17:10Z

The resource spec says:

The SDK MUST provide access to a Resource with at least the attributes listed at Semantic Attributes with SDK-provided Default Value.
Note: This means that it is possible to create and associate a resource that does not have all or any of the SDK-provided attributes present. However, that does not happen by default.

The note directly contradicts my conclusion. Its not only possible to produce a SDK without a service.name, but explicitly allowed.

The way I interpret that is: The SDK will only not add the attributes listed at Semantic Attributes with SDK-provided Default Value if the user creates a resource themselves via whatever ways, and associate that instead of the one created by default by the SDK. That makes total sense as it can happen and it's not easy to prevent it.

The paragraph even says: (emphasis mine)

Note: This means that it is possible to create and associate a resource that does not have all or any of the SDK-provided attributes present. However, that does not happen by default

So my interpretation is like yours that SDKs must always fill the service.* required and stable attributes by default.

BTW the PR that introduced that for context: open-telemetry/opentelemetry-specification#1294

jsuereth · 2024-02-12T14:51:42Z

@jack-berg Do you think this deserves an OTEP or "stronger" location in OpenTelemetry? Semconv CANNOT dictate what the specification does/says. Given complexities around modelling services and SDKs, I think this may require an OTEP.

cc @tigrannajaryan on whether this is something we should tackle as part of Entity / Resource-modelling efforts.

jack-berg · 2024-02-12T16:41:04Z

Do you think this deserves an OTEP or "stronger" location in OpenTelemetry? Semconv CANNOT dictate what the specification does/says.

Not sure I understand. This PR doesn't appear to make any mandates on the spec.

I think this may require an OTEP.

I opened this because I was under the impression that the strong language in the spec around default resource attributes and our actions to reject attempts to define alternative sets of identifying resource attributes (i.e. such as app.name) implied that the content of this PR was already the community position. It appears that I was mistaken. This is a frustrating conclusion because it means that key questions around resources continue to not have an answer:

How do you determine the identity of a telemetry producer with a resource which does not have service.* attributes?
Which types of telemetry producers should use service.* attributes?
When is it appropriate for a type of telemetry producer to have its own set of identifying attributes?
How to use the SDK with an alternative set of identifying attributes?
When both service.* and another "defacto" set of identifying attributes are present (i.e. host.*), which represents the identity telemetry producer?

These are really important questions that will continue to frustrate / confuse users, maintainers, and vendors. If we reject this PR, then we as a community (and probably the TC in particular) really ought to make a concerted effort answer these questions.

tigrannajaryan · 2024-02-12T16:59:49Z

@jack-berg those are very good questions. @jsuereth and I are working on a potential OTEP that will likely answer some of these. Can we pause this PR for a bit and see if the OTEP helps?

jack-berg · 2024-02-12T17:24:32Z

@jack-berg those are very good questions. @jsuereth and I are working on a potential OTEP that will likely answer some of these. Can we pause this PR for a bit and see if the OTEP helps?

Yes that's fine with me. I can help with reviews and any prototyping that might be required.

Reapply "Clarify that service.* conventions apply to all telemetry so…

9239a71

…urces" (open-telemetry#638) This reverts commit e37eac7.

jack-berg requested review from a team as code owners January 25, 2024 16:33

github-actions bot assigned joaopgrassi Jan 25, 2024

reyang approved these changes Jan 25, 2024

View reviewed changes

carlosalberto approved these changes Jan 25, 2024

View reviewed changes

pyohannes reviewed Jan 26, 2024

View reviewed changes

joaopgrassi approved these changes Jan 29, 2024

View reviewed changes

yurishkuro requested changes Jan 29, 2024

View reviewed changes

arminru mentioned this pull request Jan 29, 2024

Proposal: Service renaming open-telemetry/oteps#244

Closed

tigrannajaryan requested changes Jan 29, 2024

View reviewed changes

jack-berg mentioned this pull request Feb 13, 2024

Define a common algorithm for service.instance.id #312

Merged

3 tasks

jack-berg closed this Feb 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-open: Clarify that service.* conventions apply to all telemetry sources #671

Re-open: Clarify that service.* conventions apply to all telemetry sources #671

jack-berg commented Jan 25, 2024

pyohannes Jan 26, 2024

jack-berg Jan 26, 2024

pyohannes Jan 26, 2024

jack-berg Jan 26, 2024

joaopgrassi Jan 29, 2024

joaopgrassi commented Jan 29, 2024

yurishkuro left a comment

jack-berg commented Jan 29, 2024

yurishkuro commented Jan 29, 2024 •

edited

tigrannajaryan left a comment

jack-berg commented Jan 29, 2024

tigrannajaryan commented Jan 29, 2024

pyohannes commented Jan 30, 2024

jack-berg commented Jan 30, 2024 •

edited

tigrannajaryan commented Jan 31, 2024

joaopgrassi commented Feb 1, 2024 •

edited

jsuereth commented Feb 12, 2024

jack-berg commented Feb 12, 2024

tigrannajaryan commented Feb 12, 2024

jack-berg commented Feb 12, 2024

	hosts, mobile applications, browser application, edge computing devices, functions as
	hosts, mobile applications, browser applications, edge computing devices, functions as

Re-open: Clarify that service.* conventions apply to all telemetry sources #671

Re-open: Clarify that service.* conventions apply to all telemetry sources #671

Conversation

jack-berg commented Jan 25, 2024

pyohannes Jan 26, 2024

Choose a reason for hiding this comment

jack-berg Jan 26, 2024

Choose a reason for hiding this comment

pyohannes Jan 26, 2024

Choose a reason for hiding this comment

jack-berg Jan 26, 2024

Choose a reason for hiding this comment

joaopgrassi Jan 29, 2024

Choose a reason for hiding this comment

joaopgrassi commented Jan 29, 2024

yurishkuro left a comment

Choose a reason for hiding this comment

jack-berg commented Jan 29, 2024

yurishkuro commented Jan 29, 2024 • edited

tigrannajaryan left a comment

Choose a reason for hiding this comment

jack-berg commented Jan 29, 2024

tigrannajaryan commented Jan 29, 2024

pyohannes commented Jan 30, 2024

jack-berg commented Jan 30, 2024 • edited

tigrannajaryan commented Jan 31, 2024

joaopgrassi commented Feb 1, 2024 • edited

jsuereth commented Feb 12, 2024

jack-berg commented Feb 12, 2024

tigrannajaryan commented Feb 12, 2024

jack-berg commented Feb 12, 2024

yurishkuro commented Jan 29, 2024 •

edited

jack-berg commented Jan 30, 2024 •

edited

joaopgrassi commented Feb 1, 2024 •

edited