Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate that duplicate metrics across receivers are intentional #26499

Closed
mackjmr opened this issue Sep 7, 2023 · 11 comments
Closed

Validate that duplicate metrics across receivers are intentional #26499

mackjmr opened this issue Sep 7, 2023 · 11 comments
Labels
cmd/mdatagen mdatagen command enhancement New feature or request Stale

Comments

@mackjmr
Copy link
Member

mackjmr commented Sep 7, 2023

Component(s)

cmd/mdatagen

Describe the issue you're reporting

This is a proposal to disallow having metrics with the same name across different receivers to avoid metric collisions. This will be enforced by having the mdatagen validation fail in the case of non-unique metric names defined in metadata.yaml across receivers.

That said, there is likely to be exceptions to this rule. For example, there is some discussion in #24238 as to whether we should allow using the same metric name in receivers which collect metrics that represent the same underlying data.

There should be a way for to have exceptions, and allow different receivers to emit the same metric, but the general rule should be that this is not allowed.

I propose that this should be allowed for receivers that collect data from the same underlying technology (e.g. container runtime) but different implementations (e.g. docker, containerd, cri-o). Fundamentally, this should be for receivers that act as replacements for collecting the same data, and that cannot be used in conjunction.

To address this, I think we should have a field in the metadata.yaml, e.g. group to indicate that receivers pull the same data. For instance, in the case of a dockerstats receiver + containerd receiver, they would both be part of group container runtime. For the validation, only receivers from a same group will be allowed to have non-unique metric names.

@mackjmr mackjmr added the needs triage New item requiring triage label Sep 7, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Sep 7, 2023

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@hughesjj
Copy link
Contributor

My 2c:

  • we should not explicitly disallow this, but a report would be very useful
  • We need to link this to the semantic conventions group
  • I could see utility in allowing two different receivers (and thus presumably two different collection mechanisms) two report the same metric to surface issues. I'm thinking boeing/nasa here where you can have a consensus between three different devices measuring the same signal, throwing out one which may disagree, and alerting on such so the users of the telemetry can debug the disagreeing source (as an example)

@mackjmr
Copy link
Member Author

mackjmr commented Sep 14, 2023

Notes from the SIG meeting:

This should act like a linter, where it alerts the user in the case of non unique metric names (by making the validation fail), to make sure that it is intentional. If it is intentional, the user can input the metric name in an allow list.

@hughesjj

we should not explicitly disallow this, but a report would be very useful

With the approach above, this is not disallowed, but more of a confirmation check.

We need to link this to the semantic conventions group

Can you clarify what you mean here ?

I could see utility in allowing two different receivers (and thus presumably two different collection mechanisms) two report the same metric to surface issues. I'm thinking boeing/nasa here where you can have a consensus between three different devices measuring the same signal, throwing out one which may disagree, and alerting on such so the users of the telemetry can debug the disagreeing source (as an example)

Agreed 👍

@hughesjj
Copy link
Contributor

@mackjmr @ "link this to the semantic conventions"

I might be thinking too far ahead, and it's likely not a blocking prerequisite. That said, the ideal state/"north star"/eventual convergence(?) of metric and attribute namings is coupled to the opentelemetry semantic conventions. To my understanding the current plan is that only a subset of all metrics/attributes emitted would be tracked by semantic conventions, but I could see motivation for 100% coverage of all sdk instrumentation+collector metric/attribute emission being tracked by semantic conventions.

Summary:
My understanding of this proposal is to reduce conflict between namings of signals across receivers. Semantic conventions are more or less a specification for consistent naming and... semantics.. across signals. Thus, if we ever see a duplicate signal between two receivers, semantic conventions may indicate whether they should be "the same signal" or not.

Tl;Dr: if we see duplicate metric names between receivers, and said metric_name exists in semantic conventions spec, we probably wish to keep the names as-is (assuming they fulfill semantic intent). If the duplicates aren't in semconv, we may eidh to bring the collision to the semantic conventions working group.

@hughesjj
Copy link
Contributor

Tldr:tldr: we should additionally discuss this in the semantic conventions working group if we haven't already

@mackjmr
Copy link
Member Author

mackjmr commented Sep 14, 2023

@hughesjj Thank you for clarifying.

Agreed, I think that if semantic conventions on metric_name exist, validating whether the duplicate metrics are OK should be a matter of ensuring they both fulfill semantic intent (and perhaps also making sure there is a way to differentiate them with attributes, e.g. scope name). On the other hand, if semantic conventions on metric_name do not exist, it can become difficult to validate. I will bring this issue to the Semantic Conventions SIG to discuss 🙂

This issue is not specific to receivers, and some SDKs are also affected by this, so perhaps this can also be a broader discussion.

mackjmr added a commit to mackjmr/opentelemetry-collector-contrib that referenced this issue Sep 14, 2023
Issue: open-telemetry#26499

If semantic conventions on the duplicate metric exist, validating is a matter of ensuring they both fulfill semantic intent. For Validating duplicate metrics for which there is no semantic conventions, additional discussion is needed with the semantic conventions WG as mentioned in the issue. For the time being, it will be best judgment.
@mackjmr
Copy link
Member Author

mackjmr commented Sep 15, 2023

Added this issue to the Semantic Convention WG on Monday: https://docs.google.com/document/d/10xG7DNKWRhxNmFGt3yYd3980a9uwS8lMl2LvQL3VNK8/edit.

@mackjmr mackjmr changed the title Disallow non unique metric name in receivers Validate that duplicate metrics across receivers are intentional Sep 18, 2023
@mackjmr
Copy link
Member Author

mackjmr commented Sep 19, 2023

Notes from semantic conventions WG meeting

  • Arguments for having Duplicate metrics:
    From a metric data model perspective, duplicates are allowed to flow through. It is expected for the backend to resolve them. Additionally, one use case for doing this is being able to see how different systems measure the same value.

  • Arguments against having Duplicate metrics:
    Duplicates can provide a bad user experience. It can create confusion for user when aggregated in the backend (aggregation will not only be spacial aggregation, but also scope aggregation). This is solved by adding scope to the aggregation, but the collector does not necessarily add scope everywhere.


JVM metrics had same issue, and decision was that for duplicates it was opt-in on 1 side to avoid collisions (e.g. only 1 enabled by default). May not be a good default experience to have duplicate metrics, but should be opt in.

This issue should be raised in the specification meeting.

@mackjmr
Copy link
Member Author

mackjmr commented Sep 21, 2023

From the Specification SIG:

There was a lot of discussion (between 4:50=>24:24), TLDR is: Two pieces of instrumentation providing the same data is allowed. That said, it can be a problem but there is not a good solution to solve this in a generic way. Should be a collector decision how this is addressed. We could have a way to detect two identical metrics flowing through the collector (this is not only a problem if there are duplicate metrics across receivers, but can also be an issue when there is a misconfiguration and the same receiver is pointed twice to the same instance).


For the collector, I think that #26687 is a good step so that we are aware that there are duplicates. In the future, we may want to extend mdatagen to document duplicates in the Generated Readme so that users are aware, and if there are reports of duplicates being a bad user experience, perhaps we can consider having a rule that only 1 should be enabled at a time by default like it was done for JVM metrics in SDKs.

cc @dmitryax and @hughesjj if there are any concerns for: #26687.

@dmitryax
Copy link
Member

The approach sounds good to me. And I like the receivers group concept like container runtime metrics/

dmitryax pushed a commit that referenced this issue Nov 14, 2023
**Description:**
This test alerts a user if a metric is duplicated across receivers, and
instructs the user to add said metric to an allow list if this is
intentional.

If semantic conventions on the duplicate metric exist, validating is a
matter of ensuring they both fulfill semantic intent. For validating
duplicate metrics for which there is no semantic conventions, additional
discussion is needed with the semantic conventions WG as mentioned in
the issue. For the time being, it will be best judgment.

**Link to tracking Issue:** #26499
RoryCrispin pushed a commit to ClickHouse/opentelemetry-collector-contrib that referenced this issue Nov 24, 2023
…etry#26687)

**Description:**
This test alerts a user if a metric is duplicated across receivers, and
instructs the user to add said metric to an allow list if this is
intentional.

If semantic conventions on the duplicate metric exist, validating is a
matter of ensuring they both fulfill semantic intent. For validating
duplicate metrics for which there is no semantic conventions, additional
discussion is needed with the semantic conventions WG as mentioned in
the issue. For the time being, it will be best judgment.

**Link to tracking Issue:** open-telemetry#26499
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cmd/mdatagen mdatagen command enhancement New feature or request Stale
Projects
None yet
Development

No branches or pull requests

4 participants