Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telemetry Schema and Resource #161

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

jsuereth
Copy link
Contributor

@jsuereth jsuereth commented Jun 1, 2021

Specify a mechanism for users of Resource to interact with Resource leveragingTelemetrySchema version for a stable API.

Goals of this OTEP:

  • Exporters in the SDK have access to "stable" resource names. I.e. I can code an exporter against version X of the telemetry schema
  • Improve flexibiltiy in altering Resource semantic conventions through clearly defined (and implemented) conversion tooling.

@jsuereth jsuereth marked this pull request as ready for review June 3, 2021 15:58
@jsuereth jsuereth requested a review from a team as a code owner June 3, 2021 15:58
Copy link
Member

@tigrannajaryan tigrannajaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice OTEP. Spot on with the spirit of schemas and nicely builds on top of it.

Makefile Outdated
Comment on lines 48 to 50
.PHONY: check
check: markdown-lint misspell markdown-link-check
echo "All checks complete"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: unrelated change.

text/0161-telemetry-schema-and-resource.md Show resolved Hide resolved
*
* <p> Returns an invalid Resource if conversion is not possible.
*/
public Resource convertToSchema(String schemaUrl);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This is an expected capability of schemas, so no surprises here.
I believe this should also be able to return an error since some conversions are not possible (e.g. between different schema families or if the versions are incompatible).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Resource resolution happens at application startup, I'm nervous about having SDKs have to go and fetch multiple schemas from the public internet as a part of the startup process, in order to perform the conversion. There are failure modes here, and general startup-time issues to be considered (especially, for example, in mobile or FAAS implementations).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Resource resolution happens at application startup, I'm nervous about having SDKs have to go and fetch multiple schemas from the public internet as a part of the startup process, in order to perform the conversion.

That's a good point. The SDKs can have the schema file that they correspond to embedded with the source code so that there is no need to fetch anything for detectors that are part of the SDK. If each SDK includes the latest Otel schema file then any Otel schema conversions can happen without network access.
Custom, non-Otel schema conversions will still require a network fetching. Perhaps we also provide an alternate convertToSchema function which accepts a schema file instead of schema URL?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that an alternate conversion method should be required to work from a file, but an alternate URL->bytes function could be provided. I'd imagine possibly a three tier lookup: 1) embedded in the application by the SDK, 2) on disk in a well-known or specified location 3) over the network through traditional URL resolution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per this thread conversation, it may be better to start with a solution that "file"/"bytes" for the schema are provided, and delegate to the user of the API to ensure that is available somewhere to grab. Some users may embed the files in their application, some may embed the schema bytes, etc.

@jkwatson had a very good point, and probably we should postpone the API that fetches from the internet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the notion of providing the local schema instead of fetching from the URL. We may go a step further and define Schema as an opaque object that is loadable from file or from a URL. This opens up the possibility of performance improvements (parse and compile the schema once such that the subsequent conversion calls are fast). See e.g. prototype here that has "parse/compile" and "convert" as separate operations.

text/0161-telemetry-schema-and-resource.md Show resolved Hide resolved
text/0161-telemetry-schema-and-resource.md Show resolved Hide resolved
text/0161-telemetry-schema-and-resource.md Show resolved Hide resolved
text/0161-telemetry-schema-and-resource.md Show resolved Hide resolved
Comment on lines +194 to +202
The notion of telemetry-schema conversion could be expanded and adapted into
its own set of functionality that SDKs or SDK-extensions (like resource
Copy link
Member

@tigrannajaryan tigrannajaryan Jun 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make them independent part of SDKs, consumable separately if possible by code that does not care about the rest of Otel but needs to deal with schemas. It will be very nice though to make this an official part of Otel libaries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need this because you want to be able to also convert Spans/Metrics/Logs that also have schemaURL.


## Open questions

- What kinds of operations on Resource should be considered compatible?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is defined by schema OTEP. Conversion between families is incompatible. Ambiguous "renames" are incompatible.

## Open questions

- What kinds of operations on Resource should be considered compatible?
- Is "downgrading" a schema version going to be "safe" for exporters?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is safe. Any reason to believe it may not?

@MadVikingGod
Copy link

MadVikingGod commented Jun 3, 2021

Some general questions I have about this proposal:

  1. Does this need to be part of the SDK?
  2. Do we expect this to be used by most users of the SDK? I can see the use in the collector, and backends, but other then resources what does an end user application or an instrumentation library get with this feature?
  3. Could this be accomplished outside of the SDK, or maybe as a feature of the collector?
  4. Does that mean that only URLs defined by the spec are the only valid ones, what if I want to make a convention for Digital Ocean that isn't part of current specs?
    • What does newer mean in such a world? Is otel v1.4.0 newer than Oracle v1.1.2?
    • What does compatible mean in such a world? Is otel v1.4.0 compatible with Oracle v1.1.2?
    • How does one convert from Otel v1.4.0 to Oracle v1.1.2?
  5. Does this mean where ever this code lies (it's proposed for the SDK but see 1) we have to have some way to process these conversions? This has lots of implication like:
    • How does it do forward conversions for specs after the current version was released?
    • How do we limit the amount of code we introduce by updating the conventions?
    • How do we know what supported conversions are available for the current library?

@tigrannajaryan
Copy link
Member

Do we expect this to be used by most users of the SDK? I can see the use in the collector, and backends, but other then resources what does an end user application or an instrumentation library get with this feature?

It will be used by the SDK itself to correctly merge the resources. It can be also used by an external code to do schema translation (e.g. the Collector or a backend).

Could this be accomplished outside of the SDK, or maybe as a feature of the collector?
Schema conversion, once the data leaves the SDK, yes, can be done by the Collector (and it is a planned feature). However, the merging of the resources must happen in the SDK, there does not seem to be a way to avoid that.

Does that mean that only URLs defined by the spec are the only valid ones, what if I want to make a convention for Digital Ocean that isn't part of current specs?

Yes, you can have your own schema that is not Otel schema. You will need to choose a schema family URL and publish the schema files for each version you release.

What does newer mean in such a world? Is otel v1.4.0 newer than Oracle v1.1.2?

They are different schema families, not comparable. "Newer" only makes sense for schemas that belong to the same family.

What does compatible mean in such a world? Is otel v1.4.0 compatible with Oracle v1.1.2?

No, not compatible, since they are different families.

How does one convert from Otel v1.4.0 to Oracle v1.1.2?

You can't.

Note: I planned a concept of parent schemas for the future. This would allow deriving Oracle schema from Otel schema and in that case possibly allow conversion. But it is not part of the current Schemas concept yet.

How do we limit the amount of code we introduce by updating the conventions?

There is no code needed to be introduced when you update the conventions. The schema conversion code is capable of generically handling changes to convention. The rules defined in schema file drive this generic schema conversion code.

@MadVikingGod
Copy link

There is no code needed to be introduced when you update the conventions. The schema conversion code is capable of generically handling changes to convention. The rules defined in schema file drive this generic schema conversion code.

I'm trying to understand how an actual implementation would do this. So does this mean that we are going to either embed the schema file or go fetch the schema file when needed? Then the code that would be written for this would be an interpreter of the schema to dynamically convert one version to the next?

@tigrannajaryan
Copy link
Member

There is no code needed to be introduced when you update the conventions. The schema conversion code is capable of generically handling changes to convention. The rules defined in schema file drive this generic schema conversion code.

I'm trying to understand how an actual implementation would do this. So does this mean that we are going to either embed the schema file or go fetch the schema file when needed?

Both are valid approaches.

Then the code that would be written for this would be an interpreter of the schema to dynamically convert one version to the next?

Yes, exactly.

If you haven't read the Schemas OTEP it may be useful to do it since it answers some of your questions: https://github.com/open-telemetry/oteps/blob/main/text/0152-telemetry-schemas.md

@jsuereth
Copy link
Contributor Author

jsuereth commented Jun 3, 2021

@MadVikingGod

Some general questions I have about this proposal:
Does this need to be part of the SDK?

Yes, we need this for GCP exporters.

Do we expect this to be used by most users of the SDK? I can see the use in the collector, and backends, but other then resources what does an end user application or an instrumentation library get with this feature?

SDK != End users. Instrumentation library = API. This needs to be in the SDK for exporters, and it's a code-only thing. End users should be interacting MOSTLY via configuration or 'code configuration' on the sdk.

Could this be accomplished outside of the SDK, or maybe as a feature of the collector?

We also need this in the collector. This is an interface that should be provided to consumers of Resource in the SDK.

Does that mean that only URLs defined by the spec are the only valid ones, what if I want to make a convention for Digital Ocean that isn't part of current specs?
What does newer mean in such a world? Is otel v1.4.0 newer than Oracle v1.1.2?
What does compatible mean in such a world? Is otel v1.4.0 compatible with Oracle v1.1.2?
How does one convert from Otel v1.4.0 to Oracle v1.1.2?

I think tigran responded to most of these.

Does this mean where ever this code lies (it's proposed for the SDK but see 1) we have to have some way to process these conversions? This has lots of implication like:
How does it do forward conversions for specs after the current version was released?
How do we limit the amount of code we introduce by updating the conventions?
How do we know what supported conversions are available for the current library?

Totally agree these are good questions. I see this migration being a temporary solution to prevent ecosystem breakage as everyone adopts the latest telemetry schema. Let's say a user is using the following components:

  • OpenTelemtery API 1.1
  • Instrumetantion for my hipster microservice framework, 1.2
  • OpenTelemetry SDK 1.4
  • Resource Detector for K8s v1.1
  • Google cloud exporter v1.0

If we take the approach we have today for Resources, it means I need to upgrade them ALL in lockstop for correct behavior. To mitigate this, we've kept Resource detection components + GCP exporters in our own repositories and release them together, so it's easier to know if you have an issue.

However, if folks use any other Resource Detector component besides the one we provide, even if it abides by semantic conventions, we can't tell if we're compatible. Renaming attributes is silent breakage to some extent, and worse if the exporter is unable to declare which version of the schema it was using.

@MadVikingGod
Copy link

So, let me start off I like the ideas presented here, but I think there are a few assumptions that have been made that aren't stated, and it's a bit light on how external conventions interact with the conventions discussed.

And as I'm thinking more on this I think we are solving two separate issues, How to merge Resource Conventions, and How to merge Telemetry Conventions. They are both the same problem, but where we can solve them is different, as well as how we should solve them might be different. We might want to split the issue of resource migration and telemetry migration into different proposals.

The major concern right now with useability. If I want to guarantee any single convention outside of the current semantic conventions, anything that uses that is unable to be mixed in with other data. Meaning even if my convention is a superset of a semconv there is no way to upgrade other libraries to my convention. Nor can I downgrade to a different family of conventions.

The other large concern is that this seems to be leading us to dynamic processing of arbitrary schemas. While this can be made safe, it is a very challenging task to do so safely, and I would highly recommend against it.

SDK != End users. Instrumentation library = API.
Actually the End User is the only one who gets to choose the SDK and is the only entity guaranteed to have it at all. As I've been asking about I think especially for the case of telemetry semconvs we don't need it in the SDK, and adding it for an exporter that may live in contrib seems like the wrong way to reason about this.

For the resource portion, the user/library writers will know when they write the application what conventions will be used. I think we shouldn't have any complicated logic around transforming those but push for a way for an application author to know:

  1. At compile time that different conventions that may be used aren't stepping on each other.
  2. At upgrade time that taking a newer version doesn't mask something they have done.

For the telemetry portion, I think a stand-alone library is much more appropriate. SDKs don't provide any ingest logic and this is an ingest problem.

### Understanding schema compatibility

Here we propose simply providing an implementation of
[SemVer](https://semver.org/) for isCompatible, and using the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused on this. If you had a schema rule like the one above:

  next.version:
    resources:
      changes:
        - rename_attributes:
            attribute_map:
              cloud.zone: cloud.availability_zone
  previous.version:

since there is a transform between versions, would the two schemas be compatible and have the same major version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me what the answer to this is, which is why I opened: #162

IF the answer to the quesiton is "yes they're compatible", then I think this OTEP resolves some major pain that kind of change would cause.

@jsuereth
Copy link
Contributor Author

@MadVikingGod You state:

For the telemetry portion, I think a stand-alone library is much more appropriate. SDKs don't provide any ingest logic and this is an ingest problem.

I'd disagree with the later part. Exporters are ingestion adapters and such this is their problem and they live as SDK components. I'd be ok if this were a (stable) library an exporter can rely on, but it needs to be baked in.

@tigrannajaryan
Copy link
Member

@jsuereth do you plan to continue working on this OTEP?

@jsuereth
Copy link
Contributor Author

@tigrannajaryan Yes, tabled it temporarily to work more directly with metrics. I still think this is an unresolved issue w/ semantic conventions + telemetry.

@tigrannajaryan
Copy link
Member

@tigrannajaryan Yes, tabled it temporarily to work more directly with metrics. I still think this is an unresolved issue w/ semantic conventions + telemetry.

@jsuereth I believe we need this change. It evolves the schemas in the right direction.

*
* <p> Returns an invalid Resource if conversion is not possible.
*/
public Resource convertToSchema(String schemaUrl);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that an alternate conversion method should be required to work from a file, but an alternate URL->bytes function could be provided. I'd imagine possibly a three tier lookup: 1) embedded in the application by the SDK, 2) on disk in a well-known or specified location 3) over the network through traditional URL resolution.

Comment on lines +146 to +147
- **Else if the Schema urls of the old and updating resources are compatible**
**then the older resource will be converted to newer schema url and merged.**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that "older resource" may be used in two senses here. One being the target of a merge operation and the other being the resource with the lesser schema version. Utilizing a different term for one or both of those senses may help avoid confusion.

Comment on lines +159 to +170
- Initially, we could prevent any `rename` alterations to schema for compatible
version numbers. `convertToSchema` would check for semver compatibility and
just return the original resource if compatible.
- Alternatively, we can check the newer schema for any transformations in the
`all` or `resource` section before failing to convert.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this check be limited to attributes actually present on the resources being evaluated, or would the presence of any transformation, even for non-present attributes, prevent merging/migration?

Copy link
Member

@bogdandrutu bogdandrutu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, I think definitely a prototype is needed :)

My major concern is about providing the downgrading functionality since right now we require changes to be backwards compatible (semver does the same) but not forward compatible.

Please consider to "resolve" all the comments if the discussion got to the resolution point.

@@ -0,0 +1,162 @@
# Add exponential bucketing to histogram protobuf
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge conflict.

We propose a new set of SDK features:

- A mechanism to "migrate" a resource up/down compatible schema changes.
- A mechanism to determine if two schema URLs allow compatible changes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a long term issue or just a short term need, I had the impression that we should always do backwards compatible changes. So we can always go up, sometimes we may not be able to go down.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an issue of how users consume exporters + resource detecters in the SDK. Both are pluggable components and we can expect resource detection to advance ahead of exporters, so the forward-compatible behavior is needed in that kind of distribution environment.


We propose a new set of SDK features:

- A mechanism to "migrate" a resource up/down compatible schema changes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should provide only the "go up" version, and enforce everyone to use the most recent (latest) schema. The exporter in your case should be able to adopt (maybe) the rules to work on the Resource schema (if newer) vs downgrading the Resource.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can start by implementing only go-up, but as I stated, there's an issue with mismatch between various sourced components. If we're not somewhat compatible in a forwards way, we could cause issues across the ecosystem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be comfortable with both up and down directions. The schema files unambiguously defines whether it is safe to go in a particular direction.

*
* <p> Returns an invalid Resource if conversion is not possible.
*/
public Resource convertToSchema(String schemaUrl);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per this thread conversation, it may be better to start with a solution that "file"/"bytes" for the schema are provided, and delegate to the user of the API to ensure that is available somewhere to grab. Some users may embed the files in their application, some may embed the schema bytes, etc.

@jkwatson had a very good point, and probably we should postpone the API that fetches from the internet.

labels, e.g.

```java
public static String MY_SCHEMA_VERSION="https://opentelemetry.io/schemas/1.1.0";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary related question: Is this the expected value for the proto fields or just the string number "1.1.0"?

/cc @tigrannajaryan

can be converted to a compatible schema. Specfically resource can be merged
IFF:

- The schemas are compatible
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question, is this something that we need to worry? Should we assume that converting to the "newest" is always possible?


There are three major changes to SDKs for this proposal.

### Understanding schema compatibility
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this section needed because currently the schema url does not respect semver for the version number? I though that was implicit since the schema url version number corresponds to the specs version which follows semver, probably my assumption is wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Schema URL could be for a different specification entirely. This OTEP proposed a level of compatibility we could not assume before.

This is needed so we CAN assume semver for the default otel schema.

text/0161-telemetry-schema-and-resource.md Show resolved Hide resolved
e.g. In Java you might have the following for resource:

```java
public abstract class Resource {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we say that similar changes will be applied to other places where schemaurl is used, like spans/metrics/logs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually, that would be the goal.

Comment on lines +194 to +202
The notion of telemetry-schema conversion could be expanded and adapted into
its own set of functionality that SDKs or SDK-extensions (like resource
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need this because you want to be able to also convert Spans/Metrics/Logs that also have schemaURL.

@jsuereth jsuereth force-pushed the wip-telemetry-schema-and-resource branch from 5ea71c0 to 71155a1 Compare September 16, 2021 19:41
Copy link
Contributor

@jmacd jmacd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is well thought out.

@tigrannajaryan
Copy link
Member

@jsuereth do you plan to continue working on this OTEP?

@tedsuo
Copy link
Contributor

tedsuo commented Feb 6, 2023

@jsuereth I assume we still need this OTEP?

@jsuereth
Copy link
Contributor Author

jsuereth commented Feb 9, 2023

@tedsuo I think we still need it but I've put effort on hold to address higher-priority items. I think we still have time to revisit this post some semconv stability, and this needs two more approvers.

@tedsuo
Copy link
Contributor

tedsuo commented Jul 31, 2023

@jsuereth we are cleaning up stale OTEP PRs. If there is no further action at this time, we will close this PR in one week. Feel free to open it again when it is time to pick it back up.

@tedsuo tedsuo added the stale This issue or PR is stale and will be closed soon unless it is resurrected by the author. label Jul 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:sem-conv priority:p1 stale This issue or PR is stale and will be closed soon unless it is resurrected by the author. triaged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet