-
Notifications
You must be signed in to change notification settings - Fork 870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify string content must be valid Unicode sequences #3970
Conversation
specification/common/README.md
Outdated
@@ -39,6 +39,8 @@ An `Attribute` is a key-value pair, which MUST have the following properties: | |||
- An array of primitive type values. The array MUST be homogeneous, | |||
i.e., it MUST NOT contain values of different types. | |||
|
|||
String values MUST be valid Unicode sequences (UTF-8). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that any API which accepts string attribute values that are not valid UTF-8 strings doesn't comply with the specification?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.NET uses UTF-16 to encode the text in a string. I do not think it is a problem given it is properly encoded to OTLP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about all languages, but Go and C++ allow arbitrary bytes sequences in default string types (string
and std::string
) which are used in the OpenTelemetry API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the limitation to valid Unicode sequences.
BTW, the protobuf only encodes the string as UTF-8.
string := valid UTF-8 string (e.g. ASCII);
https://protobuf.dev/programming-guides/encoding/
I'm not sure about all languages, but Go and C++ allow arbitrary bytes sequences in default string types (string and std::string) which are used in the OpenTelemetry API.
I am not sure how C++ handles this. But for Go, since protobuf won't allow arbitrary bytes in string type, the string will be dropped silently if users put arbitrary bytes into the string type. Related issue: open-telemetry/opentelemetry-go#2314
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure how C++ handles this. But for Go, since protobuf won't allow arbitrary bytes in string type, the string will be dropped silently if users put arbitrary bytes into the string type.
Besides the issue of (silent) data loss, I'd argue that this change would technically cause the OTel Go API to violate the specification, as it allows users to pass non-Unicode attribute values in API functions.
[bytes_value](https://github.com/open-telemetry/opentelemetry-proto/blob/38b5b9b6e5257c6500a843f7fdacf89dd95833e8/opentelemetry/proto/common/v1/common.proto#L37) | ||
with the bytes representing the string in the original order and format of the | ||
source string. | ||
String values which are not valid Unicode sequences SHOULD be dropped. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Breaking change to the spec. Why should it be accepted? What's wrong with the existing instructions that are being deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This document is experimental (somewhat surprisingly).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't answer my question - still a breaking change. The existing wording provided lossless preservation of data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it is language-specific how the string is encoded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have discussed this in one of the Otel meetings a week or so ago (probably spec meeting, I can't remember exactly).
The conclusion was that the spec intentionally leaves this unspecified since string implementations are language specific and what is a valid string is different from language to language.
I believe we are not going to mandate Unicode since:
- It is a breaking change
- Enforcing it may be expensive in some languages
Blocking this since I don't think the change is desirable.
What we may want to do instead is specify more precisely how OTLP exporter deals with strings that are not valid Unicode, since Protobuf strings need to be unicode. This may need to be done on the https://github.com/open-telemetry/opentelemetry-proto repo.
I agree.
This is good, but I am not sure it solves all the problems. I assume we want to provide a uniform set of features that languages can all implement, meaning if a language can provide this feature, others should be able to provide the same one. Assuming we converted invalid Unicode to bytes in the OTLP exporter to satisfy Protobuf, which is the current design, we allow certain languages to push bytes in protobuf (if this language allows arbitrary bytes in string) while others cannot do this (if this language does not allow arbitrary bytes in string). That means to populate bytes in protobuf, we have to use only certain languages other than all languages to implement them. This breaks having a uniform set of features that languages can all implement. |
I don't think we aim for uniformness at all costs. Otel mostly tries to be uniform when it is reasonable to be so. Yes, I know it is subjective. |
I know we have discussed not adding the |
No, I don't think it is costly. I would be in favour of adding |
Its considered breaking: https://github.com/open-telemetry/opentelemetry-specification/blame/a03382ada8afa9415266a84dafac0510ec8c160f/specification/common/README.md#L84-L85 |
The linked note says:
And one of the following arguments reads:
In my understanding, it should be a breaking change to add any attribute value type that's not a primitive or array of primitives, but extending the list of supported primitives should not be a breaking change. I wonder if others share this understanding, the wording of just the note itself could indeed be interpreted in different ways. |
Yes, my thinking was that |
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
Closed as inactive. Feel free to reopen if this PR is still being worked on. |
Fixes #3950
Changes