Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify string content must be valid Unicode sequences #3970

Closed
wants to merge 4 commits into from

Conversation

XSAM
Copy link
Member

@XSAM XSAM commented Mar 31, 2024

Fixes #3950

Changes

  • Limit primitive string value to UTF-8.
  • Drop attribute mapping for non-UTF-8 strings.

@XSAM XSAM requested review from a team as code owners March 31, 2024 04:48
@XSAM XSAM changed the title Clarify string content should be UTF-8 Clarify string content should be valid UTF-8 Mar 31, 2024
@XSAM XSAM changed the title Clarify string content should be valid UTF-8 Clarify string content must be valid UTF-8 Mar 31, 2024
@@ -39,6 +39,8 @@ An `Attribute` is a key-value pair, which MUST have the following properties:
- An array of primitive type values. The array MUST be homogeneous,
i.e., it MUST NOT contain values of different types.

String values MUST be valid Unicode sequences (UTF-8).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that any API which accepts string attribute values that are not valid UTF-8 strings doesn't comply with the specification?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.NET uses UTF-16 to encode the text in a string. I do not think it is a problem given it is properly encoded to OTLP.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about all languages, but Go and C++ allow arbitrary bytes sequences in default string types (string and std::string) which are used in the OpenTelemetry API.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the limitation to valid Unicode sequences.

BTW, the protobuf only encodes the string as UTF-8.

string := valid UTF-8 string (e.g. ASCII);

https://protobuf.dev/programming-guides/encoding/


I'm not sure about all languages, but Go and C++ allow arbitrary bytes sequences in default string types (string and std::string) which are used in the OpenTelemetry API.

I am not sure how C++ handles this. But for Go, since protobuf won't allow arbitrary bytes in string type, the string will be dropped silently if users put arbitrary bytes into the string type. Related issue: open-telemetry/opentelemetry-go#2314

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how C++ handles this. But for Go, since protobuf won't allow arbitrary bytes in string type, the string will be dropped silently if users put arbitrary bytes into the string type.

Besides the issue of (silent) data loss, I'd argue that this change would technically cause the OTel Go API to violate the specification, as it allows users to pass non-Unicode attribute values in API functions.

[bytes_value](https://github.com/open-telemetry/opentelemetry-proto/blob/38b5b9b6e5257c6500a843f7fdacf89dd95833e8/opentelemetry/proto/common/v1/common.proto#L37)
with the bytes representing the string in the original order and format of the
source string.
String values which are not valid Unicode sequences SHOULD be dropped.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Breaking change to the spec. Why should it be accepted? What's wrong with the existing instructions that are being deleted?

Copy link
Member

@jack-berg jack-berg Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This document is experimental (somewhat surprisingly).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't answer my question - still a breaking change. The existing wording provided lossless preservation of data.

Copy link
Member

@pellared pellared left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it is language-specific how the string is encoded.

@XSAM XSAM changed the title Clarify string content must be valid UTF-8 Clarify string content must be valid Unicode sequences Apr 2, 2024
Copy link
Member

@tigrannajaryan tigrannajaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have discussed this in one of the Otel meetings a week or so ago (probably spec meeting, I can't remember exactly).

The conclusion was that the spec intentionally leaves this unspecified since string implementations are language specific and what is a valid string is different from language to language.

I believe we are not going to mandate Unicode since:

  • It is a breaking change
  • Enforcing it may be expensive in some languages

Blocking this since I don't think the change is desirable.

What we may want to do instead is specify more precisely how OTLP exporter deals with strings that are not valid Unicode, since Protobuf strings need to be unicode. This may need to be done on the https://github.com/open-telemetry/opentelemetry-proto repo.

@XSAM
Copy link
Member Author

XSAM commented Apr 2, 2024

I believe we are not going to mandate Unicode

  • Enforcing it may be expensive in some languages

I agree.

What we may want to do instead is specify more precisely how OTLP exporter deals with strings that are not valid Unicode, since Protobuf strings need to be unicode.

This is good, but I am not sure it solves all the problems. I assume we want to provide a uniform set of features that languages can all implement, meaning if a language can provide this feature, others should be able to provide the same one.

Assuming we converted invalid Unicode to bytes in the OTLP exporter to satisfy Protobuf, which is the current design, we allow certain languages to push bytes in protobuf (if this language allows arbitrary bytes in string) while others cannot do this (if this language does not allow arbitrary bytes in string).

That means to populate bytes in protobuf, we have to use only certain languages other than all languages to implement them. This breaks having a uniform set of features that languages can all implement.

@tigrannajaryan
Copy link
Member

This breaks having a uniform set of features that languages can all implement.

I don't think we aim for uniformness at all costs. Otel mostly tries to be uniform when it is reasonable to be so. Yes, I know it is subjective.

@XSAM
Copy link
Member Author

XSAM commented Apr 3, 2024

This breaks having a uniform set of features that languages can all implement.

I don't think we aim for uniformness at all costs. Otel mostly tries to be uniform when it is reasonable to be so. Yes, I know it is subjective.

I know we have discussed not adding the bytes type to the Attribute, but regarding the costs. Is adding the bytes type to the Attribute costly? By adding the bytes type, we can still provide uniformness.

@tigrannajaryan
Copy link
Member

I know we have discussed not adding the bytes type to the Attribute, but regarding the costs. Is adding the bytes type to the Attribute costly? By adding the bytes type, we can still provide uniformness.

No, I don't think it is costly. I would be in favour of adding bytes to standard attributes.

@jack-berg
Copy link
Member

No, I don't think it is costly. I would be in favour of adding bytes to standard attributes.

Its considered breaking: https://github.com/open-telemetry/opentelemetry-specification/blame/a03382ada8afa9415266a84dafac0510ec8c160f/specification/common/README.md#L84-L85

@pyohannes
Copy link
Contributor

pyohannes commented Apr 4, 2024

Its considered breaking: https://github.com/open-telemetry/opentelemetry-specification/blame/a03382ada8afa9415266a84dafac0510ec8c160f/specification/common/README.md#L84-L85

The linked note says:

Note: Extending the set of standard attribute value types is a breaking change.

And one of the following arguments reads:

Limiting attribute value types to primitives and arrays of primitives simplifies data consumers' efforts to create search indexes and perform statistical analysis.

In my understanding, it should be a breaking change to add any attribute value type that's not a primitive or array of primitives, but extending the list of supported primitives should not be a breaking change.

I wonder if others share this understanding, the wording of just the note itself could indeed be interpreted in different ways.

@tigrannajaryan
Copy link
Member

Yes, my thinking was that bytes is just another primitive type. No strong opinion, but I think it does not violate the original goal of the restriction, which was to avoid complicating the life of receivers by introducing complex types.
If we don't think the spec allows this, then I am fine with that restriction, I don't intend to reopen that discussion.

Copy link

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Apr 12, 2024
Copy link

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Apr 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Clarify the valid content in primitive type string
7 participants