Clarify string content must be valid Unicode sequences #3970

XSAM · 2024-03-31T04:48:42Z

Changes

Limit primitive string value to UTF-8.
Drop attribute mapping for non-UTF-8 strings.

pyohannes · 2024-04-02T10:58:16Z

specification/common/README.md

@@ -39,6 +39,8 @@ An `Attribute` is a key-value pair, which MUST have the following properties:
  - An array of primitive type values. The array MUST be homogeneous,
    i.e., it MUST NOT contain values of different types.

+String values MUST be valid Unicode sequences (UTF-8).


Does this mean that any API which accepts string attribute values that are not valid UTF-8 strings doesn't comply with the specification?

.NET uses UTF-16 to encode the text in a string. I do not think it is a problem given it is properly encoded to OTLP.

I'm not sure about all languages, but Go and C++ allow arbitrary bytes sequences in default string types (string and std::string) which are used in the OpenTelemetry API.

I changed the limitation to valid Unicode sequences.

BTW, the protobuf only encodes the string as UTF-8.

string := valid UTF-8 string (e.g. ASCII);

https://protobuf.dev/programming-guides/encoding/

I'm not sure about all languages, but Go and C++ allow arbitrary bytes sequences in default string types (string and std::string) which are used in the OpenTelemetry API.

I am not sure how C++ handles this. But for Go, since protobuf won't allow arbitrary bytes in string type, the string will be dropped silently if users put arbitrary bytes into the string type. Related issue: open-telemetry/opentelemetry-go#2314

I am not sure how C++ handles this. But for Go, since protobuf won't allow arbitrary bytes in string type, the string will be dropped silently if users put arbitrary bytes into the string type.

Besides the issue of (silent) data loss, I'd argue that this change would technically cause the OTel Go API to violate the specification, as it allows users to pass non-Unicode attribute values in API functions.

yurishkuro · 2024-04-02T14:09:26Z

specification/common/attribute-type-mapping.md

-[bytes_value](https://github.com/open-telemetry/opentelemetry-proto/blob/38b5b9b6e5257c6500a843f7fdacf89dd95833e8/opentelemetry/proto/common/v1/common.proto#L37)
-with the bytes representing the string in the original order and format of the
-source string.
+String values which are not valid Unicode sequences SHOULD be dropped.


Breaking change to the spec. Why should it be accepted? What's wrong with the existing instructions that are being deleted?

This document is experimental (somewhat surprisingly).

Doesn't answer my question - still a breaking change. The existing wording provided lossless preservation of data.

pellared

I find it is language-specific how the string is encoded.

tigrannajaryan

We have discussed this in one of the Otel meetings a week or so ago (probably spec meeting, I can't remember exactly).

The conclusion was that the spec intentionally leaves this unspecified since string implementations are language specific and what is a valid string is different from language to language.

I believe we are not going to mandate Unicode since:

It is a breaking change
Enforcing it may be expensive in some languages

Blocking this since I don't think the change is desirable.

What we may want to do instead is specify more precisely how OTLP exporter deals with strings that are not valid Unicode, since Protobuf strings need to be unicode. This may need to be done on the https://github.com/open-telemetry/opentelemetry-proto repo.

XSAM · 2024-04-02T17:11:59Z

I believe we are not going to mandate Unicode

Enforcing it may be expensive in some languages

I agree.

What we may want to do instead is specify more precisely how OTLP exporter deals with strings that are not valid Unicode, since Protobuf strings need to be unicode.

This is good, but I am not sure it solves all the problems. I assume we want to provide a uniform set of features that languages can all implement, meaning if a language can provide this feature, others should be able to provide the same one.

Assuming we converted invalid Unicode to bytes in the OTLP exporter to satisfy Protobuf, which is the current design, we allow certain languages to push bytes in protobuf (if this language allows arbitrary bytes in string) while others cannot do this (if this language does not allow arbitrary bytes in string).

That means to populate bytes in protobuf, we have to use only certain languages other than all languages to implement them. This breaks having a uniform set of features that languages can all implement.

tigrannajaryan · 2024-04-02T18:13:22Z

This breaks having a uniform set of features that languages can all implement.

I don't think we aim for uniformness at all costs. Otel mostly tries to be uniform when it is reasonable to be so. Yes, I know it is subjective.

XSAM · 2024-04-03T15:36:02Z

This breaks having a uniform set of features that languages can all implement.

I don't think we aim for uniformness at all costs. Otel mostly tries to be uniform when it is reasonable to be so. Yes, I know it is subjective.

I know we have discussed not adding the bytes type to the Attribute, but regarding the costs. Is adding the bytes type to the Attribute costly? By adding the bytes type, we can still provide uniformness.

tigrannajaryan · 2024-04-03T17:42:35Z

I know we have discussed not adding the bytes type to the Attribute, but regarding the costs. Is adding the bytes type to the Attribute costly? By adding the bytes type, we can still provide uniformness.

No, I don't think it is costly. I would be in favour of adding bytes to standard attributes.

jack-berg · 2024-04-03T17:45:54Z

No, I don't think it is costly. I would be in favour of adding bytes to standard attributes.

Its considered breaking: https://github.com/open-telemetry/opentelemetry-specification/blame/a03382ada8afa9415266a84dafac0510ec8c160f/specification/common/README.md#L84-L85

pyohannes · 2024-04-04T10:35:07Z

Its considered breaking: https://github.com/open-telemetry/opentelemetry-specification/blame/a03382ada8afa9415266a84dafac0510ec8c160f/specification/common/README.md#L84-L85

The linked note says:

Note: Extending the set of standard attribute value types is a breaking change.

And one of the following arguments reads:

Limiting attribute value types to primitives and arrays of primitives simplifies data consumers' efforts to create search indexes and perform statistical analysis.

In my understanding, it should be a breaking change to add any attribute value type that's not a primitive or array of primitives, but extending the list of supported primitives should not be a breaking change.

I wonder if others share this understanding, the wording of just the note itself could indeed be interpreted in different ways.

tigrannajaryan · 2024-04-04T14:43:42Z

Yes, my thinking was that bytes is just another primitive type. No strong opinion, but I think it does not violate the original goal of the restriction, which was to avoid complicating the life of receivers by introducing complex types.
If we don't think the spec allows this, then I am fine with that restriction, I don't intend to reopen that discussion.

github-actions · 2024-04-12T03:16:48Z

This PR was marked stale due to lack of activity. It will be closed in 7 days.

github-actions · 2024-04-19T03:16:57Z

Closed as inactive. Feel free to reopen if this PR is still being worked on.

XSAM added 2 commits March 30, 2024 21:37

Limiting string value to UTF-8

cd031f3

Drop non-UTF-8 strings

7639ba2

XSAM requested review from a team as code owners March 31, 2024 04:48

github-actions bot assigned arminru Mar 31, 2024

XSAM changed the title ~~Clarify string content should be UTF-8~~ Clarify string content should be valid UTF-8 Mar 31, 2024

Change SHOULD to MUST

1aa4175

XSAM changed the title ~~Clarify string content should be valid UTF-8~~ Clarify string content must be valid UTF-8 Mar 31, 2024

XSAM mentioned this pull request Mar 31, 2024

Clarify the valid content in primitive type string #3950

Open

pyohannes reviewed Apr 2, 2024

View reviewed changes

yurishkuro reviewed Apr 2, 2024

View reviewed changes

pellared requested changes Apr 2, 2024

View reviewed changes

XSAM changed the title ~~Clarify string content must be valid UTF-8~~ Clarify string content must be valid Unicode sequences Apr 2, 2024

Change limitation to Unicode sequences

157fdd6

tigrannajaryan requested changes Apr 2, 2024

View reviewed changes

github-actions bot added the Stale label Apr 12, 2024

github-actions bot closed this Apr 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify string content must be valid Unicode sequences #3970

Clarify string content must be valid Unicode sequences #3970

XSAM commented Mar 31, 2024

pyohannes Apr 2, 2024

pellared Apr 2, 2024

pyohannes Apr 2, 2024

XSAM Apr 2, 2024

pyohannes Apr 2, 2024

yurishkuro Apr 2, 2024

jack-berg Apr 2, 2024 •

edited

Loading

yurishkuro Apr 2, 2024

pellared left a comment •

edited

Loading

tigrannajaryan left a comment

XSAM commented Apr 2, 2024 •

edited

Loading

tigrannajaryan commented Apr 2, 2024

XSAM commented Apr 3, 2024

tigrannajaryan commented Apr 3, 2024

jack-berg commented Apr 3, 2024

pyohannes commented Apr 4, 2024 •

edited

Loading

tigrannajaryan commented Apr 4, 2024

github-actions bot commented Apr 12, 2024

github-actions bot commented Apr 19, 2024

Clarify string content must be valid Unicode sequences #3970

Clarify string content must be valid Unicode sequences #3970

Conversation

XSAM commented Mar 31, 2024

Changes

pyohannes Apr 2, 2024

Choose a reason for hiding this comment

pellared Apr 2, 2024

Choose a reason for hiding this comment

pyohannes Apr 2, 2024

Choose a reason for hiding this comment

XSAM Apr 2, 2024

Choose a reason for hiding this comment

pyohannes Apr 2, 2024

Choose a reason for hiding this comment

yurishkuro Apr 2, 2024

Choose a reason for hiding this comment

jack-berg Apr 2, 2024 • edited Loading

Choose a reason for hiding this comment

yurishkuro Apr 2, 2024

Choose a reason for hiding this comment

pellared left a comment • edited Loading

Choose a reason for hiding this comment

tigrannajaryan left a comment

Choose a reason for hiding this comment

XSAM commented Apr 2, 2024 • edited Loading

tigrannajaryan commented Apr 2, 2024

XSAM commented Apr 3, 2024

tigrannajaryan commented Apr 3, 2024

jack-berg commented Apr 3, 2024

pyohannes commented Apr 4, 2024 • edited Loading

tigrannajaryan commented Apr 4, 2024

github-actions bot commented Apr 12, 2024

github-actions bot commented Apr 19, 2024

jack-berg Apr 2, 2024 •

edited

Loading

pellared left a comment •

edited

Loading

XSAM commented Apr 2, 2024 •

edited

Loading

pyohannes commented Apr 4, 2024 •

edited

Loading