Clarify that API users SHOULD use valid UTF-8 for exportable data #3421

jmacd · 2022-09-08T18:13:33Z

In open-telemetry/opentelemetry-go#3021, a user reports having spans fail to export because of invalid UTF-8.

I believe the SDK should pass this data through to the receiver, regardless of UTF-8 correctness. I am not sure whether gRPC and the Google protobuf library used by each SDK offers an option to bypass UTF-8 checking, but if so that would be one path forward if we agree.

Another option is to maintain (via sed, e.g.,) copies of the protocol that replace string with bytes; SDKs could use these definitions instead to bypass UTF-8 checking as well.

The text was updated successfully, but these errors were encountered:

tigrannajaryan · 2022-09-08T18:52:21Z

Strings are UTF8 in Protobuf:

string := valid UTF-8 string, or sequence of 7-bit ASCII bytes; max 2GB
As described, a string must use UTF-8 character encoding. A string cannot exceed 2GB.

tigrannajaryan · 2022-09-08T18:53:34Z

So, bypassing SDK checks is not enough. We will be non-compliant with Protobuf spec and it may result in recipients failing to decode the data. Given that this is part of Stable spec I don't think we can change string to bytes anymore.

jmacd · 2022-09-08T19:08:52Z

Could we agree on a transformation to turn invalid UTF-8 into valid UTF-8?
Should we reject those strings?

tigrannajaryan · 2022-09-08T20:29:18Z

Could we agree on a transformation to turn invalid UTF-8 into valid UTF-8? Should we reject those strings?

Yes, I think these are valid options to consider for the SDK functionality.

tsloughter · 2022-09-08T23:18:24Z

There is an issue for this elsewhere already in the spec I believe.

For this case in the Go issue I argued we should be limiting strings not by size but by grapheme so that they don't get cut off into invalid utf8 (or change the meaning of the utf8 string).

tsloughter · 2022-09-09T18:11:16Z

Forgot, actually in the Erlang SDK we don't truncate into invalid UTF-8 because we treat the length limit as the # of graphemes https://github.com/open-telemetry/opentelemetry-erlang/blob/main/apps/opentelemetry/src/otel_attributes.erl#L105

The issue is the spec says "character" and not "graphemes" https://github.com/open-telemetry/opentelemetry-specification/blob/ca593258953811eb33197976b8e565299b874853/specification/common/README.md#attribute-limits

jmacd · 2022-09-09T18:38:55Z

I believe there's a question of what we actually want to have, here.

The term character is reasonably well defined in relation to utf-8, but it's not the same as "grapheme" I think?

Since protobuf (w/ v3 syntax) is very strict about UTF-8 and we intend for OTLP/proto to be the standard, we are quite constrained. The Golang protobuf implementation rejects such data in both Marshal() and Unmarshal(), but some do not. Without correcting this problem at the protocol level, we are constrained to one or the other of:

Validate every string-valued field before generating OTLP, force correction for invalid UTF-8
Expect errors somewhere in the Export path

I am opposed to both of these options. The validate-and-force-correction path is simply expensive, especially when the SDK actually repairs invalid data. The expect-errors path is much worse, because generally the entire batch of data is rejected with little clue as to what the actual error was (note the error message in open-telemetry/opentelemetry-go#3021 does not include detail).

Therefore, what I actually want is for SDKs to allow invalid UTF-8. While users are instructed they SHOULD not use invalid UTF-8, SDKs will be specified such that they SHOULD not themselves consider invalid UTF-8 to be an error. As mentioned in my earlier comment, I believe this can be accomplished by maintaining two variant .proto definitions, one where every string field is a wire-compatible bytes field instead.

How this solves the problem:

SDKs will be advised to use the bytes-form of OTLP so that their marshal and export routines remain ignorant of any UTF-8 problems.
Processing pipelines will be advised to use the bytes-form of OTLP so that they can process data with invalid UTF-8
Consumers will be advised that all the bytes fields SHOULD be interpreted as strings and that UTF-8 validation and correction is their responsibility.

Consumers then can accept data with invalid UTF-8, which I think is ultimately what we all want. I find myself wanting to extend the library guidelines on performance to address this situation. Where we currently write:

Although there are inevitable overhead to achieve monitoring, API should not degrade the end-user application as possible.

I want to add a statement like so:

Although users SHOULD use valid UTF-8 strings everywhere in the OpenTelemetry API, the use of invalid UTF-8 SHOULD NOT degrade the observability of telemetry data. SDKs SHOULD permit invalid UTF-8 to be exported without validating UTF-8, since that would degrade performance unnecessarily.

In addition, I would then add that SDKs SHOULD treat attribute "character limits" assuming though the character set is US-ASCII so that (as a performance concern) truncation logic need not parse UTF-8 code points or identify grapheme clusters.

tigrannajaryan · 2022-09-09T19:54:43Z

Processing pipelines will be advised to use the bytes-form of OTLP so that they can process data with invalid UTF-8

Isn't every recipient a processing pipeline from this perspective? This would mean a breaking change for every backend that supports receiving OTLP today.

I want to suggest an alternate statement:

Users SHOULD use valid UTF-8 strings everywhere in the OpenTelemetry API. The use of invalid UTF-8 results in undefined behavior.

So, dear user, you have been warned. Don't do it, if you do and your backend crashes, that's between you and your vendor.

This statement allows you to make choices in the SDKs. If you want to pay the cost and validate it you can do it. If you don't care and want to pass the problem to the recipients you can also do it.

tsloughter · 2022-09-09T20:15:22Z

"SHOULD" is a good compromise (I'd be on the side of requiring valid utf-8) but I'd argue that we should also state for the users that valid utf8 will not be broken.

jsuereth · 2022-09-12T20:06:21Z

We already specify that any non-valid UTF-8 string MUST be encoded as a bytes_value.

I'm against the chaos that would ensue with forcing every string to be raw bytes with no notion of encoding. I think the current specification around bytes_value (with no encoding flag) is already a bit rough.

Should we allow non-UTF8? Yes. But only for attribute values. Names, of all forms must be UTF-8 for our protocol to work thanks to protocol buffers.
Is it ok for API/SDK to re-encode characters for UTF-8 for names? Sure. This may be where the rub is. We should enforce this conversion happens early, so error messages reach users faster.
Should we modify OTLP to not use the string type in proto? I feel this is rather extreme for little gain. Values today can be in UTF-8, and our semantic conventions tend to also be in UTF-8. I realize this can be restrictive, but it cause major issues in using the protocol that I think are just not necessary.
Should we ensure users denote the encoding of non-UTF8 strings? I'd say yes, but the state of the world today is ok.

tigrannajaryan · 2022-09-28T01:53:33Z

We now have open issues in the spec to handle this in the API/SDK.

Can we close this issue?

tigrannajaryan · 2022-10-18T15:24:59Z

Closing. I assume nothing is needed in this repo.

jmacd · 2022-10-18T15:44:30Z

We now have open issues in the spec to handle this in the API/SDK.

Please link any spec issues, for the readers following this issue?

tigrannajaryan · 2022-10-19T14:25:50Z

Re-opening, can't find the issue I was referring to. :-(

tigrannajaryan · 2023-04-20T14:50:45Z

Moving this to spec repo since there is nothing to be done on the proto.

tigrannajaryan · 2023-04-20T15:02:26Z

Hey, bot, you can't be serious. This is the gratitude I get for cleaning up the house?

jmacd · 2024-05-08T18:00:39Z

I will elevate this topic again in the next Specification SIG meeting. Recently I observed the following interaction.

A Rust SDK emitted a string-valued attribute containing invalid utf8. This passed to an OTel collector, whose OTLP receiver parsed it as valid OTLP data. Note that gRPC receivers do not validate utf8, they assume the client does this.

The OTel collector passed this data to a vendor-internal exporter, which is based on gRPC. That exporter tried to send the same data further into the backend, but it failed because of the invalid utf8. It could be worse -- if the collector has a batching step, and the invalid utf8 data was combined with other valid data, the whole request will fail. It has become very easy to lose telemetry because some participants require valid utf8 while others do not.

What is the ideal outcome?

I think it would be unpopular if we added a requirement that SDKs MUST validate utf8, because it is an unnecessary expense. At the same time, valid utf8 is stated as a requirement and enforced by Google protobuf libraries. The compromise I want us to reach is where there is known utf8 enforcement, there MUST be an effort to prevent telemetry data from dropping.

OTel SDKs are obligated to repair the problem where there is known enforcement. As an example, if you are using the Go gRPC library, which requires valid utf8, then you must fix validation errors before passing data to the exporter. As a counter-example, if you are using the Rust tonic library (which does not enforce valid utf8), then you are not obligated to validate utf8.
API users are obligated to use non-string attribute formats where the encoding is not utf8. They may expect errors from backends that expect valid utf8 if they fail to do this correctly.
OpenTelemetry collector will be obligated to provide efficient means of enforcing valid utf8. This would likely be done with the help of code generation in the pdata framework, to automatically scan all strings before constructing invalid protobuf data, and through the component capabilities framework. Components either are or are not willing to accept invalid utf8 data, and if they aren't the collector is expected to automatically repair the problem. This could be accomplished through automatic insertion of a sanitizer component in every pipeline as early as possible.
Generally, OTLP receivers should return PartialSuccess responses to warn senders when they have sent invalid utf8. If the data was not automatically corrected, then the rejected count associated with the partial success should reflect the number of items that were not accepted due to invalid utf8.
Generally, an automatic correction step SHOULD replace any contiguous sub-string of invalid-utf8 bytes by a configurable string. The default shall be "�".

pellared · 2024-05-08T18:03:30Z

Possibly related issue: #3950

CC @XSAM

jmacd · 2024-05-08T18:15:23Z

I do see the connection with #3950, which appears equally unresolved. I am surprised to hear about vendors that accept OTLP data and are not prepared to accept byte slice data, which is already a part of the OTLP data model.

I will move forward my proposal above, which would resolves #3950 by requiring strings to be valid utf8 AND requiring SDKs to support byte slice data AND requiring SDKs and collectors to enforce validity where there is known enforcement.

jmacd · 2024-05-08T18:24:22Z

To help focus this conversation, the specification does have language to avoid API users entering raw byte -value attributes.

https://github.com/open-telemetry/opentelemetry-specification/tree/main/specification/common#attribute

I believe that (a) we should require strings to be valid utf8, (b) we should add "byte slice" as a primitive value, with no expectation for utf8 validity. We may use either this issue or #3950 to discuss.

pellared · 2024-05-08T18:27:41Z

requiring strings to be valid utf8

Just a notice, I think this is only essential for OTLP and not for the API. For instance, in Java and .NET string is UTF-16 encoded and I do not see any problem with it. Maybe we should say that it must be possible to convert the string to UTF-8 encoding?

opentelemetry-specification/specification/common/attribute-type-mapping.md

Lines 104 to 113 in 7145a5d

    
           String values which are valid UTF-8 sequences SHOULD be converted to 
        
           AnyValue's 
        
           [string_value](https://github.com/open-telemetry/opentelemetry-proto/blob/38b5b9b6e5257c6500a843f7fdacf89dd95833e8/opentelemetry/proto/common/v1/common.proto#L31) 
        
           field. 
        
           String values which are not valid Unicode sequences SHOULD be converted to 
        
           AnyValue's 
        
           [bytes_value](https://github.com/open-telemetry/opentelemetry-proto/blob/38b5b9b6e5257c6500a843f7fdacf89dd95833e8/opentelemetry/proto/common/v1/common.proto#L37) 
        
           with the bytes representing the string in the original order and format of the 
        
           source string.

I think this should be rephrased to something like

If possible, string values MUST be converted to AnyValue's
string_value
field.

Otherwise, string values MUST be converted to AnyValue's
bytes_value
with the bytes representing the string in the original order and format of the source string.

jmacd · 2024-05-10T23:29:53Z

@pellared The document you is restricted to attribute values, and my issue here is about all the other strings too. Sure, other language runtimes have an expectation of UTF-16 -- and what should a .NET or Java library do if by some unsafe mechanism they realize invalid UTF-16 string? OpenTelemetry's error-handling guidelines say they can't produce an error.

I don't know how easy it is to produce invalid UTF-16 at runtime in Java or .NET, but it's fairly easy in Go and Rust (using unsafe) to produce invalid UTF-8. I have put together a position statement in the draft OTEP here: open-telemetry/oteps#257

This was referenced Sep 9, 2022

Safely truncate over-length string attributes open-telemetry/opentelemetry-go#3156

Merged

Invalid UTF-8 can be generated by span attribute limits open-telemetry/opentelemetry-java#4740

Closed

This was referenced Sep 9, 2022

attributes: specify attribute string limits act on the grapheme level #2779

Closed

Define restrictions for attribute and label keys #504

Closed

jmacd mentioned this issue Sep 9, 2022

Revise support for UTF-8-valid string truncation open-telemetry/opentelemetry-go#3160

Closed

tigrannajaryan closed this as completed Oct 18, 2022

tigrannajaryan added the wontfix This will not be worked on label Oct 18, 2022

tigrannajaryan reopened this Oct 19, 2022

github-actions bot assigned tigrannajaryan Apr 20, 2023

tigrannajaryan transferred this issue from open-telemetry/opentelemetry-proto Apr 20, 2023

tigrannajaryan changed the title ~~Are API users required to use valid UTF-8?~~ Clarify that API users SHOULD use valid UTF-8 for exportable data Apr 20, 2023

tigrannajaryan added help wanted Extra attention is needed and removed wontfix This will not be worked on labels Apr 20, 2023

jmacd mentioned this issue May 8, 2024

Create attribute.Value type for []byte values open-telemetry/opentelemetry-go#5324

Closed

jmacd mentioned this issue May 8, 2024

Clarify the valid content in primitive type string #3950

Open

jmacd mentioned this issue May 10, 2024

Invalid UTF-8 error handling policy open-telemetry/oteps#257

Open

danielgblanco added the triage:accepted:ready-with-sponsor label May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify that API users SHOULD use valid UTF-8 for exportable data #3421

Clarify that API users SHOULD use valid UTF-8 for exportable data #3421

jmacd commented Sep 8, 2022

tigrannajaryan commented Sep 8, 2022

tigrannajaryan commented Sep 8, 2022

jmacd commented Sep 8, 2022

tigrannajaryan commented Sep 8, 2022

tsloughter commented Sep 8, 2022

tsloughter commented Sep 9, 2022

jmacd commented Sep 9, 2022

tigrannajaryan commented Sep 9, 2022

tsloughter commented Sep 9, 2022

jsuereth commented Sep 12, 2022

tigrannajaryan commented Sep 28, 2022

tigrannajaryan commented Oct 18, 2022

jmacd commented Oct 18, 2022

tigrannajaryan commented Oct 19, 2022 •

edited

Loading

tigrannajaryan commented Apr 20, 2023

tigrannajaryan commented Apr 20, 2023

jmacd commented May 8, 2024

pellared commented May 8, 2024

jmacd commented May 8, 2024

jmacd commented May 8, 2024

pellared commented May 8, 2024 •

edited

Loading

jmacd commented May 10, 2024

Clarify that API users SHOULD use valid UTF-8 for exportable data #3421

Clarify that API users SHOULD use valid UTF-8 for exportable data #3421

Comments

jmacd commented Sep 8, 2022

tigrannajaryan commented Sep 8, 2022

tigrannajaryan commented Sep 8, 2022

jmacd commented Sep 8, 2022

tigrannajaryan commented Sep 8, 2022

tsloughter commented Sep 8, 2022

tsloughter commented Sep 9, 2022

jmacd commented Sep 9, 2022

tigrannajaryan commented Sep 9, 2022

tsloughter commented Sep 9, 2022

jsuereth commented Sep 12, 2022

tigrannajaryan commented Sep 28, 2022

tigrannajaryan commented Oct 18, 2022

jmacd commented Oct 18, 2022

tigrannajaryan commented Oct 19, 2022 • edited Loading

tigrannajaryan commented Apr 20, 2023

tigrannajaryan commented Apr 20, 2023

jmacd commented May 8, 2024

pellared commented May 8, 2024

jmacd commented May 8, 2024

jmacd commented May 8, 2024

pellared commented May 8, 2024 • edited Loading

jmacd commented May 10, 2024

tigrannajaryan commented Oct 19, 2022 •

edited

Loading

pellared commented May 8, 2024 •

edited

Loading