Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify that API users SHOULD use valid UTF-8 for exportable data #3421

Open
jmacd opened this issue Sep 8, 2022 · 22 comments
Open

Clarify that API users SHOULD use valid UTF-8 for exportable data #3421

jmacd opened this issue Sep 8, 2022 · 22 comments
Assignees

Comments

@jmacd
Copy link
Contributor

jmacd commented Sep 8, 2022

In open-telemetry/opentelemetry-go#3021, a user reports having spans fail to export because of invalid UTF-8.

I believe the SDK should pass this data through to the receiver, regardless of UTF-8 correctness. I am not sure whether gRPC and the Google protobuf library used by each SDK offers an option to bypass UTF-8 checking, but if so that would be one path forward if we agree.

Another option is to maintain (via sed, e.g.,) copies of the protocol that replace string with bytes; SDKs could use these definitions instead to bypass UTF-8 checking as well.

@tigrannajaryan
Copy link
Member

Strings are UTF8 in Protobuf:

string := valid UTF-8 string, or sequence of 7-bit ASCII bytes; max 2GB
As described, a string must use UTF-8 character encoding. A string cannot exceed 2GB.

@tigrannajaryan
Copy link
Member

So, bypassing SDK checks is not enough. We will be non-compliant with Protobuf spec and it may result in recipients failing to decode the data. Given that this is part of Stable spec I don't think we can change string to bytes anymore.

@jmacd
Copy link
Contributor Author

jmacd commented Sep 8, 2022

Could we agree on a transformation to turn invalid UTF-8 into valid UTF-8?
Should we reject those strings?

@tigrannajaryan
Copy link
Member

Could we agree on a transformation to turn invalid UTF-8 into valid UTF-8? Should we reject those strings?

Yes, I think these are valid options to consider for the SDK functionality.

@tsloughter
Copy link
Member

There is an issue for this elsewhere already in the spec I believe.

For this case in the Go issue I argued we should be limiting strings not by size but by grapheme so that they don't get cut off into invalid utf8 (or change the meaning of the utf8 string).

@tsloughter
Copy link
Member

Forgot, actually in the Erlang SDK we don't truncate into invalid UTF-8 because we treat the length limit as the # of graphemes https://github.com/open-telemetry/opentelemetry-erlang/blob/main/apps/opentelemetry/src/otel_attributes.erl#L105

The issue is the spec says "character" and not "graphemes" https://github.com/open-telemetry/opentelemetry-specification/blob/ca593258953811eb33197976b8e565299b874853/specification/common/README.md#attribute-limits

@jmacd
Copy link
Contributor Author

jmacd commented Sep 9, 2022

I believe there's a question of what we actually want to have, here.

The term character is reasonably well defined in relation to utf-8, but it's not the same as "grapheme" I think?

Since protobuf (w/ v3 syntax) is very strict about UTF-8 and we intend for OTLP/proto to be the standard, we are quite constrained. The Golang protobuf implementation rejects such data in both Marshal() and Unmarshal(), but some do not. Without correcting this problem at the protocol level, we are constrained to one or the other of:

  • Validate every string-valued field before generating OTLP, force correction for invalid UTF-8
  • Expect errors somewhere in the Export path

I am opposed to both of these options. The validate-and-force-correction path is simply expensive, especially when the SDK actually repairs invalid data. The expect-errors path is much worse, because generally the entire batch of data is rejected with little clue as to what the actual error was (note the error message in open-telemetry/opentelemetry-go#3021 does not include detail).

Therefore, what I actually want is for SDKs to allow invalid UTF-8. While users are instructed they SHOULD not use invalid UTF-8, SDKs will be specified such that they SHOULD not themselves consider invalid UTF-8 to be an error. As mentioned in my earlier comment, I believe this can be accomplished by maintaining two variant .proto definitions, one where every string field is a wire-compatible bytes field instead.

How this solves the problem:

  • SDKs will be advised to use the bytes-form of OTLP so that their marshal and export routines remain ignorant of any UTF-8 problems.
  • Processing pipelines will be advised to use the bytes-form of OTLP so that they can process data with invalid UTF-8
  • Consumers will be advised that all the bytes fields SHOULD be interpreted as strings and that UTF-8 validation and correction is their responsibility.

Consumers then can accept data with invalid UTF-8, which I think is ultimately what we all want. I find myself wanting to extend the library guidelines on performance to address this situation. Where we currently write:

Although there are inevitable overhead to achieve monitoring, API should not degrade the end-user application as possible.

I want to add a statement like so:

Although users SHOULD use valid UTF-8 strings everywhere in the OpenTelemetry API, the use of invalid UTF-8 SHOULD NOT degrade the observability of telemetry data. SDKs SHOULD permit invalid UTF-8 to be exported without validating UTF-8, since that would degrade performance unnecessarily.

In addition, I would then add that SDKs SHOULD treat attribute "character limits" assuming though the character set is US-ASCII so that (as a performance concern) truncation logic need not parse UTF-8 code points or identify grapheme clusters.

@tigrannajaryan
Copy link
Member

Processing pipelines will be advised to use the bytes-form of OTLP so that they can process data with invalid UTF-8

Isn't every recipient a processing pipeline from this perspective? This would mean a breaking change for every backend that supports receiving OTLP today.

I want to suggest an alternate statement:

Users SHOULD use valid UTF-8 strings everywhere in the OpenTelemetry API. The use of invalid UTF-8 results in undefined behavior.

So, dear user, you have been warned. Don't do it, if you do and your backend crashes, that's between you and your vendor.

This statement allows you to make choices in the SDKs. If you want to pay the cost and validate it you can do it. If you don't care and want to pass the problem to the recipients you can also do it.

@tsloughter
Copy link
Member

"SHOULD" is a good compromise (I'd be on the side of requiring valid utf-8) but I'd argue that we should also state for the users that valid utf8 will not be broken.

@jsuereth
Copy link
Contributor

We already specify that any non-valid UTF-8 string MUST be encoded as a bytes_value.

I'm against the chaos that would ensue with forcing every string to be raw bytes with no notion of encoding. I think the current specification around bytes_value (with no encoding flag) is already a bit rough.

  • Should we allow non-UTF8? Yes. But only for attribute values. Names, of all forms must be UTF-8 for our protocol to work thanks to protocol buffers.
  • Is it ok for API/SDK to re-encode characters for UTF-8 for names? Sure. This may be where the rub is. We should enforce this conversion happens early, so error messages reach users faster.
  • Should we modify OTLP to not use the string type in proto? I feel this is rather extreme for little gain. Values today can be in UTF-8, and our semantic conventions tend to also be in UTF-8. I realize this can be restrictive, but it cause major issues in using the protocol that I think are just not necessary.
  • Should we ensure users denote the encoding of non-UTF8 strings? I'd say yes, but the state of the world today is ok.

@tigrannajaryan
Copy link
Member

We now have open issues in the spec to handle this in the API/SDK.

Can we close this issue?

@tigrannajaryan
Copy link
Member

Closing. I assume nothing is needed in this repo.

@tigrannajaryan tigrannajaryan added the wontfix This will not be worked on label Oct 18, 2022
@jmacd
Copy link
Contributor Author

jmacd commented Oct 18, 2022

We now have open issues in the spec to handle this in the API/SDK.

Please link any spec issues, for the readers following this issue?

@tigrannajaryan
Copy link
Member

tigrannajaryan commented Oct 19, 2022

Re-opening, can't find the issue I was referring to. :-(

@tigrannajaryan
Copy link
Member

Moving this to spec repo since there is nothing to be done on the proto.

@tigrannajaryan tigrannajaryan transferred this issue from open-telemetry/opentelemetry-proto Apr 20, 2023
@tigrannajaryan tigrannajaryan changed the title Are API users required to use valid UTF-8? Clarify that API users SHOULD use valid UTF-8 for exportable data Apr 20, 2023
@tigrannajaryan tigrannajaryan added help wanted Extra attention is needed and removed wontfix This will not be worked on labels Apr 20, 2023
@tigrannajaryan
Copy link
Member

image

Hey, bot, you can't be serious. This is the gratitude I get for cleaning up the house?

@jmacd
Copy link
Contributor Author

jmacd commented May 8, 2024

I will elevate this topic again in the next Specification SIG meeting. Recently I observed the following interaction.

A Rust SDK emitted a string-valued attribute containing invalid utf8. This passed to an OTel collector, whose OTLP receiver parsed it as valid OTLP data. Note that gRPC receivers do not validate utf8, they assume the client does this.

The OTel collector passed this data to a vendor-internal exporter, which is based on gRPC. That exporter tried to send the same data further into the backend, but it failed because of the invalid utf8. It could be worse -- if the collector has a batching step, and the invalid utf8 data was combined with other valid data, the whole request will fail. It has become very easy to lose telemetry because some participants require valid utf8 while others do not.

What is the ideal outcome?

I think it would be unpopular if we added a requirement that SDKs MUST validate utf8, because it is an unnecessary expense. At the same time, valid utf8 is stated as a requirement and enforced by Google protobuf libraries. The compromise I want us to reach is where there is known utf8 enforcement, there MUST be an effort to prevent telemetry data from dropping.

  1. OTel SDKs are obligated to repair the problem where there is known enforcement. As an example, if you are using the Go gRPC library, which requires valid utf8, then you must fix validation errors before passing data to the exporter. As a counter-example, if you are using the Rust tonic library (which does not enforce valid utf8), then you are not obligated to validate utf8.
  2. API users are obligated to use non-string attribute formats where the encoding is not utf8. They may expect errors from backends that expect valid utf8 if they fail to do this correctly.
  3. OpenTelemetry collector will be obligated to provide efficient means of enforcing valid utf8. This would likely be done with the help of code generation in the pdata framework, to automatically scan all strings before constructing invalid protobuf data, and through the component capabilities framework. Components either are or are not willing to accept invalid utf8 data, and if they aren't the collector is expected to automatically repair the problem. This could be accomplished through automatic insertion of a sanitizer component in every pipeline as early as possible.
  4. Generally, OTLP receivers should return PartialSuccess responses to warn senders when they have sent invalid utf8. If the data was not automatically corrected, then the rejected count associated with the partial success should reflect the number of items that were not accepted due to invalid utf8.
  5. Generally, an automatic correction step SHOULD replace any contiguous sub-string of invalid-utf8 bytes by a configurable string. The default shall be "�".

@pellared
Copy link
Member

pellared commented May 8, 2024

Possibly related issue: #3950

CC @XSAM

@jmacd
Copy link
Contributor Author

jmacd commented May 8, 2024

I do see the connection with #3950, which appears equally unresolved. I am surprised to hear about vendors that accept OTLP data and are not prepared to accept byte slice data, which is already a part of the OTLP data model.

I will move forward my proposal above, which would resolves #3950 by requiring strings to be valid utf8 AND requiring SDKs to support byte slice data AND requiring SDKs and collectors to enforce validity where there is known enforcement.

@jmacd
Copy link
Contributor Author

jmacd commented May 8, 2024

To help focus this conversation, the specification does have language to avoid API users entering raw byte -value attributes.

https://github.com/open-telemetry/opentelemetry-specification/tree/main/specification/common#attribute

I believe that (a) we should require strings to be valid utf8, (b) we should add "byte slice" as a primitive value, with no expectation for utf8 validity. We may use either this issue or #3950 to discuss.

@pellared
Copy link
Member

pellared commented May 8, 2024

requiring strings to be valid utf8

Just a notice, I think this is only essential for OTLP and not for the API. For instance, in Java and .NET string is UTF-16 encoded and I do not see any problem with it. Maybe we should say that it must be possible to convert the string to UTF-8 encoding?

String values which are valid UTF-8 sequences SHOULD be converted to
AnyValue's
[string_value](https://github.com/open-telemetry/opentelemetry-proto/blob/38b5b9b6e5257c6500a843f7fdacf89dd95833e8/opentelemetry/proto/common/v1/common.proto#L31)
field.
String values which are not valid Unicode sequences SHOULD be converted to
AnyValue's
[bytes_value](https://github.com/open-telemetry/opentelemetry-proto/blob/38b5b9b6e5257c6500a843f7fdacf89dd95833e8/opentelemetry/proto/common/v1/common.proto#L37)
with the bytes representing the string in the original order and format of the
source string.

I think this should be rephrased to something like

If possible, string values MUST be converted to AnyValue's
string_value
field.

Otherwise, string values MUST be converted to AnyValue's
bytes_value
with the bytes representing the string in the original order and format of the source string.

@jmacd
Copy link
Contributor Author

jmacd commented May 10, 2024

@pellared The document you is restricted to attribute values, and my issue here is about all the other strings too. Sure, other language runtimes have an expectation of UTF-16 -- and what should a .NET or Java library do if by some unsafe mechanism they realize invalid UTF-16 string? OpenTelemetry's error-handling guidelines say they can't produce an error.

I don't know how easy it is to produce invalid UTF-16 at runtime in Java or .NET, but it's fairly easy in Go and Rust (using unsafe) to produce invalid UTF-8. I have put together a position statement in the draft OTEP here: open-telemetry/oteps#257

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants