Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: utf8: add ErrInvalid #70547

Open
dsnet opened this issue Nov 24, 2024 · 4 comments
Open

proposal: utf8: add ErrInvalid #70547

dsnet opened this issue Nov 24, 2024 · 4 comments
Labels
Milestone

Comments

@dsnet
Copy link
Member

dsnet commented Nov 24, 2024

Proposal Details

I propose the addition of the following sentinel error to the utf8 package:

// ErrInvalid indicates that invalid UTF-8 was encountered within a text string.
// Packages that return this error should usually wrap this error value within
// some local error type to provide further semantic context regarding where
// this error occurred.
var ErrInvalid = errors.New("invalid UTF-8")

Many higher-level formats are built on top of UTF-8 (e.g., XML, JSON, protobuf, etc.) where encountering an UTF-8 encoding problem is a possibility. In many cases such an error is not fatal and processing can still continue such that a function that returns a typical (T, error) result may provide a sensible while also returning an error that matches utf8.ErrInvalid. Even if it is fatal, it is often useful for metrics reporting purposes to specially identify invalid UTF-8 as a particular class of errors.

This could replace internal error values used by the "encoding/json/v2" prototype and also within the protobuf module (e.g., golang/protobuf#1228).

@dsnet dsnet added the Proposal label Nov 24, 2024
@gopherbot gopherbot added this to the Proposal milestone Nov 24, 2024
@gabyhelp
Copy link

Related Documentation

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

@ianlancetaylor ianlancetaylor moved this to Incoming in Proposals Nov 25, 2024
@robpike
Copy link
Contributor

robpike commented Nov 25, 2024

I have reservations about this. When we were doing the work that led to UTF-8, Ken and I made an explicit choice to use the invalid character instead of having an out-of-band error signal. The easiest way to explain why is to run grep on a file that has UTF-8-encoded text but also other things. On many systems you'll see an endless stream of complaints about bad bytes, hiding the information you're looking for. Now I know the situations don't exactly match up, but it's important almost always to continue processing even when an encoding error arises, and adding this error to the package would encourage people not to do that.

In other words, your "processing can continue" is true and desired, but making the error happen will discourage that. I can't even think of another standard error in the library that is meant to persist until all processing is done. Instead, errors stop processing, always.

@apparentlymart
Copy link

The way I've understood this proposal is that it would only add a special error type to unicode/utf8, and would not change anything else in unicode/utf8 to make use of it.

The goal then is to have a conventionally-agreed-upon value to use for higher-level parsers for languages whose syntax requires valid UTF-8, so that their callers could use errors.Is (and similar) to handle that situation in a special way without needing each higher-level parser to offer its own error value for that case. Is that right?

I suppose in that way it's perhaps similar to io.EOF, which has been treated as a conventional way to report that a parser encountered end-of-input in an invalid location.


I have not personally encountered a need to generically match invalid-UTF-8 errors (or EOF errors, for that matter) across a variety of different callees -- parsers I use or have written for languages that require valid UTF-8 encoding typically treat it as just another kind of syntax error and that's been sufficient for my needs -- but it also seems like the cost of offering it is relatively low and if it's useful to some people then it wouldn't hurt those who it isn't useful for.

The part about returning utf8.ErrInvalid along with a parsed result to indicate "this result is usable aside from having incorrect UTF-8 encoding" seems a little dubious to me, but ultimately that's a decision that can be made by the designer of each higher-level parser and described in their own documentation, rather than being a direct part of this proposal.

@dsnet
Copy link
Member Author

dsnet commented Nov 26, 2024

The goal then is to have a conventionally-agreed-upon value to use for higher-level parsers for languages whose syntax requires valid UTF-8, so that their callers could use errors.Is (and similar) to handle that situation in a special way without needing each higher-level parser to offer its own error value for that case.

Correct. It may very well have been the intent of UTF-8 to avoid needing an out-of-band error signal, but the reality is that many formats strictly require that the format be composed of valid UTF-8 and I've encountered needing to make this distinction multiple times.

The part about returning utf8.ErrInvalid along with a parsed result to indicate "this result is usable aside from having incorrect UTF-8 encoding" seems a little dubious to me, but ultimately that's a decision that can be made by the designer of each higher-level parser and described in their own documentation, rather than being a direct part of this proposal.

Correct. My proposal doesn't specify that returning utf8.ErrInvalid guarantees that the result may be usable. I'm just saying that a given implementation that does make use of utf8.ErrInvalid may document if that is so.

For example, jsontext.AppendUnquote will decode a JSON string completely, but report an error if invalid UTF-8 is encountered. Some callers of that function do not care if invalid UTF-8 exists so long as it is reasonably handled (e.g., coerced into the Unicode replacement character), but other use-cases require that the input be valid UTF-8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Incoming
Development

No branches or pull requests

5 participants