-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: utf8: add ErrInvalid #70547
Comments
Related Documentation (Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.) |
I have reservations about this. When we were doing the work that led to UTF-8, Ken and I made an explicit choice to use the invalid character instead of having an out-of-band error signal. The easiest way to explain why is to run grep on a file that has UTF-8-encoded text but also other things. On many systems you'll see an endless stream of complaints about bad bytes, hiding the information you're looking for. Now I know the situations don't exactly match up, but it's important almost always to continue processing even when an encoding error arises, and adding this error to the package would encourage people not to do that. In other words, your "processing can continue" is true and desired, but making the error happen will discourage that. I can't even think of another standard error in the library that is meant to persist until all processing is done. Instead, errors stop processing, always. |
The way I've understood this proposal is that it would only add a special error type to The goal then is to have a conventionally-agreed-upon value to use for higher-level parsers for languages whose syntax requires valid UTF-8, so that their callers could use I suppose in that way it's perhaps similar to I have not personally encountered a need to generically match invalid-UTF-8 errors (or EOF errors, for that matter) across a variety of different callees -- parsers I use or have written for languages that require valid UTF-8 encoding typically treat it as just another kind of syntax error and that's been sufficient for my needs -- but it also seems like the cost of offering it is relatively low and if it's useful to some people then it wouldn't hurt those who it isn't useful for. The part about returning |
Correct. It may very well have been the intent of UTF-8 to avoid needing an out-of-band error signal, but the reality is that many formats strictly require that the format be composed of valid UTF-8 and I've encountered needing to make this distinction multiple times.
Correct. My proposal doesn't specify that returning For example, |
Proposal Details
I propose the addition of the following sentinel error to the
utf8
package:Many higher-level formats are built on top of UTF-8 (e.g., XML, JSON, protobuf, etc.) where encountering an UTF-8 encoding problem is a possibility. In many cases such an error is not fatal and processing can still continue such that a function that returns a typical
(T, error)
result may provide a sensible while also returning an error that matchesutf8.ErrInvalid
. Even if it is fatal, it is often useful for metrics reporting purposes to specially identify invalid UTF-8 as a particular class of errors.This could replace internal error values used by the "encoding/json/v2" prototype and also within the protobuf module (e.g., golang/protobuf#1228).
The text was updated successfully, but these errors were encountered: