Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The unicode module handles invalid UTF-8 incorrectly #10750

Open
GULPF opened this issue Feb 27, 2019 · 0 comments
Open

The unicode module handles invalid UTF-8 incorrectly #10750

GULPF opened this issue Feb 27, 2019 · 0 comments

Comments

@GULPF
Copy link
Member

GULPF commented Feb 27, 2019

Some procs in the unicode module produce confusing results when given invalid UTF-8 as input.

doAssert "\xC0".runeLenAt(0) == 2
doAssert "\xC0a".runeLen == 1
doAssert "\xC2\xE2".runeAt(0) == "\xC2\xA2".runeAt(0)

Some procs like toRunes sometimes map invalid bytes to 0xFFFD, which is undocumented.

I've found 9a59842, which changed the behavior of fastRunAt from "raise assertion error on invalid UTF-8" to "return garbage on invalid UTF-8".

Here's what the Unicode conformance document says on handling invalid UTF-8:

For example, in UTF-8 every code unit of the form 110xxxx must be followed by a code unit of the form 10xxxxxx. A sequence such as 110xxxxx 0xxxxxxx is ill-formed and must never be generated. When faced with this ill-formed code unit sequence while transforming or interpreting text, a conformant process must treat the first code unit 110xxxxx as an illegally terminated code unit sequence—for example, by signaling an error, filtering the code unit out, or representing the code unit with a marker such as U+FFFD replacement character.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant