The unicode module handles invalid UTF-8 incorrectly #10750

GULPF · 2019-02-27T08:43:56Z

Some procs in the unicode module produce confusing results when given invalid UTF-8 as input.

doAssert "\xC0".runeLenAt(0) == 2
doAssert "\xC0a".runeLen == 1
doAssert "\xC2\xE2".runeAt(0) == "\xC2\xA2".runeAt(0)

Some procs like toRunes sometimes map invalid bytes to 0xFFFD, which is undocumented.

I've found 9a59842, which changed the behavior of fastRunAt from "raise assertion error on invalid UTF-8" to "return garbage on invalid UTF-8".

Here's what the Unicode conformance document says on handling invalid UTF-8:

For example, in UTF-8 every code unit of the form 110xxxx must be followed by a code unit of the form 10xxxxxx. A sequence such as 110xxxxx 0xxxxxxx is ill-formed and must never be generated. When faced with this ill-formed code unit sequence while transforming or interpreting text, a conformant process must treat the first code unit 110xxxxx as an illegally terminated code unit sequence—for example, by signaling an error, filtering the code unit out, or representing the code unit with a marker such as U+FFFD replacement character.

The text was updated successfully, but these errors were encountered:

GULPF added the Standard Library label Feb 27, 2019

GULPF mentioned this issue Aug 17, 2019

Improve handling of invalid UTF-8 #11968

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The unicode module handles invalid UTF-8 incorrectly #10750

The unicode module handles invalid UTF-8 incorrectly #10750

GULPF commented Feb 27, 2019 •

edited

The unicode module handles invalid UTF-8 incorrectly #10750

The unicode module handles invalid UTF-8 incorrectly #10750

Comments

GULPF commented Feb 27, 2019 • edited

GULPF commented Feb 27, 2019 •

edited