Fix UTF-8 decoding of lazy bytestrings #333

Lysxia · 2021-05-02T03:09:32Z

Fixes #330.

I found another issue that's not yet fixed: both strict and lazy decodeUtf8With are actually memory unsafe if you use a bad onErr argument. We allocate a destination buffer with 2x the number of bytes from the original bytestrings, but by using an onErr function which replaces any invalid byte with a Char which is a surrogate pair in UTF-16, it possible to blow up the size taken by a Text to 4x. In practice, onErr is almost always lenientDecode though, so perhaps a better solution than allocating more memory is to either hide decodeUtf8With or clamp the range of onErr.

Lysxia · 2021-05-02T03:22:10Z

I don't understand the doctest errors on GHC 8+, and on GHC 7+ the errors are that old versions of bytestring did not have toStrict, which I guess I could fix by concatenating toChunks instead.

Bodigrim

Doctests are failing in master as well (I guess it has something to do with a new version of doctest package?.. Dunno)

Could you please check a coverage report to ensure that all lines are well-tested?

src/Data/Text/Encoding.hs

Lysxia · 2021-05-07T12:29:01Z

There's one uncovered branch because the tests only use an onErr function (lenientDecode) which never returns Nothing. Should I add another test using ignore?

Bodigrim · 2021-05-08T00:47:10Z

That's probably fine, I guess.

Looks good to me except GHC < 7.6 builds.

tests/Tests/QuickCheckUtils.hs

Bodigrim

Nice!

Lysxia · 2021-05-12T20:18:29Z

(Found a silly space.)

Bodigrim · 2021-05-22T12:07:48Z

@Lysxia could you please resolve a conflict?

At the beginning of a new chunk we may be trying to complete a UTF-8 sequence started in the previous chunk (contained in the `undecode0` buffer). If it turns out to be invalid, we must apply the `onErr` handler to every character in that buffer. When we reach the end of the chunk, we must also be more careful about when to keep the previous buffer: a UTF-8 sequence (up to 4 bytes) can span more than two chunks, when those chunks are very short (of length 0, 1, or 2).

Lysxia added the fix:bug label May 2, 2021

Bodigrim reviewed May 6, 2021

View reviewed changes

src/Data/Text/Encoding.hs Outdated Show resolved Hide resolved

src/Data/Text/Encoding.hs Outdated Show resolved Hide resolved

Lysxia force-pushed the utf-8-lazy branch 2 times, most recently from b42b59c to de7c071 Compare May 7, 2021 03:53

Lysxia force-pushed the utf-8-lazy branch from de7c071 to 789ff70 Compare May 8, 2021 16:13

Bodigrim reviewed May 8, 2021

View reviewed changes

tests/Tests/QuickCheckUtils.hs Outdated Show resolved Hide resolved

Lysxia force-pushed the utf-8-lazy branch from 789ff70 to 48b2d2d Compare May 8, 2021 21:06

Bodigrim previously approved these changes May 8, 2021

View reviewed changes

Lysxia dismissed Bodigrim’s stale review via 9dc69d9 May 12, 2021 15:50

Lysxia force-pushed the utf-8-lazy branch from 48b2d2d to 9dc69d9 Compare May 12, 2021 15:50

Bodigrim previously approved these changes May 12, 2021

View reviewed changes

Lysxia added this to the 1.3.0.0 milestone May 12, 2021

Lysxia dismissed Bodigrim’s stale review via 02c97b7 May 12, 2021 20:17

Lysxia force-pushed the utf-8-lazy branch from 9dc69d9 to 02c97b7 Compare May 12, 2021 20:17

Bodigrim previously approved these changes May 12, 2021

View reviewed changes

Lysxia dismissed Bodigrim’s stale review via c9874d3 May 22, 2021 14:31

Lysxia force-pushed the utf-8-lazy branch from 02c97b7 to c9874d3 Compare May 22, 2021 14:31

Bodigrim approved these changes May 22, 2021

View reviewed changes

Bodigrim merged commit 204f6ac into haskell:master May 22, 2021

Lysxia deleted the utf-8-lazy branch May 16, 2022 09:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix UTF-8 decoding of lazy bytestrings #333

Fix UTF-8 decoding of lazy bytestrings #333

Lysxia commented May 2, 2021 •

edited

Loading

Lysxia commented May 2, 2021

Bodigrim left a comment

Lysxia commented May 7, 2021

Bodigrim commented May 8, 2021

Bodigrim left a comment

Lysxia commented May 12, 2021

Bodigrim commented May 22, 2021

Fix UTF-8 decoding of lazy bytestrings #333

Fix UTF-8 decoding of lazy bytestrings #333

Conversation

Lysxia commented May 2, 2021 • edited Loading

Lysxia commented May 2, 2021

Bodigrim left a comment

Choose a reason for hiding this comment

Lysxia commented May 7, 2021

Bodigrim commented May 8, 2021

Bodigrim left a comment

Choose a reason for hiding this comment

Lysxia commented May 12, 2021

Bodigrim commented May 22, 2021

Lysxia commented May 2, 2021 •

edited

Loading