Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix UTF-8 decoding of lazy bytestrings #333

Merged
merged 1 commit into from
May 22, 2021
Merged

Conversation

Lysxia
Copy link
Contributor

@Lysxia Lysxia commented May 2, 2021

Fixes #330.

I found another issue that's not yet fixed: both strict and lazy decodeUtf8With are actually memory unsafe if you use a bad onErr argument. We allocate a destination buffer with 2x the number of bytes from the original bytestrings, but by using an onErr function which replaces any invalid byte with a Char which is a surrogate pair in UTF-16, it possible to blow up the size taken by a Text to 4x. In practice, onErr is almost always lenientDecode though, so perhaps a better solution than allocating more memory is to either hide decodeUtf8With or clamp the range of onErr.

@Lysxia
Copy link
Contributor Author

Lysxia commented May 2, 2021

I don't understand the doctest errors on GHC 8+, and on GHC 7+ the errors are that old versions of bytestring did not have toStrict, which I guess I could fix by concatenating toChunks instead.

@Lysxia Lysxia added the fix:bug label May 2, 2021
Copy link
Contributor

@Bodigrim Bodigrim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doctests are failing in master as well (I guess it has something to do with a new version of doctest package?.. Dunno)

Could you please check a coverage report to ensure that all lines are well-tested?

src/Data/Text/Encoding.hs Outdated Show resolved Hide resolved
src/Data/Text/Encoding.hs Outdated Show resolved Hide resolved
@Lysxia Lysxia force-pushed the utf-8-lazy branch 2 times, most recently from b42b59c to de7c071 Compare May 7, 2021 03:53
@Lysxia
Copy link
Contributor Author

Lysxia commented May 7, 2021

There's one uncovered branch because the tests only use an onErr function (lenientDecode) which never returns Nothing. Should I add another test using ignore?

Untitled

@Bodigrim
Copy link
Contributor

Bodigrim commented May 8, 2021

That's probably fine, I guess.

Looks good to me except GHC < 7.6 builds.

Bodigrim
Bodigrim previously approved these changes May 8, 2021
Copy link
Contributor

@Bodigrim Bodigrim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Bodigrim
Bodigrim previously approved these changes May 12, 2021
@Lysxia
Copy link
Contributor Author

Lysxia commented May 12, 2021

(Found a silly space.)

Bodigrim
Bodigrim previously approved these changes May 12, 2021
@Bodigrim
Copy link
Contributor

@Lysxia could you please resolve a conflict?

At the beginning of a new chunk we may be trying to complete a UTF-8
sequence started in the previous chunk (contained in the `undecode0`
buffer). If it turns out to be invalid, we must apply the `onErr`
handler to every character in that buffer.

When we reach the end of the chunk, we must also be more careful
about when to keep the previous buffer: a UTF-8 sequence (up to 4 bytes)
can span more than two chunks, when those chunks are very short
(of length 0, 1, or 2).
@Bodigrim Bodigrim merged commit 204f6ac into haskell:master May 22, 2021
@Lysxia Lysxia deleted the utf-8-lazy branch May 16, 2022 09:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inconsistent UTF8 decoding behaviour based on the underlying chunking
2 participants