Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix UTF iterators end too early. #9797

Closed
wants to merge 1 commit into from
Closed

Conversation

Uhf7
Copy link
Contributor

@Uhf7 Uhf7 commented Apr 24, 2021

Fixes the bug in Utf16_Iter described here: #9599 (comment). I introduced this bug in 9599, it is a regression. Sorry for that. The problem is: If the last character of an UTF-16-coded file is greater then 0x7F, and hence needs more than one byte in UTF-8-encoding, only the 1st byte of the UTF-8 sequence arrives in the text buffer.

A similar bug does exist in Utf8_Iter, which is fixed too. This bug, at least, is no regression. It is also harder to reproduce. When writing an UTF-16-coded file, and

  • the code point of the last character in the text buffer is above 0x0FFFF, which means, two 16-bit codes need to be written instead of one, and
  • the position of the last character in the text buffer is 65536 (or any multiple of this),

then only the first 16-bit code is written to the file.

The 65536 comes from the size of the intermediate buffer which is used while conversion:

static const int bufSize = 64*1024;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants