Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ByteBufUtil#writeUtf8 subsequence split surrogate edge-case bug #9437

Merged
merged 2 commits into from
Aug 10, 2019

Conversation

njhill
Copy link
Member

@njhill njhill commented Aug 8, 2019

Motivation

#9224 introduced overloads of ByteBufUtil#writeUtf8(...) and related methods to operate on a sub-charsequence directly to save having to allocate substrings, but it missed an edge case where the subsequence does not extend to the end of the CharSequence and the last char in the sequence is a high surrogate.

Due to the catch-IndexOutOfBoundsException optimization that avoids an additional bounds check, it would be possible to read past the specified end char and successfully decode a surrogate pair that would otherwise result in a '?' byte being written.

Modifications

  • Check for end-of-subsequence before reading next char after a high surrogate is encountered in the writeUtf8(AbstractByteBuf,int,CharSequence,int,int) and utf8BytesNonAscii methods
  • Add unit test for this edge case

Result

Bug is fixed.

This removes the bounds-check-avoidance optimization but it does not appear to have a measurable impact on benchmark results, including when the char sequence contains many surrogate pairs (which should be rare in any case).

Motivation

netty#9224 introduced overrides of ByteBufUtil#writeUtf8(...) and related
methods to operate on a sub-CharSequence directly to save having to
allocate substrings, but it missed an edge case where the subsequence
does not extend to the end of the CharSequence and the last char in the
sequence is a high surrogate.

Due to the catch-IndexOutOfBoundsException optimization that avoids an
additional bounds check, it would be possible to read past the specified
end char index and successfully decode a surrogate pair which would
otherwise result in a '?' byte being written.

Modifications

- Check for end-of-subsequence before reading next char after a high
surrogate is encountered in the
writeUtf8(AbstractByteBuf,int,CharSequence,int,int) and
utf8BytesNonAscii methods
- Add unit test for this edge case

Result

Bug is fixed.

This removes the bounds-check-avoidance optimization but it does not
appear to have a measurable impact on benchmark results, including when
the char sequence contains many surrogate pairs (which should be rare in
any case).
@njhill njhill added the defect label Aug 8, 2019
@netty-bot
Copy link

Can one of the admins verify this patch?

@normanmaurer
Copy link
Member

@netty-bot test this please

@normanmaurer
Copy link
Member

@netty-bot test this please

@normanmaurer normanmaurer merged commit fedcc40 into netty:4.1 Aug 10, 2019
@normanmaurer
Copy link
Member

@njhill thanks a lot!

@normanmaurer normanmaurer added this to the 4.1.39.Final milestone Aug 10, 2019
@normanmaurer normanmaurer self-requested a review August 10, 2019 18:54
normanmaurer pushed a commit that referenced this pull request Aug 10, 2019
…9437)


Motivation:

#9224 introduced overrides of ByteBufUtil#writeUtf8(...) and related
methods to operate on a sub-CharSequence directly to save having to
allocate substrings, but it missed an edge case where the subsequence
does not extend to the end of the CharSequence and the last char in the
sequence is a high surrogate.

Due to the catch-IndexOutOfBoundsException optimization that avoids an
additional bounds check, it would be possible to read past the specified
end char index and successfully decode a surrogate pair which would
otherwise result in a '?' byte being written.

Modifications:

- Check for end-of-subsequence before reading next char after a high
surrogate is encountered in the
writeUtf8(AbstractByteBuf,int,CharSequence,int,int) and
utf8BytesNonAscii methods
- Add unit test for this edge case

Result:

Bug is fixed.

This removes the bounds-check-avoidance optimization but it does not
appear to have a measurable impact on benchmark results, including when
the char sequence contains many surrogate pairs (which should be rare in
any case).
@njhill njhill deleted the utf8-subseq-fix branch August 11, 2019 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants