Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major overhaul of mbstring (part 20) #8257

Closed
wants to merge 25 commits into from

Conversation

alexdowad
Copy link
Contributor

We are almost reaching the point where the new, faster interface for converting text encodings in mbstring is implemented for all supported legacy text encodings. Actually, all that is left now is the non-encodings 'HTML-ENTITIES', UUEncode, Base64, and QPrint.

Aside from being faster, the new code in this PR does fix a number of bugs. As with the last couple of PRs, an automated test harness was used to generate vast numbers of random strings and find cases where the output of the new and old code was different. In close to 90% of such cases, a careful examination of the differences revealed that the old code was incorrect. The remaining ~10% were caused by bugs in the new code, which have been fixed.

FYA @nikic @cmb69

@alexdowad
Copy link
Contributor Author

Just fixed one more bug. All tests are passing now.

@alexdowad
Copy link
Contributor Author

Have you ever heard that old nugget of programmer's wisdom which says that if you find a bug somewhere, there is probably another similar one elsewhere in the same codebase?

After seeing that the GitHub CI process caught one bug which had escaped my own testing process, that thought echoed in my ears. I decided to examine this PR again to see if there could be another instance of the same problem... and wouldn't you know it, there were not one but three in mbfilter_sjis_mobile.c. 😮‍💨

It so happened that it was a bit tricky to construct a test case which 'tickled' the bug, but I did so, and added it as a regression test case.

Copy link
Member

@derickr derickr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I had a quick look at this PR, and at a glance can't see any obvious issues. I'm not sure whether I'm qualified to approve it though? In general, as you've added so many tests, I can't see a problem by just doing so though.

@alexdowad
Copy link
Contributor Author

FWIW, I had a quick look at this PR, and at a glance can't see any obvious issues. I'm not sure whether I'm qualified to approve it though? In general, as you've added so many tests, I can't see a problem by just doing so though.

@derickr, thanks for the review!

Nikita was last listed as the primary maintainer of mbstring in EXTENSIONS, so it would be nice to hear from him, but if not... I may go ahead and merge.

Copy link
Member

@nikic nikic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two (related) concerns:

  • A number of commits mention fixes relative to the legacy implementation, but most of those don't seem to come with additional test coverage?
  • I'm concerned that we're deviating in behavior between the legacy and fast implementation here. The problem is that the fast implementation is currently only used in some places (only encoding conversion?) so you may see different "interpretations" of the same string depending on which function you use and which encoding/decoding hooks that function happens to use.

ext/mbstring/libmbfl/filters/mbfilter_sjis_2004.c Outdated Show resolved Hide resolved
ext/mbstring/tests/utf8_mobile_encodings.phpt Outdated Show resolved Hide resolved
@alexdowad
Copy link
Contributor Author

@nikic Excellent review (as usual)!

For the record, here are bugs I found in the old conversion code while fuzzing it to look for differences between the old and new code:

• In some cases, when converting to UTF-7, just a bare "+" was emitted
• JIS and ISO-2022-JP would accept a bare 0x1B (ESC) without emitting an error
• UCS-4BE, UCS-4LE, and UCS-4 would accept codepoints up to 0x200000 without error, though the highest valid codepoint is 0x10FFFF
• Some 2-character strings starting with "+" would produce no output at all when converted from UTF-7 to some other encoding
• For some error conditions, GB18030 would emit a null byte (0x00) rather than an error marker.
• For some error conditions, CP50220 would not emit an error marker.
• When converting some single-character strings from JIS or ISO-2022-JP to CP50220, no output at all was produced
• JIS, CP50221, and CP50222 would not return to ASCII mode at the end of a string, which is needed for strings to be safely concatenated
• HZ would not return to ASCII mode after "~~"
• CP50220 would 'eat' a trailing null byte without producing any output
• ISO-2022-JP-2004 did not distinguish between JIS X 0213 plane 1 and plane 2; after it switched to one plane, it would not emit the correct escape code to switch to the other plane if necessary
• If a ISO-2022-JP-KDDI string ended with a character which could have been part of a special KDDI emoji, it would not emit the escape code to return to ASCII mode
• EUC-JP-2004, SJIS-2004, ISO-2022-JP-2004, SJIS-mac, JIS, ISO-2022-JP did not properly call the next "flush function" in the chain when ending a conversion operation; depending on what the destination encoding was, this could cause the output to be truncated (especially when converting to UTF-7, UTF-7-IMAP, ISO-2022-JP, CP50220, CP50221, CP50222, or any of the KDDI, Softbank, or Docomo-specific encodings)
• UTF-7-IMAP converted U+0000 to 0x00; it seems this was deliberate, but it clearly violates the RFC
• ISO-2022-JP-2004 would pass bytes from 0x80-0x9F straight through to the output without any emitting any escape codes to switch to the proper mode
• For U+FF95 and a few other codepoints which have a special representation in EUC-JP-2004, they were converted to the same special value for ISO-2022-JP-2004, which is not correct
• In some cases, HZ would not return to ASCII mode at the end of a string
• In some cases, ISO-2022-KR would take a codepoint which is not in KSC 5601 or KS X 1001 at all, subtract 0x8080 from it, and then use it as a KS X 1001 code sequence (totally mangling it)
• ISO-2022-JP-KDDI could not emit the special KDDI emoji for national flags
• ISO-2022-KR used an incorrect test to determine whether an escape code is needed to return to ASCII mode at the end of a string, so in some cases, the escape code was not emitted correctly

I can go through and add a few tests to cover these specific situations.

Regarding the differences between the 'old' and 'new' conversion functions... when is the next release coming up? If there is still enough time, I could just finish the switchover within the available time. Otherwise, I could fix these bugs in the old conversion code, so we don't have differences between functions which use the old conversion code and those which use the new conversion code.

@alexdowad
Copy link
Contributor Author

By the way, I am just adding implementations of the 'fast' conversion interface for UUEncode and Base64... we want to eventually remove those from mbstring, but I want to move ahead and switch over completely to the new interface before that, so we need temporary implementations for them...

@nikic
Copy link
Member

nikic commented Apr 24, 2022

Regarding the differences between the 'old' and 'new' conversion functions... when is the next release coming up? If there is still enough time, I could just finish the switchover within the available time. Otherwise, I could fix these bugs in the old conversion code, so we don't have differences between functions which use the old conversion code and those which use the new conversion code.

I don't think a schedule for PHP 8.2 is up yet, but I'd expect it to be about the same as https://wiki.php.net/todo/php81 plus one year. Probably makes sense to focus on switching to the new conversion functions and getting rid of the old ones entirely.

@alexdowad
Copy link
Contributor Author

I don't think a schedule for PHP 8.2 is up yet, but I'd expect it to be about the same as https://wiki.php.net/todo/php81 plus one year. Probably makes sense to focus on switching to the new conversion functions and getting rid of the old ones entirely.

Great! Thanks.

@alexdowad
Copy link
Contributor Author

Haven't yet added more tests as suggested by @nikic, but I have just added a 'fast' conversion filter for HTML-ENTITIES.

@alexdowad
Copy link
Contributor Author

Have added more tests as suggested by @nikic, though they do not exhaustively cover every issue discovered by fuzzing. (Some of the issues were extremely obscure and I am finding it hard to reproduce them, since I didn't keep a record of the input strings which triggered the differing outputs.)

Fast conversion filters for Base64, UUEncode, QPrint, and HTML-ENTITIES are included. That is all the text encodings supported by mbstring.

Next step is to start using the faster conversion filters throughout.

@alexdowad
Copy link
Contributor Author

Test failure is spurious. (It's testing the effect of lstat.)

ext/mbstring/libmbfl/filters/mbfilter_uuencode.c Outdated Show resolved Hide resolved
ext/mbstring/libmbfl/filters/mbfilter_base64.c Outdated Show resolved Hide resolved
ext/mbstring/libmbfl/filters/mbfilter_base64.c Outdated Show resolved Hide resolved
ext/mbstring/libmbfl/filters/mbfilter_base64.c Outdated Show resolved Hide resolved
@alexdowad
Copy link
Contributor Author

@nikic just saved my skin here by spotting a buffer overrun.

This tells me that I need to do more testing of this code... or at least read though it all another time looking for any other possible buffer overruns.

@alexdowad
Copy link
Contributor Author

Looks like the CI build is broken. Same test failure again:

=====================================================================
FAILED TEST SUMMARY
---------------------------------------------------------------------
Test lstat() and stat() functions: usage variations - effects changing permissions of link [ext/standard/tests/file/lstat_stat_variation15.phpt]
=====================================================================

@alexdowad
Copy link
Contributor Author

Well, this is interesting. @nikic's discovery of a buffer overrun in my UUEncode conversion code prompted me to add a38c7e5. Wouldn't you just know it... that immediately revealed another buffer overrun bug in my UTF-7/UTF7-IMAP code.

Trying to write correct, non-trivial code is no joke!

I am going to do some more personal code review, as well as some more fuzzing, of this PR.

For now, all of @nikic's feedback has been addressed.

@alexdowad
Copy link
Contributor Author

(By the way, the UTF-7 conversion code with the buffer overrun did not go into any public release of PHP. So there is no need to release a patch for 8.1 or anything.)

@alexdowad
Copy link
Contributor Author

😅

As a tiny little stress test of my own code, I tried modifying mb_fast_convert to use a tiny buffer for passing wchars between the input and output stages.

That immediately revealed a bug in my code for SJIS-Mobile#DOCOMO, SJIS-Mobile#KDDI, and SJIS-Mobile#SOFTBANK.

Just need to add a regression test for that...

@alexdowad
Copy link
Contributor Author

If you want to see the bugfix which was just added, git diff e4ef64b.

I need to pound harder on this new code and see if I can shake more bugs out. I think it's time to break afl out and see if it can find anything.

An overly complex boolean test was used to check if a 3-byte code unit
was valid. Convert it to an equivalent test with fewer terms.
When testing the preceding commits, I used a script to generate a large
number of random strings and try to find strings which would yield
different outputs from the new and old encoding conversion code.
Some were found. In most cases, analysis revealed that the new code
was correct and the old code was not.

In all cases where the new code was incorrect, regression tests were
added. However, there may be some value in adding regression tests
for cases where the old code was incorrect as well. That is done here.

This does not cover every case where the new and old code yielded
different results. Some of them were very obscure, and it is proving
difficult even to reproduce them (since I did not keep a record of
all the input strings which triggered the differing output).
After Nikita Popov found a buffer overrun bug in one of my pull
requests, I was prompted to add more assertions in a38c7e5 to help
me catch such bugs myself more easily in testing.

Wouldn't you just know it... as soon as I added those assertions, the
mbstring test suite caught another buffer overrun bug in my UTF-7
conversion code, which I wrote the better part of a year ago.

Then, when I started fuzzing the code with libfuzzer, I found
and fixed another buffer overflow:

If we enter the main loop, which normally outputs 3 decoded Base64
characters, where the first half of a surrogate pair had appeared at
the end of the previous run, but the second half does not appear
on this run, we need to output one error marker.

Then, at the end of the main loop, if the Base64 input ends at an
unexpected position AND the last character was not a legal
Base64-encoded character, we need to output two error markers
for that. The three error markers plus two valid, decoded bytes
can push us over the available space in our wchar buffer.
@alexdowad
Copy link
Contributor Author

All bugs which were found using php-fuzz-mbstring have been fixed. I think this is ready to merge now.

@nikic
Copy link
Member

nikic commented May 27, 2022

I think you accidentally included more commits than intended, e.g. there's one marked WIP at the end.

@alexdowad
Copy link
Contributor Author

I think you accidentally included more commits than intended, e.g. there's one marked WIP at the end.

You are absolutely right, my bad. Fixed that now.

@alexdowad
Copy link
Contributor Author

The one commit which is a bit new here is 35e4768. That was done so that php-fuzz-mbstring would exercise the new conversion code.

ext/mbstring/mbstring.c Outdated Show resolved Hide resolved
ext/mbstring/mbstring.h Show resolved Hide resolved
ext/mbstring/mbstring.c Outdated Show resolved Hide resolved
ext/mbstring/mbstring.c Outdated Show resolved Hide resolved
@alexdowad
Copy link
Contributor Author

As always, thanks to @nikic for the thorough code review.

That's what all existing callers want anyways. This avoids 2
unnecessary copies of the converted string.
@alexdowad alexdowad closed this May 28, 2022
@alexdowad alexdowad deleted the cleanup-mbstring-20 branch May 28, 2022 19:59
@alexdowad
Copy link
Contributor Author

Merged.

Thanks all for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants