Major overhaul of mbstring (part 20) #8257

alexdowad · 2022-03-27T20:23:42Z

We are almost reaching the point where the new, faster interface for converting text encodings in mbstring is implemented for all supported legacy text encodings. Actually, all that is left now is the non-encodings 'HTML-ENTITIES', UUEncode, Base64, and QPrint.

Aside from being faster, the new code in this PR does fix a number of bugs. As with the last couple of PRs, an automated test harness was used to generate vast numbers of random strings and find cases where the output of the new and old code was different. In close to 90% of such cases, a careful examination of the differences revealed that the old code was incorrect. The remaining ~10% were caused by bugs in the new code, which have been fixed.

FYA @nikic @cmb69

alexdowad · 2022-03-28T12:16:23Z

Just fixed one more bug. All tests are passing now.

alexdowad · 2022-03-28T20:35:25Z

Have you ever heard that old nugget of programmer's wisdom which says that if you find a bug somewhere, there is probably another similar one elsewhere in the same codebase?

After seeing that the GitHub CI process caught one bug which had escaped my own testing process, that thought echoed in my ears. I decided to examine this PR again to see if there could be another instance of the same problem... and wouldn't you know it, there were not one but three in mbfilter_sjis_mobile.c. 😮‍💨

It so happened that it was a bit tricky to construct a test case which 'tickled' the bug, but I did so, and added it as a regression test case.

derickr

FWIW, I had a quick look at this PR, and at a glance can't see any obvious issues. I'm not sure whether I'm qualified to approve it though? In general, as you've added so many tests, I can't see a problem by just doing so though.

alexdowad · 2022-04-22T11:06:43Z

FWIW, I had a quick look at this PR, and at a glance can't see any obvious issues. I'm not sure whether I'm qualified to approve it though? In general, as you've added so many tests, I can't see a problem by just doing so though.

@derickr, thanks for the review!

Nikita was last listed as the primary maintainer of mbstring in EXTENSIONS, so it would be nice to hear from him, but if not... I may go ahead and merge.

nikic

Two (related) concerns:

A number of commits mention fixes relative to the legacy implementation, but most of those don't seem to come with additional test coverage?
I'm concerned that we're deviating in behavior between the legacy and fast implementation here. The problem is that the fast implementation is currently only used in some places (only encoding conversion?) so you may see different "interpretations" of the same string depending on which function you use and which encoding/decoding hooks that function happens to use.

ext/mbstring/libmbfl/filters/mbfilter_sjis_2004.c

ext/mbstring/tests/utf8_mobile_encodings.phpt

alexdowad · 2022-04-23T17:51:29Z

@nikic Excellent review (as usual)!

For the record, here are bugs I found in the old conversion code while fuzzing it to look for differences between the old and new code:

• In some cases, when converting to UTF-7, just a bare "+" was emitted
• JIS and ISO-2022-JP would accept a bare 0x1B (ESC) without emitting an error
• UCS-4BE, UCS-4LE, and UCS-4 would accept codepoints up to 0x200000 without error, though the highest valid codepoint is 0x10FFFF
• Some 2-character strings starting with "+" would produce no output at all when converted from UTF-7 to some other encoding
• For some error conditions, GB18030 would emit a null byte (0x00) rather than an error marker.
• For some error conditions, CP50220 would not emit an error marker.
• When converting some single-character strings from JIS or ISO-2022-JP to CP50220, no output at all was produced
• JIS, CP50221, and CP50222 would not return to ASCII mode at the end of a string, which is needed for strings to be safely concatenated
• HZ would not return to ASCII mode after "~~"
• CP50220 would 'eat' a trailing null byte without producing any output
• ISO-2022-JP-2004 did not distinguish between JIS X 0213 plane 1 and plane 2; after it switched to one plane, it would not emit the correct escape code to switch to the other plane if necessary
• If a ISO-2022-JP-KDDI string ended with a character which could have been part of a special KDDI emoji, it would not emit the escape code to return to ASCII mode
• EUC-JP-2004, SJIS-2004, ISO-2022-JP-2004, SJIS-mac, JIS, ISO-2022-JP did not properly call the next "flush function" in the chain when ending a conversion operation; depending on what the destination encoding was, this could cause the output to be truncated (especially when converting to UTF-7, UTF-7-IMAP, ISO-2022-JP, CP50220, CP50221, CP50222, or any of the KDDI, Softbank, or Docomo-specific encodings)
• UTF-7-IMAP converted U+0000 to 0x00; it seems this was deliberate, but it clearly violates the RFC
• ISO-2022-JP-2004 would pass bytes from 0x80-0x9F straight through to the output without any emitting any escape codes to switch to the proper mode
• For U+FF95 and a few other codepoints which have a special representation in EUC-JP-2004, they were converted to the same special value for ISO-2022-JP-2004, which is not correct
• In some cases, HZ would not return to ASCII mode at the end of a string
• In some cases, ISO-2022-KR would take a codepoint which is not in KSC 5601 or KS X 1001 at all, subtract 0x8080 from it, and then use it as a KS X 1001 code sequence (totally mangling it)
• ISO-2022-JP-KDDI could not emit the special KDDI emoji for national flags
• ISO-2022-KR used an incorrect test to determine whether an escape code is needed to return to ASCII mode at the end of a string, so in some cases, the escape code was not emitted correctly

I can go through and add a few tests to cover these specific situations.

Regarding the differences between the 'old' and 'new' conversion functions... when is the next release coming up? If there is still enough time, I could just finish the switchover within the available time. Otherwise, I could fix these bugs in the old conversion code, so we don't have differences between functions which use the old conversion code and those which use the new conversion code.

alexdowad · 2022-04-23T17:57:37Z

By the way, I am just adding implementations of the 'fast' conversion interface for UUEncode and Base64... we want to eventually remove those from mbstring, but I want to move ahead and switch over completely to the new interface before that, so we need temporary implementations for them...

nikic · 2022-04-24T10:51:59Z

Regarding the differences between the 'old' and 'new' conversion functions... when is the next release coming up? If there is still enough time, I could just finish the switchover within the available time. Otherwise, I could fix these bugs in the old conversion code, so we don't have differences between functions which use the old conversion code and those which use the new conversion code.

I don't think a schedule for PHP 8.2 is up yet, but I'd expect it to be about the same as https://wiki.php.net/todo/php81 plus one year. Probably makes sense to focus on switching to the new conversion functions and getting rid of the old ones entirely.

alexdowad · 2022-04-24T11:09:33Z

I don't think a schedule for PHP 8.2 is up yet, but I'd expect it to be about the same as https://wiki.php.net/todo/php81 plus one year. Probably makes sense to focus on switching to the new conversion functions and getting rid of the old ones entirely.

Great! Thanks.

alexdowad · 2022-04-24T14:11:56Z

Haven't yet added more tests as suggested by @nikic, but I have just added a 'fast' conversion filter for HTML-ENTITIES.

alexdowad · 2022-05-03T19:23:44Z

Have added more tests as suggested by @nikic, though they do not exhaustively cover every issue discovered by fuzzing. (Some of the issues were extremely obscure and I am finding it hard to reproduce them, since I didn't keep a record of the input strings which triggered the differing outputs.)

Fast conversion filters for Base64, UUEncode, QPrint, and HTML-ENTITIES are included. That is all the text encodings supported by mbstring.

Next step is to start using the faster conversion filters throughout.

alexdowad · 2022-05-03T20:20:39Z

Test failure is spurious. (It's testing the effect of lstat.)

ext/mbstring/libmbfl/filters/mbfilter_uuencode.c

ext/mbstring/libmbfl/filters/mbfilter_base64.c

alexdowad · 2022-05-07T21:09:19Z

@nikic just saved my skin here by spotting a buffer overrun.

This tells me that I need to do more testing of this code... or at least read though it all another time looking for any other possible buffer overruns.

alexdowad · 2022-05-08T10:46:36Z

Looks like the CI build is broken. Same test failure again:

=====================================================================
FAILED TEST SUMMARY
---------------------------------------------------------------------
Test lstat() and stat() functions: usage variations - effects changing permissions of link [ext/standard/tests/file/lstat_stat_variation15.phpt]
=====================================================================

alexdowad · 2022-05-08T13:10:39Z

Well, this is interesting. @nikic's discovery of a buffer overrun in my UUEncode conversion code prompted me to add a38c7e5. Wouldn't you just know it... that immediately revealed another buffer overrun bug in my UTF-7/UTF7-IMAP code.

Trying to write correct, non-trivial code is no joke!

I am going to do some more personal code review, as well as some more fuzzing, of this PR.

For now, all of @nikic's feedback has been addressed.

alexdowad · 2022-05-08T13:11:52Z

(By the way, the UTF-7 conversion code with the buffer overrun did not go into any public release of PHP. So there is no need to release a patch for 8.1 or anything.)

alexdowad · 2022-05-08T19:11:57Z

😅

As a tiny little stress test of my own code, I tried modifying mb_fast_convert to use a tiny buffer for passing wchars between the input and output stages.

That immediately revealed a bug in my code for SJIS-Mobile#DOCOMO, SJIS-Mobile#KDDI, and SJIS-Mobile#SOFTBANK.

Just need to add a regression test for that...

alexdowad · 2022-05-08T19:28:37Z

If you want to see the bugfix which was just added, git diff e4ef64b.

I need to pound harder on this new code and see if I can shake more bugs out. I think it's time to break afl out and see if it can find anything.

An overly complex boolean test was used to check if a 3-byte code unit was valid. Convert it to an equivalent test with fewer terms.

When testing the preceding commits, I used a script to generate a large number of random strings and try to find strings which would yield different outputs from the new and old encoding conversion code. Some were found. In most cases, analysis revealed that the new code was correct and the old code was not. In all cases where the new code was incorrect, regression tests were added. However, there may be some value in adding regression tests for cases where the old code was incorrect as well. That is done here. This does not cover every case where the new and old code yielded different results. Some of them were very obscure, and it is proving difficult even to reproduce them (since I did not keep a record of all the input strings which triggered the differing output).

…sion code

After Nikita Popov found a buffer overrun bug in one of my pull requests, I was prompted to add more assertions in a38c7e5 to help me catch such bugs myself more easily in testing. Wouldn't you just know it... as soon as I added those assertions, the mbstring test suite caught another buffer overrun bug in my UTF-7 conversion code, which I wrote the better part of a year ago. Then, when I started fuzzing the code with libfuzzer, I found and fixed another buffer overflow: If we enter the main loop, which normally outputs 3 decoded Base64 characters, where the first half of a surrogate pair had appeared at the end of the previous run, but the second half does not appear on this run, we need to output one error marker. Then, at the end of the main loop, if the Base64 input ends at an unexpected position AND the last character was not a legal Base64-encoded character, we need to output two error markers for that. The three error markers plus two valid, decoded bytes can push us over the available space in our wchar buffer.

alexdowad · 2022-05-27T19:57:49Z

All bugs which were found using php-fuzz-mbstring have been fixed. I think this is ready to merge now.

nikic · 2022-05-27T21:58:17Z

I think you accidentally included more commits than intended, e.g. there's one marked WIP at the end.

alexdowad · 2022-05-28T07:11:49Z

I think you accidentally included more commits than intended, e.g. there's one marked WIP at the end.

You are absolutely right, my bad. Fixed that now.

alexdowad · 2022-05-28T07:13:24Z

The one commit which is a bit new here is 35e4768. That was done so that php-fuzz-mbstring would exercise the new conversion code.

ext/mbstring/mbstring.c

ext/mbstring/mbstring.h

ext/mbstring/mbstring.c

sapi/fuzzer/fuzzer-mbstring.c

alexdowad · 2022-05-28T19:07:31Z

As always, thanks to @nikic for the thorough code review.

That's what all existing callers want anyways. This avoids 2 unnecessary copies of the converted string.

alexdowad · 2022-05-28T19:59:31Z

Merged.

Thanks all for your help.

alexdowad force-pushed the cleanup-mbstring-20 branch from 161ee22 to fff72ba Compare March 28, 2022 11:09

alexdowad force-pushed the cleanup-mbstring-20 branch from fff72ba to 71c5558 Compare March 28, 2022 20:29

alexdowad force-pushed the cleanup-mbstring-20 branch from 71c5558 to a7ed6f6 Compare March 28, 2022 21:44

alexdowad mentioned this pull request Apr 9, 2022

mb_convert_case($str, MB_CASE_TITLE, "UTF-8"); doesn't convert the Greek last letter to ς but to σ #8096

Closed

derickr requested review from cmb69 and nikic April 22, 2022 10:11

derickr added the Waiting on Review label Apr 22, 2022

derickr reviewed Apr 22, 2022

View reviewed changes

nikic reviewed Apr 23, 2022

View reviewed changes

ext/mbstring/libmbfl/filters/mbfilter_sjis_2004.c Outdated Show resolved Hide resolved

ext/mbstring/tests/utf8_mobile_encodings.phpt Outdated Show resolved Hide resolved

alexdowad force-pushed the cleanup-mbstring-20 branch from a7ed6f6 to 62103bb Compare May 3, 2022 19:21

nikic reviewed May 7, 2022

View reviewed changes

alexdowad force-pushed the cleanup-mbstring-20 branch from 62103bb to e4ef64b Compare May 8, 2022 13:04

nikic approved these changes May 8, 2022

View reviewed changes

alexdowad force-pushed the cleanup-mbstring-20 branch from e4ef64b to 4924c6c Compare May 8, 2022 19:27

alexdowad added 12 commits May 27, 2022 21:51

Implement fast text conversion interface for UUENCODE

3ced851

Implement fast text conversion interface for Base64

5bfbdf2

Simplify code for converting UTF-8

8804db3

An overly complex boolean test was used to check if a 3-byte code unit was valid. Convert it to an equivalent test with fewer terms.

Implement fast text conversion interface for HTML-ENTITIES

3bd69a6

Implement fast text conversion interface for QPrint

e055b42

For JIS/ISO-2022-JP, treat a truncated escape sequence as error

7d37ba6

Add assertions to help catch buffer overflows in mbstring text conver…

38c4193

…sion code

Use fast text conversion filters to implement php_mb_convert_encoding_ex

35e4768

Fix buffer overflow bug in HZ text conversion code

7d716b8

Fix buffer overflow bugs in CP50222 text conversion code

0eb000f

alexdowad force-pushed the cleanup-mbstring-20 branch from 4924c6c to 27d0125 Compare May 27, 2022 19:54

alexdowad force-pushed the cleanup-mbstring-20 branch from 27d0125 to 0eb000f Compare May 28, 2022 07:11

nikic reviewed May 28, 2022

View reviewed changes

ext/mbstring/mbstring.c Outdated Show resolved Hide resolved

nikic reviewed May 28, 2022

View reviewed changes

ext/mbstring/mbstring.h Show resolved Hide resolved

ext/mbstring/mbstring.c Outdated Show resolved Hide resolved

nikic reviewed May 28, 2022

View reviewed changes

ext/mbstring/mbstring.c Outdated Show resolved Hide resolved

nikic reviewed May 28, 2022

View reviewed changes

sapi/fuzzer/fuzzer-mbstring.c Outdated Show resolved Hide resolved

alexdowad force-pushed the cleanup-mbstring-20 branch from eb71ae6 to 9aca533 Compare May 28, 2022 19:06

php_mb_convert_encoding{,_ex} returns zend_string

df05cde

That's what all existing callers want anyways. This avoids 2 unnecessary copies of the converted string.

alexdowad force-pushed the cleanup-mbstring-20 branch from 9aca533 to df05cde Compare May 28, 2022 19:11

nikic approved these changes May 28, 2022

View reviewed changes

alexdowad closed this May 28, 2022

alexdowad deleted the cleanup-mbstring-20 branch May 28, 2022 19:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Major overhaul of mbstring (part 20) #8257

Major overhaul of mbstring (part 20) #8257

alexdowad commented Mar 27, 2022

alexdowad commented Mar 28, 2022

alexdowad commented Mar 28, 2022

derickr left a comment

alexdowad commented Apr 22, 2022

nikic left a comment

alexdowad commented Apr 23, 2022

alexdowad commented Apr 23, 2022

nikic commented Apr 24, 2022

alexdowad commented Apr 24, 2022

alexdowad commented Apr 24, 2022

alexdowad commented May 3, 2022

alexdowad commented May 3, 2022

alexdowad commented May 7, 2022

alexdowad commented May 8, 2022

alexdowad commented May 8, 2022

alexdowad commented May 8, 2022

alexdowad commented May 8, 2022

alexdowad commented May 8, 2022

alexdowad commented May 27, 2022

nikic commented May 27, 2022

alexdowad commented May 28, 2022

alexdowad commented May 28, 2022

alexdowad commented May 28, 2022

alexdowad commented May 28, 2022

Major overhaul of mbstring (part 20) #8257

Major overhaul of mbstring (part 20) #8257

Conversation

alexdowad commented Mar 27, 2022

alexdowad commented Mar 28, 2022

alexdowad commented Mar 28, 2022

derickr left a comment

Choose a reason for hiding this comment

alexdowad commented Apr 22, 2022

nikic left a comment

Choose a reason for hiding this comment

alexdowad commented Apr 23, 2022

alexdowad commented Apr 23, 2022

nikic commented Apr 24, 2022

alexdowad commented Apr 24, 2022

alexdowad commented Apr 24, 2022

alexdowad commented May 3, 2022

alexdowad commented May 3, 2022

alexdowad commented May 7, 2022

alexdowad commented May 8, 2022

alexdowad commented May 8, 2022

alexdowad commented May 8, 2022

alexdowad commented May 8, 2022

alexdowad commented May 8, 2022

alexdowad commented May 27, 2022

nikic commented May 27, 2022

alexdowad commented May 28, 2022

alexdowad commented May 28, 2022

alexdowad commented May 28, 2022

alexdowad commented May 28, 2022