-
Notifications
You must be signed in to change notification settings - Fork 7.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Major overhaul of mbstring (part 20) #8257
Conversation
161ee22
to
fff72ba
Compare
Just fixed one more bug. All tests are passing now. |
fff72ba
to
71c5558
Compare
Have you ever heard that old nugget of programmer's wisdom which says that if you find a bug somewhere, there is probably another similar one elsewhere in the same codebase? After seeing that the GitHub CI process caught one bug which had escaped my own testing process, that thought echoed in my ears. I decided to examine this PR again to see if there could be another instance of the same problem... and wouldn't you know it, there were not one but three in It so happened that it was a bit tricky to construct a test case which 'tickled' the bug, but I did so, and added it as a regression test case. |
71c5558
to
a7ed6f6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I had a quick look at this PR, and at a glance can't see any obvious issues. I'm not sure whether I'm qualified to approve it though? In general, as you've added so many tests, I can't see a problem by just doing so though.
@derickr, thanks for the review! Nikita was last listed as the primary maintainer of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two (related) concerns:
- A number of commits mention fixes relative to the legacy implementation, but most of those don't seem to come with additional test coverage?
- I'm concerned that we're deviating in behavior between the legacy and fast implementation here. The problem is that the fast implementation is currently only used in some places (only encoding conversion?) so you may see different "interpretations" of the same string depending on which function you use and which encoding/decoding hooks that function happens to use.
@nikic Excellent review (as usual)! For the record, here are bugs I found in the old conversion code while fuzzing it to look for differences between the old and new code: • In some cases, when converting to UTF-7, just a bare "+" was emitted I can go through and add a few tests to cover these specific situations. Regarding the differences between the 'old' and 'new' conversion functions... when is the next release coming up? If there is still enough time, I could just finish the switchover within the available time. Otherwise, I could fix these bugs in the old conversion code, so we don't have differences between functions which use the old conversion code and those which use the new conversion code. |
By the way, I am just adding implementations of the 'fast' conversion interface for UUEncode and Base64... we want to eventually remove those from |
I don't think a schedule for PHP 8.2 is up yet, but I'd expect it to be about the same as https://wiki.php.net/todo/php81 plus one year. Probably makes sense to focus on switching to the new conversion functions and getting rid of the old ones entirely. |
Great! Thanks. |
Haven't yet added more tests as suggested by @nikic, but I have just added a 'fast' conversion filter for HTML-ENTITIES. |
a7ed6f6
to
62103bb
Compare
Have added more tests as suggested by @nikic, though they do not exhaustively cover every issue discovered by fuzzing. (Some of the issues were extremely obscure and I am finding it hard to reproduce them, since I didn't keep a record of the input strings which triggered the differing outputs.) Fast conversion filters for Base64, UUEncode, QPrint, and HTML-ENTITIES are included. That is all the text encodings supported by mbstring. Next step is to start using the faster conversion filters throughout. |
Test failure is spurious. (It's testing the effect of |
@nikic just saved my skin here by spotting a buffer overrun. This tells me that I need to do more testing of this code... or at least read though it all another time looking for any other possible buffer overruns. |
Looks like the CI build is broken. Same test failure again:
|
62103bb
to
e4ef64b
Compare
Well, this is interesting. @nikic's discovery of a buffer overrun in my UUEncode conversion code prompted me to add a38c7e5. Wouldn't you just know it... that immediately revealed another buffer overrun bug in my UTF-7/UTF7-IMAP code. Trying to write correct, non-trivial code is no joke! I am going to do some more personal code review, as well as some more fuzzing, of this PR. For now, all of @nikic's feedback has been addressed. |
(By the way, the UTF-7 conversion code with the buffer overrun did not go into any public release of PHP. So there is no need to release a patch for 8.1 or anything.) |
😅 As a tiny little stress test of my own code, I tried modifying That immediately revealed a bug in my code for SJIS-Mobile#DOCOMO, SJIS-Mobile#KDDI, and SJIS-Mobile#SOFTBANK. Just need to add a regression test for that... |
e4ef64b
to
4924c6c
Compare
If you want to see the bugfix which was just added, I need to pound harder on this new code and see if I can shake more bugs out. I think it's time to break afl out and see if it can find anything. |
An overly complex boolean test was used to check if a 3-byte code unit was valid. Convert it to an equivalent test with fewer terms.
When testing the preceding commits, I used a script to generate a large number of random strings and try to find strings which would yield different outputs from the new and old encoding conversion code. Some were found. In most cases, analysis revealed that the new code was correct and the old code was not. In all cases where the new code was incorrect, regression tests were added. However, there may be some value in adding regression tests for cases where the old code was incorrect as well. That is done here. This does not cover every case where the new and old code yielded different results. Some of them were very obscure, and it is proving difficult even to reproduce them (since I did not keep a record of all the input strings which triggered the differing output).
After Nikita Popov found a buffer overrun bug in one of my pull requests, I was prompted to add more assertions in a38c7e5 to help me catch such bugs myself more easily in testing. Wouldn't you just know it... as soon as I added those assertions, the mbstring test suite caught another buffer overrun bug in my UTF-7 conversion code, which I wrote the better part of a year ago. Then, when I started fuzzing the code with libfuzzer, I found and fixed another buffer overflow: If we enter the main loop, which normally outputs 3 decoded Base64 characters, where the first half of a surrogate pair had appeared at the end of the previous run, but the second half does not appear on this run, we need to output one error marker. Then, at the end of the main loop, if the Base64 input ends at an unexpected position AND the last character was not a legal Base64-encoded character, we need to output two error markers for that. The three error markers plus two valid, decoded bytes can push us over the available space in our wchar buffer.
4924c6c
to
27d0125
Compare
All bugs which were found using |
I think you accidentally included more commits than intended, e.g. there's one marked WIP at the end. |
27d0125
to
0eb000f
Compare
You are absolutely right, my bad. Fixed that now. |
The one commit which is a bit new here is 35e4768. That was done so that |
eb71ae6
to
9aca533
Compare
As always, thanks to @nikic for the thorough code review. |
That's what all existing callers want anyways. This avoids 2 unnecessary copies of the converted string.
9aca533
to
df05cde
Compare
Merged. Thanks all for your help. |
We are almost reaching the point where the new, faster interface for converting text encodings in mbstring is implemented for all supported legacy text encodings. Actually, all that is left now is the non-encodings 'HTML-ENTITIES', UUEncode, Base64, and QPrint.
Aside from being faster, the new code in this PR does fix a number of bugs. As with the last couple of PRs, an automated test harness was used to generate vast numbers of random strings and find cases where the output of the new and old code was different. In close to 90% of such cases, a careful examination of the differences revealed that the old code was incorrect. The remaining ~10% were caused by bugs in the new code, which have been fixed.
FYA @nikic @cmb69