-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Major overhaul of mbstring (part 11) #7419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…FFFFF Some text encodings supported by mbstring (such as UCS-4) accept 4-byte characters. When mbstring encounters an illegal byte sequence for the encoding it is using, it should emit an 'illegal character' marker, which can either be a single character like '?', an HTML hexadecimal entity, or a marker string like 'BAD+XXXX'. Because of the use of signed integers to hold 4-byte characters, illegal 4-byte sequences with a 'negative' value (one with the high bit set) were not handled correctly when emitting the illegal char marker. The result is that such illegal sequences were just skipped over (and the marker was not emitted to the output). Fix that.
After mb_substitute_character("long"), mbstring will respond to erroneous input by inserting 'long' error markers into the output. Depending on the situation, these error markers will either look like BAD+XXXX (for general bad input), U+XXXX (when the input is OK, but it converts to Unicode codepoints which cannot be represented in the output encoding), or an encoding-specific marker like JISX+XXXX or W932+XXXX. We have almost no tests for this feature. Add a bunch of tests to ensure that all our legacy encoding handlers work in a reasonable way when 'long' error markers are enabled.
There was a bit of legacy code here which looks like the original author of mbstring intended to allow conversion of Unicode Private Use Area codepoints to ISO-2022-JP-KDDI. However, that code never worked. It set the output variable to values which were not matched by any of the 'if' clauses below, which meant that nothing was actually emitted to the output. In other words, if one tried to convert Unicode to ISO-2022-JP-KDDI, and the Unicode string contained PUA codepoints, they would be quietly 'swallowed' and disappear. I don't know what ISO-2022-JP-KDDI byte sequences the author wanted to map those PUA codepoints to, and anyways, this use case is so obscure that there is little point in worrying about it. However, it is better to remove the non-functioning code than to leave it in. This means that if now one tries to convert PUA codepoints to ISO-2022-JP-KDDI, those codepoints will be treated as erroneous rather than silently ignored.
Sigh. I included tests which were intended to check this case in the test suite for ISO-2022-JP-MS, but those tests were faulty and didn't actually test what they were supposed to. Fixing the tests revealed that there were still bugs in this area.
This is to match the way that we handle UCS-2. When a BOM is found at the beginning of a 'UCS-2' string (NOT 'UCS-2BE' or 'UCS-2LE'), we take note of the intended byte order and handle the string accordingly, but do NOT emit a BOM to the output. Rather, we just use the default byte order for the requested output encoding. Some might argue that if the input string used a BOM, and we are emitting output in a text encoding where both big-endian and little-endian byte orders are possible, we should include a BOM in the output string. To such hypothetical debaters of minutiae, I can only offer you a shoulder shrug. No reasonable program which handles UCS-2 and UCS-4 text should require a BOM. Really, the concept of the BOM is a poor idea and should not have been included in Unicode. Standardizing on a single byte order would have been much better, similar to 'network byte order' for the Internet Protocol. But this is not the place to speak at length of such things.
a10d214
to
d3655b4
Compare
If you are looking for test data, https://github.com/web-platform-tests/wpt/tree/master/encoding, has data for a handful of legacy encodings, though you'll need to pluck that data out of the html and js files. The data is designed to test the WHATWG Encoding specification, so you may run into cases that don't match up if mbstring's implementation is different than what is described in that specification. For the most part the name of the html data files typically follow the pattern <encoding_name>_chars or <encoding_name>_errors. A few of the js files have a small amount of data embedded in them, but most contain only test code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, nice work!
The long
substitutions seems like more hassle than they're worth -- we might want to drop support for them. I guess they're just there as a debugging aid?
😮 😮 😮 You don't mind if we just drop support for the long substitutions??? That would really help to simplify the code. Since this library has ~20 years of history and many users, I am trying to be cautious to minimize the amount of breakage. But if you don't think anybody will care... it would definitely be nice to drop them. If I was creating the library from scratch, I would certainly never have added such a feature. |
Based on the grep.app dataset, there's one use of And there's another use of I think we need to distinguish between two things here though: The part where a Unicode codepoint can't be encoded in the target encoding (the Not sure about the other WCSPLANEs, but the part that seems really unnecessary here is the case where we're trying to transfer raw illegal bytes from the input encoding into the output encoding using |
What's the point of |
The code point is valid, but cannot be encoded in the target encoding. |
OK. So we keep the What about |
Just illegal_substchar I'd expect. Same as the fallback case for Though in any case, I think we should merge this PR first and then do any cleanup on top of that. Also worth noting that PHP-8.1 branches off tomorrow. |
@nikic Merged as advised. If PHP 8.1 is branching off tomorrow... perhaps I can do some more work on this and submit another PR today. If we are eliminating the |
Yeah, using -1 for this makes sense to me. |
Test failure on Travis CI after merging this PR looks spurious. |
Are you talking about |
With this PR, there are (at least some) tests for all legacy text encodings supported by mbstring. BUT! I checked our test coverage of mbstring using gcov, and much to my dismay, there is still a lot of code in the library which is not executed at all by the existing tests.
Some of the non-tested code is actually dead code, but for most of it, we just need to keep adding more tests. To avoid letting my backlog of unmerged commits from growing out of control, I have split off half of them and are submitting them here.
@nikic FYA...