Major overhaul of mbstring (part 11) #7419

alexdowad · 2021-08-28T09:19:06Z

With this PR, there are (at least some) tests for all legacy text encodings supported by mbstring. BUT! I checked our test coverage of mbstring using gcov, and much to my dismay, there is still a lot of code in the library which is not executed at all by the existing tests.

Some of the non-tested code is actually dead code, but for most of it, we just need to keep adding more tests. To avoid letting my backlog of unmerged commits from growing out of control, I have split off half of them and are submitting them here.

@nikic FYA...

…FFFFF Some text encodings supported by mbstring (such as UCS-4) accept 4-byte characters. When mbstring encounters an illegal byte sequence for the encoding it is using, it should emit an 'illegal character' marker, which can either be a single character like '?', an HTML hexadecimal entity, or a marker string like 'BAD+XXXX'. Because of the use of signed integers to hold 4-byte characters, illegal 4-byte sequences with a 'negative' value (one with the high bit set) were not handled correctly when emitting the illegal char marker. The result is that such illegal sequences were just skipped over (and the marker was not emitted to the output). Fix that.

After mb_substitute_character("long"), mbstring will respond to erroneous input by inserting 'long' error markers into the output. Depending on the situation, these error markers will either look like BAD+XXXX (for general bad input), U+XXXX (when the input is OK, but it converts to Unicode codepoints which cannot be represented in the output encoding), or an encoding-specific marker like JISX+XXXX or W932+XXXX. We have almost no tests for this feature. Add a bunch of tests to ensure that all our legacy encoding handlers work in a reasonable way when 'long' error markers are enabled.

There was a bit of legacy code here which looks like the original author of mbstring intended to allow conversion of Unicode Private Use Area codepoints to ISO-2022-JP-KDDI. However, that code never worked. It set the output variable to values which were not matched by any of the 'if' clauses below, which meant that nothing was actually emitted to the output. In other words, if one tried to convert Unicode to ISO-2022-JP-KDDI, and the Unicode string contained PUA codepoints, they would be quietly 'swallowed' and disappear. I don't know what ISO-2022-JP-KDDI byte sequences the author wanted to map those PUA codepoints to, and anyways, this use case is so obscure that there is little point in worrying about it. However, it is better to remove the non-functioning code than to leave it in. This means that if now one tries to convert PUA codepoints to ISO-2022-JP-KDDI, those codepoints will be treated as erroneous rather than silently ignored.

Sigh. I included tests which were intended to check this case in the test suite for ISO-2022-JP-MS, but those tests were faulty and didn't actually test what they were supposed to. Fixing the tests revealed that there were still bugs in this area.

…ugs)

This is to match the way that we handle UCS-2. When a BOM is found at the beginning of a 'UCS-2' string (NOT 'UCS-2BE' or 'UCS-2LE'), we take note of the intended byte order and handle the string accordingly, but do NOT emit a BOM to the output. Rather, we just use the default byte order for the requested output encoding. Some might argue that if the input string used a BOM, and we are emitting output in a text encoding where both big-endian and little-endian byte orders are possible, we should include a BOM in the output string. To such hypothetical debaters of minutiae, I can only offer you a shoulder shrug. No reasonable program which handles UCS-2 and UCS-4 text should require a BOM. Really, the concept of the BOM is a poor idea and should not have been included in Unicode. Standardizing on a single byte order would have been much better, similar to 'network byte order' for the Internet Protocol. But this is not the place to speak at length of such things.

TRowbotham · 2021-08-28T14:18:50Z

If you are looking for test data, https://github.com/web-platform-tests/wpt/tree/master/encoding, has data for a handful of legacy encodings, though you'll need to pluck that data out of the html and js files. The data is designed to test the WHATWG Encoding specification, so you may run into cases that don't match up if mbstring's implementation is different than what is described in that specification.

For the most part the name of the html data files typically follow the pattern <encoding_name>_chars or <encoding_name>_errors. A few of the js files have a small amount of data embedded in them, but most contain only test code.

nikic

Looks great, nice work!

The long substitutions seems like more hassle than they're worth -- we might want to drop support for them. I guess they're just there as a debugging aid?

ext/mbstring/libmbfl/mbfl/mbfl_convert.c

ext/mbstring/tests/euc_tw_encoding.phpt

alexdowad · 2021-08-30T10:48:37Z

Looks great, nice work!

The long substitutions seems like more hassle than they're worth -- we might want to drop support for them. I guess they're just there as a debugging aid?

😮 😮 😮 You don't mind if we just drop support for the long substitutions??? That would really help to simplify the code.

Since this library has ~20 years of history and many users, I am trying to be cautious to minimize the amount of breakage. But if you don't think anybody will care... it would definitely be nice to drop them. If I was creating the library from scratch, I would certainly never have added such a feature.

nikic · 2021-08-30T12:24:27Z

Looks great, nice work!
The long substitutions seems like more hassle than they're worth -- we might want to drop support for them. I guess they're just there as a debugging aid?

open_mouth open_mouth open_mouth You don't mind if we just drop support for the long substitutions??? That would really help to simplify the code.

Since this library has ~20 years of history and many users, I am trying to be cautious to minimize the amount of breakage. But if you don't think anybody will care... it would definitely be nice to drop them. If I was creating the library from scratch, I would certainly never have added such a feature.

Based on the grep.app dataset, there's one use of long: https://github.com/exflickr/flamework/blob/b6b8b7523af3907fa43aace4900abbc86a5635ab/www/include/lib_sanitize.php#L279

And there's another use of entity that looks a bit more legitimate: https://github.com/esperecyan/url/blob/7514ea30f442cee6f270209320914600315f6ae5/src/lib/URLencoding.php#L194

I think we need to distinguish between two things here though: The part where a Unicode codepoint can't be encoded in the target encoding (the < MBFL_WCSGROUP_UCS4MAX case, also the only case handled by entity). I think that's relatively reasonable functionality (especially in the entity format), and it shouldn't really cost us anything to keep it, right?

Not sure about the other WCSPLANEs, but the part that seems really unnecessary here is the case where we're trying to transfer raw illegal bytes from the input encoding into the output encoding using BAD+ prefix. And that also looks like the primary source of complexity because we need to reconstruct the original byte sequence from the cached state. So I think it may be worthwhile to drop this part.

ext/mbstring/tests/big5_encoding.phpt

kamil-tekiela · 2021-08-30T12:41:58Z

What's the point of entity? It looks like it converts to an HTML entity. If it's an invalid code point then what good does it do to convert to HTML entity?

nikic · 2021-08-30T12:45:02Z

What's the point of entity? It looks like it converts to an HTML entity. If it's an invalid code point then what good does it do to convert to HTML entity?

The code point is valid, but cannot be encoded in the target encoding.

alexdowad · 2021-08-30T13:51:16Z

I think we need to distinguish between two things here though: The part where a Unicode codepoint can't be encoded in the target encoding (the < MBFL_WCSGROUP_UCS4MAX case, also the only case handled by entity). I think that's relatively reasonable functionality (especially in the entity format), and it shouldn't really cost us anything to keep it, right?

Not sure about the other WCSPLANEs, but the part that seems really unnecessary here is the case where we're trying to transfer raw illegal bytes from the input encoding into the output encoding using BAD+ prefix. And that also looks like the primary source of complexity because we need to reconstruct the original byte sequence from the cached state. So I think it may be worthwhile to drop this part.

OK. So we keep the U+XXXX long substitutions when the input text decodes to a valid Unicode codepoint, but the output encoding can't handle it.

What about JIS+XXXX and so on? If we don't want those, what should the 'error marker' be for input text which can't be decoded?

nikic · 2021-08-30T14:23:55Z

What about JIS+XXXX and so on? If we don't want those, what should the 'error marker' be for input text which can't be decoded?

Just illegal_substchar I'd expect. Same as the fallback case for entity.

Though in any case, I think we should merge this PR first and then do any cleanup on top of that. Also worth noting that PHP-8.1 branches off tomorrow.

alexdowad · 2021-08-30T14:50:36Z

@nikic Merged as advised.

If PHP 8.1 is branching off tomorrow... perhaps I can do some more work on this and submit another PR today.

If we are eliminating the BAD+XXXX output syntax, then there seems to be little reason to have different flags in the top bits of a wchar to indicate different kinds of bad input. Just a single 'bad input' value should be enough. Probably this would be -1. Agree?

nikic · 2021-08-30T14:52:12Z

Yeah, using -1 for this makes sense to me.

alexdowad · 2021-08-30T15:12:15Z

Test failure on Travis CI after merging this PR looks spurious.

kamil-tekiela · 2021-08-30T15:19:12Z

Are you talking about ext/date/tests/bug30096.phpt? This seems unrelated. I don't know why it would fail after your changes. The LDAP and SNMP tests fail all the time, don't worry about them.

alexdowad added 22 commits August 27, 2021 23:09

Declare ARMSCII-8 conversion functions as 'static'

02c0938

Simplify code for converting UHC to Unicode

284a550

Correct comment in mbfilter_ucs4.c

ea9e57c

Add more tests for ISO-2022-JP-KDDI text conversion

c936525

Add more tests for ISO-2022-JP-MS text conversion

e050283

Add more tests for ISO-2022-JP/JIS7/JIS8 text conversion

29db56e

Add more tests for SJIS text conversion

9f44207

Add more tests for SJIS-2004 text conversion

35bca82

Extraneous trailing UCS-4 bytes are treated as error

d555396

Remove useless variable: mbfl_encoding_utf8_kddi_a_aliases

5a0de3c

Add test suite for EUC-JP-WIN (or EUC-JP-MS) text encoding (and fix b…

3250812

…ugs)

Add test suite for mobile variants of UTF-8 (and fix bugs)

7e6f16a

Add tests for '7bit' and '8bit' text encodings in mbstring

54e9ba4

Add tests for UCS-2 text encoding

2a6b584

Add tests for UCS-4 text conversion

46d05b3

Add another test for UTF-16LE

6063024

Add more tests for UTF-8 text conversion

d3655b4

alexdowad force-pushed the cleanup-mbstring-11 branch from a10d214 to d3655b4 Compare August 28, 2021 09:23

nikic approved these changes Aug 30, 2021

View reviewed changes

ext/mbstring/libmbfl/mbfl/mbfl_convert.c Show resolved Hide resolved

ext/mbstring/tests/euc_tw_encoding.phpt Show resolved Hide resolved

nikic reviewed Aug 30, 2021

View reviewed changes

ext/mbstring/tests/big5_encoding.phpt Show resolved Hide resolved

alexdowad merged commit 15ba73c into php:master Aug 30, 2021

alexdowad mentioned this pull request Aug 30, 2021

Major overhaul of mbstring (part 12) #7430

Merged

alexdowad deleted the cleanup-mbstring-11 branch September 6, 2021 11:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Major overhaul of mbstring (part 11) #7419

Major overhaul of mbstring (part 11) #7419

Uh oh!

alexdowad commented Aug 28, 2021

Uh oh!

TRowbotham commented Aug 28, 2021

Uh oh!

nikic left a comment

Uh oh!

Uh oh!

Uh oh!

alexdowad commented Aug 30, 2021

Uh oh!

nikic commented Aug 30, 2021

Uh oh!

Uh oh!

kamil-tekiela commented Aug 30, 2021

Uh oh!

nikic commented Aug 30, 2021

Uh oh!

alexdowad commented Aug 30, 2021

Uh oh!

nikic commented Aug 30, 2021

Uh oh!

alexdowad commented Aug 30, 2021

Uh oh!

nikic commented Aug 30, 2021

Uh oh!

alexdowad commented Aug 30, 2021

Uh oh!

kamil-tekiela commented Aug 30, 2021

Uh oh!

Uh oh!

Major overhaul of mbstring (part 11) #7419

Major overhaul of mbstring (part 11) #7419

Uh oh!

Conversation

alexdowad commented Aug 28, 2021

Uh oh!

TRowbotham commented Aug 28, 2021

Uh oh!

nikic left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alexdowad commented Aug 30, 2021

Uh oh!

nikic commented Aug 30, 2021

Uh oh!

Uh oh!

kamil-tekiela commented Aug 30, 2021

Uh oh!

nikic commented Aug 30, 2021

Uh oh!

alexdowad commented Aug 30, 2021

Uh oh!

nikic commented Aug 30, 2021

Uh oh!

alexdowad commented Aug 30, 2021

Uh oh!

nikic commented Aug 30, 2021

Uh oh!

alexdowad commented Aug 30, 2021

Uh oh!

kamil-tekiela commented Aug 30, 2021

Uh oh!

Uh oh!