Skip to content

Major overhaul of mbstring (part 11) #7419

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Aug 30, 2021

Conversation

alexdowad
Copy link
Contributor

With this PR, there are (at least some) tests for all legacy text encodings supported by mbstring. BUT! I checked our test coverage of mbstring using gcov, and much to my dismay, there is still a lot of code in the library which is not executed at all by the existing tests.

Some of the non-tested code is actually dead code, but for most of it, we just need to keep adding more tests. To avoid letting my backlog of unmerged commits from growing out of control, I have split off half of them and are submitting them here.

@nikic FYA...

…FFFFF

Some text encodings supported by mbstring (such as UCS-4) accept 4-byte
characters. When mbstring encounters an illegal byte sequence for the
encoding it is using, it should emit an 'illegal character' marker,
which can either be a single character like '?', an HTML hexadecimal
entity, or a marker string like 'BAD+XXXX'.

Because of the use of signed integers to hold 4-byte characters,
illegal 4-byte sequences with a 'negative' value (one with the high
bit set) were not handled correctly when emitting the illegal char
marker. The result is that such illegal sequences were just skipped
over (and the marker was not emitted to the output). Fix that.
After mb_substitute_character("long"), mbstring will respond to
erroneous input by inserting 'long' error markers into the output.
Depending on the situation, these error markers will either look like
BAD+XXXX (for general bad input), U+XXXX (when the input is OK, but it
converts to Unicode codepoints which cannot be represented in the
output encoding), or an encoding-specific marker like JISX+XXXX or
W932+XXXX.

We have almost no tests for this feature. Add a bunch of tests to
ensure that all our legacy encoding handlers work in a reasonable
way when 'long' error markers are enabled.
There was a bit of legacy code here which looks like the original author
of mbstring intended to allow conversion of Unicode Private Use Area
codepoints to ISO-2022-JP-KDDI. However, that code never worked.
It set the output variable to values which were not matched by any
of the 'if' clauses below, which meant that nothing was actually
emitted to the output. In other words, if one tried to convert Unicode
to ISO-2022-JP-KDDI, and the Unicode string contained PUA codepoints,
they would be quietly 'swallowed' and disappear.

I don't know what ISO-2022-JP-KDDI byte sequences the author wanted
to map those PUA codepoints to, and anyways, this use case is so obscure
that there is little point in worrying about it. However, it is better
to remove the non-functioning code than to leave it in.

This means that if now one tries to convert PUA codepoints to
ISO-2022-JP-KDDI, those codepoints will be treated as erroneous rather
than silently ignored.
Sigh. I included tests which were intended to check this case in the
test suite for ISO-2022-JP-MS, but those tests were faulty and didn't
actually test what they were supposed to.

Fixing the tests revealed that there were still bugs in this area.
This is to match the way that we handle UCS-2. When a BOM is found at
the beginning of a 'UCS-2' string (NOT 'UCS-2BE' or 'UCS-2LE'), we take
note of the intended byte order and handle the string accordingly, but
do NOT emit a BOM to the output. Rather, we just use the default byte
order for the requested output encoding.

Some might argue that if the input string used a BOM, and we are
emitting output in a text encoding where both big-endian and
little-endian byte orders are possible, we should include a BOM in the
output string. To such hypothetical debaters of minutiae, I can only
offer you a shoulder shrug. No reasonable program which handles UCS-2
and UCS-4 text should require a BOM.

Really, the concept of the BOM is a poor idea and should not have been
included in Unicode. Standardizing on a single byte order would have
been much better, similar to 'network byte order' for the Internet
Protocol. But this is not the place to speak at length of such things.
@alexdowad alexdowad force-pushed the cleanup-mbstring-11 branch from a10d214 to d3655b4 Compare August 28, 2021 09:23
@TRowbotham
Copy link
Contributor

If you are looking for test data, https://github.com/web-platform-tests/wpt/tree/master/encoding, has data for a handful of legacy encodings, though you'll need to pluck that data out of the html and js files. The data is designed to test the WHATWG Encoding specification, so you may run into cases that don't match up if mbstring's implementation is different than what is described in that specification.

For the most part the name of the html data files typically follow the pattern <encoding_name>_chars or <encoding_name>_errors. A few of the js files have a small amount of data embedded in them, but most contain only test code.

Copy link
Member

@nikic nikic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, nice work!

The long substitutions seems like more hassle than they're worth -- we might want to drop support for them. I guess they're just there as a debugging aid?

@alexdowad
Copy link
Contributor Author

Looks great, nice work!

The long substitutions seems like more hassle than they're worth -- we might want to drop support for them. I guess they're just there as a debugging aid?

😮 😮 😮 You don't mind if we just drop support for the long substitutions??? That would really help to simplify the code.

Since this library has ~20 years of history and many users, I am trying to be cautious to minimize the amount of breakage. But if you don't think anybody will care... it would definitely be nice to drop them. If I was creating the library from scratch, I would certainly never have added such a feature.

@nikic
Copy link
Member

nikic commented Aug 30, 2021

Looks great, nice work!
The long substitutions seems like more hassle than they're worth -- we might want to drop support for them. I guess they're just there as a debugging aid?

open_mouth open_mouth open_mouth You don't mind if we just drop support for the long substitutions??? That would really help to simplify the code.

Since this library has ~20 years of history and many users, I am trying to be cautious to minimize the amount of breakage. But if you don't think anybody will care... it would definitely be nice to drop them. If I was creating the library from scratch, I would certainly never have added such a feature.

Based on the grep.app dataset, there's one use of long: https://github.com/exflickr/flamework/blob/b6b8b7523af3907fa43aace4900abbc86a5635ab/www/include/lib_sanitize.php#L279

And there's another use of entity that looks a bit more legitimate: https://github.com/esperecyan/url/blob/7514ea30f442cee6f270209320914600315f6ae5/src/lib/URLencoding.php#L194

I think we need to distinguish between two things here though: The part where a Unicode codepoint can't be encoded in the target encoding (the < MBFL_WCSGROUP_UCS4MAX case, also the only case handled by entity). I think that's relatively reasonable functionality (especially in the entity format), and it shouldn't really cost us anything to keep it, right?

Not sure about the other WCSPLANEs, but the part that seems really unnecessary here is the case where we're trying to transfer raw illegal bytes from the input encoding into the output encoding using BAD+ prefix. And that also looks like the primary source of complexity because we need to reconstruct the original byte sequence from the cached state. So I think it may be worthwhile to drop this part.

@kamil-tekiela
Copy link
Member

What's the point of entity? It looks like it converts to an HTML entity. If it's an invalid code point then what good does it do to convert to HTML entity?

@nikic
Copy link
Member

nikic commented Aug 30, 2021

What's the point of entity? It looks like it converts to an HTML entity. If it's an invalid code point then what good does it do to convert to HTML entity?

The code point is valid, but cannot be encoded in the target encoding.

@alexdowad
Copy link
Contributor Author

I think we need to distinguish between two things here though: The part where a Unicode codepoint can't be encoded in the target encoding (the < MBFL_WCSGROUP_UCS4MAX case, also the only case handled by entity). I think that's relatively reasonable functionality (especially in the entity format), and it shouldn't really cost us anything to keep it, right?

Not sure about the other WCSPLANEs, but the part that seems really unnecessary here is the case where we're trying to transfer raw illegal bytes from the input encoding into the output encoding using BAD+ prefix. And that also looks like the primary source of complexity because we need to reconstruct the original byte sequence from the cached state. So I think it may be worthwhile to drop this part.

OK. So we keep the U+XXXX long substitutions when the input text decodes to a valid Unicode codepoint, but the output encoding can't handle it.

What about JIS+XXXX and so on? If we don't want those, what should the 'error marker' be for input text which can't be decoded?

@nikic
Copy link
Member

nikic commented Aug 30, 2021

What about JIS+XXXX and so on? If we don't want those, what should the 'error marker' be for input text which can't be decoded?

Just illegal_substchar I'd expect. Same as the fallback case for entity.

Though in any case, I think we should merge this PR first and then do any cleanup on top of that. Also worth noting that PHP-8.1 branches off tomorrow.

@alexdowad alexdowad merged commit 15ba73c into php:master Aug 30, 2021
@alexdowad
Copy link
Contributor Author

@nikic Merged as advised.

If PHP 8.1 is branching off tomorrow... perhaps I can do some more work on this and submit another PR today.

If we are eliminating the BAD+XXXX output syntax, then there seems to be little reason to have different flags in the top bits of a wchar to indicate different kinds of bad input. Just a single 'bad input' value should be enough. Probably this would be -1. Agree?

@nikic
Copy link
Member

nikic commented Aug 30, 2021

Yeah, using -1 for this makes sense to me.

@alexdowad
Copy link
Contributor Author

Test failure on Travis CI after merging this PR looks spurious.

@kamil-tekiela
Copy link
Member

Are you talking about ext/date/tests/bug30096.phpt? This seems unrelated. I don't know why it would fail after your changes. The LDAP and SNMP tests fail all the time, don't worry about them.

@alexdowad alexdowad deleted the cleanup-mbstring-11 branch September 6, 2021 11:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants