Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix GH-9535: The behavior of mb_strcut in mbstring has been changed in PHP8.1 #9562

Closed
wants to merge 2 commits into from

Conversation

NathanFreeman
Copy link
Contributor

No description provided.

@NathanFreeman
Copy link
Contributor Author

cc @alexdowad
Hi!
Do you have time to review it?
Thanks.

@alexdowad
Copy link
Contributor

Hi, @NathanFreeman. Thanks very much for working on mbstring.

I am currently travelling and do not have a machine with me which is set up to build PHP, or I would pull your changes, build, and do some testing. In any case, the issue raised in #9535 is 100% a bug, and your changes do apparently fix it.

Our test suite for mb_strcut is obviously inadequate, or this bug would have been caught earlier. I would be more comfortable if, rather than just adding one test which covers this bug (but who knows how many other bugs are still there), the test suite was filled out to make it really complete. When I started working on mbstring, there were only a handful of tests for mb_strcut; I added a handful more, but they are not enough.

Best would be to add a test which generates several thousands (or 10,000s) of random strings, of different lengths, and in different encodings, runs them through mb_strcut using random indices, and checks to make sure that in all cases, the strings are cut at character boundaries (further, they should be the closest character boundaries on the left side of the requested cut points). I think you can use mb_str_split to find out where the character boundaries are; have a look at my existing test code in mb_strcut.phpt.

Set the RNG seed to a fixed value at the beginning of the test (with srand) so the results are consistent.

Does that sound like something which might be feasible? If not, please let me know. If you do it, it will be very interesting to see if any other bugs are found...

@NathanFreeman
Copy link
Contributor Author

NathanFreeman commented Sep 19, 2022

Hi, @alexdowad
BASE64, HTML-ENTITIES, Quoted-Printable, UTF-16, UTF-16BE, UTF-16LE, UTF-7, UTF7-IMAP, JIS, ISO-2022-JP, ISO-2022-JP-MS, GB18030, HZ, ISO-2022-KR, ISO-2022-JP-2004, ISO-2022-JP-MOBILE#KDDI, CP50220, CP50221, CP50222

I find that the above encodings will cause this bug. It seems that the used substitution character may not be supported by the target character encoding and use ? to indicate. So that is why the ? is output in the console.

So after executing the filter_flush function I need to check the device.buffer changes to find out the location of the illegal subcharacters in it and modify the value of the device.pos.

position = device.pos;
(*encoder->filter_flush)(encoder);
illegal_substchar = (unsigned char) encoder->illegal_substchar;

while(device.pos > position) {
       /* check illegal output */
	if (device.buffer[position++] == illegal_substchar) {
		device.pos = position - 1;
		break;
	}
}

@NathanFreeman
Copy link
Contributor Author

NathanFreeman commented Sep 19, 2022

Chinese Japanese and other languages are easier to trigger this bug after executing mb_convert_encoding .
I m not sure such modification is reasonable but it looks little different from the other PHP versions(PHP7 and PHP8.0).
https://3v4l.org/tL90V

@alexdowad
Copy link
Contributor

@NathanFreeman Thank you very much for taking the time to add more tests. I strongly suspected that more testing would reveal something interesting. (It usually does!)

As mentioned, I am on a trip. Yesterday I took a bit of time to install everything needed on the laptop which I am carrying so that I can build PHP. I am just contemplating whether I should try to squeeze some time to step through this adjusted code in a debugger and understand the issue and fix in more detail. It would be easier once I get back home.

In any case, your code does fix a bug and you have added a number of additional tests, which gives a bit more confidence that the fix doesn't break anything. I am thinking that I may approve the fix for now...

I would just kindly like to ask that you amend the commit log message for the commit which adds the new tests. Right now we have two commits which are both called "fix #9535".

@alexdowad
Copy link
Contributor

I guess this fix will be merged into PHP-8.1 first, then from there to PHP-8.2, then from there to master.

If @NathanFreeman can update the description for the 2nd commit as requested, I could merge, or else any other core developer could do it...

@NathanFreeman
Copy link
Contributor Author

Hi, @alexdowad
I have modified the description for the 2nd commit.

@alexdowad
Copy link
Contributor

Before I merge, do any others who are interested in mbstring want to say anything? @cmb69 @nikic @kamil-tekiela

@cmb69
Copy link
Contributor

cmb69 commented Sep 19, 2022

It would be easier once I get back home.

There is no need to hurry. :) Next 8.1/8.2 RCs are only scheduled for tagging on 2022-10-11.

Regarding the fix: could this cause issues if there are ?s in the string to cut? For instance, would https://3v4l.org/uhPsX show the same output?

@NathanFreeman
Copy link
Contributor Author

Hi, @cmb69
It shows the same output.
https://3v4l.org/4m5X4

My server.
微信图片_20220919215041

@NathanFreeman
Copy link
Contributor Author

NathanFreeman commented Sep 20, 2022

var_dump(mb_strcut('???', 0, 2, 'CP50220'));

It's a special case.
It will add valid ? to device.buffer , making the original judgment less reasonable.
Here are the new conditions of the judgment.

if ((encoder->status && ((encoder->status & 0xF) || (encoder->status == 0x11))) || encoder->cache) {
    illegal_substchar = (unsigned char) encoder->illegal_substchar;
    while (device.pos > position) {
        /* check illegal output */
        if (device.buffer[position++] == illegal_substchar) {
            device.pos = position - 1;
            break;
        }
    }
}

@alexdowad
Copy link
Contributor

Hi @NathanFreeman, and thank you again for diligently working on this issue.

I think this time I may actually build PHP with your patch and step through in a debugger to see how it works.

@alexdowad
Copy link
Contributor

Hmm. I have built the code and walked through it to see how the fix works. I'm afraid that we may need to do more to get a robust fix, which will resolve this problem without breaking anything else.

@NathanFreeman, before we go further,

  1. Do you understand how mbstring's concept of "conversion filters" works? Since you have got this far in working on a bug fix, I guess you do.
  2. Do you understand what mbstring's "flush functions" are for, when they are normally called, etc?
  3. Do you understand how mbstring signals errors in the input string, and how those are replaced by illegal_substchar, depending on the current error-handling mode?
  4. Do you understand the algorithm which the current implementation of mb_strcut is using?

@NathanFreeman
Copy link
Contributor Author

@alexdowad Hi.
Could you please tell me the answer?
I only know what the code does but don t know why the code does it😵

Answer 1: The filters will process characters one by one according to certain rules, and modify the filter properties.
Answer 2: I think the flush function is finally used to add some special characters to result and making the result more consistent with the target encoding rules.
Answer 3: if (encoder->status && ((encoder->status & 0xF) || (encoder->status == 0x11))) || encoder->cache is true, it mean There some error in the input string and the flush function will use illegal_substchar to replace it in a specific location (device.pos).
Answer 4: sorry, I dont know well about it.

@alexdowad
Copy link
Contributor

1 is true. You should also note that some filters convert bytes in an "input" text encoding to wchars, while others convert wchars to bytes in an "output" text encoding. Both types are represented by the same C struct, but some of the rules which they follow are a bit different.

A "wchar" is a Unicode codepoint, stored in a 32-bit or larger value (such as a C int).

To convert text from one encoding to another, we typically join together two filters, one to convert the input text to wchars, and another to convert the wchars to the desired destination encoding.

2: See

static int mbfl_filt_conv_wchar_cp50220_flush(mbfl_convert_filter *filter)
{
int mode = MBFL_HAN2ZEN_KATAKANA | MBFL_HAN2ZEN_GLUE;
if (filter->cache) {
int s = mb_convert_kana_codepoint(filter->cache, 0, NULL, NULL, mode);
filter->filter_function = mbfl_filt_conv_wchar_cp50221;
mbfl_filt_conv_wchar_cp50221(s, filter);
filter->filter_function = mbfl_filt_conv_wchar_cp50220;
filter->cache = 0;
}
return mbfl_filt_conv_any_jis_flush(filter);
}
as an example.

For some text encodings, it is necessary to look ahead to the next byte/wchar to know how to convert the current one. This is the case when converting wchars to CP50220, since some combinations of 2 codepoints are represented as a single "character" in CP50220. Since the conversion filters just receive one byte/wchar at a time, the only way they can compare with the following one is to store the current one (usually in filter->cache) and then perform the comparison on the next call.

However, what should then happen when we reach the end of the input string? After the last call to the filter function, there will still be a codepoint sitting in filter->cache. If we terminate the conversion operation just like that, one character will be missing from the end of the output string.

The answer is provided by the "flush functions". The flush functions are called at the end of a conversion operation, to "flush out" any cached data which is being held in filter->cache. Generally, when filters are joined into a pipeline, each flush function will call the next one in the pipeline.

(Since calling the flush functions is the last thing which is done before completing a conversion operation, it is also a good place to signal an error if the input string ended in an illegal state.)

Do you see why mb_strcut repeatedly attempts to call the flush functions for its conversion pipeline?

  1. I wasn't asking about your code, but about mbstring in general. Try looking through mbfilter_utf8.c. What does the input → wchar function (mbfl_filt_conv_utf8_wchar) do if it detects an error in the input? What about the wchar → output function (mbfl_filt_conv_wchar_utf8)?

  2. I will explain the algorithm used by mb_strcut once you understand the important points referenced above.

One more point... I have explained above what filter->cache is used for. But what about filter->status? If you look at mbfilter_utf8.c or mbfilter_utf16.c (or the conversion code for any other encoding which you are interested in), what is done with filter->status?

@NathanFreeman
Copy link
Contributor Author

Thank you for your reply.
I'll take some time to understand this knowledge.
I wrote it wrong. I meat encoder->status instead of filter->status.

@NathanFreeman
Copy link
Contributor Author

NathanFreeman commented Oct 8, 2022

@alexdowad @cmb69 Hi.
This PR is not perfect enough.
It is possible that both function filter_function and flush_function will add illegal_substchar to device multiple times during the conversion process.
Maybe we need to check illegal_substchar multiple times during the conversion process rather than after the conversion process and skip the process and restore the state if found it.

@alexdowad
Copy link
Contributor

Hi @NathanFreeman. I guess from your latest message that you are starting to understand more about how mbstring works?

I think right now our concern is only about flush_functions. If the input string is actually invalid, then yes, the filter_functions will emit error markers, and that is expected. I think it would be a mistake to remove them.

As far as the flush_functions, you are very right that this PR is not perfect enough yet. Don't worry, we will get there.

We need you to understand mbstring a bit more, then I think it will become clear what needs to be done. (Hint: there are better ways of suppressing error output from the flush_functions rather than checking for the illegal_substchar.)

Please let me know when you are ready to respond to my questions above.

@NathanFreeman
Copy link
Contributor Author

NathanFreeman commented Oct 8, 2022

@alexdowad Hi.
I found:
mbfl_filt_conv_utf8_wchar will call mbfl_filt_conv_wchar_utf8 (decoder->filter_function, encoder->output_function) if it detects an error in the input and mbfl_filt_conv_wchar_utf8 will call mbfl_filt_conv_illegal_output and mbfl_memory_device_output to write illegal_substchar into device directly if it detects an error in the input.

In shorts, The encoder let the decoder to check for the illegal characters.

I read the details of mbfl_filt_conv_illegal_output, and maybe we can set decoder->illegal_mode = MBFL_OUTPUTFILTER_ILLEGAL_MODE_NONE; to suppressing error output from the flush_functions.

@alexdowad
Copy link
Contributor

@NathanFreeman 👍🏻 👍🏻 👍🏻 Sounds like a great idea.

I suggest you set decoder->illegal_mode to MBFL_OUTPUTFILTER_ILLEGAL_MODE_NONE before calling a flush function and then set it back to the original value afterwards. However, in cases where mbfl_strcut saves the encoder/decoder state and then restores it, illegal_mode should also be restored, so in such places there's no need to explicitly restore the state.

@NathanFreeman
Copy link
Contributor Author

@alexdowad Hi.
I have pushed a new commit.
Only the last flush function did I set decoder->illegal_mode to MBFL_OUTPUTFILTER_ILLEGAL_MODE_NONE.
Because I found that if all the flush function have set decoder->illegal_mode to MBFL_OUTPUTFILTER_ILLEGAL_MODE_NONE, the result is different.

@alexdowad
Copy link
Contributor

I'm happy with this. 👍 Thanks very much!

@NathanFreeman Could you squash this series of 7 commits into one commit?

😅 And I guess the next question is... are you interested in working on mbstring some more? If so, I can give you another issue to work on.

find the position of illegal subcharacters and modify the value of
device.pos

add ((encoder->status && ((encoder->status & 0xF) || (encoder->status == 0x11))) || encoder->cache) condition.

fix test

fix test

delay initialization parameter illegal_substchar

optimize code
@NathanFreeman
Copy link
Contributor Author

NathanFreeman commented Oct 10, 2022

@alexdowad Hi.
I have squashed this series of 7 commits into one commit but I dont konw why all ci are failed.
Maybe I need to merge other commits from PHP-8.1.
You can give another issue, I think I can try to fix it.

@alexdowad
Copy link
Contributor

@NathanFreeman I have just built this commit and ran the tests locally, I am getting a failure in the new test as follows:

--
     UTF-16BE: 宛如繁星般宛如
     UTF-16LE: 宛如繁星般宛如
     UTF-7: 宛如繁星
009+ JIS: 宛如繁星
010+ ISO-2022-JP: 宛如繁星
009- JIS: 宛如繁星般
010- ISO-2022-JP: 宛如繁星般
     ISO-2022-JP-MS: 宛如繁星
     GB18030: 宛如繁星般宛如
     HZ: 宛如繁星般
--
     GB18030: 星のように月のように
     HZ: 星のように月のよ
     ISO-2022-KR: 星のように月の
035+ ISO-2022-JP-2004: 星のように月
035- ISO-2022-JP-2004: 星のように月の
     ISO-2022-JP-MOBILE#KDDI: 星のように月の
     CP50220: 星のように月の
     CP50221: 星のように月の
--%                         

I analyzed the reason for these failures. The expected test results which you put in gh9535.phpt do appear reasonable. If we were to cut out $bytes_length bytes from the input strings, and take everything up to the last character boundary in that range, we would indeed get the results which you put in the test file. (Good work!)

The problem is that mbfl_strcut does something a little bit different from what one might reasonably expect. This is one of the relevant lines:

if (device.pos > length) {

As well as this one:

if (device.pos > length) {

Notice that when it is deciding whether or not to continue consuming more bytes from the input string, mbfl_strcut does this by comparing the number of requested bytes to the length of the string in the output buffer. If the length of the output exceeds the number of requested bytes, then it backs up to the previously saved state and uses it to generate the return value.

This is problematic for a couple of reasons. For one thing, if there are invalid byte sequences in the middle of the requested range, those will be converted to error markers, and the error markers may have a different byte length, which would cause mbfl_strcut to either go further or not as far in the input string as requested.

More to the point in this case, some of the flush functions emit escape sequences when ending a converted string. mbfl_strcut is counting these added escape sequences against the requested byte length, which is making it not go as far in the input string as it should.

For JIS/ISO-2022-JP, this is the relevant bit:

/* back to latin */
if ((filter->status & 0xff00) != 0) {
CK((*filter->output_function)(0x1b, filter->data)); /* ESC */
CK((*filter->output_function)(0x28, filter->data)); /* '(' */
CK((*filter->output_function)(0x42, filter->data)); /* 'B' */
}

For ISO-2022-JP-2004, it's here:

/* If we had switched to a different charset, go back to ASCII mode
* This makes it possible to concatenate arbitrary valid strings
* together and get a valid string */
if (filter->status & 0xff00) {
CK((*filter->output_function)(0x1b, filter->data)); /* ESC */
CK((*filter->output_function)('(', filter->data));
CK((*filter->output_function)('B', filter->data));
}

Hopefully the comment in the 2nd one makes it clear why this is done. (If it doesn't make sense, I will explain more.)

Historical note: The added escape sequence at the end of a JIS/ISO-2022-JP string was part of the original code from libmbfl, but the one for ISO-2022-JP-2004 was added by me.

While we want these ending escape sequences to appear in the output, we do not want their byte length to be subtracted from the size of the extracted portion.

It seems to me that the condition device.pos > length is really just wrong and inherently unreliable. I can think of at least three other ways we can detect where the extracted portion should end, and all of them are more reliable than this.

Since @NathanFreeman is now setting the illegal_mode to NONE for the final "flush" operation, one very attractive option might be... completely do away with this (silly) business of consuming bytes one by one, saving the state of the filter, and restoring it. We could just consume all the bytes up to the requested end point, and rely on the new illegal_mode to avoid an extra error marker appearing at the end of the string if we happen to cut it in the middle of a multi-byte character.

I'm thinking about this and it seems that it should work. Is there anything I'm missing?

@alexdowad
Copy link
Contributor

By the way, @NathanFreeman... since this is a separate issue from the one you are trying to fix in this PR...

Another option would be that you put comments in the implementation code and test code explaining this issue, and we leave it to be resolved in another PR. It is a pre-existing bug and has nothing to do with your change.

You could set the new test to XFAIL to make CI pass.

@cmb69
Copy link
Contributor

cmb69 commented Oct 10, 2022

You could set the new test to XFAIL to make CI pass.

Maybe it's better to split the test case into two test cases, and only mark the failing test as XFAIL.

@alexdowad
Copy link
Contributor

Hmm, update here.

I tried implementing the idea mentioned above. After examining the inputs/outputs of some test cases byte-by-byte, I can say that the output seems more logical than before. As a side benefit, about 150 lines of very tricky (and brittle) code are eliminated. However, the output is significantly different for all encodings which have selectable 'modes', including HZ, all the ISO-2022 encodings, and CP5022{0,1,2}.

Output is also significantly different for UTF-7 and UTF-7-IMAP. This is because in UTF-7, all non-ASCII characters are Base64-encoded, and Base64-encoding automatically pads the output to a multiple of 4 bytes. Since the legacy implementation of mbfl_strcut looks at the output length to determine when it has consumed enough of the input, this means that historically, it has systemically produced truncated output for UTF-7 and UTF-7-IMAP.

The output is also changed for GB18030, which seems odd, since GB18030 does not include anything like escape codes, nor does it pad its output to a certain byte length. I haven't dug in to see what the cause of the change for GB18030 is.

Therefore, while this simplified implementation might arguably be considered more "correct", it would be a significant BC break. I can just imagine the complaints from users whose existing code would be broken.

I am wondering if we should just define the existing behavior (i.e. deducting the length of an ending escape sequence from the number of bytes which were requested from the input string) to be "correct" and stick to it.

If we do that, then @NathanFreeman could just adjust the expected values for the failing test cases so that the test passes.

@NathanFreeman, another observation on your code: on line 1284, you restore the previous illegal_mode for decoder... but this is after we are finished using decoder, when it is just about to be de-allocated. Since decoder is not used again after the final flush, you don't need to restore illegal_mode. So you don't need to save the original illegal_mode either.

@cmb69 @Girgias @nikic Any comments on whether it is a good idea to define the historical behavior of mb_strcut to be "correct" to avoid BC breaks, although it doesn't seem to make sense in certain cases?

@alexdowad
Copy link
Contributor

alexdowad commented Oct 10, 2022

@NathanFreeman If you are available to work on another mbstring issue, it would be helpful if you could resolve this one:

The Unicode standard includes special casing (uppercase, lowercase, etc) rules for Greek, Lithuanian, Turkish, and Azeri. (https://unicode.org/Public/UNIDATA/SpecialCasing.txt) We do not implement any of these 'conditional mapping' rules right now. The most important conditional mapping which we would like to implement is the special case for the Greek letter sigma.

This should only be done on master (so it goes into the next major release of PHP).

If you look in php_unicode.c, you will see that it implements various case conversion 'modes'. We do not want any of the _SIMPLE modes to implement the conditional case mappings from SpecialCasing.txt. The special rule for Greek sigma should only be for PHP_UNICODE_CASE_LOWER mode.

@Girgias
Copy link
Member

Girgias commented Oct 11, 2022

It might make sense to consider this as a bug fix, but due to the BC only land this change in PHP 8.2 and master? (as we're still in an RC I think it still makes sense to patch 8.2) as producing a string which is not valid within the encoding sounds pretty bad to me...

@alexdowad
Copy link
Contributor

@Girgias for clarity, the current implementation of mb_strcut does not produce strings which are not valid for their encoding. The issue here is that the portion which is extracted from the input string is shorter than what one would probably expect.

For example, if you indicate that you want to extract 15 bytes from the input string, mb_strcut makes sure that the return value is no longer than 15 bytes. However, this does not mean that it actually encodes all the characters which are contained in the 15-byte range which you wanted to extract. If the return value contains added escape sequences, then those will take up some of the 15 byte limit, and the number of characters actually encoded will be reduced accordingly.

@NathanFreeman
Copy link
Contributor Author

NathanFreeman commented Oct 11, 2022

@NathanFreeman I have just built this commit and ran the tests locally, I am getting a failure in the new test as follows:

@alexdowad Hi.
Thanks for your careful instructions.
filter_flush function will add additional string to process the input and perhaps the actual length is greater than the request length so that you have to discard this part of the input.

For ISO-2022-JP-2004.
I think the point of this is that we need to add extra characters to make the return result more reasonable.
I can only think of that.

/* If we had switched to a different charset, go back to ASCII mode
* This makes it possible to concatenate arbitrary valid strings
* together and get a valid string */
if (filter->status & 0xff00) {
CK((*filter->output_function)(0x1b, filter->data)); /* ESC */
CK((*filter->output_function)('(', filter->data));
CK((*filter->output_function)('B', filter->data));
}

@NathanFreeman
Copy link
Contributor Author

By the way, @NathanFreeman... since this is a separate issue from the one you are trying to fix in this PR...

Another option would be that you put comments in the implementation code and test code explaining this issue, and we leave it to be resolved in another PR. It is a pre-existing bug and has nothing to do with your change.

You could set the new test to XFAIL to make CI pass.

I have modified my test by following your suggestions.

@NathanFreeman
Copy link
Contributor Author

@NathanFreeman, another observation on your code: on line 1284, you restore the previous illegal_mode for decoder... but this is after we are finished using decoder, when it is just about to be de-allocated. Since decoder is not used again after the final flush, you don't need to restore illegal_mode. So you don't need to save the original illegal_mode either.

I have removed the unuseful code. Thanks.

@cmb69
Copy link
Contributor

cmb69 commented Oct 11, 2022

Any comments on whether it is a good idea to define the historical behavior of mb_strcut to be "correct" to avoid BC breaks, although it doesn't seem to make sense in certain cases?

I don't think it is a good idea to keep the historical behavior for ever, if it sometimes doesn't make sense. Now, if we assume that some users rely on the current behavior, we should consider a deprecation phase (if feasible) telling users that a certain mb_strcut() will change its result, and than actually change the behavior in the next major version.

@alexdowad
Copy link
Contributor

@NathanFreeman I would like to apologize that it took a very long time to merge this bug fix. Thanks for working on it.

When merging, I squashed commits and edited the commit log message to make it more informative. I also updated NEWS and credited you for the fix there.

Further, I also split the passing/failing tests into two different test files, as suggested by @cmb69, and put XFAIL only on the failing ones.

Then merged up into PHP-8.2 and master.

@alexdowad alexdowad closed this Nov 13, 2022
@alexdowad
Copy link
Contributor

Any comments on whether it is a good idea to define the historical behavior of mb_strcut to be "correct" to avoid BC breaks, although it doesn't seem to make sense in certain cases?

I don't think it is a good idea to keep the historical behavior for ever, if it sometimes doesn't make sense. Now, if we assume that some users rely on the current behavior, we should consider a deprecation phase (if feasible) telling users that a certain mb_strcut() will change its result, and than actually change the behavior in the next major version.

OK. Not sure when this will happen.

But definitely there are cases where the current (legacy) behavior of mb_strcut does not make sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants