Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mb_detect_encoding(): wrong results with null $encodings #9008

Closed
machitgarha opened this issue Jul 13, 2022 · 11 comments
Closed

mb_detect_encoding(): wrong results with null $encodings #9008

machitgarha opened this issue Jul 13, 2022 · 11 comments

Comments

@machitgarha
Copy link

machitgarha commented Jul 13, 2022

Description

First of all, a special thanks to the progress made by all developers!

Calling mb_detect_encoding() with the second argument being null may return a really strange result, while changing it to mb_detect_order() fixes the issue. Based on the official PHP documentation, they must behave the same.

I tried something like four hours, but cannot figure out what the problem exactly is. Even I cannot create a small re-producable example; maybe the problem is environment- or context-dependent.

Steps to Reproduce

  1. Clone Serenata and navigate to its directory. Checkout the commit d3c9dcb3426a9b5ffe442a436e2179063ea6c9d7 (current master).
  2. Run composer install.
  3. Run ./vendor/bin/phpunit.

You should see there are lots of failures (i.e. 346). So what the hell? Wait, start editing the file src/Analysis/SourceCodeReading/TextToUtf8Converter.php, go to line 15 and change it from

        $encoding = mb_detect_encoding($code, null, true);

to

        $encoding = mb_detect_encoding($code, mb_detect_order(), true);

. Now re-run ./vendor/bin/phpunit. With this simple change, the issue is (almost) fixed and only one failure remains.

I'm pretty sure that this is a new bug introduced in PHP 8.1. I compiled both PHP 8.0.19 and 8.1.8 the same way, with default configurations, and ran PHPUnit under both.

Under PHP 8.0, passing both values as the second argument works perfectly, and ALL tests pass. Under PHP 8.1, I expect $encoding to hold either ASCII or UTF-8 (or maybe even false), but it strangely equals to UUENCODE. Also, the one remaining failed test after the change is also related to mb_detect_encoding(), and I expect the same, but it randomly returns UUENCODE or UTF-7, causing the + signs of the string to disappear after mb_convert_encoding() (line 22 of the file).

PHP Version

PHP 8.1.7, PHP 8.1.8

Operating System

Fedora Workstation 36

@machitgarha
Copy link
Author

machitgarha commented Jul 13, 2022

I don't know if it helps or not, but after installing Composer dependencies, the point mb_detect_encoding() starts to misbehave resides in vendor/phpunit/phpunit/src/TextUI/Command.php, line 563 (i.e. FileLoader::checkAndLoad($filename)). Try putting the following code snippet before and after that line:

echo mb_detect_encoding("Oops!") . PHP_EOL;

Then run ./vendor/bin/phpunit on a failing test, e.g.

./vendor/bin/phpunit --filter testRetrievesAllClasses tests/Integration/Analysis/ClasslikeListProviderTest.php

The output on my system is:

ASCII
UUENCODE
...

Although passing mb_detect_encoding() as the second parameter fixes the issue, but it's not always the case. Try the doing the same thing but with the following code snippet:

echo mb_detect_encoding("+") . PHP_EOL;

Outputs:

ASCII
BASE64

Amazingly enough, adding mb_detect_order() produces another wrong output:

ASCII
UTF-7

If you try this:

var_dump(mb_detect_order());

, you'll see that its size increases from 2 (containing only "ASCII" and "UTF-8") to 80. Amazing!

@cmb69
Copy link
Member

cmb69 commented Jul 14, 2022

you'll see that its size increases from 2 (containing only "ASCII" and "UTF-8") to 80. Amazing!

Indeed: https://gitlab.com/Serenata/Serenata/-/blob/master/src/Bootstrap.php#L55-61

@machitgarha
Copy link
Author

machitgarha commented Jul 14, 2022

@cmb69, nice catch. 👏 So maybe the bug actually is, the order of the encodings passed to mb_detect_encoding() (sometimes) is not regarded; right?

@cmb69
Copy link
Member

cmb69 commented Jul 18, 2022

Simple reproducer: https://3v4l.org/arIJ5.

The difference is because the second mb_detect_encoding() call calls remove_non_encodings_from_elist(), but the first does not.

static void remove_non_encodings_from_elist(const mbfl_encoding **elist, size_t *size)
{
/* mbstring supports some 'text encodings' which aren't really text encodings
* at all, but really 'byte encodings', like Base64, QPrint, and so on.
* These should never be returned by `mb_detect_encoding`. */
int shift = 0;
for (int i = 0; i < *size; i++) {
const mbfl_encoding *encoding = elist[i];
if (encoding->no_encoding <= mbfl_no_encoding_charset_min) {
shift++; /* Remove this encoding from the list */
} else if (shift) {
elist[i - shift] = encoding;
}
}
*size -= shift;
}

@alexdowad, shouldn't we call remove_non_encodings_from_elist() also if null is passed as $encodings (even though we would need to duplicate the elist in that case)? Or has this been done deliberately, and we "just" need to document the change?

@cmb69 cmb69 removed their assignment Jul 18, 2022
@alexdowad
Copy link
Contributor

@cmb69, what path through the code is taken if $encodings is null? Isn't it this one?

from_encodings = &MBSTRG(current_internal_encoding);
num_from_encodings = 1;
free_from_encodings = 0;

@cmb69
Copy link
Member

cmb69 commented Jul 18, 2022

@alexdowad, this is not about mb_convert_encoding(), but rather about mb_detect_encoding(). So in PHP-8.1, it is

} else {
elist = MBSTRG(current_detect_order_list);
size = MBSTRG(current_detect_order_list_size);
free_elist = 0;
}

And since free_elist = 0, the following then-clause is not executed:

if (free_elist) {
remove_non_encodings_from_elist(elist, &size);
if (size == 0) {
efree(ZEND_VOIDP(elist));
RETURN_FALSE;
}
}

However, if encodings is not null, remove_non_encodings_from_elist() is called, so there is different behavior, if someone added some non-text encodings to detect_order. And while it doesn't make sense to add non-text encodings to detect_order, some do that; probably not that uncommon: https://gitlab.com/Serenata/Serenata/-/blob/master/src/Bootstrap.php#L55-61

@alexdowad
Copy link
Contributor

Ah, I see.

One option is to automatically filter out non-text encodings from current_detect_order_list whenever it's set. Do you think that's a good or bad idea?

@alexdowad
Copy link
Contributor

Thanks to @machitgarha for finding this problem, by the way.

@cmb69
Copy link
Member

cmb69 commented Jul 18, 2022

One option is to automatically filter out non-text encodings from current_detect_order_list whenever it's set.

Ah, right. That may confuse users though (mb_detect_order($encodings); $encodings != mb_detect_order()). OTOH, this is something we can document.

@alexdowad
Copy link
Contributor

Maybe it's better to just copy elist to avoid a possible BC break.

The non-text encodings will go away in a future release of PHP anyways.

@cmb69 cmb69 self-assigned this Jul 20, 2022
cmb69 added a commit to cmb69/php-src that referenced this issue Jul 20, 2022
Passing `null` to `$encodings` is supposed to behave like passing the
result of `mb_detect_order()`.  Therefore, we need to remove the non-
encodings from the `elist` in this case as well.  Thus, we duplicate
the global `elist`, so we can modify it.
@cmb69 cmb69 changed the title mb_detect_encoding() sometimes not working with null as its second argument mb_detect_encoding(): wrong results with null $encodings Jul 20, 2022
@cmb69 cmb69 closed this as completed in c2bdaa4 Jul 20, 2022
cmb69 added a commit that referenced this issue Jul 20, 2022
* PHP-8.1:
  Fix GH-9008: mb_detect_encoding(): wrong results with null $encodings
@machitgarha
Copy link
Author

@cmb69 and @alexdowad, thank you for fixing the issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants