New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mb_trim() inaccurate $characters default value #13815
Comments
Looks like a bug, but it's not with the default character list - it's because giving characters and not giving characters works differently, and that should not be the case. Relevant functions are all near each other: php-src/ext/mbstring/mbstring.c Lines 3058 to 3146 in 4d51bfa
Calls These two pairs of outputs should be the same. $input_utf8 = "\u{3000}abc\u{3000}";
var_dump(mb_strlen(mb_trim($input_utf8, encoding: "UTF-8"))); // 3
$trimable_utf8 = "\u{3000}";
var_dump(mb_strlen(mb_trim($input_utf8, $trimable_utf8, "UTF-8"))); // 3
//
$input_sjis = mb_convert_encoding($input_utf8, "Shift_JIS", "UTF-8");
var_dump(mb_strlen(mb_trim($input_sjis, encoding: "Shift_JIS"))); // 7
$trimable_sjis = mb_convert_encoding($trimable_utf8, "Shift_JIS", "UTF-8");
var_dump(mb_strlen(mb_trim($input_sjis, $trimable_sjis, "Shift_JIS"))); // 3 |
Yes.
Yes. The literal array in .c is already in wide (Unicode) form.
By changing the signature to
And it looks like what the code intended. |
Whether the parameter's default value is a string or is null is not the problem. The problem is that this mb_trim($input, /* $characters is "\x20\x0C...", */ encoding: "Shift_JIS") and this mb_trim($input, "\x20\x0C...", "Shift_JIS") behave differently. That is what should to be fixed. And there are multiple ways it could be fixed. CC @youkidearitai @alexdowad @nielsdos who were part of #12459, and where I see some brief conversation that touched on the subject of omitting the $characters list. |
If change to default parameter to @ranvis What should we do solve this issue? I can't understand "inaccurate" that collect behavior (because my English is poor). |
The problem is that the stub file is utf8, and so the unicode characters in the default value are encoded as utf8 too. That means when using a different character encoding, we have a mismatch in encoding. Changing the argument type would indeed fix this. |
Thank you @.nielsdos for describing concisely :) @youkidearitai --- a/ext/mbstring/mbstring.c
+++ b/ext/mbstring/mbstring.c
@@ -3139,8 +3139,10 @@ static void php_do_mb_trim(INTERNAL_FUNCTION_PARAMETERS, mb_trim_mode mode)
}
if (what) {
+ puts("mb_trim_what_chars()");
RETURN_STR(mb_trim_what_chars(str, what, mode, enc));
} else {
+ puts("mb_trim_default_chars()");
RETURN_STR(mb_trim_default_chars(str, mode, enc));
}
} <?php
echo "single argument: ";
mb_strlen(mb_trim("\u{3000}"));
echo "named argument: ";
mb_strlen(mb_trim("\u{3000}", encoding: 'UTF-8')); This will print:
So, if user call the function using "named argument" without the |
Because the default characters are defined in the stub file, and the stub file is UTF-8 (typically), the characters are encoded in the string as UTF-8. When using a different character encoding, there is a mismatch between what mb_trim expects and the UTF-8 encoded string it gets. One way of solving this is by making the characters argument nullable, which would mean that it always uses the internal code path that has the unicode codepoints that are defaulted actually stored as codepoint numbers instead of in a string. Co-authored-by: @ranvis
I've made a PoC PR to fix this with the proposed solution (i.e. making the argument null by default): #13820. Maybe there are other, nicer, solutions possible. |
Ah, I got it. There is also the problem of #13789, I think posted #13820 by @nielsdos seems to make sense. |
OK. I'll try to keep an eye on it. Probably I could have focused on making examples. mb_internal_encoding('Shift_JIS');
$str = mb_convert_encoding('俄には信じ難い?', 'Shift_JIS', 'UTF-8');
var_dump(mb_convert_encoding(mb_trim($str, encoding: 'Shift_JIS'), 'UTF-8'));
// string(19) "には信じ難い?" |
Thanks all. |
Description
The default values for the parameter
$characters
of the new mb_trim functions are not accurate.When the very same value as the default is implied to
$characters
like the code below,mb_trim()
does not necessarily work the same way, because$characters
also depends on$encoding
.The parameter should be typed as
?string $characters = null
instead.PHP Version
PHP dev-master
Operating System
No response
The text was updated successfully, but these errors were encountered: