Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mb_trim() inaccurate $characters default value #13815

Closed
ranvis opened this issue Mar 26, 2024 · 10 comments
Closed

mb_trim() inaccurate $characters default value #13815

ranvis opened this issue Mar 26, 2024 · 10 comments

Comments

@ranvis
Copy link
Contributor

ranvis commented Mar 26, 2024

Description

function mb_trim(string $string, string $characters = " \f\n\r\t\v\x00\u{00A0}...\u{180E}", ?string $encoding = null): string {}

The default values for the parameter $characters of the new mb_trim functions are not accurate.
When the very same value as the default is implied to $characters like the code below, mb_trim() does not necessarily work the same way, because $characters also depends on $encoding.
The parameter should be typed as ?string $characters = null instead.

<?php
$strUtf8 = "\u{3042}\u{3000}"; // U+3000: fullwidth space
var_dump(mb_strlen(mb_trim($strUtf8))); // 1
var_dump(mb_strlen(mb_trim($strUtf8, encoding: 'UTF-8'))); // 1

mb_internal_encoding('Shift_JIS');
$strSjis = mb_convert_encoding($strUtf8, 'Shift_JIS', 'UTF-8');
var_dump(mb_strlen(mb_trim($strSjis))); // 1
var_dump(mb_strlen(mb_trim($strSjis, encoding: 'Shift_JIS'))); // 2

PHP Version

PHP dev-master

Operating System

No response

@damianwadley
Copy link
Member

Looks like a bug, but it's not with the default character list - it's because giving characters and not giving characters works differently, and that should not be the case.

Relevant functions are all near each other:

static zend_string* mb_trim_default_chars(zend_string *str, mb_trim_mode mode, const mbfl_encoding *enc)
{
const uint32_t trim_default_chars[] = {
0x20, 0x0C, 0x0A, 0x0D, 0x09, 0x0B, 0x00, 0xA0, 0x1680,
0x2000, 0x2001, 0x2002, 0x2003, 0x2004, 0x2005, 0x2006, 0x2007,
0x2008, 0x2009, 0x200A, 0x2028, 0x2029, 0x202F, 0x205F, 0x3000,
0x85, 0x180E
};
size_t trim_default_chars_length = sizeof(trim_default_chars) / sizeof(uint32_t);
HashTable what_ht;
zval val;
ZVAL_TRUE(&val);
zend_hash_init(&what_ht, trim_default_chars_length, NULL, NULL, false);
for (size_t i = 0; i < trim_default_chars_length; i++) {
zend_hash_index_add_new(&what_ht, trim_default_chars[i], &val);
}
zend_string* retval = trim_each_wchar(str, &what_ht, NULL, 0, mode, enc);
zend_hash_destroy(&what_ht);
return retval;
}
static zend_string* mb_trim_what_chars(zend_string *str, zend_string *what, mb_trim_mode mode, const mbfl_encoding *enc)
{
unsigned char *what_in = (unsigned char*)ZSTR_VAL(what);
uint32_t what_wchar_buf[128];
size_t what_out_len = 0;
unsigned int state = 0;
size_t what_len = ZSTR_LEN(what);
HashTable what_ht;
zval val;
bool hash_initialized = false;
while (what_len) {
what_out_len = enc->to_wchar(&what_in, &what_len, what_wchar_buf, 128, &state);
ZEND_ASSERT(what_out_len <= 128);
if (what_out_len <= 4 && !hash_initialized) {
return trim_each_wchar(str, NULL, what_wchar_buf, what_out_len, mode, enc);
} else {
if (!hash_initialized) {
hash_initialized = true;
ZVAL_TRUE(&val);
zend_hash_init(&what_ht, what_len, NULL, NULL, false);
}
for (size_t i = 0; i < what_out_len; i++) {
zend_hash_index_add(&what_ht, what_wchar_buf[i], &val);
}
}
}
if (UNEXPECTED(!hash_initialized)) {
/* This is only possible if what is empty */
return zend_string_copy(str);
}
zend_string *retval = trim_each_wchar(str, &what_ht, NULL, 0, mode, enc);
zend_hash_destroy(&what_ht);
return retval;
}
static void php_do_mb_trim(INTERNAL_FUNCTION_PARAMETERS, mb_trim_mode mode)
{
zend_string *str;
zend_string *what = NULL;
zend_string *encoding = NULL;
ZEND_PARSE_PARAMETERS_START(1, 3)
Z_PARAM_STR(str)
Z_PARAM_OPTIONAL
Z_PARAM_STR(what)
Z_PARAM_STR_OR_NULL(encoding)
ZEND_PARSE_PARAMETERS_END();
const mbfl_encoding *enc = php_mb_get_encoding(encoding, 3);
if (!enc) {
RETURN_THROWS();
}
if (what) {
RETURN_STR(mb_trim_what_chars(str, what, mode, enc));
} else {
RETURN_STR(mb_trim_default_chars(str, mode, enc));
}
}

Calls mb_trim_what_chars when there is a character list, said characters are enc->to_wchar-ified.
Calls mb_trim_default_chars when there is no character list, default characters are not enc->to_wchar-ified.

These two pairs of outputs should be the same.

$input_utf8 = "\u{3000}abc\u{3000}";
var_dump(mb_strlen(mb_trim($input_utf8, encoding: "UTF-8"))); // 3

$trimable_utf8 = "\u{3000}";
var_dump(mb_strlen(mb_trim($input_utf8, $trimable_utf8, "UTF-8"))); // 3

//

$input_sjis = mb_convert_encoding($input_utf8, "Shift_JIS", "UTF-8");
var_dump(mb_strlen(mb_trim($input_sjis, encoding: "Shift_JIS"))); // 7

$trimable_sjis = mb_convert_encoding($trimable_utf8, "Shift_JIS", "UTF-8");
var_dump(mb_strlen(mb_trim($input_sjis, $trimable_sjis, "Shift_JIS"))); // 3

@ranvis
Copy link
Contributor Author

ranvis commented Mar 27, 2024

Calls mb_trim_what_chars when there is a character list, said characters are enc->to_wchar-ified.

Yes.

Calls mb_trim_default_chars when there is no character list, default characters are not enc->to_wchar-ified.

Yes. The literal array in .c is already in wide (Unicode) form.
Not sure if I read correctly. Let me describe some more to make my intention clearer.

  1. If mb_trim is called with one argument from PHP, the C function receives just one parameter. The remaining two are not updated on ZPP; what is null.
  2. If encoding: is specified in PHP but $characters is not, the default value of $characters defined in the stub (mbstring_arginfo.h) is copied to call_args before the internal function invocation, as internal functions cannot support named arguments directly. As a result, the function receives three parameters; what is what is defined as a default value in the stub.
    This causes the problem. While $encoding is an arbitrary value specified in userland, $characters is taken from the hard-coded stub, which is currently a sequence of UTF-8 bytes.

By changing the signature to ?string $characters = null, what is null on the second case too; the documentation might be able to say like:

When $characters is null, trimmed characters are as follows: U+0020, U+000C, U+000A, U+000D, U+0009, U+000B, U+0000, U+00A0, U+1680, U+2000, U+2001, U+2002, U+2003, U+2004, U+2005, U+2006, U+2007, U+2008, U+2009, U+200A, U+2028, U+2029, U+202F, U+205F, U+3000, U+0085 and U+180E.

And it looks like what the code intended.
The implementation works fine. However, the stub's default value declaration " \f\n\r\t\v\x00\u{00A0}... causes the problem.

@damianwadley
Copy link
Member

Whether the parameter's default value is a string or is null is not the problem. The problem is that this

mb_trim($input, /* $characters is "\x20\x0C...", */ encoding: "Shift_JIS")

and this

mb_trim($input, "\x20\x0C...", "Shift_JIS")

behave differently. That is what should to be fixed. And there are multiple ways it could be fixed.
Yes, that could also be a time to decide whether to make the parameter use null as the default.

CC @youkidearitai @alexdowad @nielsdos who were part of #12459, and where I see some brief conversation that touched on the subject of omitting the $characters list.

@youkidearitai
Copy link
Contributor

If change to default parameter to $characters = null, I think need an RFC.

@ranvis What should we do solve this issue? I can't understand "inaccurate" that collect behavior (because my English is poor).

@nielsdos
Copy link
Member

The problem is that the stub file is utf8, and so the unicode characters in the default value are encoded as utf8 too. That means when using a different character encoding, we have a mismatch in encoding.

Changing the argument type would indeed fix this.
We need to bring it up on the mailing list and go from there.

@ranvis
Copy link
Contributor Author

ranvis commented Mar 27, 2024

Thank you @.nielsdos for describing concisely :)

@youkidearitai
Depending on how the function is called, the different internal function is called. See below:

--- a/ext/mbstring/mbstring.c
+++ b/ext/mbstring/mbstring.c
@@ -3139,8 +3139,10 @@ static void php_do_mb_trim(INTERNAL_FUNCTION_PARAMETERS, mb_trim_mode mode)
        }

        if (what) {
+               puts("mb_trim_what_chars()");
                RETURN_STR(mb_trim_what_chars(str, what, mode, enc));
        } else {
+               puts("mb_trim_default_chars()");
                RETURN_STR(mb_trim_default_chars(str, mode, enc));
        }
 }
<?php
echo "single argument: ";
mb_strlen(mb_trim("\u{3000}"));
echo "named argument: ";
mb_strlen(mb_trim("\u{3000}", encoding: 'UTF-8'));

This will print:

single argument: mb_trim_default_chars()
named argument: mb_trim_what_chars()

So, if user call the function using "named argument" without the $characters parameter specified, what is set to "\x20\x0c\x0a\x0d...\xe3\x80\x80\xc2\x85\xe1\xa0\x8e", the UTF-8 encoded string. But mb_trim expects what to be in encoding encoding.
If encoding is not UTF-8, the above two calls works differently.
I called it inaccuracy for the function doesn't actually use the advertised default value for "single argument" case. Or should I have called it discrepancy?
Well, sorry if my subject confused you. I hope this clarifies your question.

nielsdos added a commit to nielsdos/php-src that referenced this issue Mar 27, 2024
Because the default characters are defined in the stub file, and the
stub file is UTF-8 (typically), the characters are encoded in the string
as UTF-8. When using a different character encoding, there is a mismatch
between what mb_trim expects and the UTF-8 encoded string it gets.

One way of solving this is by making the characters argument nullable,
which would mean that it always uses the internal code path that has the
unicode codepoints that are defaulted actually stored as codepoint
numbers instead of in a string.

Co-authored-by: @ranvis
@nielsdos
Copy link
Member

nielsdos commented Mar 27, 2024

I've made a PoC PR to fix this with the proposed solution (i.e. making the argument null by default): #13820. Maybe there are other, nicer, solutions possible.
Note: this is a proof-of-concept, not a commitment to this change. It must be discussed on the mailing list. This is just to show that the proposed solution would work.

@youkidearitai
Copy link
Contributor

Ah, I got it.
$character is dependent to UTF-8, then encoding: SJIS is not compatible $character between $encoding, right?
In a sense, according to specifications, but it's not intuitive. (And, not compatible mapping UTF-8 between SJIS, for example, \u{00A0} convert to ? (means can not convert))

There is also the problem of #13789, I think posted #13820 by @nielsdos seems to make sense.
Thank you post the issue @ranvis , I will post and discussion PHP internals.

@ranvis
Copy link
Contributor Author

ranvis commented Mar 28, 2024

OK. I'll try to keep an eye on it.

Probably I could have focused on making examples.

mb_internal_encoding('Shift_JIS');
$str = mb_convert_encoding('俄には信じ難い?', 'Shift_JIS', 'UTF-8');
var_dump(mb_convert_encoding(mb_trim($str, encoding: 'Shift_JIS'), 'UTF-8'));
// string(19) "には信じ難い?"

@youkidearitai
Copy link
Contributor

Thanks all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants