mb_trim() inaccurate $characters default value #13815

ranvis · 2024-03-26T22:33:11Z

Description

function mb_trim(string $string, string $characters = " \f\n\r\t\v\x00\u{00A0}...\u{180E}", ?string $encoding = null): string {}

The default values for the parameter $characters of the new mb_trim functions are not accurate.
When the very same value as the default is implied to $characters like the code below, mb_trim() does not necessarily work the same way, because $characters also depends on $encoding.
The parameter should be typed as ?string $characters = null instead.

<?php
$strUtf8 = "\u{3042}\u{3000}"; // U+3000: fullwidth space
var_dump(mb_strlen(mb_trim($strUtf8))); // 1
var_dump(mb_strlen(mb_trim($strUtf8, encoding: 'UTF-8'))); // 1

mb_internal_encoding('Shift_JIS');
$strSjis = mb_convert_encoding($strUtf8, 'Shift_JIS', 'UTF-8');
var_dump(mb_strlen(mb_trim($strSjis))); // 1
var_dump(mb_strlen(mb_trim($strSjis, encoding: 'Shift_JIS'))); // 2

PHP Version

PHP dev-master

Operating System

No response

The text was updated successfully, but these errors were encountered:

damianwadley · 2024-03-27T02:27:47Z

Looks like a bug, but it's not with the default character list - it's because giving characters and not giving characters works differently, and that should not be the case.

Relevant functions are all near each other:

php-src/ext/mbstring/mbstring.c

Lines 3058 to 3146 in 4d51bfa

    
           static zend_string* mb_trim_default_chars(zend_string *str, mb_trim_mode mode, const mbfl_encoding *enc) 
        
           { 
        
           	const uint32_t trim_default_chars[] = { 
        
           		0x20, 0x0C, 0x0A, 0x0D, 0x09, 0x0B, 0x00, 0xA0, 0x1680, 
        
           		0x2000, 0x2001, 0x2002, 0x2003, 0x2004, 0x2005, 0x2006, 0x2007, 
        
           		0x2008, 0x2009, 0x200A, 0x2028, 0x2029, 0x202F, 0x205F, 0x3000, 
        
           		0x85, 0x180E 
        
           	}; 
        
           	size_t trim_default_chars_length = sizeof(trim_default_chars) / sizeof(uint32_t); 
        
           	HashTable what_ht; 
        
           	zval val; 
        
           	ZVAL_TRUE(&val); 
        
           	zend_hash_init(&what_ht, trim_default_chars_length, NULL, NULL, false); 
        
           	for (size_t i = 0; i < trim_default_chars_length; i++) { 
        
           		zend_hash_index_add_new(&what_ht, trim_default_chars[i], &val); 
        
           	} 
        
           	zend_string* retval = trim_each_wchar(str, &what_ht, NULL, 0, mode, enc); 
        
           	zend_hash_destroy(&what_ht); 
        
           	return retval; 
        
           } 
        
           static zend_string* mb_trim_what_chars(zend_string *str, zend_string *what, mb_trim_mode mode, const mbfl_encoding *enc) 
        
           { 
        
           	unsigned char *what_in = (unsigned char*)ZSTR_VAL(what); 
        
           	uint32_t what_wchar_buf[128]; 
        
           	size_t what_out_len = 0; 
        
           	unsigned int state = 0; 
        
           	size_t what_len = ZSTR_LEN(what); 
        
           	HashTable what_ht; 
        
           	zval val; 
        
           	bool hash_initialized = false; 
        
           	while (what_len) { 
        
           		what_out_len = enc->to_wchar(&what_in, &what_len, what_wchar_buf, 128, &state); 
        
           		ZEND_ASSERT(what_out_len <= 128); 
        
           		if (what_out_len <= 4 && !hash_initialized) { 
        
           			return trim_each_wchar(str, NULL, what_wchar_buf, what_out_len, mode, enc); 
        
           		} else { 
        
           			if (!hash_initialized) { 
        
           				hash_initialized = true; 
        
           				ZVAL_TRUE(&val); 
        
           				zend_hash_init(&what_ht, what_len, NULL, NULL, false); 
        
           			} 
        
           			for (size_t i = 0; i < what_out_len; i++) { 
        
           				zend_hash_index_add(&what_ht, what_wchar_buf[i], &val); 
        
           			} 
        
           		} 
        
           	} 
        
           	if (UNEXPECTED(!hash_initialized)) { 
        
           		/* This is only possible if what is empty */ 
        
           		return zend_string_copy(str); 
        
           	} 
        
           	zend_string *retval = trim_each_wchar(str, &what_ht, NULL, 0, mode, enc); 
        
           	zend_hash_destroy(&what_ht); 
        
           	return retval; 
        
           } 
        
           static void php_do_mb_trim(INTERNAL_FUNCTION_PARAMETERS, mb_trim_mode mode) 
        
           { 
        
           	zend_string *str; 
        
           	zend_string *what = NULL; 
        
           	zend_string *encoding = NULL; 
        
           	ZEND_PARSE_PARAMETERS_START(1, 3) 
        
           		Z_PARAM_STR(str) 
        
           		Z_PARAM_OPTIONAL 
        
           		Z_PARAM_STR(what) 
        
           		Z_PARAM_STR_OR_NULL(encoding) 
        
           	ZEND_PARSE_PARAMETERS_END(); 
        
           	const mbfl_encoding *enc = php_mb_get_encoding(encoding, 3); 
        
           	if (!enc) { 
        
           		RETURN_THROWS(); 
        
           	} 
        
           	if (what) { 
        
           		RETURN_STR(mb_trim_what_chars(str, what, mode, enc)); 
        
           	} else { 
        
           		RETURN_STR(mb_trim_default_chars(str, mode, enc)); 
        
           	} 
        
           }

Calls mb_trim_what_chars when there is a character list, said characters are enc->to_wchar-ified.
Calls mb_trim_default_chars when there is no character list, default characters are not enc->to_wchar-ified.

These two pairs of outputs should be the same.

$input_utf8 = "\u{3000}abc\u{3000}";
var_dump(mb_strlen(mb_trim($input_utf8, encoding: "UTF-8"))); // 3

$trimable_utf8 = "\u{3000}";
var_dump(mb_strlen(mb_trim($input_utf8, $trimable_utf8, "UTF-8"))); // 3

//

$input_sjis = mb_convert_encoding($input_utf8, "Shift_JIS", "UTF-8");
var_dump(mb_strlen(mb_trim($input_sjis, encoding: "Shift_JIS"))); // 7

$trimable_sjis = mb_convert_encoding($trimable_utf8, "Shift_JIS", "UTF-8");
var_dump(mb_strlen(mb_trim($input_sjis, $trimable_sjis, "Shift_JIS"))); // 3

ranvis · 2024-03-27T05:11:30Z

Calls mb_trim_what_chars when there is a character list, said characters are enc->to_wchar-ified.

Yes.

Calls mb_trim_default_chars when there is no character list, default characters are not enc->to_wchar-ified.

Yes. The literal array in .c is already in wide (Unicode) form.
Not sure if I read correctly. Let me describe some more to make my intention clearer.

If mb_trim is called with one argument from PHP, the C function receives just one parameter. The remaining two are not updated on ZPP; what is null.
If encoding: is specified in PHP but $characters is not, the default value of $characters defined in the stub (mbstring_arginfo.h) is copied to call_args before the internal function invocation, as internal functions cannot support named arguments directly. As a result, the function receives three parameters; what is what is defined as a default value in the stub.
This causes the problem. While $encoding is an arbitrary value specified in userland, $characters is taken from the hard-coded stub, which is currently a sequence of UTF-8 bytes.

By changing the signature to ?string $characters = null, what is null on the second case too; the documentation might be able to say like:

When $characters is null, trimmed characters are as follows: U+0020, U+000C, U+000A, U+000D, U+0009, U+000B, U+0000, U+00A0, U+1680, U+2000, U+2001, U+2002, U+2003, U+2004, U+2005, U+2006, U+2007, U+2008, U+2009, U+200A, U+2028, U+2029, U+202F, U+205F, U+3000, U+0085 and U+180E.

And it looks like what the code intended.
The implementation works fine. However, the stub's default value declaration " \f\n\r\t\v\x00\u{00A0}... causes the problem.

damianwadley · 2024-03-27T05:45:18Z

Whether the parameter's default value is a string or is null is not the problem. The problem is that this

mb_trim($input, /* $characters is "\x20\x0C...", */ encoding: "Shift_JIS")

and this

mb_trim($input, "\x20\x0C...", "Shift_JIS")

behave differently. That is what should to be fixed. And there are multiple ways it could be fixed.
Yes, that could also be a time to decide whether to make the parameter use null as the default.

CC @youkidearitai @alexdowad @nielsdos who were part of #12459, and where I see some brief conversation that touched on the subject of omitting the $characters list.

youkidearitai · 2024-03-27T07:47:09Z

If change to default parameter to $characters = null, I think need an RFC.

@ranvis What should we do solve this issue? I can't understand "inaccurate" that collect behavior (because my English is poor).

nielsdos · 2024-03-27T07:57:34Z

The problem is that the stub file is utf8, and so the unicode characters in the default value are encoded as utf8 too. That means when using a different character encoding, we have a mismatch in encoding.

Changing the argument type would indeed fix this.
We need to bring it up on the mailing list and go from there.

ranvis · 2024-03-27T12:45:52Z

Thank you @.nielsdos for describing concisely :)

@youkidearitai
Depending on how the function is called, the different internal function is called. See below:

--- a/ext/mbstring/mbstring.c
+++ b/ext/mbstring/mbstring.c
@@ -3139,8 +3139,10 @@ static void php_do_mb_trim(INTERNAL_FUNCTION_PARAMETERS, mb_trim_mode mode)
        }

        if (what) {
+               puts("mb_trim_what_chars()");
                RETURN_STR(mb_trim_what_chars(str, what, mode, enc));
        } else {
+               puts("mb_trim_default_chars()");
                RETURN_STR(mb_trim_default_chars(str, mode, enc));
        }
 }

<?php
echo "single argument: ";
mb_strlen(mb_trim("\u{3000}"));
echo "named argument: ";
mb_strlen(mb_trim("\u{3000}", encoding: 'UTF-8'));

This will print:

single argument: mb_trim_default_chars()
named argument: mb_trim_what_chars()

So, if user call the function using "named argument" without the $characters parameter specified, what is set to "\x20\x0c\x0a\x0d...\xe3\x80\x80\xc2\x85\xe1\xa0\x8e", the UTF-8 encoded string. But mb_trim expects what to be in encoding encoding.
If encoding is not UTF-8, the above two calls works differently.
I called it inaccuracy for the function doesn't actually use the advertised default value for "single argument" case. Or should I have called it discrepancy?
Well, sorry if my subject confused you. I hope this clarifies your question.

@ranvis

Because the default characters are defined in the stub file, and the stub file is UTF-8 (typically), the characters are encoded in the string as UTF-8. When using a different character encoding, there is a mismatch between what mb_trim expects and the UTF-8 encoded string it gets. One way of solving this is by making the characters argument nullable, which would mean that it always uses the internal code path that has the unicode codepoints that are defaulted actually stored as codepoint numbers instead of in a string. Co-authored-by: @ranvis

nielsdos · 2024-03-27T21:26:52Z

I've made a PoC PR to fix this with the proposed solution (i.e. making the argument null by default): #13820. Maybe there are other, nicer, solutions possible.
Note: this is a proof-of-concept, not a commitment to this change. It must be discussed on the mailing list. This is just to show that the proposed solution would work.

youkidearitai · 2024-03-27T22:52:41Z

Ah, I got it.
$character is dependent to UTF-8, then encoding: SJIS is not compatible $character between $encoding, right?
In a sense, according to specifications, but it's not intuitive. (And, not compatible mapping UTF-8 between SJIS, for example, \u{00A0} convert to ? (means can not convert))

There is also the problem of #13789, I think posted #13820 by @nielsdos seems to make sense.
Thank you post the issue @ranvis , I will post and discussion PHP internals.

ranvis · 2024-03-28T01:02:25Z

OK. I'll try to keep an eye on it.

Probably I could have focused on making examples.

mb_internal_encoding('Shift_JIS');
$str = mb_convert_encoding('俄には信じ難い?', 'Shift_JIS', 'UTF-8');
var_dump(mb_convert_encoding(mb_trim($str, encoding: 'Shift_JIS'), 'UTF-8'));
// string(19) "には信じ難い?"

youkidearitai · 2024-04-24T07:23:10Z

Thanks all.

ranvis added Bug Status: Needs Triage labels Mar 26, 2024

ranvis mentioned this issue Mar 26, 2024

Build failed at mbstring_arginfo.h on Windows(Visual C++) #13789

Closed

damianwadley added Extension: mbstring Status: Verified and removed Status: Needs Triage labels Mar 27, 2024

Ayesh mentioned this issue Mar 31, 2024

Mb trim functions symfony/polyfill#469

Open

youkidearitai mentioned this issue Apr 7, 2024

Fix GH-13789 build failed mbstring_arginfo.h when Visual C++ on Windows #13906

Merged

nielsdos closed this as completed in f813708 Apr 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mb_trim() inaccurate $characters default value #13815

mb_trim() inaccurate $characters default value #13815

ranvis commented Mar 26, 2024

damianwadley commented Mar 27, 2024

ranvis commented Mar 27, 2024

damianwadley commented Mar 27, 2024

youkidearitai commented Mar 27, 2024

nielsdos commented Mar 27, 2024

ranvis commented Mar 27, 2024

nielsdos commented Mar 27, 2024 •

edited

youkidearitai commented Mar 27, 2024

ranvis commented Mar 28, 2024

youkidearitai commented Apr 24, 2024

mb_trim() inaccurate $characters default value #13815

mb_trim() inaccurate $characters default value #13815

Comments

ranvis commented Mar 26, 2024

Description

PHP Version

Operating System

damianwadley commented Mar 27, 2024

ranvis commented Mar 27, 2024

damianwadley commented Mar 27, 2024

youkidearitai commented Mar 27, 2024

nielsdos commented Mar 27, 2024

ranvis commented Mar 27, 2024

nielsdos commented Mar 27, 2024 • edited

youkidearitai commented Mar 27, 2024

ranvis commented Mar 28, 2024

youkidearitai commented Apr 24, 2024

nielsdos commented Mar 27, 2024 •

edited