Multi-byte word problem with TrimStrings Middleware. #40577

nshiro · 2022-01-24T06:52:48Z

Laravel Version: 9.0.0-beta.3
PHP Version: 8.1.0

Description:

In 9.x, the trimString middleware removes NBSP. #38117
This is causing the problem. When dealing with multi-byte words (in my case Japanese), we have some problems. A few words are garbled.
When I put だ or ム in the end of text, those words will be garbled.

Steps to Reproduce (quick version)

Add below to the welcome.blade.php.

<h1>Hello {{ request()->name }}</h1>

Access the url.
http://localhost/?name=やま
 http://localhost/?name=やまだ

(Replace localhost to your domain. やま or やまだ may be encoded in the URL field of your browser.）

You can see やま is ok. But if you add だ, it's not working.

Steps To Reproduce: (Original version)

[Caution]
You cannot just copy and paste the below. NBSP is replaced with the normal space.
Please use NBSP as the second argument of the trim function.
You can copy NBSP from the real source code. (Please don't copy from the github website.)

$str = 'あいうえおかきくけこさしすせそたちつてとなにぬねのはひふへほまみむめもやゆよらをわがぎぐげござしずぜぞだぢづでどばびぶべぼぱぴぷぺぽ'
    . 'アイウエオカキクケコサシスセソタチツテトナニヌネノハヒフヘホマミムメモヤユヨラヲワガギグゲゴザジズゼゾダヂヅデドバビブベボパピプペポ';

$words = preg_split('//u', $str); // split a string into each words. (This part is not related with the problem)

foreach ($words as $word) {
    echo $word.' '.bin2hex($word).' '.trim($word, " ")."<br>";
}

I guess the reason is that だ is like e381a0 and ム is like e383a0 and the NBSP is like U+00A0 in Unicode.
So If I put だ in the end of text, the last part of word a0 is trimmed and the word is garbled.
We (Japanese) also use chinese characters which I didn't looked into.

Thank you for reading.

The text was updated successfully, but these errors were encountered:

driesvints · 2022-01-24T08:15:05Z

Ping @allowing

rodrigopedra · 2022-01-24T20:20:52Z

Hey @nshiro , I can't replicate this issue...

I used an expanded version from your code:

<?php

$str = 'あいうえおかきくけこさしすせそたちつてとなにぬねのはひふへほまみむめもやゆよらをわがぎぐげござしずぜぞだぢづでどばびぶべぼぱぴぷぺぽ'
    . 'アイウエオカキクケコサシスセソタチツテトナニヌネノハヒフヘホマミムメモヤユヨラヲワガギグゲゴザジズゼゾダヂヅデドバビブベボパピプペポ';

// split a string into each words. (This part is not related with the problem)
$words = preg_split('//u', $str);

// Maybe try adding this to your code sample
echo '<meta charset="utf-8">', PHP_EOL;

echo '<table border="1">', PHP_EOL;

foreach ($words as $word) {
    echo '<tr>', PHP_EOL;

    // raw word
    echo '<td>', $word, '</td>', PHP_EOL;
    echo '<td>', bin2hex($word), '</td>', PHP_EOL;

    // raw trim
    echo '<td>', trim($word), '</td>', PHP_EOL;
    echo '<td>', bin2hex(trim($word)), '</td>', PHP_EOL;

    // trim with just an space
    echo '<td>', trim($word, " "), '</td>', PHP_EOL;
    echo '<td>', bin2hex(trim($word, " ")), '</td>', PHP_EOL;

    // trim as PR #38117
    echo '<td>', trim($word, "  \t\n\r\0\x0B"), '</td>', PHP_EOL;
    echo '<td>', bin2hex(trim($word, "  \t\n\r\0\x0B")), '</td>', PHP_EOL;

    // trim as I use in my projects
    // I add the \x08 to the default parameter as described into PHP docs
    // https://www.php.net/manual/en/function.trim
    echo '<td>', trim($word, " \n\r\t\v\x00\x08"), '</td>', PHP_EOL;
    echo '<td>', bin2hex(trim($word, " \n\r\t\v\x00\x08")), '</td>', PHP_EOL;


    echo '</tr>', PHP_EOL;
}

echo '</table>', PHP_EOL;

I got these results on firefox:

And these results on chromium:

Note I added the <meta charset="utf-8"> to the script's output (see line 10)

Also check if your code editor is saving the file with UTF-8 encoding:

nshiro · 2022-01-25T00:33:47Z

@rodrigopedra
Thank you for your response and I'm sorry for the inconvenience.
I assume you were using normal space not NBSP.
I updated the comment.

Looks like NBSP is replaced with the normal space in the github website.
So please use NBSP as the second argument of the trim function.
You can copy NBSP from the real source code. (Please don't copy from the github website.)

Please check the below. The problem still happens. I used your script (not all).

I also added another version that can be reproduced quickly. Please see the first comment.

rodrigopedra · 2022-01-25T02:28:37Z

@nshiro thanks for the heads up, maybe when I copied and pasted the code, either my browser, OS or editor converted the NBSP to a regular space.

So I was looking into my recent projects and I actually use a newer approach to deal with NBSP. I use preg_replace with the unicode modifier:

<?php

$str = 'あいうえおかきくけこさしすせそたちつてとなにぬねのはひふへほまみむめもやゆよらをわがぎぐげござしずぜぞだぢづでどばびぶべぼぱぴぷぺぽ'
    . 'アイウエオカキクケコサシスセソタチツテトナニヌネノハヒフヘホマミムメモヤユヨラヲワガギグゲゴザジズゼゾダヂヅデドバビブベボパピプペポ';

// split a string into each words. (This part is not related with the problem)
$words = preg_split('//u', $str);

// adding two more test characters
$words[] = '  だ  '; // 2x NBSP before, 2x regular spaces after
$words[] = '  だ  '; // 2x regular spaces before, 2x NBSP after
$words[] = '  だ  '; // NBSP + regular space before and after

$words[] = '  ム  '; // 2x NBSP before, 2x regular spaces after
$words[] = '  ム  '; // 2x regular spaces before, 2x NBSP after
$words[] = '  ム  '; // NBSP + regular space before and after

// Maybe try adding this to your code sample
echo '<meta charset="utf-8">', PHP_EOL;

echo '<table border="1">', PHP_EOL;

foreach ($words as $word) {
    echo '<tr>', PHP_EOL;

    // raw word
    echo '<td>', $word, '</td>', PHP_EOL;
    echo '<td>', bin2hex($word), '</td>', PHP_EOL;

    // raw trim
    echo '<td>', trim($word), '</td>', PHP_EOL;
    echo '<td>', bin2hex(trim($word)), '</td>', PHP_EOL;

    // trim with just an space
    echo '<td>', trim($word, " "), '</td>', PHP_EOL;
    echo '<td>', bin2hex(trim($word, " ")), '</td>', PHP_EOL;

    // trim as PR #38117
    echo '<td>', trim($word, "  \t\n\r\0\x0B"), '</td>', PHP_EOL;
    echo '<td>', bin2hex(trim($word, "  \t\n\r\0\x0B")), '</td>', PHP_EOL;

    // trim as I **used** in my projects
    // I add the \x08 to the default parameter as described into PHP docs
    // https://www.php.net/manual/en/function.trim
    echo '<td>', trim($word, " \n\r\t\v\x00\x08"), '</td>', PHP_EOL;
    echo '<td>', bin2hex(trim($word, " \n\r\t\v\x00\x08")), '</td>', PHP_EOL;

    // transform I now use in my projects
    $value = preg_replace('~^\s+|\s+$~iu', '', $word);

    echo '<td>', $value, '</td>', PHP_EOL;
    echo '<td>', bin2hex($value), '</td>', PHP_EOL;

    echo '</tr>', PHP_EOL;
}

echo '</table>', PHP_EOL;

You can see I added some additional cases to test it better, and the results are these:

I sent PR #40600 to modify the TrimStrings middleware to use preg_replace

nshiro · 2022-01-25T04:19:19Z

@rodrigopedra
I tested your script and there was no problem.
I also tested it with some kanji, emojis and other characters and found no problem.
Looks good to me !

Thank you for your support.

rodrigopedra · 2022-01-26T06:22:11Z

ping @nshiro and @foremtehan

Could you take a look at my comment on PR #40600 about supporting the word-joiner?

#40600 (comment)

I didn't want to spam a closed PR to avoid annoying the maintainers.

allowing · 2022-04-13T10:38:57Z

@nshiro 哈哈，来自日本的朋友你好。很抱歉今天才看到这个 issue 。我解决NBSP的方案太过于简单粗暴。
你看一下这个会不会带来同样的问题: #41949。

Translate:

Haha, hello friends from Japan. Sorry I only saw this issue today. My solution to NBSP is too simplistic.
See if this brings up the same problem: #41949.

driesvints added the bug label Jan 24, 2022

driesvints mentioned this issue Jan 24, 2022

[9.x] TrimString can now remove NBSP #38117

Merged

rodrigopedra mentioned this issue Jan 25, 2022

Handle unicode characters on TrimStrings middleware #40596

Closed

rodrigopedra mentioned this issue Jan 25, 2022

[9.x] Handle unicode characters on TrimStrings middleware #40600

Merged

driesvints linked a pull request Jan 25, 2022 that will close this issue

[9.x] Handle unicode characters on TrimStrings middleware #40600

Merged

driesvints closed this as completed Jan 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-byte word problem with TrimStrings Middleware. #40577

Multi-byte word problem with TrimStrings Middleware. #40577

nshiro commented Jan 24, 2022 •

edited

driesvints commented Jan 24, 2022

rodrigopedra commented Jan 24, 2022 •

edited

nshiro commented Jan 25, 2022 •

edited

rodrigopedra commented Jan 25, 2022 •

edited

nshiro commented Jan 25, 2022

rodrigopedra commented Jan 26, 2022

allowing commented Apr 13, 2022

Multi-byte word problem with TrimStrings Middleware. #40577

Multi-byte word problem with TrimStrings Middleware. #40577

Comments

nshiro commented Jan 24, 2022 • edited

Description:

Steps to Reproduce (quick version)

Steps To Reproduce: (Original version)

driesvints commented Jan 24, 2022

rodrigopedra commented Jan 24, 2022 • edited

nshiro commented Jan 25, 2022 • edited

rodrigopedra commented Jan 25, 2022 • edited

nshiro commented Jan 25, 2022

rodrigopedra commented Jan 26, 2022

allowing commented Apr 13, 2022

nshiro commented Jan 24, 2022 •

edited

rodrigopedra commented Jan 24, 2022 •

edited

nshiro commented Jan 25, 2022 •

edited

rodrigopedra commented Jan 25, 2022 •

edited