Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-byte word problem with TrimStrings Middleware. #40577

Closed
nshiro opened this issue Jan 24, 2022 · 7 comments · Fixed by #40600
Closed

Multi-byte word problem with TrimStrings Middleware. #40577

nshiro opened this issue Jan 24, 2022 · 7 comments · Fixed by #40600
Labels

Comments

@nshiro
Copy link
Contributor

nshiro commented Jan 24, 2022

  • Laravel Version: 9.0.0-beta.3
  • PHP Version: 8.1.0

Description:

In 9.x, the trimString middleware removes NBSP. #38117
This is causing the problem. When dealing with multi-byte words (in my case Japanese), we have some problems. A few words are garbled.
When I put or in the end of text, those words will be garbled.

Steps to Reproduce (quick version)

Add below to the welcome.blade.php.

<h1>Hello {{ request()->name }}</h1>

Access the url.
http://localhost/?name=やま
http://localhost/?name=やまだ

(Replace localhost to your domain. やま or やまだ may be encoded in the URL field of your browser.)

You can see やま is ok. But if you add だ, it's not working.

2022-01-25_09h46_35

Steps To Reproduce: (Original version)

[Caution]
You cannot just copy and paste the below. NBSP is replaced with the normal space.
Please use NBSP as the second argument of the trim function.
You can copy NBSP from the real source code. (Please don't copy from the github website.)

$str = 'あいうえおかきくけこさしすせそたちつてとなにぬねのはひふへほまみむめもやゆよらをわがぎぐげござしずぜぞだぢづでどばびぶべぼぱぴぷぺぽ'
    . 'アイウエオカキクケコサシスセソタチツテトナニヌネノハヒフヘホマミムメモヤユヨラヲワガギグゲゴザジズゼゾダヂヅデドバビブベボパピプペポ';

$words = preg_split('//u', $str); // split a string into each words. (This part is not related with the problem)

foreach ($words as $word) {
    echo $word.' '.bin2hex($word).' '.trim($word, " ")."<br>";
}

2022-01-25_09h16_51

I guess the reason is that is like e381a0 and is like e383a0 and the NBSP is like U+00A0 in Unicode.
So If I put in the end of text, the last part of word a0 is trimmed and the word is garbled.
We (Japanese) also use chinese characters which I didn't looked into.

2022-01-24_15h06_27

2022-01-24 15 37 02 trim719fac98c688

Thank you for reading.

@driesvints
Copy link
Member

Ping @allowing

@rodrigopedra
Copy link
Contributor

rodrigopedra commented Jan 24, 2022

Hey @nshiro , I can't replicate this issue...

I used an expanded version from your code:

<?php

$str = 'あいうえおかきくけこさしすせそたちつてとなにぬねのはひふへほまみむめもやゆよらをわがぎぐげござしずぜぞだぢづでどばびぶべぼぱぴぷぺぽ'
    . 'アイウエオカキクケコサシスセソタチツテトナニヌネノハヒフヘホマミムメモヤユヨラヲワガギグゲゴザジズゼゾダヂヅデドバビブベボパピプペポ';

// split a string into each words. (This part is not related with the problem)
$words = preg_split('//u', $str);

// Maybe try adding this to your code sample
echo '<meta charset="utf-8">', PHP_EOL;

echo '<table border="1">', PHP_EOL;

foreach ($words as $word) {
    echo '<tr>', PHP_EOL;

    // raw word
    echo '<td>', $word, '</td>', PHP_EOL;
    echo '<td>', bin2hex($word), '</td>', PHP_EOL;

    // raw trim
    echo '<td>', trim($word), '</td>', PHP_EOL;
    echo '<td>', bin2hex(trim($word)), '</td>', PHP_EOL;

    // trim with just an space
    echo '<td>', trim($word, " "), '</td>', PHP_EOL;
    echo '<td>', bin2hex(trim($word, " ")), '</td>', PHP_EOL;

    // trim as PR #38117
    echo '<td>', trim($word, "  \t\n\r\0\x0B"), '</td>', PHP_EOL;
    echo '<td>', bin2hex(trim($word, "  \t\n\r\0\x0B")), '</td>', PHP_EOL;

    // trim as I use in my projects
    // I add the \x08 to the default parameter as described into PHP docs
    // https://www.php.net/manual/en/function.trim
    echo '<td>', trim($word, " \n\r\t\v\x00\x08"), '</td>', PHP_EOL;
    echo '<td>', bin2hex(trim($word, " \n\r\t\v\x00\x08")), '</td>', PHP_EOL;


    echo '</tr>', PHP_EOL;
}

echo '</table>', PHP_EOL;

I got these results on firefox:

image

image

And these results on chromium:

image

image

Note I added the <meta charset="utf-8"> to the script's output (see line 10)

Also check if your code editor is saving the file with UTF-8 encoding:

image

@nshiro
Copy link
Contributor Author

nshiro commented Jan 25, 2022

@rodrigopedra
Thank you for your response and I'm sorry for the inconvenience.
I assume you were using normal space not NBSP.
I updated the comment.

Looks like NBSP is replaced with the normal space in the github website.
So please use NBSP as the second argument of the trim function.
You can copy NBSP from the real source code. (Please don't copy from the github website.)

Please check the below. The problem still happens. I used your script (not all).

2022-01-25_09h26_23

I also added another version that can be reproduced quickly. Please see the first comment.

@rodrigopedra
Copy link
Contributor

rodrigopedra commented Jan 25, 2022

@nshiro thanks for the heads up, maybe when I copied and pasted the code, either my browser, OS or editor converted the NBSP to a regular space.

So I was looking into my recent projects and I actually use a newer approach to deal with NBSP. I use preg_replace with the unicode modifier:

<?php

$str = 'あいうえおかきくけこさしすせそたちつてとなにぬねのはひふへほまみむめもやゆよらをわがぎぐげござしずぜぞだぢづでどばびぶべぼぱぴぷぺぽ'
    . 'アイウエオカキクケコサシスセソタチツテトナニヌネノハヒフヘホマミムメモヤユヨラヲワガギグゲゴザジズゼゾダヂヅデドバビブベボパピプペポ';

// split a string into each words. (This part is not related with the problem)
$words = preg_split('//u', $str);

// adding two more test characters
$words[] = '  だ  '; // 2x NBSP before, 2x regular spaces after
$words[] = '  だ  '; // 2x regular spaces before, 2x NBSP after
$words[] = '  だ  '; // NBSP + regular space before and after

$words[] = '  ム  '; // 2x NBSP before, 2x regular spaces after
$words[] = '  ム  '; // 2x regular spaces before, 2x NBSP after
$words[] = '  ム  '; // NBSP + regular space before and after

// Maybe try adding this to your code sample
echo '<meta charset="utf-8">', PHP_EOL;

echo '<table border="1">', PHP_EOL;

foreach ($words as $word) {
    echo '<tr>', PHP_EOL;

    // raw word
    echo '<td>', $word, '</td>', PHP_EOL;
    echo '<td>', bin2hex($word), '</td>', PHP_EOL;

    // raw trim
    echo '<td>', trim($word), '</td>', PHP_EOL;
    echo '<td>', bin2hex(trim($word)), '</td>', PHP_EOL;

    // trim with just an space
    echo '<td>', trim($word, " "), '</td>', PHP_EOL;
    echo '<td>', bin2hex(trim($word, " ")), '</td>', PHP_EOL;

    // trim as PR #38117
    echo '<td>', trim($word, "  \t\n\r\0\x0B"), '</td>', PHP_EOL;
    echo '<td>', bin2hex(trim($word, "  \t\n\r\0\x0B")), '</td>', PHP_EOL;

    // trim as I **used** in my projects
    // I add the \x08 to the default parameter as described into PHP docs
    // https://www.php.net/manual/en/function.trim
    echo '<td>', trim($word, " \n\r\t\v\x00\x08"), '</td>', PHP_EOL;
    echo '<td>', bin2hex(trim($word, " \n\r\t\v\x00\x08")), '</td>', PHP_EOL;

    // transform I now use in my projects
    $value = preg_replace('~^\s+|\s+$~iu', '', $word);

    echo '<td>', $value, '</td>', PHP_EOL;
    echo '<td>', bin2hex($value), '</td>', PHP_EOL;

    echo '</tr>', PHP_EOL;
}

echo '</table>', PHP_EOL;

You can see I added some additional cases to test it better, and the results are these:

image

I sent PR #40600 to modify the TrimStrings middleware to use preg_replace

@nshiro
Copy link
Contributor Author

nshiro commented Jan 25, 2022

@rodrigopedra
I tested your script and there was no problem.
I also tested it with some kanji, emojis and other characters and found no problem.
Looks good to me !

Thank you for your support.

@rodrigopedra
Copy link
Contributor

ping @nshiro and @foremtehan

Could you take a look at my comment on PR #40600 about supporting the word-joiner?

#40600 (comment)

I didn't want to spam a closed PR to avoid annoying the maintainers.

@allowing
Copy link
Contributor

@nshiro 哈哈,来自日本的朋友你好。很抱歉今天才看到这个 issue 。我解决NBSP的方案太过于简单粗暴。
你看一下这个会不会带来同样的问题: #41949

Translate:

Haha, hello friends from Japan. Sorry I only saw this issue today. My solution to NBSP is too simplistic.
See if this brings up the same problem: #41949.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants