Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mb_substr returns a different value in PHP 8.4 #12972

Closed
MauricioFauth opened this issue Dec 19, 2023 · 4 comments · Fixed by #12983
Closed

mb_substr returns a different value in PHP 8.4 #12972

MauricioFauth opened this issue Dec 19, 2023 · 4 comments · Fixed by #12983

Comments

@MauricioFauth
Copy link
Member

MauricioFauth commented Dec 19, 2023

Description

The following code:

<?php
var_dump(bin2hex(mb_substr(hex2bin("8964"), 0, 1)));
var_dump(mb_substr('Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vestibulum dapibus feugiat ex non cursus. Pellentesque vestibulum tellus sit lectus.', 19, -1));

Resulted in this output:

string(2) "3f"
string(109) "it amet, consectetur adipiscing elit. Vestibulum dapibus feugiat ex non cursus. Pellentesque vestibulum tellu"

But I expected this output instead:

string(2) "89"
string(121) "it amet, consectetur adipiscing elit. Vestibulum dapibus feugiat ex non cursus. Pellentesque vestibulum tellus sit lectus"

https://3v4l.org/KZrMo
https://3v4l.org/KZrMo/rfc#vgit.master

https://3v4l.org/B0sf5
https://3v4l.org/B0sf5/rfc#vgit.master

PHP Version

PHP 8.4 cffdeb8 (This is the commit I tested, not necessarily the one that introduced the issue.)

Operating System

Debian Trixie

@youkidearitai
Copy link
Contributor

It seems behavior changes by #12337

@MauricioFauth
Copy link
Member Author

I added one more weird behavior of mb_substr to the first comment.

<?php
var_dump(mb_substr('Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vestibulum dapibus feugiat ex non cursus. Pellentesque vestibulum tellus sit lectus.', 19, -1));

Resulted in this output (Note the end of the result string):

string(109) "it amet, consectetur adipiscing elit. Vestibulum dapibus feugiat ex non cursus. Pellentesque vestibulum tellu"

But I expected this output instead:

string(121) "it amet, consectetur adipiscing elit. Vestibulum dapibus feugiat ex non cursus. Pellentesque vestibulum tellus sit lectus"

https://3v4l.org/B0sf5
https://3v4l.org/B0sf5/rfc#vgit.master

@MauricioFauth
Copy link
Member Author

Hi @alexdowad. Looks like this is related to your recent changes to mbstring.

@alexdowad
Copy link
Contributor

@MauricioFauth Thanks very much.

Change #1 is indeed related to my recent changes to mbstring, and it is the expected behavior. The test string which you kindly provided ("\x89\x64") is not a UTF-8 string, but you are trying to extract UTF-8 characters from it, which does not make sense. (I am assuming that mbstring.internal_encoding is UTF-8.)

So for point 1, the best things you can do are: A) explicitly specify the text encoding for your string when calling mb_substr, and B) ensure that you are operating on valid strings. You can do this by calling mb_check_encoding before operating on strings passed in by application users. In most cases, your application will be more secure if you refuse to operate on strings with invalid encoding (i.e. just return an error).

Change #2 is clearly a bug, and it is not caused by the recent changes to mbstring! The recent changes to mbstring just revealed the bug, because it is now affecting UTF-8 strings, whereas before, it would only affect certain other legacy text encodings.

The bug was introduced in 0c0774f5, back in October 2022. It's interesting that such a serious bug, and in such a heavily used function, was not noticed for 14 months. I think what this tells me is that (likely) the bulk of mbstring users are all operating on UTF-8 text, so bugs which only affect legacy text encodings are not so readily noticed.

So, @MauricioFauth, thank you for finally finding my mistake after 14 long months. Your test case will be added to the PHP test suite as a regression test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants