-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Description
Description
The following code:
<?php
mb_substitute_character(ord('?'));
var_dump(mb_convert_encoding("\xDF\xC0", 'UTF-8', 'UTF-8'));
var_dump(mb_convert_encoding("\xEF\xBF\xC0", 'UTF-8', 'UTF-8'));
var_dump(mb_convert_encoding("\xF4\x8F\xBF\xC0", 'UTF-8', 'UTF-8'));
Resulted in this output:
string(1) "?"
string(1) "?"
string(1) "?"
But I expected this output instead (as it is in PHP 7.x and 8.0):
string(2) "??"
string(2) "??"
string(2) "??"
As far as I know mb_convert_encoding(…, 'UTF-8', 'UTF-8')
is commonly used to convert (potentially) non-conforming UTF-8 to valid UTF-8.
Before PHP 8.1.0 this worked as expected and behaved the same way as all major web browsers do. But since PHP 8.1.0 the behavior has changed and UTF-8 sequences with out-of-bounds follow bytes are no longer substituted with two substitute characters (usually represented as �) but with only one.
To my understanding this is defined in the encoding specification: https://encoding.spec.whatwg.org/#utf-8-decoder
The step 8.1.1.4.2 “Prepend byte to ioQueue.” seems to be missing in these cases.
PHP Version
PHP 8.1.5 (all 8.1.x are affected)
Operating System
No response