Skip to content

mbstring: UTF-8 sequences with out-of-bounds follow bytes are no longer substituted correctly #8360

@ausi

Description

@ausi

Description

The following code:

<?php
mb_substitute_character(ord('?'));
var_dump(mb_convert_encoding("\xDF\xC0", 'UTF-8', 'UTF-8'));
var_dump(mb_convert_encoding("\xEF\xBF\xC0", 'UTF-8', 'UTF-8'));
var_dump(mb_convert_encoding("\xF4\x8F\xBF\xC0", 'UTF-8', 'UTF-8'));

Resulted in this output:

string(1) "?"
string(1) "?"
string(1) "?"

But I expected this output instead (as it is in PHP 7.x and 8.0):

string(2) "??"
string(2) "??"
string(2) "??"

See https://3v4l.org/Qt2HJ

As far as I know mb_convert_encoding(…, 'UTF-8', 'UTF-8') is commonly used to convert (potentially) non-conforming UTF-8 to valid UTF-8.

Before PHP 8.1.0 this worked as expected and behaved the same way as all major web browsers do. But since PHP 8.1.0 the behavior has changed and UTF-8 sequences with out-of-bounds follow bytes are no longer substituted with two substitute characters (usually represented as �) but with only one.

To my understanding this is defined in the encoding specification: https://encoding.spec.whatwg.org/#utf-8-decoder
The step 8.1.1.4.2 “Prepend byte to ioQueue.” seems to be missing in these cases.

PHP Version

PHP 8.1.5 (all 8.1.x are affected)

Operating System

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions