-
Notifications
You must be signed in to change notification settings - Fork 7.7k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Error handling for UTF-8 complies with WHATWG specification
In 7502c86, I adjusted the number of error markers emitted on invalid UTF-8 text to be more consistent with mbstring's behavior on other text encodings (generally, it emits one error marker for one unexpected byte). I didn't expect that anybody would actually care one way or the other, but felt that it was better to be consistent than not. Later, Martin Auswöger kindly pointed out that the WHATWG encoding specification, which governs how various text encodings are handled by web browsers, does actually specify how many error markers should be generated for any given piece of invalid UTF-8 text. Until now, we have never really paid much attention to the WHATWG specification, but we do want to comply with as many relevant specifications as possible. And since PHP is commonly used for web applications, compatibility with the behavior of web browsers is obviously a good thing.
- Loading branch information
Showing
5 changed files
with
81 additions
and
29 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
--TEST-- | ||
Confirm error handling for UTF-8 complies with WHATWG spec | ||
--EXTENSIONS-- | ||
mbstring | ||
--FILE-- | ||
<?php | ||
/* The WHATWG specifies not just how web browsers should handle _valid_ | ||
* UTF-8 text, but how they should handle _invalid_ UTF-8 text (such | ||
* as how many error markers each invalid byte sequence should decode | ||
* to). | ||
* That specification is followed by the JavaScript Encoding API. | ||
* | ||
* The API documentation for mb_convert_encoding does not specify how | ||
* many error markers we will emit for each possible invalid byte | ||
* sequence, so we might as well comply with the WHATWG specification. | ||
* | ||
* Thanks to Martin Auswöger for pointing this out... and another big | ||
* thanks for providing test cases! | ||
* | ||
* Ref: https://encoding.spec.whatwg.org/#utf-8-decoder | ||
*/ | ||
mb_substitute_character(0x25); | ||
|
||
$testCases = [ | ||
["\x80", "%"], | ||
["\xFF", "%"], | ||
["\xC2\x7F", "%\x7F"], | ||
["\xC2\x80", "\xC2\x80"], | ||
["\xDF\xBF", "\xDF\xBF"], | ||
["\xDF\xC0", "%%"], | ||
["\xE0\xA0\x7F", "%\x7F"], | ||
["\xE0\xA0\x80", "\xE0\xA0\x80"], | ||
["\xEF\xBF\xBF", "\xEF\xBF\xBF"], | ||
["\xEF\xBF\xC0", "%%"], | ||
["\xF0\x90\x80\x7F", "%\x7F"], | ||
["\xF0\x90\x80\x80", "\xF0\x90\x80\x80"], | ||
["\xF4\x8F\xBF\xBF", "\xF4\x8F\xBF\xBF"], | ||
["\xF4\x8F\xBF\xC0", "%%"], | ||
["\xFA\x80\x80\x80\x80", "%%%%%"], | ||
["\xFB\xBF\xBF\xBF\xBF", "%%%%%"], | ||
["\xFD\x80\x80\x80\x80\x80", "%%%%%%"], | ||
["\xFD\xBF\xBF\xBF\xBF\xBF", "%%%%%%"] | ||
]; | ||
|
||
foreach ($testCases as $testCase) { | ||
$result = mb_convert_encoding($testCase[0], 'UTF-8', 'UTF-8'); | ||
if ($result !== $testCase[1]) { | ||
die("Expected UTF-8 string " . bin2hex($testCase[0]) . " to convert to UTF-8 string " . bin2hex($testCase[1]) . "; got " . bin2hex($result)); | ||
} | ||
} | ||
|
||
echo "All done!\n"; | ||
|
||
?> | ||
--EXPECT-- | ||
All done! |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters