Miscellaneous Unicode utility functions.
Namespace pcrov\Unicode
.
Translates a UTF-16 surrogate pair into a single code point. Wikipedia's UTF-16 article explains what this is fairly well.
Returns the position of the first invalid byte sequence or null if the input is valid.
Returns the first invalid byte sequence or null if the input is valid.
Provides a state machine letting you walk a (potentially endless) UTF-8 sequence byte by byte.
It is in the form of [byte => [valid next byte => ...,], ...]
Example use:
function utf8_generate_all_code_points(): string
{
$generator = function (array $machine, string $buffer = "") use (&$generator) {
// Completed a UTF-8 encoded code point.
if ($buffer !== "" && isset($machine["\x0"])) {
return $buffer;
}
$out = "";
foreach ($machine as $byte => $next) {
$out .= $generator($next, $buffer . $byte);
}
return $out;
};
return $generator(utf8_get_state_machine());
}
Does what it says on the box.
The test/data directory holds two files containing all possible UTF-8 encoded characters.
All 1,112,064 of them. One as plain text, the other as json. These are not included in
packaged stable releases but can be generated with the example utf8_generate_all_code_points()
function above (returns the plain text string.)
Excerpts from the Unicode 10.0.0 standard:
Recreated here for ease of reference. Nobody likes PDFs.
Scalar Value | First Byte | Second Byte | Third Byte | Fourth Byte |
---|---|---|---|---|
00000000 0xxxxxxx | 0xxxxxxx | |||
00000yyy yyxxxxxx | 110yyyyy | 10xxxxxx | ||
zzzzyyyy yyxxxxxx | 1110zzzz | 10yyyyyy | 10xxxxxx | |
000uuuuu zzzzyyyy yyxxxxxx | 11110uuu | 10uuzzzz | 10yyyyyy | 10xxxxxx |
Code Points | First Byte | Second Byte | Third Byte | Fourth Byte |
---|---|---|---|---|
U+0000..U+007F | 00..7F | |||
U+0080..U+07FF | C2..DF | 80..BF | ||
U+0800..U+0FFF | E0 | A0..BF | 80..BF | |
U+1000..U+CFFF | E1..EC | 80..BF | 80..BF | |
U+D000..U+D7FF | ED | 80..9F | 80..BF | |
U+E000..U+FFFF | EE..EF | 80..BF | 80..BF | |
U+10000..U+3FFFF | F0 | 90..BF | 80..BF | 80..BF |
U+40000..U+FFFFF | F1..F3 | 80..BF | 80..BF | 80..BF |
U+100000..U+10FFFF | F4 | 80..8F | 80..BF | 80..BF |