Skip to content
Extensive, portable and performant handling of UTF-8 and grapheme clusters for PHP
PHP
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
class
tests
.gitattributes
README.md
bootup.utf8.php
composer.json

README.md

Patchwork UTF-8

Patchwork UTF-8 provides both :

  • a portability layer for Unicode handling in PHP, and
  • a class that mirrors the quasi complete set of native string functions, enhanced to UTF-8 grapheme clusters awareness.

It can also serve as a documentation source referencing the practical problems that arise when handling UTF-8 in PHP: Unicode concepts, related algorithms, bugs in PHP core, workarounds, etc.

Portability

Unicode handling in PHP is best performed using a combo of mbstring, iconv, intl and pcre with the u flag enabled. But when an application is expected to run on many servers, you should be aware that these 4 extensions are not always enabled.

Patchwork UTF-8 provides pure PHP implementations for 3 of those 4 extensions. Here is the set of portability-fallbacks that are currently implemented:

  • utf8_encode, utf8_decode,
  • mbstring: mb_convert_encoding, mb_decode_mimeheader, mb_encode_mimeheader, mb_convert_case, mb_internal_encoding, mb_list_encodings, mb_strlen, mb_strpos, mb_strrpos, mb_strtolower, mb_strtoupper, mb_substitute_character, mb_substr, mb_stripos, mb_stristr, mb_strrchr, mb_strrichr, mb_strripos, mb_strstr,
  • iconv: iconv, iconv_mime_decode, iconv_mime_decode_headers, iconv_get_encoding, iconv_set_encoding, iconv_mime_encode, ob_iconv_handler, iconv_strlen, iconv_strpos, iconv_strrpos, iconv_substr,
  • intl: Normalizer, grapheme_extract, grapheme_stripos, grapheme_stristr, grapheme_strlen, grapheme_strpos, grapheme_strripos, grapheme_strrpos, grapheme_strstr, grapheme_substr.

pcre compiled with unicode support is currently required.

Patchwork\Utf8

Grapheme clusters should always be considered when working with generic Unicode strings. The Patchwork\Utf8 class implements the quasi-complete set of native string functions that need UTF-8 grapheme clusters awareness. Function names, arguments and behavior carefully replicates native PHP string functions so that usage is very easy.

Some more functions are also provided to help handling UTF-8 strings:

  • isUtf8(): checks if a string contains well formed UTF-8 data,
  • toAscii(): generic UTF-8 to ASCII transliteration,
  • strtocasefold(): unicode transformation for caseless matching,
  • strtonatfold(): generic case sensitive transformation for collation matching

Mirrored string functions are: strlen, substr, strpos, stripos, strrpos, strripos, strstr, stristr, strrchr, strrichr, strtolower, strtoupper, wordwrap, chr, count_chars, ltrim, ord, rtrim, trim, str_ireplace, str_pad, str_shuffle, str_split, str_word_count, strcmp, strnatcmp, strcasecmp, strnatcasecmp, strncasecmp, strncmp, strcspn, strpbrk, strrev, strspn, strtr, substr_compare, substr_count, substr_replace, ucfirst, lcfirst, ucwords, number_format, utf8_encode, utf8_decode. Missing are printf-family functions.

Usage

Including the bootup.utf8.php file is the easiest way to enable the portability layer and configure PHP for an UTF-8 aware and portable application.

Classes are named following PSR-0 autoloader interoperability recommandations, so other loading scheme are easy to implement.

The Patchwork\Utf8 class exposes its features through static methods. Just add a use Patchwork\Utf8 as u; at the beginning of your files, then when UTF-8 awareness is required, prefix the string function by u::: echo strlen("déjà"); may become echo u::strlen("déjà"); eg.

Just run phpunit in the tests/ directory to see the code in action.

Make sure that you are confident about using UTF-8 by reading Character Sets / Character Encoding Issues and Handling UTF-8 with PHP.

You should also get familar with the concept of Unicode Normalization and Grapheme Clusters.

Do not blindly replace all use of PHP's string functions. Most of the time you will not need to, and you will be introducing a significant performance overhead to your application.

Screen your input on the outer perimeter so that only well formed UTF-8 pass through. When dealing with badly formed UTF-8, you should not try to fix it. Instead, consider it as ISO-8859-1 and use utf8_encode() to get an UTF-8 string. Don't forget also to choose one unicode normalization form and stick to it. NFC is the most in use today.

This library is orthogonal to mbstring.func_overload and will not work if the php.ini setting is enabled.

Licensing

Patchwork\Utf8 is free software; you can redistribute it and/or modify it under the terms of the (at your option):

Unicode handling requires tedious work to be implemented and maintained on the long run. As such, contributions such as unit tests, bug reports, comments or patches licensed under both licenses are really welcomed.

I hope many projects could adopt this code and together help solve the unicode subject for PHP.

Something went wrong with that request. Please try again.