New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Unicode Supplementary Characters support in Joomla! #7173
Conversation
Add utf8mb4 connection charset support in MySQL, MySQLi and PDOMySQL drivers
Use utf8mb4 if available when creating a database or updating its character set using the JDatabaseDriver methods.
Add JDatabaseDriver method to get the connection collation
Display the connection collation in the System Information core component. This will come in handy should we need to debug character set issues.
Added alterTableCharacterSet method to JDatabaseDriver to allow Joomla! to transparently convert tables to utf8/utf8mb4
Added getName and getServerType methods to JDatabaseDriver to let the CMS (and its extensions) figure out the type of the database driver in use.
Extensions installer: downgrade utf8mb4 to utf8 when there's no multibyte support in the database server
Joomla! web installer: downgrade utf8mb4 to utf8 when there's no multibyte support in the database server
Update installation schema creation SQL
Create update SQL to convert tables to utf8mb4
When utf8mb4 is not supported do not run the CONVERT TO CHARACTER SET database queries (they have no effect, just waste time)
Fixing the update SQL file to work with the database fix code
Fixing the update SQL file to work with the database fix code
Make the Database Fix feature really convert an existing database to utf8mb4. The kicker? It's not required in any installation or upgrade scenario. It's only required to let people test the pull request of this feature. [sigh]
Strip 4-byte characters when the server doesn't support multibyte unicode characters
JFilterInput should work even when a MySQL connection is not available.
@@ -1,30 +0,0 @@ | |||
<?xml version="1.0" encoding="UTF-8" ?> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we needed this for people upgrading from 3.4 to 3.5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually 3.3 and up. It auto-removes if you do an update in the CMS and the component isn't present. Probably what happened here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed.
New language string seemed odd but after reading the mysql docs I understand it and it makes sense so good to go for the new string |
…itted the removal that shouldn't have happened. Story of my life.
``` FILE: ...ome/travis/build/joomla/joomla-cms/libraries/joomla/database/driver.php -------------------------------------------------------------------------------- FOUND 4 ERROR(S) AFFECTING 4 LINE(S) -------------------------------------------------------------------------------- 738 | ERROR | Concat operator must be followed by one space 774 | ERROR | Cast statements must be followed by a single space; expected | | "(object) $column" but found "(object)$column" 1085 | ERROR | Return comment requires a blank newline after it 1105 | ERROR | Return comment requires a blank newline after it -------------------------------------------------------------------------------- UPGRADE TO PHP_CODESNIFFER 2.0 TO FIX ERRORS AUTOMATICALLY -------------------------------------------------------------------------------- ```
``` FILE: ...vis/build/joomla/joomla-cms/libraries/joomla/database/driver/mysqli.php -------------------------------------------------------------------------------- FOUND 1 ERROR(S) AFFECTING 1 LINE(S) -------------------------------------------------------------------------------- 941 | ERROR | Return comment requires a blank newline after it -------------------------------------------------------------------------------- UPGRADE TO PHP_CODESNIFFER 2.0 TO FIX ERRORS AUTOMATICALLY -------------------------------------------------------------------------------- ```
``` FILE: ...avis/build/joomla/joomla-cms/libraries/joomla/database/driver/mysql.php -------------------------------------------------------------------------------- FOUND 1 ERROR(S) AFFECTING 1 LINE(S) -------------------------------------------------------------------------------- 500 | ERROR | Return comment requires a blank newline after it -------------------------------------------------------------------------------- UPGRADE TO PHP_CODESNIFFER 2.0 TO FIX ERRORS AUTOMATICALLY -------------------------------------------------------------------------------- ```
This PR produces for Travis a PHP Error in the unit tests?
BTW here is a Travis CS Issues fix PR 😄 https://github.com/nikosdion/joomla-cms/pull/9 Thanks 😄 |
Fix Travis Codestyle with CMS: #7173
Modify the fake JDatabaseDriverNosql driver used in unit tests to conform to the new JDatabaseDriver base class
The latest commit should fix that issue. |
More unit tests to amend yet :( |
Fixed a problem with JFilterInput and arrays.
OK, the last notice from phpcs is fixed. |
Now that the Unit Tests are all green could some people please test it on their servers? I've tried it with PHP 5.4, 5.5 and 5.6 and MySQL 5.1 and 5.5 on Mac OS X and Linux. Any other environment will be highly appreciated. |
After following the instructions and using Unicode Supplementary Characters Test Data from http://www.i18nguy.com/unicode/supplementary-test.html#utf8 I can confirm that the issue is resolved. Tested with php 5.5.23 and 5.3.29 and mysql 5.5.42 on OSX This comment was created with the J!Tracker Application at issues.joomla.org/joomla-cms/7173. |
Test Ok, now we are talking ... an article with a SMILING FACE on title and alias!! Tested on php 5.5.9 and MySQL 5.5.43 on Linux Mint 17.1 This comment was created with the J!Tracker Application at issues.joomla.org/joomla-cms/7173. |
@wilsonge Since we have two successful tests can we please set it RTC for the 3.5 branch? |
Setting RTC This comment was created with the J!Tracker Application at issues.joomla.org/joomla-cms/7173. |
Merged with 04f09d5 |
There are VERY important reasons for this PR do not blindly revert |
@brianteeman see: #8268 I'm not reverting I'm fixing but there is still one open point ;) |
@nikosdion The changing of the alias to 191 chars is causing all sort of issues around the place. After @zero-24 lost his PR due to a mistake rebasing it we merged in manually everything except his alias changes and @roland-d has redone that in #8472 however there have been all sorts of issues raised in that PR. I've been rereading your description of by the 191 character change was made but what I'm not clear on is that does this mean we are going to loose existing characters in alias' in sites etc? If you could reply on #8472 it would be hugely appreciated :) |
Byte count. 255 quad octets are larger than what MySQL allows plain and simple. If you increase that limit stuff breaks. If you don't bc breaks. We can't change how MySQL works. I explained everything in my PR months ago. |
Pinging @wilsonge and @Bakual about this.
Executive summary
Joomla! uses the UTF-8 character set to acquire, process and store information in the database. Due to the way MySQL 5.5.2 and earlier versions handle UTF-8 any Unicode character requiring 4 bytes for storage leads to text gone missing. This is both annoying (loss of information) and potentially dangerous (see Unicode security).
This PR addresses the issue in two ways:
Backwards compatibility
This PR is fully backwards compatible. Data is migrated automatically on upgrade. Existing extensions (core and third party) don't have to change anything in their code, everything is handled at the database lever.
However, if you have written your own custom database driver you MUST implement utf8mb4 support. If you don't there is no actual problem: your code will STILL work fine. The 4-byte Unicode characters will be stripped off. This is NOT a backwards compatibility break. Before this PR a 4-byte Unicode character would lead to loss of data. After the PR the 4-byte character is replaced with the Unicode "uncertainty" character and there is no loss of data after it. Therefore even if you don't bother upgrading your database driver this PR will still guard you against data loss and potential security implications. You're welcome.
Translation impact
None (not applicable)
Test instructions
Reproducing the issue
Title ???? and it's gone!
This part of the text works. ???? The devil lies in the details.
IMPORTANT Make sure you enter the text above using an actual Emoji, NOT a picture reference, i.e. do NOT copy-paste the text from GitHub
Title
and the body text only readsThis part of the text works.
. That's because MySQL eats up the 4-byte Unicode characters using the utf8 encoding :(Installation
Database Connection Collation
Alternatively just install a new site using the ZIP file of the PR's branch
Verify the connection
If it reads "utf8_general_ci" please confirm that the Database Version reported is less than 5.5.3. If this is the case please skip to "Testing on MySQL 5.5.2 or earlier (including 5.1)"
Testing under PHP 5.5.3 and later
If you get a question mark instead of the chinese symbol and the Emoji you have forgotten to go through the Extensions, Extension Manager, Database, Fix routine. Still, even though you screwed up this PR made sure you don't get truncated text therefore mitigating the issue.
If, however, you got either a blank rectangle or a question mark inside a hollow diamond this means that either your PHP version is very old or the PHP mysql, mysqli, mysqlnd or PDO driver is linked against a very old MySQL client library. In this case please consult "Testing on MySQL 5.5.2 or earlier (including 5.1)". That's a limitation of PHP but STILL you get proper mitigation of the issue.
Testing on MySQL 5.5.2 or earlier (including 5.1)
Notes and technical information
We try to guess if MySQL supports utf8mb4 by first looking at the driver version. If the PHP driver (mysql, mysqli, mysqlnd, PDO) doesn't support utf8mb4 we drop back to utf8. If the driver supports utf8mb4 we try to connect using the utf8mb4 client encoding. If that fails we retry connecting with the utf8 encoding and mark the driver as NOT supporting utf8mb4. This is normal and expected when you are using MySQL server 5.5.2 or earlier (commonly: 5.1.x) with a relatively new PHP version.
JInputFilter now has a
$stripUSC
parameter which tells it whether it should strip 4-byte Unicode characters. If this is not set, or set to -1, JInputFilter will get a reference to the database object through JFactory and ask it if it supports utf8mb4. If it doesn't the$stripUSC
flag is set to 1 (strip). In this case all 4-byte Unicode characters are replaced with the 3-byte Unicode "uncertainty" character which renders either as a small blank rectangle or as a question mark inside a hollow diamond. This prevents the MySQL issue of losing all text after such a character and its potential security implications.Upon upgrade all core tables are upgraded to utf8mb4 by virtue of the 3.5.0-2015-01-01.sql script. DO NOT CHANGE THE NAME OF THIS SCRIPT!!! The name HAD to be hard-coded in com_installer's Database, Fix action to allow this feature to handle upgrade to utf8mb4. Unfortuantely there is no simple way to either list core tables (without having a hardcoded list) or tell the default character set of a table. Therefore the only way to ensure the database is up-to-date –especially when transferring a site from a MySQL 5.1 server to a MySQL 5.5.3+ server– is always running this conversion script at the end of the database fix.
All CREATE TABLE and ALTER TABLE queries MUST be written using the utf8mb4 character set and encodings. If Joomla! detects that the database does not support utf8mb4 it will automatically downgrade them to utf8. This simplifies installation.
Should you need to upgrade a table from PPH code you can use
JFactory::getDbo()->alterTableCharacterSet('#__your_table')
.Finally, the JDatabaseDriver now has two methods getName and getServerType to get the name of the driver (works like $db->name) and the server family type. The latter is used to determine whether we're running under MySQL.
Special considerations for third party developers
This section is meant for inclusion in our documentation
The following applies only on MySQL databases. Other database servers are not affected.
utf8 vs utf8mb4
Joomla! up to 3.4 uses the "utf8" character set for its database. This means that MySQL will use 1 to 3 bytes to store each Unicode character. The problem is that certain 4-byte Unicode characters such as Emoji, extended Chinese and others –collectively called Unicode Supplementary Characters– cannot be expressed under this scheme. Instead of simply dropping the offending character, MySQL will drop not just this character but all the text that follows it. This is both annoying (data loss) and a potential security issue (depending on how data is rendered by your component).
Joomla! 3.5.0 onwards uses the backwards-compatible utf8mb4 encoding when available, namely when Joomla! is installed on a server using MySQL 5.5.3 or later and an up-to-date version of the mysql, mysqli or mysqlng PHP extension. On top of that, Joomla! will filter out 4-byte characters when they cannot be stored in the database, therefore preventing this data loss and potential security issue.
From your PHP code's point of view you do not have to make any change whatsoever in your PHP code to support it. Joomla! does all the hard work for you! You SHOULD, however, make changes to your extensions' schema installation (.sql) files to support storing 4-byte long Unicode Supplementary Characters.
Writing your schema files to support extended characters
Your schema files MUST ALWAYS use the utf8mb4 character set when creating or changing tables (CREATE TABLE and ALTER TABLE queries). Joomla! will figure out at extension installation / update time whether the MySQL server supports utf8mb4 or not. If it doesn't, Joomla! will automatically convert your query to use the old utf8 character set and the respective utf8_* collations instead of the utf8mb4 character set and utf8mb4_* collations in your .sql file.
Example:
As you can see, everything is set up to use utf8mb4. If a user tries to install your extension on a server using MySQL 5.1 Joomla! will automatically rewrite the query above as:
You DO NOT and MUST NOT write a query using utf8 manually. Let Joomla! do it for you!
If you do not use utf8mb4 in your tables, Joomla! is running on a MySQL server supporting utf8mb4 and the user enters a Unicode Supplementary Character no real harm is done (since Joomla! 3.5.0): the character is automatically replaced by MySQL with a question mark (?).
Updating your existing tables
When a user updates their site to Joomla! 3.5.0 only the core database tables will be converted to utf8mb4. The update will not convert third party extensions' tables. Likewise, the "mysql" schema installation scripts are not automatically converted to utf8mb4. This is by design and has to do with MySQL's key length.
If you are using InnoDB tables the maximum size for an index / key is 767 bytes. When using the old "utf8" character set each character of a CHAR or VARCHAR column counts as 3 bytes, whereas under the utf8mb4 character set each character counts as 4 bytes. This means that your old indices will probably be too big.
A practical example can be found in the
#__menu
core table. Before Joomla! 3.5 there was this index:Client ID is a tinyint (1 byte), parent_id an int (4 bytes), language a 7-character CHAR column (21 bytes under utf8, 28 under utf8mb4) and alias is a 255 character VARCHAR (765 bytes under utf8, 1020 bytes under utf8mb4). As you can understand, the index becomes too long for MySQL to handle under utf8mb4.
There is a workaround to that. We can tell MySQL to use the first n bytes of a column for the index. In our case, we can tell MySQL to use the first 191 characters of the alias column, therefore reserving 191 characters x 4 bytes per character = 764 bytes which is less than the 767 bytes limit of InnoDB. You can also see why trying to use 192 characters would fail: 192 characters x 4 bytes per character = 768 bytes which is one byte too long for MySQL's index limit.
In order to fix the index we need to drop and create it again in a single query:
Please keep in mind that the DROP KEY and ADD KEY should be on the same query, separated by a comma.
After having resized the index we need to convert any columns which do not use the same collation as the rest of the table. In
#__menu
's case there's thealias
column which uses the utf8_bin collation. We have to convert it to the utf8mb4_bin collation instead. It's a simple ALTER TABLE command:Finally, we will have to convert the table itself to utf8mb4 encoding.
IMPORTANT: Do NOT try to convert the table before resizing the indices, it will lead to an error!
The query to convert a table to utf8mb4 is as straightforward as it gets:
So, the complete SQL script to convert the
#__menu
table is:Now you might think, but what about my users who are NOT using utf8mb4? As we already saw, Joomla! is smart enough to figure this out for you. It automatically converts the queries referencing the utf8mb4 character set and collations to utf8. So you do not need to worry about it, Joomla! will sort everything out for you as long as you use utf8mb4 in your .sql files.
If you have written a custom MySQL database driver
You SHOULD make small changes to your database driver unless you are extending from
JDatabaseDriverMysqli
orJDatabaseDriverMysql
AND you have not overridden thesetUTF
method. These changes are not necessary, but highly recommended.First you need to do:
in your
__construct
method.Secondly you need to copy the
setUTF
method fromJDatabaseDriverMysqli
orJDatabaseDriverMysql
. This method determines if utf8mb4 is supported by the database server. If you fail to do that you might end up with broken text.