Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selecting from JSON columns does not decode them from UTF-8 under mysql_enable_utf8/mb4 #309

Open
Grinnz opened this issue Mar 8, 2019 · 9 comments
Labels
bug utf8 Unicode and UTF-8 handling

Comments

@Grinnz
Copy link
Contributor

Grinnz commented Mar 8, 2019

I don't have a new enough MySQL handy, but in a bug report for Mojo::mysql it was indicated with a test case that when selecting from a JSON type column, the output is not UTF-8 decoded (jhthorsen/mojo-mysql#57 (comment)). MySQL documents that when JSON is interacted with as a string it is treated as the utf8mb4 charset so it should be returned that way, the same way it is passed as a utf8mb4 string.

Confusing the issue is DBD::mysql's misuse of the internal encoding of strings, but it's possible to avoid that confusion by ensuring the passed JSON string is utf8::upgraded or constructing the JSON within the query from bytes.

The issue should be reproducible by inserting into a JSON type column with a query like so:

INSERT INTO unicode_json_test SET my_json = JSON_OBJECT('drink', CONVERT(UNHEX('537AC5916CC5916CC3A9') USING utf8mb4));

then selecting that column. JSON and encoding is a tricky subject because it is commonly represented as UTF-8 encoded bytes rather than text, but as MySQL treats it on input and output as a character string, it should be decoded as the input string is encoded; this also is what happens in DBD::Pg.

@dveeden
Copy link
Collaborator

dveeden commented Mar 8, 2019

#!/usr/bin/perl
use v5.28.0;
use strict;
use warnings;
use DBI;

my $dsn = "DBI:mysql:test;host=127.0.0.1;port=8013;mysql_ssl=1";
my $dbh = DBI->connect($dsn, "msandbox", "msandbox");

$dbh->do("DROP TABLE IF EXISTS unicode_json_test");
$dbh->do("CREATE TABLE unicode_json_test (id int auto_increment primary key, my_json JSON)");
$dbh->do("INSERT INTO unicode_json_test SET my_json = JSON_OBJECT('drink', CONVERT(UNHEX('537AC5916CC5916CC3A9') USING utf8mb4))");
my $result = $dbh->selectcol_arrayref("SELECT my_json FROM unicode_json_test")->[0];
print $result . "\n";

say join " ", map { sprintf '%X', ord } split //, $result;

$dbh->disconnect();

I use dbdeployer to test with multiple versions of MySQL and MariaDB.

The above code for me returns:

{"drink": "Szőlőlé"}
7B 22 64 72 69 6E 6B 22 3A 20 22 53 7A C5 91 6C C5 91 6C C3 A9 22 7D

This is with MySQL 8.0.13 and perl v5.28.1
Is this also what you got?
As far as I know the JSON standard dictates that it is not JSON if it is not UTF-8.

@pali
Copy link
Member

pali commented Mar 8, 2019

As far as I know the JSON standard dictates that it is not JSON if it is not UTF-8.

Which JSON standard? There are at least 5 JSON standards which are with each other incompatible.
So I would not be surprised that one of the JSON standards have above requirement. Apparently I know two JSON standards which allows any of following encoding UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, UTF-32LE :-)

So everybody who is going to support JSON, should exactly specify which JSON (according to which standard) supports...

@Grinnz
Copy link
Contributor Author

Grinnz commented Mar 8, 2019

The above code for me returns:

{"drink": "Szőlőlé"}
7B 22 64 72 69 6E 6B 22 3A 20 22 53 7A C5 91 6C C5 91 6C C3 A9 22 7D

This is the incorrect response. It should be decoded from UTF-8.

@Grinnz
Copy link
Contributor Author

Grinnz commented Mar 8, 2019

As I mentioned, it is a tricky problem because JSON as a standard implies the UTF-8 encoding. But in the case where the data interchange format also defines a charset, the JSON encoding and decoding should not also encode/decode the charset or you will double encode the data, and the data stored in the binary format will end up being the UTF-8 bytes instead of the correct JSON data. This is how it works already when you pass a JSON string as a parameter: it should not be encoded, since DBD::mysql will encode it to UTF-8 (wrongly, but that is the intention). So on retrieval, it should have been decoded as a text response from UTF-8, and the application then decodes it from JSON without decoding the charset again.

As mentioned, the correct behavior is the behavior of DBD::Pg with json and jsonb columns, and in another test it seems to be the case if you select the output of a JSON_OBJECT() function call or similar; but not when selecting a JSON column from MySQL.

As for the JSON standard dictating UTF-8, MySQL already adheres to this by the nature of always interpreting strings as utf8mb4 in the JSON context.

@Tekki
Copy link

Tekki commented Mar 22, 2019

@dveeden We have a small test at jhthorsen/mojo-mysql#57 (comment) followed by some discussion. I'd appreciate to hear your opinion about that.

@bes-internal
Copy link

bes-internal commented Apr 15, 2019

use utf8;

require DBI;
$dsn = "DBI:mysql:database=$dbname;host=$dbhost;port=$dbport;";
$dbh = DBI->connect($dsn,$dbuser,$dbpass, {AutoCommit => 1, mysql_enable_utf8mb4 => 1});

$dbh->do("DROP TABLE IF EXISTS unicode_json_test");
$dbh->do("CREATE TABLE unicode_json_test (id int auto_increment primary key, my_str VARCHAR(255), my_json JSON)");
$dbh->do("INSERT INTO unicode_json_test SET my_str=JSON_OBJECT('drink', CONVERT(UNHEX('537AC5916CC5916CC3A9') USING utf8mb4)), my_json = JSON_OBJECT('drink', CONVERT(UNHEX('537AC5916CC5916CC3A9') USING utf8mb4))");
($str, $json) = $dbh->selectrow_array("SELECT my_str, my_json FROM unicode_json_test");
$out .= utf8::is_utf8($str) || "no";
$out .= "\n";
# # 1
$out .= utf8::is_utf8($json) || "no";
$out .= "\n";
# # no
$out .= join " ", map { sprintf '%X', ord } split //, $str;
$out .= "\n";
# # 7B 22 64 72 69 6E 6B 22 3A 20 22 53 7A 151 6C 151 6C E9 22 7D
$out .= join " ", map { sprintf '%X', ord } split //, $json;
$out .= "\n";
# # 7B 22 64 72 69 6E 6B 22 3A 20 22 53 7A C5 91 6C C5 91 6C C3 A9 22 7D

print $out;

# same for utf8

from https://metacpan.org/pod/DBD::mysql#mysql_enable_utf8 :

When set, a data retrieved from a textual column type (char, varchar, etc) will have the UTF-8 flag turned on if necessary. This enables character semantics on that string

Iis not clear why json data type is different from any other text types. This is obviously a bug in DBD::mysql and needs to be fixed.

@Grinnz
Copy link
Contributor Author

Grinnz commented Jun 7, 2019

FYI: I have opened a bug against MySQL here, but I still have no way to reproduce or test it myself. https://bugs.mysql.com/bug.php?id=95698

@dveeden dveeden removed their assignment Jan 12, 2022
@yahermann
Copy link

yahermann commented Jan 15, 2022

Hello folks, I'm stumbling into this bug now. I posted it as a question on Stack Overflow before finding this discussion. My post includes a self-contained unit test that replicates/illustrates the problem:

https://stackoverflow.com/questions/70724642/utf-8-decoding-when-reading-a-mysql-column-of-type-json-vs-text-using-perl-d

My question is, while waiting for a fix, does anyone have any recommendations or suggestions for how to work around it in a way that is backward-compatible to a possible eventual fix to DBD::Mysql. I wouldn't want to add my own decode after retrieving a JSON column, only to have some later DBD::Mysql also decode.

Thanks!

@cmanley
Copy link

cmanley commented Sep 14, 2023

You need to specify your connection encoding to mysql server.
If your database is called test and you expect utf8 strings in your client app, then use this dsn:
dbi:mysql(mysql_enable_utf8mb4=>1):database=test;host=localhost
This dsn won't work b.t.w., I think it's a bug:
dbi:mysql:database=test;host=localhost;mysql_enable_utf8mb4=1

Then in a Perl script, dump these variables so that you can see what's going on:

			$q->print('<table border="1" style="border-collapse: collapse; margin-top: 1em"><thead><tr><th colspan="2">DBD::mysql version ' . $DBD::mysql::VERSION . ' (DBI version ' . $DBI::VERSION . ")</th></tr></thead><tbody>\n");
			foreach my $k (
				'mysql_client_found_rows',
				'mysql_hostinfo',
				'mysql_info',
				'mysql_protoinfo',
				'mysql_serverinfo',
				'mysql_server_prepare',
				'mysql_stat',
				'mysql_serverversion',
				'mysql_enable_utf8',
				'mysql_enable_utf8mb4',
			) {
				my $v = $dbh->{$k} // '';
				$q->print('<tr><td>' . HTML::Entities::encode($k) . '</td><td>' . HTML::Entities::encode($v) . "</td></tr>\n");
			}
			$q->print("</tbody></table>\n");

and these variables as well....

			$q->print("<table border='1' style='border-collapse: collapse; margin-top: 1em'><thead><tr><th colspan='2'>MySQL variables</th></tr></thead><tbody>\n");
			tie(my %map, 'Tie::IxHash');
			foreach my $what (
				'character_set',
				'collation',
				'version',
			) {
				my $sth = $dbh->prepare('SHOW VARIABLES LIKE "' . $what . '%"');
				$sth->execute();
				while (my ($k, $v) = $sth->fetchrow_array()) {
					if ($k eq 'version_source_revision') {
						# nop
					}
					else {
						$map{$k} = $v;
					}
				}
				$sth->finish();
			}
			if (my ($k,$v) = $dbh->selectrow_array('SHOW STATUS LIKE "Uptime"')) {
				$map{$k} = $v;
			}
			if (my $v = $dbh->selectrow_array('SELECT @@wait_timeout')) {
				$map{'wait_timeout'} = $v;
			}
			foreach my $k (keys %map) {
				my $v = $map{$k};
				$q->print('<tr><td>' . HTML::Entities::encode($k) . '</td><td>' . HTML::Entities::encode($v) . "</td></tr>\n");
			}
			$q->print("</tbody></table>\n");

then select a single character and check how it ends up in your application:

			#SELECT CHAR(0xC3BC USING utf8mb4), LENGTH(CHAR(0xC3BC USING utf8mb4)), CHAR_LENGTH(CHAR(0xC3BC USING utf8mb4));
			#+----------------------------+------------------------------------+-----------------------------------------+
			#| CHAR(0xC3BC USING utf8mb4) | LENGTH(CHAR(0xC3BC USING utf8mb4)) | CHAR_LENGTH(CHAR(0xC3BC USING utf8mb4)) |
			#+----------------------------+------------------------------------+-----------------------------------------+
			#| ü                          |                                  2 |                                       1 |
			#+----------------------------+------------------------------------+-----------------------------------------+
			tie(my %char_hex_to_encoding, 'Tie::IxHash',
				'0xE282AC' => 'utf8mb4', # €
				'0x80'     => 'latin1',  # €
				'0xC3BC'   => 'utf8mb4', # ü
				'0xFC'     => 'latin1',  # ü
				'0x40'     => 'utf8mb4', # @
				'0x40'     => 'latin1',  # @
			);
			foreach my $hex (keys %char_hex_to_encoding) {
				my $encoding = $char_hex_to_encoding{$hex};
				my $sql = "SELECT CHAR($hex USING $encoding), LENGTH(CHAR($hex USING $encoding)), CHAR_LENGTH(CHAR($hex USING $encoding))";
				$q->print('<table border="1" style="border-collapse: collapse; margin-top: 1em"><thead><tr><th colspan="2">' . HTML::Entities::encode($sql) . "</th></tr></thead><tbody>\n");
				my ($char, $byte_length, $char_length) = $dbh->selectrow_array($sql);
				$q->print('<tr><td>character raw</td><td>' . $char . "</td></tr>\n");
				$q->print('<tr><td>character HTML encoded</td><td>' . HTML::Entities::encode($char) . "</td></tr>\n");
				$q->print('<tr><td>is_utf8?</td><td>' . int(Encode::is_utf8($char)) . "</td></tr>\n");
				$q->print('<tr><td>mysql byte length</td><td>' . $byte_length . "</td></tr>\n");
				$q->print('<tr><td>mysql char length</td><td>' . $char_length . "</td></tr>\n");
				#require bytes;
				$q->print('<tr><td>perl byte length</td><td>' . bytes::length($char) . "</td></tr>\n");
				$q->print('<tr><td>perl char length</td><td>' . length($char) . "</td></tr>\n");
				$q->print("</tbody></table>\n");
			}

Make sure that the character and byte lengths make sense depending on your database and connection encoding and check if the utf8 flag is set when appropriate. That's how you can debug your issues.

@dveeden dveeden added the utf8 Unicode and UTF-8 handling label Oct 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug utf8 Unicode and UTF-8 handling
Projects
None yet
Development

No branches or pull requests

7 participants