Fix support for mysql_enable_utf8 and mysql_enable_utf8mb4 #67

pali · 2016-10-22T15:04:41Z

WARNING: This pull request changes behavior of mysql_enable_utf8 and mysql_enable_utf8mb4 attributes. New behavior is documented in POD section. Changes should be properly reviewed and tested as in some situations it can be backward incompatible change...

Probably people affected by UTF-8 bugs in DBD::mysql should test and verify that it really fix reported problems. Maybe we should discuss more about this pull request.

Reported bugs:
https://rt.cpan.org/Public/Bug/Display.html?id=25590
https://rt.cpan.org/Public/Bug/Display.html?id=53130
https://rt.cpan.org/Public/Bug/Display.html?id=60987
https://rt.cpan.org/Public/Bug/Display.html?id=87428

Important changes:

Fix support for prepared statement numeric storage, DBI SQL bind types
Make consistency between DBI, mysql and perl types
Do not UTF-8 encode input binded params which are of DBI SQL BINARY types
Do not UTF-8 decode output fields which are not send in utf8 (resp. utf8mb4) charset from mysql server
Use SvPVutf8() for retrieving UTF-8 encoded char * because SvPV() can return char * in Latin1
Remove #ifdef checks for sv_utf8_decode and SvUTF8 as they are not needed anymore

Part of this pull request is @davel (Dave Lambley) test for UTF-8 params from davel/DBD-mysql@8695068 references by https://rt.cpan.org/Public/Bug/Display.html?id=60987. I slightly modified it, but it is still great test case for scenario with different table charset, different session charset and state of mysql_enable_utf8.

Prepared statements now uses only necessary long buffers for integer types
(not always 64bit anymore). DBI SQL types SQL_BOOLEAN and SQL_TINYINT now
handles 8bit integers, SQL_SMALLINT 16bit integers, SQL_INTEGER 32bit
integers and SQL_BIGINT 64bit integers. SQL_FLOAT native floats and
SQL_DOUBLE or SQL_REAL native doubles.

Now when prepared statements are enabled and explicit storage of numeric
values (different integer size and floating point size) via DBI SQL is
used, conversion to storage type is done by DBD::mysql driver and not by
mysql server. It means that any overflow/underflow is not detected by mysql
server and application needs to handle it.

SQL_BOOLEAN is treated as numeric type and SQL_DECIMAL is handed by mysql
as string type (due to decimal precision). SQL_BIT, SQL_BLOB, SQL_BINARY,
SQL_VARBINARY and SQL_LONGVARBINARY are treated as perl binary types.

And MYSQL_TYPE_BIT, MYSQL_TYPE_MEDIUM_BLOB and MYSQL_TYPE_LONG_BLOB are
treated as binary perl types. MYSQL_TYPE_LONGLONG as numeric perl type on
platforms with 64 bit perl's integers and string perl type on others.
MYSQL_TYPE_NULL is now perl's undef.

For each fetched field mysql server tells us also charset id. Before this
commit when mysql_enable_utf8 was enabled DBD::mysql UTF-8 decoded all
fields with charset id different of 63 (means binary).

Now DBD::mysql UTF-8 decode only those fields which have charset set to
utf8 or utf8mb4. By default mysql server sends data in encoding specified
by SET NAMES command, which is by default Latin1. So received Latin1 data
are not UTF-8 decoded anymore.

Before this commit perl scalars (statements or bind parameters) without
UTF8 status flag were not encoded to UTF-8 even if mysql_enable_utf8 was
enabled. It caused that perl scalars with internal Latin1 encoding were
send to mysql server as Latin1 even if mysql_enable_utf8 was enabled.

Now all statements and bind parameters which are not of DBI binary type
(SQL_BIT, SQL_BLOB, SQL_BINARY, SQL_VARBINARY and SQL_LONGVARBINARY) are
automatically encoded to UTF-8 when mysql_enable_utf8 is enabled.

If mysql_enable_utf8 is not enabled and statement or bind parameter
contains wide Unicode character then DBD::mysql shows warning. If binary
parameter contains wide Unicode character then DBD::mysql shows warning
too. Similar like function print without :utf8 perlio layer.

Grinnz · 2016-10-22T21:18:47Z

Thank you so much. This fix is sorely needed. I will try to find time to test but one question, should SQL_BLOB also be considered a column type to prevent encoding? It seems to be used mostly by SQLite and several database abstraction tools.

pali · 2016-10-22T21:32:44Z

Looks like SQL_BLOB should not be encoded/decoded too... But this is probably question for DBI (API) developers which types are considered as binary and which not. I think that every DBI driver needs have similar logic if it want to have proper support for Unicode perl scalars...

…64bit IV When perl is compiled with 64bit size IV value, use newSVuv() for storing 64bit integer instead own int-to-string function. Also rename my_ulonglong2str() to my_ulonglong2sv() so function name match what is doing.

…s and make consistency between DBI, mysql and perl types Prepared statements now uses only necessary long buffers for integer types (not always 64bit anymore). DBI SQL types SQL_BOOLEAN and SQL_TINYINT now handles 8bit integers, SQL_SMALLINT 16bit integers, SQL_INTEGER 32bit integers and SQL_BIGINT 64bit integers. SQL_FLOAT native floats and SQL_DOUBLE or SQL_REAL native doubles. Now when prepared statements are enabled and explicit storage of numeric values (different integer size and floating point size) via DBI SQL is used, conversion to storage type is done by DBD::mysql driver and not by mysql server. It means that any overflow/underflow is not detected by mysql server and application needs to handle it. SQL_BOOLEAN is treated as numeric type and SQL_DECIMAL is handed by mysql as string type (due to decimal precision). SQL_BIT, SQL_BLOB, SQL_BINARY, SQL_VARBINARY and SQL_LONGVARBINARY are treated as perl binary types. And MYSQL_TYPE_BIT, MYSQL_TYPE_MEDIUM_BLOB and MYSQL_TYPE_LONG_BLOB are treated as binary perl types. MYSQL_TYPE_LONGLONG as numeric perl type on platforms with 64 bit perl's integers and string perl type on others. MYSQL_TYPE_NULL is now perl's undef. Default mysql type is MYSQL_TYPE_STRING which can represent any supported mysql value.

… and SvUTF8 Boolean variables enable_utf8 and enable_utf8mb4 are needed for UTF-8 support. Macros sv_utf8_decode and SvUTF8 are part of perl 5.6.0 and DBD::mysql already needs at least perl 5.8.1.

For each fetched field mysql server tells us also charset id. Before this commit when mysql_enable_utf8 was enabled DBD::mysql UTF-8 decoded all fields with charset id different of 63 (means binary). Now DBD::mysql UTF-8 decode only those fields which have charset set to utf8 or utf8mb4. By default mysql server sends data in encoding specified by SET NAMES command, which is by default Latin1. So received Latin1 data are not UTF-8 decoded anymore. Mysql server sends charset id, not charset name. Each combination of pairs charset name and collation has its own charset id. New function charsetnr_is_utf8() has hardcoded all utf8 and utf8mb4 charset ids from mysql (up to 8.0.0) and mariadb (up to 10.2.2) source code. Looks like they are not changing since old mysql 5.0, just new are added.

It is needed for upcoming UTF-8 support of statement string.

…as C char* For upcoming UTF-8 support it is needed to have code for extration of char* from perl scalar at one place -- in input functions. In structs imp_sth_ph_st and imp_sth_st are stored own copy of statement and parameters, so they do not disappear after XS functions returns.

… mysql_enable_utf8 is enabled Before this commit perl scalars (statements or bind parameters) without UTF8 status flag were not encoded to UTF-8 even if mysql_enable_utf8 was enabled. It caused that perl scalars with internal Latin1 encoding were send to mysql server as Latin1 even if mysql_enable_utf8 was enabled. Now all statements and bind parameters which are not of DBI binary type (SQL_BIT, SQL_BLOB, SQL_BINARY, SQL_VARBINARY and SQL_LONGVARBINARY) are automatically encoded to UTF-8 when mysql_enable_utf8 is enabled. If mysql_enable_utf8 is not enabled and statement or bind parameter contains wide Unicode character then DBD::mysql shows warning. If binary parameter contains wide Unicode character then DBD::mysql shows warning too. Similar like function print without :utf8 perlio layer. Perl's SvPV() returns char* from perl scalar and following SvUTF8() call for that scalar returns true if SvPV returned data in UTF-8 or Latin1. SvPVutf8() always returns data in UTF-8, but has side effect that it modifies and upgrades scalar to UTF-8. To prevent modification of original scalar we create new mortal (temporary) one for modification. Because invariant UTF-8 characters (7bit ASCII) are exaclty same also in Latin1, we do not need to do any encoding when all characters are UTF-8 invariants. SvPVbyte() first downgrades scalar to Latin1 and then returns data in Latin1. If downgrade is not possible then it croaks. So instead SvPVbyte() for binary data is used manual conversion which throws only warning.

pali · 2016-12-08T23:42:04Z

Big update of this pull request is there! It address problem reported by schmorp in discussion https://rt.cpan.org/Public/Bug/Display.html?id=87428

pali · 2016-12-08T23:42:21Z

lib/DBD/mysql.pm

+C<mysql_enable_utf8> attribute state.  They are treated as sequence of octets
+and it is your responsibility to decode them correctly.
+
+B<WARNING>: DBD::mysql prior to version XXX had different and buggy behaviour


version XXX needs to be changed according to next release after merge...

Also please re-check that new rewritten documentation is OK... I'm not native English speaker and documentation is important!

Grinnz · 2016-12-09T00:30:33Z

lib/DBD/mysql.pm

 handle, when creating the statement handle or after it has been created.
 See L</"STATEMENT HANDLES">.

 =item mysql_enable_utf8


Easier to repaste the whole section with grammar fixes than comment on each part. (Sorry but difficult to fix the line lengths in github's editor.) This documentation section is greatly improved from the released version btw!

=item mysql_enable_utf8 This attribute affects input data from DBI (statement and bind parameters) and output data from the mysql server. If used as a part of the call to C<connect()> then it issues the command C<SET NAMES utf8>. When set, any statement or bind parameter which is not of binary type is automatically encoded to UTF-8 octets before being sent to the MySQL server. Any retrieved MySQL data with a charset of C<utf8> or C<utf8mb4> from a textual column type (char, varchar, etc) is automatically UTF-8 decoded and returned as a perl Unicode scalar (with SvUTF8 flag on). That enables character semantics on those retrieved UTF-8 strings. The MySQL charset of a retrieved value is affected by the last C<SET NAMES> command and also could be affected by the database, table and column configuration. For more information, see the I<Character Set Support> chapter in the MySQL manual: L<http://dev.mysql.com/doc/refman/5.7/en/charset.html> When unset and a statement or bind parameter contains a wide Unicode character then DBD::mysql gives the warning C<Wide character in ... but mysql_enable_utf8 not set>. The MySQL protocol does not support wide characters and so DBD::mysql does not know how to send a statement with wide characters when C<mysql_enable_utf8> is not set. Please note that when C<mysql_enable_utf8> is set, the input statement and bind parameters are encoded to UTF-8 octets even if the current MySQL session charset is not C<utf8> or C<utf8mb4>! You are responsible for calling the C<SET NAMES utf8> or C<SET NAMES utf8mb4> command when setting the C<mysql_enable_utf8> attribute after connecting. The same applies to unsetting the C<mysql_enable_utf8> attribute. You are responsible for calling C<SET NAMES latin1> (resp. with correct charset) and then passing perl scalars in the correct encoding. Otherwise strings will be sent to MySQL server incorrectly! Input bind parameters of binary types (C<SQL_BIT>, C<SQL_BLOB>, C<SQL_BINARY>, C<SQL_VARBINARY> and C<SQL_LONGVARBINARY>) are not touched regardless of the C<mysql_enable_utf8> attribute state. They are treated as a sequence of octets and sent to the MySQL server as is. If that bind parameter contains a wide Unicode character then DBD::mysql gives the warning C<Wide character in binary field ...> because binary data is a sequence of octets, not Unicode characters! Output data fetched from the MySQL server which does not have a C<utf8> or C<utf8mb4> charset (so also binary data) is not UTF-8 decoded regardless of the C<mysql_enable_utf8> attribute state. They are treated as a sequence of octets and it is your responsibility to decode them correctly. B<WARNING>: DBD::mysql prior to version XXX had different and buggy behaviour when the attribute C<mysql_enable_utf8> was enabled! Input statement and bind parameters were never encoded to UTF-8 octets and retrieved columns were always UTF-8 decoded regardless of the column charset (except binary charsets). =item mysql_enable_utf8mb4 Exactly the same as the attribute C<mysql_enable_utf8>. Additionally if used as a part of the call to C<connect()> then it issues the command C<SET NAMES utf8mb4> instead of C<utf8>. MySQL's C<utf8mb4> charset is capable of handling 4-byte UTF-8 characters. MySQL's C<utf8> charset is capable of handling only up to 3-byte UTF-8 characters! See MySQL manual for more information: L<http://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8mb4.html> You should use MySQL's C<utf8mb4> charset instead of C<utf8> to prevent problems with data exchange. When the C<utf8> charset is used then you are responsible for 3-byte UTF-8 sequence checks on input perl scalar strings. Otherwise MySQL server can reject or modify the input statement!

Thank you! Text updated.

…ritten to the database.

…_server_prepare configurations

mbeijen · 2016-12-10T09:36:27Z

This is SUPER awesome! Thanks a lot. I'll put out a new test release on CPAN with it.

Of course testing, eyeballs and comments are welcome.

pali · 2016-12-10T11:05:09Z

lib/DBD/mysql.pm

+C<mysql_enable_utf8> attribute state.  They are treated as a sequence of octets
+and it is your responsibility to decode them correctly.
+
+B<WARNING>: DBD::mysql prior to version XXX had different and buggy behaviour


Do not forget to set correct version instead of XXX.

mbeijen · 2016-12-10T11:07:10Z

Sure, thanks for the pointer. I will modify that and amend the change log and then push a new development release out tomorrow. Op za 10 dec. 2016 om 12:05 schreef pali <notifications@github.com>

…

***@***.**** commented on this pull request. ------------------------------ In lib/DBD/mysql.pm <#67 (review)> : > +correct charset) and then passing perl scalars in the correct encoding. +Otherwise strings will be sent to MySQL server incorrectly! + +Input bind parameters of binary types (C<SQL_BIT>, C<SQL_BLOB>, C<SQL_BINARY>, +C<SQL_VARBINARY> and C<SQL_LONGVARBINARY>) are not touched regardless of the +C<mysql_enable_utf8> attribute state. They are treated as a sequence of octets +and sent to the MySQL server as is. If that bind parameter contains a wide Unicode +character then DBD::mysql gives the warning C<Wide character in binary field ...> +because binary data is a sequence of octets, not Unicode characters! + +Output data fetched from the MySQL server which does not have a C<utf8> or +C<utf8mb4> charset (so also binary data) is not UTF-8 decoded regardless of the +C<mysql_enable_utf8> attribute state. They are treated as a sequence of octets +and it is your responsibility to decode them correctly. + +B<WARNING>: DBD::mysql prior to version XXX had different and buggy behaviour Do not forget to set correct version instead of XXX. — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#67 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAoQMGf_kzh5nG5MyVubTxyxEMIVqtQCks5rGodlgaJpZM4Kd4ac> .

pali force-pushed the master branch 2 times, most recently from 14515c8 to 75ecd62 Compare October 22, 2016 16:10

pali force-pushed the master branch 2 times, most recently from 1c710d2 to cc9120b Compare October 27, 2016 16:24

pali force-pushed the master branch from 9d6931b to cc9120b Compare November 17, 2016 12:52

mbeijen force-pushed the master branch from f0b7a44 to 59f9b51 Compare November 30, 2016 07:17

pali added 8 commits December 8, 2016 23:41

Always compile enable_utf8 and remove ifdef checks for sv_utf8_decode…

bb964e7

… and SvUTF8 Boolean variables enable_utf8 and enable_utf8mb4 are needed for UTF-8 support. Macros sv_utf8_decode and SvUTF8 are part of perl 5.6.0 and DBD::mysql already needs at least perl 5.8.1.

Use dbd_st_prepare_sv instead dbd_st_prepare

e57dcaa

It is needed for upcoming UTF-8 support of statement string.

Add tests for mysql_enable_utf8 and mysql_enable_utf8mb4

0626173

pali force-pushed the master branch 2 times, most recently from 32a4118 to 99c42e2 Compare December 8, 2016 23:35

pali commented Dec 8, 2016

View reviewed changes

Grinnz reviewed Dec 9, 2016

View reviewed changes

pali and others added 4 commits December 9, 2016 17:59

Update POD documentation for mysql_enable_utf8 and mysql_enable_utf8mb4

ef978a1

Watch the interaction between the UTF-8 flag and what actually gets w…

c041cfe

…ritten to the database.

Fix test t/90utf8_params.t for situations when mysql_enable_utf8=0

ece4326

Extend test t/90utf8_params.t with also different SET NAMES and mysql…

e2705dc

…_server_prepare configurations

pali force-pushed the master branch from 99c42e2 to e2705dc Compare December 9, 2016 17:01

mbeijen merged commit e2705dc into perl5-dbi:master Dec 10, 2016

pali commented Dec 10, 2016

View reviewed changes

pali mentioned this pull request Apr 15, 2017

4.042 improperly encoding blobs when sql_type is SQL_UNKNOWN_TYPE #117

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix support for mysql_enable_utf8 and mysql_enable_utf8mb4 #67

Fix support for mysql_enable_utf8 and mysql_enable_utf8mb4 #67

Uh oh!

pali commented Oct 22, 2016 •

edited

Loading

Uh oh!

Grinnz commented Oct 22, 2016 •

edited

Loading

Uh oh!

pali commented Oct 22, 2016

Uh oh!

pali commented Dec 8, 2016

Uh oh!

pali Dec 8, 2016

Uh oh!

Grinnz Dec 9, 2016

Uh oh!

pali Dec 9, 2016

Uh oh!

mbeijen commented Dec 10, 2016

Uh oh!

pali Dec 10, 2016

Uh oh!

mbeijen commented Dec 10, 2016 via email

Uh oh!

Uh oh!

Fix support for mysql_enable_utf8 and mysql_enable_utf8mb4 #67

Fix support for mysql_enable_utf8 and mysql_enable_utf8mb4 #67

Uh oh!

Conversation

pali commented Oct 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Grinnz commented Oct 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pali commented Oct 22, 2016

Uh oh!

pali commented Dec 8, 2016

Uh oh!

pali Dec 8, 2016

Choose a reason for hiding this comment

Uh oh!

Grinnz Dec 9, 2016

Choose a reason for hiding this comment

Uh oh!

pali Dec 9, 2016

Choose a reason for hiding this comment

Uh oh!

mbeijen commented Dec 10, 2016

Uh oh!

pali Dec 10, 2016

Choose a reason for hiding this comment

Uh oh!

mbeijen commented Dec 10, 2016 via email

Uh oh!

Uh oh!

pali commented Oct 22, 2016 •

edited

Loading

Grinnz commented Oct 22, 2016 •

edited

Loading