Skip to content

Conversation

pali
Copy link
Member

@pali pali commented Oct 22, 2016

WARNING: This pull request changes behavior of mysql_enable_utf8 and mysql_enable_utf8mb4 attributes. New behavior is documented in POD section. Changes should be properly reviewed and tested as in some situations it can be backward incompatible change...

Probably people affected by UTF-8 bugs in DBD::mysql should test and verify that it really fix reported problems. Maybe we should discuss more about this pull request.

Reported bugs:
https://rt.cpan.org/Public/Bug/Display.html?id=25590
https://rt.cpan.org/Public/Bug/Display.html?id=53130
https://rt.cpan.org/Public/Bug/Display.html?id=60987
https://rt.cpan.org/Public/Bug/Display.html?id=87428


Important changes:

  • Fix support for prepared statement numeric storage, DBI SQL bind types
  • Make consistency between DBI, mysql and perl types
  • Do not UTF-8 encode input binded params which are of DBI SQL BINARY types
  • Do not UTF-8 decode output fields which are not send in utf8 (resp. utf8mb4) charset from mysql server
  • Use SvPVutf8() for retrieving UTF-8 encoded char * because SvPV() can return char * in Latin1
  • Remove #ifdef checks for sv_utf8_decode and SvUTF8 as they are not needed anymore

Part of this pull request is @davel (Dave Lambley) test for UTF-8 params from davel/DBD-mysql@8695068 references by https://rt.cpan.org/Public/Bug/Display.html?id=60987. I slightly modified it, but it is still great test case for scenario with different table charset, different session charset and state of mysql_enable_utf8.


Prepared statements now uses only necessary long buffers for integer types
(not always 64bit anymore). DBI SQL types SQL_BOOLEAN and SQL_TINYINT now
handles 8bit integers, SQL_SMALLINT 16bit integers, SQL_INTEGER 32bit
integers and SQL_BIGINT 64bit integers. SQL_FLOAT native floats and
SQL_DOUBLE or SQL_REAL native doubles.

Now when prepared statements are enabled and explicit storage of numeric
values (different integer size and floating point size) via DBI SQL is
used, conversion to storage type is done by DBD::mysql driver and not by
mysql server. It means that any overflow/underflow is not detected by mysql
server and application needs to handle it.

SQL_BOOLEAN is treated as numeric type and SQL_DECIMAL is handed by mysql
as string type (due to decimal precision). SQL_BIT, SQL_BLOB, SQL_BINARY,
SQL_VARBINARY and SQL_LONGVARBINARY are treated as perl binary types.

And MYSQL_TYPE_BIT, MYSQL_TYPE_MEDIUM_BLOB and MYSQL_TYPE_LONG_BLOB are
treated as binary perl types. MYSQL_TYPE_LONGLONG as numeric perl type on
platforms with 64 bit perl's integers and string perl type on others.
MYSQL_TYPE_NULL is now perl's undef.


For each fetched field mysql server tells us also charset id. Before this
commit when mysql_enable_utf8 was enabled DBD::mysql UTF-8 decoded all
fields with charset id different of 63 (means binary).

Now DBD::mysql UTF-8 decode only those fields which have charset set to
utf8 or utf8mb4. By default mysql server sends data in encoding specified
by SET NAMES command, which is by default Latin1. So received Latin1 data
are not UTF-8 decoded anymore.


Before this commit perl scalars (statements or bind parameters) without
UTF8 status flag were not encoded to UTF-8 even if mysql_enable_utf8 was
enabled. It caused that perl scalars with internal Latin1 encoding were
send to mysql server as Latin1 even if mysql_enable_utf8 was enabled.

Now all statements and bind parameters which are not of DBI binary type
(SQL_BIT, SQL_BLOB, SQL_BINARY, SQL_VARBINARY and SQL_LONGVARBINARY) are
automatically encoded to UTF-8 when mysql_enable_utf8 is enabled.

If mysql_enable_utf8 is not enabled and statement or bind parameter
contains wide Unicode character then DBD::mysql shows warning. If binary
parameter contains wide Unicode character then DBD::mysql shows warning
too. Similar like function print without :utf8 perlio layer.

@pali pali force-pushed the master branch 2 times, most recently from 14515c8 to 75ecd62 Compare October 22, 2016 16:10
@Grinnz
Copy link
Contributor

Grinnz commented Oct 22, 2016

Thank you so much. This fix is sorely needed. I will try to find time to test but one question, should SQL_BLOB also be considered a column type to prevent encoding? It seems to be used mostly by SQLite and several database abstraction tools.

@pali
Copy link
Member Author

pali commented Oct 22, 2016

Looks like SQL_BLOB should not be encoded/decoded too... But this is probably question for DBI (API) developers which types are considered as binary and which not. I think that every DBI driver needs have similar logic if it want to have proper support for Unicode perl scalars...

pali added 8 commits December 8, 2016 23:41
…64bit IV

When perl is compiled with 64bit size IV value, use newSVuv() for storing
64bit integer instead own int-to-string function.

Also rename my_ulonglong2str() to my_ulonglong2sv() so function name match
what is doing.
…s and make consistency between DBI, mysql and perl types

Prepared statements now uses only necessary long buffers for integer types
(not always 64bit anymore). DBI SQL types SQL_BOOLEAN and SQL_TINYINT now
handles 8bit integers, SQL_SMALLINT 16bit integers, SQL_INTEGER 32bit
integers and SQL_BIGINT 64bit integers. SQL_FLOAT native floats and
SQL_DOUBLE or SQL_REAL native doubles.

Now when prepared statements are enabled and explicit storage of numeric
values (different integer size and floating point size) via DBI SQL is
used, conversion to storage type is done by DBD::mysql driver and not by
mysql server. It means that any overflow/underflow is not detected by mysql
server and application needs to handle it.

SQL_BOOLEAN is treated as numeric type and SQL_DECIMAL is handed by mysql
as string type (due to decimal precision). SQL_BIT, SQL_BLOB, SQL_BINARY,
SQL_VARBINARY and SQL_LONGVARBINARY are treated as perl binary types.

And MYSQL_TYPE_BIT, MYSQL_TYPE_MEDIUM_BLOB and MYSQL_TYPE_LONG_BLOB are
treated as binary perl types. MYSQL_TYPE_LONGLONG as numeric perl type on
platforms with 64 bit perl's integers and string perl type on others.
MYSQL_TYPE_NULL is now perl's undef.

Default mysql type is MYSQL_TYPE_STRING which can represent any supported
mysql value.
… and SvUTF8

Boolean variables enable_utf8 and enable_utf8mb4 are needed for UTF-8
support. Macros sv_utf8_decode and SvUTF8 are part of perl 5.6.0 and
DBD::mysql already needs at least perl 5.8.1.
For each fetched field mysql server tells us also charset id. Before this
commit when mysql_enable_utf8 was enabled DBD::mysql UTF-8 decoded all
fields with charset id different of 63 (means binary).

Now DBD::mysql UTF-8 decode only those fields which have charset set to
utf8 or utf8mb4. By default mysql server sends data in encoding specified
by SET NAMES command, which is by default Latin1. So received Latin1 data
are not UTF-8 decoded anymore.

Mysql server sends charset id, not charset name. Each combination of pairs
charset name and collation has its own charset id. New function
charsetnr_is_utf8() has hardcoded all utf8 and utf8mb4 charset ids from
mysql (up to 8.0.0) and mariadb (up to 10.2.2) source code. Looks like they
are not changing since old mysql 5.0, just new are added.
It is needed for upcoming UTF-8 support of statement string.
…as C char*

For upcoming UTF-8 support it is needed to have code for extration of char*
from perl scalar at one place -- in input functions.

In structs imp_sth_ph_st and imp_sth_st are stored own copy of statement
and parameters, so they do not disappear after XS functions returns.
… mysql_enable_utf8 is enabled

Before this commit perl scalars (statements or bind parameters) without
UTF8 status flag were not encoded to UTF-8 even if mysql_enable_utf8 was
enabled. It caused that perl scalars with internal Latin1 encoding were
send to mysql server as Latin1 even if mysql_enable_utf8 was enabled.

Now all statements and bind parameters which are not of DBI binary type
(SQL_BIT, SQL_BLOB, SQL_BINARY, SQL_VARBINARY and SQL_LONGVARBINARY) are
automatically encoded to UTF-8 when mysql_enable_utf8 is enabled.

If mysql_enable_utf8 is not enabled and statement or bind parameter
contains wide Unicode character then DBD::mysql shows warning. If binary
parameter contains wide Unicode character then DBD::mysql shows warning
too. Similar like function print without :utf8 perlio layer.

Perl's SvPV() returns char* from perl scalar and following SvUTF8() call
for that scalar returns true if SvPV returned data in UTF-8 or Latin1.

SvPVutf8() always returns data in UTF-8, but has side effect that it
modifies and upgrades scalar to UTF-8. To prevent modification of original
scalar we create new mortal (temporary) one for modification. Because
invariant UTF-8 characters (7bit ASCII) are exaclty same also in Latin1, we
do not need to do any encoding when all characters are UTF-8 invariants.

SvPVbyte() first downgrades scalar to Latin1 and then returns data in
Latin1. If downgrade is not possible then it croaks. So instead SvPVbyte()
for binary data is used manual conversion which throws only warning.
@pali pali force-pushed the master branch 2 times, most recently from 32a4118 to 99c42e2 Compare December 8, 2016 23:35
@pali
Copy link
Member Author

pali commented Dec 8, 2016

Big update of this pull request is there! It address problem reported by schmorp in discussion https://rt.cpan.org/Public/Bug/Display.html?id=87428

C<mysql_enable_utf8> attribute state. They are treated as sequence of octets
and it is your responsibility to decode them correctly.
B<WARNING>: DBD::mysql prior to version XXX had different and buggy behaviour
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

version XXX needs to be changed according to next release after merge...

Also please re-check that new rewritten documentation is OK... I'm not native English speaker and documentation is important!

handle, when creating the statement handle or after it has been created.
See L</"STATEMENT HANDLES">.
=item mysql_enable_utf8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Easier to repaste the whole section with grammar fixes than comment on each part. (Sorry but difficult to fix the line lengths in github's editor.) This documentation section is greatly improved from the released version btw!

=item mysql_enable_utf8

This attribute affects input data from DBI (statement and bind parameters) and
output data from the mysql server.

If used as a part of the call to C<connect()> then it issues the command
C<SET NAMES utf8>.

When set, any statement or bind parameter which is not of binary type is
automatically encoded to UTF-8 octets before being sent to the MySQL server.  Any
retrieved MySQL data with a charset of C<utf8> or C<utf8mb4> from a textual column
type (char, varchar, etc) is automatically UTF-8 decoded and returned as a perl
Unicode scalar (with SvUTF8 flag on).  That enables character semantics on those
retrieved UTF-8 strings.  The MySQL charset of a retrieved value is affected by the
last C<SET NAMES> command and also could be affected by the database, table and column
configuration.  For more information, see the I<Character Set Support> chapter in
the MySQL manual: L<http://dev.mysql.com/doc/refman/5.7/en/charset.html>

When unset and a statement or bind parameter contains a wide Unicode character then
DBD::mysql gives the warning C<Wide character in ... but mysql_enable_utf8 not set>.
The MySQL protocol does not support wide characters and so DBD::mysql does not know
how to send a statement with wide characters when C<mysql_enable_utf8> is not set.

Please note that when C<mysql_enable_utf8> is set, the input statement and bind
parameters are encoded to UTF-8 octets even if the current MySQL session charset is
not C<utf8> or C<utf8mb4>!  You are responsible for calling the C<SET NAMES utf8> or
C<SET NAMES utf8mb4> command when setting the C<mysql_enable_utf8> attribute after connecting.
The same applies to unsetting the C<mysql_enable_utf8> attribute.  You are responsible
for calling C<SET NAMES latin1> (resp. with correct charset) and then passing perl
scalars in the correct encoding.  Otherwise strings will be sent to MySQL server
incorrectly!

Input bind parameters of binary types (C<SQL_BIT>, C<SQL_BLOB>, C<SQL_BINARY>,
C<SQL_VARBINARY> and C<SQL_LONGVARBINARY>) are not touched regardless of the
C<mysql_enable_utf8> attribute state.  They are treated as a sequence of octets
and sent to the MySQL server as is.  If that bind parameter contains a wide Unicode
character then DBD::mysql gives the warning C<Wide character in binary field ...>
because binary data is a sequence of octets, not Unicode characters!

Output data fetched from the MySQL server which does not have a C<utf8> or C<utf8mb4>
charset (so also binary data) is not UTF-8 decoded regardless of the
C<mysql_enable_utf8> attribute state.  They are treated as a sequence of octets
and it is your responsibility to decode them correctly.

B<WARNING>: DBD::mysql prior to version XXX had different and buggy behaviour
when the attribute C<mysql_enable_utf8> was enabled!  Input statement and bind
parameters were never encoded to UTF-8 octets and retrieved columns were
always UTF-8 decoded regardless of the column charset (except binary charsets).

=item mysql_enable_utf8mb4

Exactly the same as the attribute C<mysql_enable_utf8>.

Additionally if used as a part of the call to C<connect()> then it issues
the command C<SET NAMES utf8mb4> instead of C<utf8>.

MySQL's C<utf8mb4> charset is capable of handling 4-byte UTF-8 characters.
MySQL's C<utf8> charset is capable of handling only up to 3-byte UTF-8
characters! See MySQL manual for more information:
L<http://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8mb4.html>

You should use MySQL's C<utf8mb4> charset instead of C<utf8> to prevent problems
with data exchange.  When the C<utf8> charset is used then you are responsible for
3-byte UTF-8 sequence checks on input perl scalar strings.  Otherwise MySQL
server can reject or modify the input statement!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Text updated.

@mbeijen
Copy link
Contributor

mbeijen commented Dec 10, 2016

This is SUPER awesome! Thanks a lot. I'll put out a new test release on CPAN with it.

Of course testing, eyeballs and comments are welcome.

@mbeijen mbeijen merged commit e2705dc into perl5-dbi:master Dec 10, 2016
C<mysql_enable_utf8> attribute state. They are treated as a sequence of octets
and it is your responsibility to decode them correctly.
B<WARNING>: DBD::mysql prior to version XXX had different and buggy behaviour
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not forget to set correct version instead of XXX.

@mbeijen
Copy link
Contributor

mbeijen commented Dec 10, 2016 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants