-
Notifications
You must be signed in to change notification settings - Fork 77
Fix support for mysql_enable_utf8 and mysql_enable_utf8mb4 #67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
14515c8
to
75ecd62
Compare
Thank you so much. This fix is sorely needed. I will try to find time to test but one question, should SQL_BLOB also be considered a column type to prevent encoding? It seems to be used mostly by SQLite and several database abstraction tools. |
Looks like SQL_BLOB should not be encoded/decoded too... But this is probably question for DBI (API) developers which types are considered as binary and which not. I think that every DBI driver needs have similar logic if it want to have proper support for Unicode perl scalars... |
1c710d2
to
cc9120b
Compare
…64bit IV When perl is compiled with 64bit size IV value, use newSVuv() for storing 64bit integer instead own int-to-string function. Also rename my_ulonglong2str() to my_ulonglong2sv() so function name match what is doing.
…s and make consistency between DBI, mysql and perl types Prepared statements now uses only necessary long buffers for integer types (not always 64bit anymore). DBI SQL types SQL_BOOLEAN and SQL_TINYINT now handles 8bit integers, SQL_SMALLINT 16bit integers, SQL_INTEGER 32bit integers and SQL_BIGINT 64bit integers. SQL_FLOAT native floats and SQL_DOUBLE or SQL_REAL native doubles. Now when prepared statements are enabled and explicit storage of numeric values (different integer size and floating point size) via DBI SQL is used, conversion to storage type is done by DBD::mysql driver and not by mysql server. It means that any overflow/underflow is not detected by mysql server and application needs to handle it. SQL_BOOLEAN is treated as numeric type and SQL_DECIMAL is handed by mysql as string type (due to decimal precision). SQL_BIT, SQL_BLOB, SQL_BINARY, SQL_VARBINARY and SQL_LONGVARBINARY are treated as perl binary types. And MYSQL_TYPE_BIT, MYSQL_TYPE_MEDIUM_BLOB and MYSQL_TYPE_LONG_BLOB are treated as binary perl types. MYSQL_TYPE_LONGLONG as numeric perl type on platforms with 64 bit perl's integers and string perl type on others. MYSQL_TYPE_NULL is now perl's undef. Default mysql type is MYSQL_TYPE_STRING which can represent any supported mysql value.
… and SvUTF8 Boolean variables enable_utf8 and enable_utf8mb4 are needed for UTF-8 support. Macros sv_utf8_decode and SvUTF8 are part of perl 5.6.0 and DBD::mysql already needs at least perl 5.8.1.
For each fetched field mysql server tells us also charset id. Before this commit when mysql_enable_utf8 was enabled DBD::mysql UTF-8 decoded all fields with charset id different of 63 (means binary). Now DBD::mysql UTF-8 decode only those fields which have charset set to utf8 or utf8mb4. By default mysql server sends data in encoding specified by SET NAMES command, which is by default Latin1. So received Latin1 data are not UTF-8 decoded anymore. Mysql server sends charset id, not charset name. Each combination of pairs charset name and collation has its own charset id. New function charsetnr_is_utf8() has hardcoded all utf8 and utf8mb4 charset ids from mysql (up to 8.0.0) and mariadb (up to 10.2.2) source code. Looks like they are not changing since old mysql 5.0, just new are added.
It is needed for upcoming UTF-8 support of statement string.
…as C char* For upcoming UTF-8 support it is needed to have code for extration of char* from perl scalar at one place -- in input functions. In structs imp_sth_ph_st and imp_sth_st are stored own copy of statement and parameters, so they do not disappear after XS functions returns.
… mysql_enable_utf8 is enabled Before this commit perl scalars (statements or bind parameters) without UTF8 status flag were not encoded to UTF-8 even if mysql_enable_utf8 was enabled. It caused that perl scalars with internal Latin1 encoding were send to mysql server as Latin1 even if mysql_enable_utf8 was enabled. Now all statements and bind parameters which are not of DBI binary type (SQL_BIT, SQL_BLOB, SQL_BINARY, SQL_VARBINARY and SQL_LONGVARBINARY) are automatically encoded to UTF-8 when mysql_enable_utf8 is enabled. If mysql_enable_utf8 is not enabled and statement or bind parameter contains wide Unicode character then DBD::mysql shows warning. If binary parameter contains wide Unicode character then DBD::mysql shows warning too. Similar like function print without :utf8 perlio layer. Perl's SvPV() returns char* from perl scalar and following SvUTF8() call for that scalar returns true if SvPV returned data in UTF-8 or Latin1. SvPVutf8() always returns data in UTF-8, but has side effect that it modifies and upgrades scalar to UTF-8. To prevent modification of original scalar we create new mortal (temporary) one for modification. Because invariant UTF-8 characters (7bit ASCII) are exaclty same also in Latin1, we do not need to do any encoding when all characters are UTF-8 invariants. SvPVbyte() first downgrades scalar to Latin1 and then returns data in Latin1. If downgrade is not possible then it croaks. So instead SvPVbyte() for binary data is used manual conversion which throws only warning.
32a4118
to
99c42e2
Compare
Big update of this pull request is there! It address problem reported by schmorp in discussion https://rt.cpan.org/Public/Bug/Display.html?id=87428 |
C<mysql_enable_utf8> attribute state. They are treated as sequence of octets | ||
and it is your responsibility to decode them correctly. | ||
B<WARNING>: DBD::mysql prior to version XXX had different and buggy behaviour |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
version XXX
needs to be changed according to next release after merge...
Also please re-check that new rewritten documentation is OK... I'm not native English speaker and documentation is important!
handle, when creating the statement handle or after it has been created. | ||
See L</"STATEMENT HANDLES">. | ||
=item mysql_enable_utf8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Easier to repaste the whole section with grammar fixes than comment on each part. (Sorry but difficult to fix the line lengths in github's editor.) This documentation section is greatly improved from the released version btw!
=item mysql_enable_utf8
This attribute affects input data from DBI (statement and bind parameters) and
output data from the mysql server.
If used as a part of the call to C<connect()> then it issues the command
C<SET NAMES utf8>.
When set, any statement or bind parameter which is not of binary type is
automatically encoded to UTF-8 octets before being sent to the MySQL server. Any
retrieved MySQL data with a charset of C<utf8> or C<utf8mb4> from a textual column
type (char, varchar, etc) is automatically UTF-8 decoded and returned as a perl
Unicode scalar (with SvUTF8 flag on). That enables character semantics on those
retrieved UTF-8 strings. The MySQL charset of a retrieved value is affected by the
last C<SET NAMES> command and also could be affected by the database, table and column
configuration. For more information, see the I<Character Set Support> chapter in
the MySQL manual: L<http://dev.mysql.com/doc/refman/5.7/en/charset.html>
When unset and a statement or bind parameter contains a wide Unicode character then
DBD::mysql gives the warning C<Wide character in ... but mysql_enable_utf8 not set>.
The MySQL protocol does not support wide characters and so DBD::mysql does not know
how to send a statement with wide characters when C<mysql_enable_utf8> is not set.
Please note that when C<mysql_enable_utf8> is set, the input statement and bind
parameters are encoded to UTF-8 octets even if the current MySQL session charset is
not C<utf8> or C<utf8mb4>! You are responsible for calling the C<SET NAMES utf8> or
C<SET NAMES utf8mb4> command when setting the C<mysql_enable_utf8> attribute after connecting.
The same applies to unsetting the C<mysql_enable_utf8> attribute. You are responsible
for calling C<SET NAMES latin1> (resp. with correct charset) and then passing perl
scalars in the correct encoding. Otherwise strings will be sent to MySQL server
incorrectly!
Input bind parameters of binary types (C<SQL_BIT>, C<SQL_BLOB>, C<SQL_BINARY>,
C<SQL_VARBINARY> and C<SQL_LONGVARBINARY>) are not touched regardless of the
C<mysql_enable_utf8> attribute state. They are treated as a sequence of octets
and sent to the MySQL server as is. If that bind parameter contains a wide Unicode
character then DBD::mysql gives the warning C<Wide character in binary field ...>
because binary data is a sequence of octets, not Unicode characters!
Output data fetched from the MySQL server which does not have a C<utf8> or C<utf8mb4>
charset (so also binary data) is not UTF-8 decoded regardless of the
C<mysql_enable_utf8> attribute state. They are treated as a sequence of octets
and it is your responsibility to decode them correctly.
B<WARNING>: DBD::mysql prior to version XXX had different and buggy behaviour
when the attribute C<mysql_enable_utf8> was enabled! Input statement and bind
parameters were never encoded to UTF-8 octets and retrieved columns were
always UTF-8 decoded regardless of the column charset (except binary charsets).
=item mysql_enable_utf8mb4
Exactly the same as the attribute C<mysql_enable_utf8>.
Additionally if used as a part of the call to C<connect()> then it issues
the command C<SET NAMES utf8mb4> instead of C<utf8>.
MySQL's C<utf8mb4> charset is capable of handling 4-byte UTF-8 characters.
MySQL's C<utf8> charset is capable of handling only up to 3-byte UTF-8
characters! See MySQL manual for more information:
L<http://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8mb4.html>
You should use MySQL's C<utf8mb4> charset instead of C<utf8> to prevent problems
with data exchange. When the C<utf8> charset is used then you are responsible for
3-byte UTF-8 sequence checks on input perl scalar strings. Otherwise MySQL
server can reject or modify the input statement!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Text updated.
…ritten to the database.
…_server_prepare configurations
This is SUPER awesome! Thanks a lot. I'll put out a new test release on CPAN with it. Of course testing, eyeballs and comments are welcome. |
C<mysql_enable_utf8> attribute state. They are treated as a sequence of octets | ||
and it is your responsibility to decode them correctly. | ||
B<WARNING>: DBD::mysql prior to version XXX had different and buggy behaviour |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not forget to set correct version instead of XXX.
Sure, thanks for the pointer.
I will modify that and amend the change log and then push a new development
release out tomorrow.
Op za 10 dec. 2016 om 12:05 schreef pali <notifications@github.com>
… ***@***.**** commented on this pull request.
------------------------------
In lib/DBD/mysql.pm
<#67 (review)>
:
> +correct charset) and then passing perl scalars in the correct encoding.
+Otherwise strings will be sent to MySQL server incorrectly!
+
+Input bind parameters of binary types (C<SQL_BIT>, C<SQL_BLOB>, C<SQL_BINARY>,
+C<SQL_VARBINARY> and C<SQL_LONGVARBINARY>) are not touched regardless of the
+C<mysql_enable_utf8> attribute state. They are treated as a sequence of octets
+and sent to the MySQL server as is. If that bind parameter contains a wide Unicode
+character then DBD::mysql gives the warning C<Wide character in binary field ...>
+because binary data is a sequence of octets, not Unicode characters!
+
+Output data fetched from the MySQL server which does not have a C<utf8> or
+C<utf8mb4> charset (so also binary data) is not UTF-8 decoded regardless of the
+C<mysql_enable_utf8> attribute state. They are treated as a sequence of octets
+and it is your responsibility to decode them correctly.
+
+B<WARNING>: DBD::mysql prior to version XXX had different and buggy behaviour
Do not forget to set correct version instead of XXX.
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#67 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAoQMGf_kzh5nG5MyVubTxyxEMIVqtQCks5rGodlgaJpZM4Kd4ac>
.
|
WARNING: This pull request changes behavior of mysql_enable_utf8 and mysql_enable_utf8mb4 attributes. New behavior is documented in POD section. Changes should be properly reviewed and tested as in some situations it can be backward incompatible change...
Probably people affected by UTF-8 bugs in DBD::mysql should test and verify that it really fix reported problems. Maybe we should discuss more about this pull request.
Reported bugs:
https://rt.cpan.org/Public/Bug/Display.html?id=25590
https://rt.cpan.org/Public/Bug/Display.html?id=53130
https://rt.cpan.org/Public/Bug/Display.html?id=60987
https://rt.cpan.org/Public/Bug/Display.html?id=87428
Important changes:
Part of this pull request is @davel (Dave Lambley) test for UTF-8 params from davel/DBD-mysql@8695068 references by https://rt.cpan.org/Public/Bug/Display.html?id=60987. I slightly modified it, but it is still great test case for scenario with different table charset, different session charset and state of mysql_enable_utf8.
Prepared statements now uses only necessary long buffers for integer types
(not always 64bit anymore). DBI SQL types SQL_BOOLEAN and SQL_TINYINT now
handles 8bit integers, SQL_SMALLINT 16bit integers, SQL_INTEGER 32bit
integers and SQL_BIGINT 64bit integers. SQL_FLOAT native floats and
SQL_DOUBLE or SQL_REAL native doubles.
Now when prepared statements are enabled and explicit storage of numeric
values (different integer size and floating point size) via DBI SQL is
used, conversion to storage type is done by DBD::mysql driver and not by
mysql server. It means that any overflow/underflow is not detected by mysql
server and application needs to handle it.
SQL_BOOLEAN is treated as numeric type and SQL_DECIMAL is handed by mysql
as string type (due to decimal precision). SQL_BIT, SQL_BLOB, SQL_BINARY,
SQL_VARBINARY and SQL_LONGVARBINARY are treated as perl binary types.
And MYSQL_TYPE_BIT, MYSQL_TYPE_MEDIUM_BLOB and MYSQL_TYPE_LONG_BLOB are
treated as binary perl types. MYSQL_TYPE_LONGLONG as numeric perl type on
platforms with 64 bit perl's integers and string perl type on others.
MYSQL_TYPE_NULL is now perl's undef.
For each fetched field mysql server tells us also charset id. Before this
commit when mysql_enable_utf8 was enabled DBD::mysql UTF-8 decoded all
fields with charset id different of 63 (means binary).
Now DBD::mysql UTF-8 decode only those fields which have charset set to
utf8 or utf8mb4. By default mysql server sends data in encoding specified
by SET NAMES command, which is by default Latin1. So received Latin1 data
are not UTF-8 decoded anymore.
Before this commit perl scalars (statements or bind parameters) without
UTF8 status flag were not encoded to UTF-8 even if mysql_enable_utf8 was
enabled. It caused that perl scalars with internal Latin1 encoding were
send to mysql server as Latin1 even if mysql_enable_utf8 was enabled.
Now all statements and bind parameters which are not of DBI binary type
(SQL_BIT, SQL_BLOB, SQL_BINARY, SQL_VARBINARY and SQL_LONGVARBINARY) are
automatically encoded to UTF-8 when mysql_enable_utf8 is enabled.
If mysql_enable_utf8 is not enabled and statement or bind parameter
contains wide Unicode character then DBD::mysql shows warning. If binary
parameter contains wide Unicode character then DBD::mysql shows warning
too. Similar like function print without :utf8 perlio layer.