New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Maybe lose tail data when access Mysql's VARCHAR field contained UTF8 string. #219
Comments
I suspect MySQL bug or misconfiguration like mismatch of utf8 (old proprietary 3 byte encoding) vs You need to investigate it to find out. The |
https://dev.mysql.com/doc/refman/8.0/en/storage-requirements.html CHAR(M) The compact family of InnoDB row formats optimize storage for variable-length character sets. See COMPACT Row Format Storage Characteristics. Otherwise, M × w bytes, <= M <= 255, where w is the number of bytes required for the maximum-length character in the character set. To calculate the number of bytes used to store a particular CHAR, VARCHAR, or TEXT column value, you must take into account the character set used for that column and whether the value contains multibyte characters. In particular, when using a utf8 Unicode character set, you must keep in mind that not all characters use the same number of bytes. utf8mb3 and utf8mb4 character sets can require up to three and four bytes per character, respectively. For a breakdown of the storage used for different categories of utf8mb3 or utf8mb4 characters, see Section 10.9, “Unicode Support”. VARCHAR, VARBINARY, and the BLOB and TEXT types are variable-length types. For each, the storage requirements depend on these factors:
For example, a VARCHAR(255) column can hold a string with a maximum length of 255 characters. Assuming that the column uses the latin1 character set (one byte per character), the actual storage required is the length of the string (L), plus one byte to record the length of the string. For the string 'abcd', L is 4 and the storage requirement is five bytes. If the same column is instead declared to use the ucs2 double-byte character set, the storage requirement is 10 bytes: The length of 'abcd' is eight bytes and the column requires two bytes to store lengths because the maximum length is greater than 255 (up to 510 bytes). InnoDB encodes fixed-length fields greater than or equal to 768 bytes in length as variable-length fields, which can be stored off-page. For example, a CHAR(255) column can exceed 768 bytes if the maximum byte length of the character set is greater than 3, as it is with utf8mb4. |
I see the same issue with unixODBC and DB2; The problem is data is truncated if a column contains non-ASCII characters. e.g. This happens in Windows too. I found a workaround for it: I changed system locale to English (united states) to fix this. But with Arabic locale problem arises. I could not solve the problem. |
I'm not a MySQL user myself, so I'm lacking experience and also time for thorough testing of such corner cases. Perhaps this issue is related to the use of limited Lines 317 to 321 in d7138a9
|
Hey @mloskot: Trying to think through this as well. I think we may run into this with other back ends as well - for example Snowflake where apparently all character data-types are equivalent and support UTF-8. Am I correct in reading the ODBC documentation that ODBC has poor support for UTF-8. In fact:
If that's the case, with data-bases such as If I am correct up to this point, I am trying to figure out what the epilogue for end users is:
At any rate, with UTF-8 conquering the world, seems like this issue might crop up more often. Looks like SQL Server started supporting UTF-8 in varchar columns with the 2019 release, though from what I can tell they have done it more intelligently. |
Hi @detule, I appreciate your input and considerations. This indeed is something that we will have to solve in/for nanodbc either with an implementation or best practice recommendations at least. I currently don't have any solution to offer. It would be good to collect examples on what UTF-8-aware backends recommend to access data via ODBC. I don't know Snowflake. I mostly use nanodbc to access SQL Server and I have not been concerned about the UTF-8. For the time being, the casting to p.s. @detule I've added you Triage permissions to the nanodbc repo. In case you are willing review PRs, we'll appreciate your help. |
The only examples I was able to find in the snowflake documentation use a pre-set maximum buffer size, rather than querying the back-end for the storage size of the buffer bound to each column in the result. Not very useful. Microsoft does mention a known deficiency related to SQLBindParameter that seems related ( from a distance, and if I squint my eyes ). Rather than Thanks / happy to help as time allows. |
+1 for overallocating. AFAICT, currently each codepoint takes at most four bytes in UTF-8. |
I don't mind overallocating. It could be opt-in using CMake variable. Best way to move forward with this is to propose an enhancement via PR, then we can discuss over more concrete details. |
Environment
Actual behavior
When access Mysql's VARCHAR(or CHAR) field contained UTF8 string, and the size (number of bytes) of UTF8 string is more than the size (number of characters) of VARCHAR(N) field, nanodbc only return the front N bytes and lose tail data. For Mysql's VARCHAR field, its bytes is variable, but through the function SQLDescribeCol (nanodbc.cpp:2724) only get the static max number of characters (sqlsize), and only copy sqlsize byte.
const char* s = col.pdata_ + rowset_position_ * col.clen_;
(nanodbc.cpp:3021)Expected behavior
Minimal Working Example
E4 B8 96 E7 95 8C E4 BD A0 E5 A5 BD
This is four Chinese character string encoding by UTF8, 3 bytes per Chinese character. This string can storage in the VARCHAR(10) field of Mysql, but only read the first 10 bytes.
The text was updated successfully, but these errors were encountered: