Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maybe lose tail data when access Mysql's VARCHAR field contained UTF8 string. #219

Closed
alexqlyu opened this issue May 21, 2019 · 10 comments · Fixed by #380
Closed

Maybe lose tail data when access Mysql's VARCHAR field contained UTF8 string. #219

alexqlyu opened this issue May 21, 2019 · 10 comments · Fixed by #380

Comments

@alexqlyu
Copy link

Environment

  • nanodbc version: master
  • DBMS name/version: mysql 5.7.26
  • ODBC connection string: Driver={MySQL ODBC 8.0 ANSI Driver};Server=xxx;Database=xxx;Uid=xxx;Pwd=xxx;CharSet=UTF8;
  • OS and Compiler: Ubuntu 18.04
  • CMake settings: NA

Actual behavior

When access Mysql's VARCHAR(or CHAR) field contained UTF8 string, and the size (number of bytes) of UTF8 string is more than the size (number of characters) of VARCHAR(N) field, nanodbc only return the front N bytes and lose tail data. For Mysql's VARCHAR field, its bytes is variable, but through the function SQLDescribeCol (nanodbc.cpp:2724) only get the static max number of characters (sqlsize), and only copy sqlsize byte.
const char* s = col.pdata_ + rowset_position_ * col.clen_; (nanodbc.cpp:3021)

Expected behavior

Minimal Working Example

E4 B8 96 E7 95 8C E4 BD A0 E5 A5 BD
This is four Chinese character string encoding by UTF8, 3 bytes per Chinese character. This string can storage in the VARCHAR(10) field of Mysql, but only read the first 10 bytes.

@mloskot
Copy link
Member

mloskot commented May 21, 2019

I suspect MySQL bug or misconfiguration like mismatch of utf8 (old proprietary 3 byte encoding) vs utf8mb4 (default in MySQL 8?).
Or double re-encoding to utf8 (superfluous CharSet settin?) - the connector 5.7+ always transmits strings as utf8 doesn't it?

You need to investigate it to find out.

The SQLDescribeCol returns number of characters by definition.

@alexqlyu
Copy link
Author

https://dev.mysql.com/doc/refman/8.0/en/storage-requirements.html

CHAR(M) The compact family of InnoDB row formats optimize storage for variable-length character sets. See COMPACT Row Format Storage Characteristics. Otherwise, M × w bytes, <= M <= 255, where w is the number of bytes required for the maximum-length character in the character set.

To calculate the number of bytes used to store a particular CHAR, VARCHAR, or TEXT column value, you must take into account the character set used for that column and whether the value contains multibyte characters. In particular, when using a utf8 Unicode character set, you must keep in mind that not all characters use the same number of bytes. utf8mb3 and utf8mb4 character sets can require up to three and four bytes per character, respectively. For a breakdown of the storage used for different categories of utf8mb3 or utf8mb4 characters, see Section 10.9, “Unicode Support”.

VARCHAR, VARBINARY, and the BLOB and TEXT types are variable-length types. For each, the storage requirements depend on these factors:

  • The actual length of the column value
  • The column's maximum possible length
  • The character set used for the column, because some character sets contain multibyte characters

For example, a VARCHAR(255) column can hold a string with a maximum length of 255 characters. Assuming that the column uses the latin1 character set (one byte per character), the actual storage required is the length of the string (L), plus one byte to record the length of the string. For the string 'abcd', L is 4 and the storage requirement is five bytes. If the same column is instead declared to use the ucs2 double-byte character set, the storage requirement is 10 bytes: The length of 'abcd' is eight bytes and the column requires two bytes to store lengths because the maximum length is greater than 255 (up to 510 bytes).

InnoDB encodes fixed-length fields greater than or equal to 768 bytes in length as variable-length fields, which can be stored off-page. For example, a CHAR(255) column can exceed 768 bytes if the maximum byte length of the character set is greater than 3, as it is with utf8mb4.

@alitavakol
Copy link

I suspect MySQL bug or misconfiguration like mismatch of utf8 (old proprietary 3 byte encoding) vs utf8mb4 (default in MySQL 8?). Or double re-encoding to utf8 (superfluous CharSet settin?) - the connector 5.7+ always transmits strings as utf8 doesn't it?

You need to investigate it to find out.

The SQLDescribeCol returns number of characters by definition.

I see the same issue with unixODBC and DB2; The problem is data is truncated if a column contains non-ASCII characters. e.g. حسن25656 truncates to حسن25 but isql and iusql (from the unixODBC package) get and print it correctly.

This happens in Windows too. I found a workaround for it: I changed system locale to English (united states) to fix this. But with Arabic locale problem arises.

I could not solve the problem.

@mloskot
Copy link
Member

mloskot commented Jul 7, 2022

I'm not a MySQL user myself, so I'm lacking experience and also time for thorough testing of such corner cases.
I'm afraid this issue will have to wait for a MySQL expert willing to chime in and help.

Perhaps this issue is related to the use of limited std::wstring_convert which has been deprecated in C++11.
I'd suggest to try nanodbc with Boost enabled to use Boost.Locale conversions instead:

nanodbc/nanodbc/nanodbc.cpp

Lines 317 to 321 in d7138a9

inline void convert(wide_char_t const* beg, size_t n, std::string& out)
{
#ifdef NANODBC_ENABLE_BOOST
using boost::locale::conv::utf_to_utf;
out = utf_to_utf<char>(beg, beg + n);

@detule
Copy link
Collaborator

detule commented Jan 21, 2023

Hey @mloskot:

Trying to think through this as well. I think we may run into this with other back ends as well - for example Snowflake where apparently all character data-types are equivalent and support UTF-8.

Am I correct in reading the ODBC documentation that ODBC has poor support for UTF-8. In fact:

  • ODBC only supports fixed-length unicode (UCS-2); and
  • The expectation is that these data are stored in columns supported by the wide string types ( WCHAR, VARWCHAR, TEXT etc )

If that's the case, with data-bases such as Snowflake for example where UTF-8 data seems likely to be stored in a VARCHAR column, we should expect that in retrieving results where we have auto-bound a buffer to a string column may lead to truncated data ( since the number of characters/code-points is not equivalent to the number of bytes for the buffer we reserve ).

If I am correct up to this point, I am trying to figure out what the epilogue for end users is:

  • Cast your columns wide data-types when you suspect there is unicode data stored in CHAR/VARCHAR columns. This seems pretty ugly from an end-user perspective.
  • We could ask them to unbind string columns for back-ends/drivers that seem may be storing unicode data in CHAR/VARCHAR columns. SQLGetData, I think, would handle this more gracefully.
  • We could over-allocate in nanodbc; maybe [2,4] * current buffer sizes. Would come at some cost to current memory footprint - TBD if this over-allocation is meaningful.
  • ....

At any rate, with UTF-8 conquering the world, seems like this issue might crop up more often. Looks like SQL Server started supporting UTF-8 in varchar columns with the 2019 release, though from what I can tell they have done it more intelligently.

@mloskot
Copy link
Member

mloskot commented Jan 25, 2023

Hi @detule, I appreciate your input and considerations. This indeed is something that we will have to solve in/for nanodbc either with an implementation or best practice recommendations at least. I currently don't have any solution to offer.

It would be good to collect examples on what UTF-8-aware backends recommend to access data via ODBC.
Does the Snowflake document it in any way?

I don't know Snowflake. I mostly use nanodbc to access SQL Server and I have not been concerned about the UTF-8.

For the time being, the casting to NVARCHAR seems like the simplest to apply, even if not friendly or pretty.

p.s. @detule I've added you Triage permissions to the nanodbc repo. In case you are willing review PRs, we'll appreciate your help.

@detule
Copy link
Collaborator

detule commented Jan 29, 2023

The only examples I was able to find in the snowflake documentation use a pre-set maximum buffer size, rather than querying the back-end for the storage size of the buffer bound to each column in the result. Not very useful.

Microsoft does mention a known deficiency related to SQLBindParameter that seems related ( from a distance, and if I squint my eyes ). Rather than SQLBindParameter, I think we are seeing something that manifests similarly to what they are describing with SQL/Describe/Bind/Col.

Thanks / happy to help as time allows.

@krlmlr
Copy link

krlmlr commented Mar 10, 2023

+1 for overallocating. AFAICT, currently each codepoint takes at most four bytes in UTF-8.

@mloskot
Copy link
Member

mloskot commented Mar 12, 2023

I don't mind overallocating. It could be opt-in using CMake variable.

Best way to move forward with this is to propose an enhancement via PR, then we can discuss over more concrete details.

@mloskot
Copy link
Member

mloskot commented Mar 21, 2023

@detule Thank you for your PR #380 which now has been merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants