-
Notifications
You must be signed in to change notification settings - Fork 33
Description
The driver currently has some issues that relate to fetching 1-byte sting data (not nvarchar) that contains non-ascii characters. For context, both server and client are running on 64bit Linux, my test database uses latin1 encoded varchars and when you execute a select with a hardcoded string like in the examples below, the driver receives SQL_CHAR as sql type and the exact length in characters of that string as column size.
Here are some examples that don't work:
cursor.execute("select 'Felix Graßl'").fetchall()This raises RuntimeError: Error fetching LOB for column 1, cType=1, loop=1, SQLGetData return=-1. which is a SQLGetData error. When calling check_error(ddbc_sql_const.SQL_HANDLE_STMT.value, cursor.hstmt, -1) we get more precisely Invalid cursor position.
A similar issue happens with
cursor.execute("select 'Felix Graßl'").fetchmany(1)Only difference is that check_error returns Invalid Descriptor Index now.
The SQLGetData error may be a bit misleading here. The issue is that it is getting called on a column that is already bound and fetched that way, so SQLGetData should have never been called. I assume it got called as an attempt to handle a previous error. Maybe this fallback mechanism should be removed and the error raised earlier, since to my knowledge it is never ok to call SQLBindCol and SQLGetData on the same column.
Another example is
print(cursor.execute("select 'Felix Graßl'").fetchone())Here, the driver does not error, but it only returns 'l' (last letter) instead of the full string.
print(cursor.execute("select 'Félix Graßl'").fetchone())If we add one more special character (é), we get back b'\x9fl', which is the last two bytes of the string. Since ß has 2 bytes in utf8 and somehow the first one gets lost, the result is no longer utf8, which raises a decoding error. mssql_python returns the raw bytes as a fallback. I think I would prefer an error in such a situation as well. I think decoding errors like that are not possible without errors in the driver code.
Underlying issue
The situation became more clear to me when I looked at how my proposed arrow fetching method #354 behaves.
print(cursor.execute("select 'Felix Graßl'").arrow())Here, only the final 'l' is cut off. The arrow fetching functions share the column binding code with the python fetching path. The SQLBindColums function abstracts over SQLBindCol_ptr and allocates appropriate buffers. In the case of SQL_CHARs the buffer is simply too small. I assume the confusion comes from the expectation that the ODBC Driver simply returns the varchars bytes as they are stored in the database. With latin1 encoding, ß fits into one byte. The ODBC Driver somewhat unexpectedly doesn't return latin1 encoded bytes though, but instead it converts to utf8. Since ß has 2 bytes in utf8, the buffer is now too small. If we cast it to a larger type, it works totally fine
print(cursor.execute("select cast('Felix Graßl' as varchar(100))").fetchall())The reason fetchone has different issues in the code above is that it uses SQLGetData directly instead of SQLBindCol, which is a different code path, but the issue likely stems from the same misunderstanding.
Solution
Just use a bigger buffer. Assuming that every character in every one byte SQL Server collation takes at most 2 bytes to represent in unicode, multiplying columnsize by 2 would be sufficient. Unicode itself can use up to 4 bytes per character, but I think such characters can't be represented without using an utf8 collation. And in that case the columnsize we get is already correct.
Alternative: Request wide chars
ODBC also allows binding SQL_CHARs to the SQL_C_WCHAR type, which indicates to the driver that the result should be converted to utf-16-le. It seems to be the most reliable option for different kinds of odbc drivers and it's also what pyodbc does by default afaik. But here we only care about the specific SQL Server driver, so that would probably only lose performance for no reason.
What about setdecoding
It seems kind of pointless to have this option, when the driver already converts varchars to utf8 no matter the collation. If you can verify with the ODBC driver team that this behavior is something we can rely on, then the whole thing could simply be removed (or raise an exception if anything other than SQL_CHAR + utf-8 or SQL_WCHAR + utf-16-le is passed).
Currently it seems to me like changing the encoding in setdecode away from utf8 only has the capability to corrupt data.
Here is an example that shows that the driver converts to utf8 even if the bytes stored in the database are completely different
cursor.execute("""
drop table if exists t1
create table t1 (
a varchar(2) collate Latin1_General_100_CI_AI_SC_UTF8,
/* N'ß' would also fit into varchar(1), but then the buffer size bug would trigger */
b varchar(2) collate SQL_Latin1_General_CP1_CI_AS
)
insert into t1 values (N'ß', N'ß')
select a, b, cast(a as varbinary(100)), cast(b as varbinary(100)) from t1
""")
cursor.nextset()
(ret,) = cursor.fetchall()
assert tuple(ret) == ('ß', 'ß', b'\xc3\x9f', b'\xdf')