Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to insert UTF-8 character with fast_executemany option #617

Closed
maimat1 opened this issue Sep 2, 2019 · 17 comments
Closed

Unable to insert UTF-8 character with fast_executemany option #617

maimat1 opened this issue Sep 2, 2019 · 17 comments

Comments

@maimat1
Copy link

maimat1 commented Sep 2, 2019

I am trying to copy data from one SQL Server instance to another using pyODBC package and I encountered an error while handling UTF-8 characters on Windows.

I have narrowed it down to this snippet, which fails for me:

import pyodbc
import sys
print('system:' + str(sys.getwindowsversion()))
print('python:' + sys.version)
print('pyodbc:' + pyodbc.version)
cnxn = pyodbc.connect(
    r'''Driver=ODBC Driver 17 for SQL Server;Server=localhost;'''
    r'''Database=DB_Pros;port=1433;Trusted_Connection=yes'''
    )
crsr = cnxn.cursor()
crsr.fast_executemany=True

crsr.execute("""DROP TABLE IF EXISTS dbo.testing_table""")
crsr.execute("""CREATE TABLE dbo.testing_table (col1 nvarchar(max) null)""")
crsr.executemany("""INSERT INTO dbo.testing_table (col1) VALUES (?)""", [[' 🎥 ',]])

return_value = crsr.execute("SELECT * FROM dbo.testing_table").fetchone()

print(return_value )

crsr.close()
cnxn.close()

I get the following error:

system:sys.getwindowsversion(major=10, minor=0, build=18362, platform=2, service_pack='')
python:3.7.1 (default, Oct 28 2018, 08:39:03) [MSC v.1912 64 bit (AMD64)]
pyodbc:4.0.27
Traceback (most recent call last):
  File "test.py", line 15, in <module>
    crsr.executemany("""INSERT INTO dbo.testing_table (col1) VALUES (?)""", [[' 🎥 ',]])
pyodbc.DataError: ('22026', '[22026] [Microsoft][ODBC Driver 17 for SQL Server]String data, length mismatch (0) (SQLParamData)')

Everything works fine when I set the crsr.fast_executemany=False.

This seems to be somewhat related to #246 , but this happens on Windows system.

@v-chojas
Copy link
Contributor

v-chojas commented Sep 4, 2019

Could you post an ODBC trace?

@maimat1
Copy link
Author

maimat1 commented Sep 5, 2019

Hi, thank you for looking into this.

Below attached is the ODBC trace, I have executed the snippet posted in the initial comment:
SQL.LOG

@v-makouz
Copy link
Contributor

v-makouz commented Sep 6, 2019

It looks like Python3 Unicode conversions don't really handle characters that require two UTF-16 codepoints very well, so "space", "🎥", "space" gets mangled.

I need to dig a lot deeper to see if there is a neat fix for this, but one work around is to manually escape the codepoints with a "\u" in front of each one of the pair, so in case of "🎥", which is "d83c dfa5" in UTF-16, you'd put "\ud83c\udfa5"

@gordthompson
Copy link
Collaborator

gordthompson commented Sep 8, 2019

For me, with Python 3.7.4 under Windows 7, " \ud83c\udfa5 " doesn't seem to produce a valid string:

str2 = " \ud83c\udfa5 "
print(str2)
# UnicodeEncodeError: 'utf-8' codec can't encode characters in position 1-2: surrogates not allowed

However, " \U0001F3A5 " works fine.

str2 = " \U0001f3a5 "
print(str2)
#  🎥 

Unfortunately, it isn't really a workaround because the string parameter value is the same whether it is defined using a string literal (" 🎥 ") or an escaped Unicode code point (as above).

I also notice that the example code fails if the column is declared as nvarchar(max) (which triggers DAE) but does not fail if the column is declared as nvarchar(50).

@maimat1
Copy link
Author

maimat1 commented Sep 10, 2019

I also notice that the example code fails if the column is declared as nvarchar(max) (which triggers DAE) but does not fail if the column is declared as nvarchar(50).

Yes, I forgot to mention this in the original description - this is only an issue if you insert to nvarchar(max) column with fast_executemany flag using cursor.executemany.

However, the string " 🎥 " was originally only a part of a much longer text stored in nvarchar(max) column from which I isolated the problematic part in the snippet, so changing the datatype of the column is unfortunately not a solution.

@v-makouz
Copy link
Contributor

I know what the problem is now, it's due to some confusion between number of characters and number of codepoints in conversion code. I checked in a fix that should work for this case:

v-makouz@606b4a9

However in the process of testing it, I discovered that when using Python2 on Windows, inserting any Unicode character in this manner doesn't work, because it doesn't do any conversion and so gives the driver the data in the source format (i.e. UTF-8), which the driver doesn't interpret correctly, as Windows itself is not set to UTF-8, so the driver treats it as 1252

So I will see what to do about that and will make a PR once I have a fix for that case as well.

@gordthompson
Copy link
Collaborator

I discovered that when using Python2 on Windows, inserting any Unicode character in this manner doesn't work

If @mkleehammer feels so inclined perhaps we could take this opportunity to discuss the future of Python2 support.

@maimat1
Copy link
Author

maimat1 commented Sep 15, 2019

Thank you for your support, I have manually built pyodbc with proposed fix (v-makouz@606b4a9) and it seems to work flawlessly since then (on Windows 10, Python 3.7.1).

@mkleehammer
Copy link
Owner

@gordthompson What are your thoughts on Python 2 support? I'd love to stop updates for it. I personally maintain a large code base still using Python 2 for some of the microservices, but because they are critical 24x7 services I rarely want to update the the pyodbc versions. (They'll hopefully be ported in 12 months or so.). I'd be happy to make occasional security / crash fixes on a dedicated Python 2 branch for a couple of years.

If we take that step. we should also look at what the minimum Python 3 version should be. I guess that would depend on which LTS OS versions are out there.

@gordthompson
Copy link
Collaborator

gordthompson commented Oct 11, 2019

I'd be happy to make occasional security / crash fixes on a dedicated Python 2 branch for a couple of years.

In that case I would propose the following:

At the end of 2019, create a pyodbc4 branch. The master branch will then be for v5 (and beyond) and will be for Python_3(+) only. Probably no need to do a massive cleanup to remove all Python_2 code from master; as other changes are made to a particular source file just remove the Python_2-specific stuff at the same time. Eventually the Python_2 remnants will disappear.

If we take that step. we should also look at what the minimum Python 3 version should be. I guess that would depend on which LTS OS versions are out there.

My personal preference would be for 3.6+, but I notice that Ubuntu 16.04 LTS still distributes 3.5. Ubuntu 16.04 is on Standard Support until April 2021 (and EOL three years later) so it will be around for a while. So, we might have to go with Python_3.5+, at least at the source level (e.g., no f'{string}'s 😢) for the time being. For Windows we could probably get away with only building 3.6+ wheels. (I don't know about Macs.)

@timnyborg
Copy link

I know what the problem is now, it's due to some confusion between number of characters and number of codepoints in conversion code. I checked in a fix that should work for this case:

v-makouz@606b4a9

However in the process of testing it, I discovered that when using Python2 on Windows, inserting any Unicode character in this manner doesn't work, because it doesn't do any conversion and so gives the driver the data in the source format (i.e. UTF-8), which the driver doesn't interpret correctly, as Windows itself is not set to UTF-8, so the driver treats it as 1252

So I will see what to do about that and will make a PR once I have a fix for that case as well.

Does this fix enable high unicode chars for inserts that don't use parameterized queries?
E.g.
cnxn.execute("""INSERT INTO dbo.testing_table (col1) VALUES (N'🎥')""")

instead of
cnxn.execute("""INSERT INTO dbo.testing_table (col1) VALUES (?)""", [[' 🎥 ',]])

@gordthompson
Copy link
Collaborator

@timnyborg - Inserts using N'...' literals already work with supplementary characters. Do you have an example of a case where they don't?

@timnyborg
Copy link

Well, while the parameterized method works on my stack (Python 3.7, MSSQL, Pyodbc 4.0.27, FreeTDS 1.00.40), the literals cause a sql syntax error. Perhaps it's a separate issue from the interaction with FreeTDS?

cnxn = pyodbc.connect(...)
res = cnxn.execute("update person set surname = ? where id = 1", '🎥') # Works
res = cnxn.execute("update person set surname = N'🎥' where id = 1")

pyodbc.ProgrammingError: ('42000', "[42000] [FreeTDS][SQL Server]Unclosed quotation mark 
after the character string ''. (105) (SQLExecDirectW)")

@v-chojas
Copy link
Contributor

v-chojas commented Nov 6, 2019

Try the Microsoft ODBC Driver for SQL Server instead: https://docs.microsoft.com/en-us/sql/connect/odbc/microsoft-odbc-driver-for-sql-server

@timnyborg
Copy link

Yep, works with the Microsoft drivers (didn't even know they were available!)

However, needed to implement an OpenSSL fix on Debian 10 to get around error 0x2746:
microsoft/msphpsql#1021 (comment)

Thanks, all.

@ambarbour
Copy link

Is the fix for this problem (parameters not literals) currently scheduled for any release?

In the interim is there a workaround, a way of 'pre-treating' the strings to bypass the error? The closest I can get is val.encode('unicode-escape').decode('utf-8') anything going to nvarchar(max), but then they look pretty bad because it encodes even unicode chars below 256, such as common western language accented vowels.

I load/reload a lot of data daily, many millions of rows, so I need fast_executemany. I just swapped out turbodbc, because it has its own problems with unicode strings longer than 1600 chars, and now I have hit this issue.

Any advice is appreciated, thanks.

@ambarbour
Copy link

Well, it's not pretty but it works:

re_multipoint_unicode = re.compile(f'[{chr(0x100)}-{chr(0x10FFFF)}]+')
bad_str = chr(0xFF) + chr(0x1F609)
good_str = re_multipoint_unicode.sub(lambda v: codecs.decode(codecs.unicode_escape_encode(v.group())[0]), bad_str)

If there is a better way, and there probably is, please let me know...

Thanks

mkleehammer added a commit that referenced this issue Dec 21, 2019
* Merging updates. (#1)

Merging updates.

* fix for smalldatetime issue

* Fixed a bad merge

* Fix for inserting high unicode chars

* merge with main branch

* Fix for function sequence error

* reverted unnecessary file changes

* removed obsolete include

* fix for 540

* fix for TVP type mismatch issue

* Combined the IFs

* Fix for high unicode insertion, WIP

* Fix python2 high unicode insertion

* Renamed a table to t1

Co-authored-by: v-chojas <25211973+v-chojas@users.noreply.github.com>
Co-authored-by: Michael Kleehammer <michael@kleehammer.com>
@maimat1 maimat1 closed this as completed Feb 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants