Describe the bug
Found mssql-python recently and started testing to see about usage in our team to replace pyodbc. Generally worked well and enjoyed the arrow functionality to quickly get data to pandas.
However most of our DB tables use non-unicode/VARCHAR string column types with similar collations to that in the example. This means the data coming back from the DB was coming back with incorrect unicode python strings e.g. from .fetchall() and the .arrow().to_pandas() method would fail as the strings were invalid unicode - see MWE.
Using .setdecoding would fix these string issues. Although no sure if that then plays other havoc elsewhere. Converting to nvarchar in the query also fixes the strings.
No conversion seems to be necessary from pyodbc so wondered if this is an issue or just a different workflow in this library.
To reproduce
import mssql_python
SERVER = ...
DATABASE = ...
connection_string = ";".join(
[
f"Server={SERVER}",
f"Database={DATABASE}",
"Authentication=ActiveDirectoryIntegrated",
"TrustServerCertificate=yes",
]
)
conn = mssql_python.connect(connection_string)
cursor = conn.cursor()
# Create a temp table on a database with Latin1_General_CI_AS collation.
# The £ sign is 0xA3 in Windows-1252 — valid in Latin1 collations,
# but 0xA3 is not a valid single-byte UTF-8 sequence.
cursor.execute("""
CREATE TABLE #test_encoding (
id INT,
name VARCHAR(100) COLLATE Latin1_General_CI_AS
)
""")
cursor.execute("INSERT INTO #test_encoding VALUES (1, 'Price: £100')")
cursor.execute("INSERT INTO #test_encoding VALUES (2, 'Résumé')")
# ASCII only, works fine
cursor.execute("INSERT INTO #test_encoding VALUES (3, 'Hello')")
conn.commit()
# --- fetchall() ---
cursor.execute("SELECT id, name FROM #test_encoding")
rows = cursor.fetchall()
print("fetchall() result:", rows)
# Expected: [(1, 'Price: £100'), (2, 'Résumé'), (3, 'Hello')]
conn.setdecoding(mssql_python.SQL_CHAR, encoding="windows-1252")
# --- fetchall() works correctly after setting decoding ---
cursor.execute("SELECT id, name FROM #test_encoding")
rows = cursor.fetchall()
print("fetchall() result:", rows)
# --- cursor.arrow().to_pandas() ---
cursor.execute("SELECT id, name FROM #test_encoding")
try:
df = cursor.arrow().to_pandas()
print("to_pandas() result:", df)
except Exception as e:
print(f"to_pandas() raised {type(e).__name__}: {e}")
# Expected error: ArrowInvalid: Invalid UTF-8 sequence in string - decoding doesn't apply
# --- Workaround: CAST to nvarchar in SQL ---
cursor.execute("SELECT id, CAST(name AS nvarchar(100)) AS name FROM #test_encoding")
table = cursor.arrow()
print("Workaround result:", table.to_pandas())
# Works correctly
cursor.execute("DROP TABLE #test_encoding")
conn.close()
Expected behavior
Special characters in strings to be returned without any conversion
Further technical details
Python version: 3.10.11
SQL Server version: 2019
Operating system: Windows 10
Describe the bug
Found mssql-python recently and started testing to see about usage in our team to replace pyodbc. Generally worked well and enjoyed the arrow functionality to quickly get data to pandas.
However most of our DB tables use non-unicode/VARCHAR string column types with similar collations to that in the example. This means the data coming back from the DB was coming back with incorrect unicode python strings e.g. from
.fetchall()and the.arrow().to_pandas()method would fail as the strings were invalid unicode - see MWE.Using
.setdecodingwould fix these string issues. Although no sure if that then plays other havoc elsewhere. Converting to nvarchar in the query also fixes the strings.No conversion seems to be necessary from pyodbc so wondered if this is an issue or just a different workflow in this library.
To reproduce
Expected behavior
Special characters in strings to be returned without any conversion
Further technical details
Python version: 3.10.11
SQL Server version: 2019
Operating system: Windows 10