Skip to content

VARCHAR non-ascii character parsing #553

@benmatwil

Description

@benmatwil

Describe the bug

Found mssql-python recently and started testing to see about usage in our team to replace pyodbc. Generally worked well and enjoyed the arrow functionality to quickly get data to pandas.

However most of our DB tables use non-unicode/VARCHAR string column types with similar collations to that in the example. This means the data coming back from the DB was coming back with incorrect unicode python strings e.g. from .fetchall() and the .arrow().to_pandas() method would fail as the strings were invalid unicode - see MWE.

Using .setdecoding would fix these string issues. Although no sure if that then plays other havoc elsewhere. Converting to nvarchar in the query also fixes the strings.

No conversion seems to be necessary from pyodbc so wondered if this is an issue or just a different workflow in this library.

To reproduce

import mssql_python

SERVER = ...
DATABASE = ...


connection_string = ";".join(
    [
        f"Server={SERVER}",
        f"Database={DATABASE}",
        "Authentication=ActiveDirectoryIntegrated",
        "TrustServerCertificate=yes",
    ]
)


conn = mssql_python.connect(connection_string)

cursor = conn.cursor()

# Create a temp table on a database with Latin1_General_CI_AS collation.
# The £ sign is 0xA3 in Windows-1252 — valid in Latin1 collations,
# but 0xA3 is not a valid single-byte UTF-8 sequence.
cursor.execute("""
    CREATE TABLE #test_encoding (
        id   INT,
        name VARCHAR(100) COLLATE Latin1_General_CI_AS
    )
""")
cursor.execute("INSERT INTO #test_encoding VALUES (1, 'Price: £100')")
cursor.execute("INSERT INTO #test_encoding VALUES (2, 'Résumé')")
# ASCII only, works fine
cursor.execute("INSERT INTO #test_encoding VALUES (3, 'Hello')")
conn.commit()

# --- fetchall() ---
cursor.execute("SELECT id, name FROM #test_encoding")
rows = cursor.fetchall()
print("fetchall() result:", rows)
# Expected: [(1, 'Price: £100'), (2, 'Résumé'), (3, 'Hello')]

conn.setdecoding(mssql_python.SQL_CHAR, encoding="windows-1252")

# --- fetchall() works correctly after setting decoding ---
cursor.execute("SELECT id, name FROM #test_encoding")
rows = cursor.fetchall()
print("fetchall() result:", rows)

# --- cursor.arrow().to_pandas() ---
cursor.execute("SELECT id, name FROM #test_encoding")
try:
    df = cursor.arrow().to_pandas()
    print("to_pandas() result:", df)
except Exception as e:
    print(f"to_pandas() raised {type(e).__name__}: {e}")
# Expected error: ArrowInvalid: Invalid UTF-8 sequence in string - decoding doesn't apply

# --- Workaround: CAST to nvarchar in SQL ---
cursor.execute("SELECT id, CAST(name AS nvarchar(100)) AS name FROM #test_encoding")
table = cursor.arrow()
print("Workaround result:", table.to_pandas())
# Works correctly

cursor.execute("DROP TABLE #test_encoding")
conn.close()

Expected behavior

Special characters in strings to be returned without any conversion

Further technical details

Python version: 3.10.11
SQL Server version: 2019
Operating system: Windows 10

Metadata

Metadata

Assignees

Labels

triage neededFor new issues, not triaged yet.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions