Skip to content

Decoding VARCHAR with UTF-8 collations fails #531

@dvtxc

Description

@dvtxc

Describe the bug

According to the following blog post in 2019, Microsoft has added Unicode support to CHAR and VARCHAR columns using variable length UTF-8 encoding with the new UTF-8 collations: https://techcommunity.microsoft.com/blog/sqlserver/introducing-utf-8-support-for-sql-server/734928

This is also described in the Microsoft documentation: https://learn.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support?view=sql-server-ver17#utf-8-support

Although this seems to work in the command line interface, I am not able to store/encode and retrieve/decode UTF-8 encoded strings using mssql-python. Looking at the repository, there seems to be no test that covers this functionality.

e.g. the following excerpt from tests/test_013_encoding_decoding.py shows that the developers are aware that mssql-python does not properly support UTF-8 collations, but it is not mentioned in the documentation. That's why I am flagging this as a bug.

def test_setdecoding_with_unicode_data(db_connection):
    """Test setdecoding with actual Unicode data operations.

    Note: VARCHAR columns in SQL Server use the database's default collation
    (typically Latin1/CP1252) and cannot reliably store Unicode characters.
    Only NVARCHAR columns properly support Unicode. This test focuses on
    NVARCHAR columns and ASCII-safe data for VARCHAR columns.
    """

To reproduce

# Quick test script:
import mssql_python
import dotenv
import os


SQL_CHAR = 1
SQL_WCHAR = -8
TEST_DB_TABLE = "test_utf8"

dotenv.load_dotenv()

# Try different connection string formats:
conn_str = (
    f"SERVER={os.getenv('DB_SERVER')};"
    f"DATABASE={os.getenv('DATABASE')};"
    f"UID={os.getenv('DB_USER')};"
    f"PWD={os.getenv('DB_PASSWORD')};"
    "TrustServerCertificate=yes"
)

conn = mssql_python.connect(conn_str)
cursor = conn.cursor()

# Check if table exists before running tests
print("Checking if table exists...")
result = cursor.execute(f"SELECT object_id FROM sys.tables WHERE name = '{TEST_DB_TABLE}'")
exists = result.fetchone() is not None
if not exists:
    print(f"\nCreating table '{TEST_DB_TABLE}'")
    cursor.execute(f"""
        CREATE TABLE {TEST_DB_TABLE} (
            id INT PRIMARY KEY,
            varchar_data VARCHAR(100) COLLATE Latin1_General_100_CI_AS_SC_UTF8,
            varchar_data_workaround VARBINARY(100),  -- Store UTF-8 bytes directly
            nvarchar_data NVARCHAR(100)
        )
    """)
    cursor.commit()
    print(f"✅ Table '{TEST_DB_TABLE}' created successfully!")


# Test Unicode data
unicode_test_cases = [
    ("ASCII", "Hello World"),
    ("German", "Grüße"),
    ("Chinese", "你好世界"),
    ("Japanese", "こんにちは"),
    ("Russian", "Привет"),
    ("Mixed", "Hello 世界"),
    ("Emoji", "😀😃😄😁"),
]

# Configure encodings properly:
# - SQL_CHAR encoding affects VARCHAR columns
# - SQL_WCHAR encoding affects NVARCHAR columns
conn.setencoding(encoding="utf-8", ctype=SQL_CHAR)  # For VARCHAR
conn.setdecoding(SQL_CHAR, encoding="utf-8", ctype=SQL_CHAR)

# NVARCHAR always uses UTF-16LE (SQL_WCHAR)
conn.setencoding(encoding="utf-16le", ctype=SQL_WCHAR)  # For NVARCHAR
conn.setdecoding(SQL_WCHAR, encoding="utf-16le", ctype=SQL_WCHAR)

# Clear table
cursor.execute(f"DELETE FROM {TEST_DB_TABLE}")
cursor.commit()

results = []

for i, (test_name, unicode_text) in enumerate(unicode_test_cases):

    # Insert Unicode data
    cursor.execute(f"INSERT INTO {TEST_DB_TABLE} VALUES ({i + 1}, '{unicode_text}', 0x{unicode_text.encode('utf-8').hex()}, '{unicode_text}')")
    cursor.commit()

    # Retrieve data
    cursor.execute(
        f"SELECT varchar_data, varchar_data_workaround, nvarchar_data FROM {TEST_DB_TABLE} WHERE id = {i + 1}"
    )
    result = cursor.fetchone()

    # Store all results
    results.append((test_name, unicode_text, result))



# Display all results in tabular format
print("\n" + "=" * 80)
print(f"{'Test Case':<15} {'Original':<25} {'VARCHAR Result':<25} {'VARCHAR Workaround':<25} {'NVARCHAR Result':<25}")
print("=" * 80)
for test_name, unicode_text, result in results:
    varchar_result = repr(result[0])[:23]
    varchar_workaround_result = repr(result[1].decode('utf-8'))[:23]
    nvarchar_result = repr(result[2])[:23]

    print(f"{test_name:<15} {repr(unicode_text)[:23]:<25} {varchar_result:<25} {varchar_workaround_result:<25} {nvarchar_result:<25}")

Decoding the VARCHAR result fails.

================================================================================
Test Case       Original                  VARCHAR Result            VARCHAR Workaround        NVARCHAR Result
================================================================================
ASCII           'Hello World'             'Hello World'             'Hello World'             'Hello World'
German          'Grüße'                   b'Gr\xfc\xdfe'            'Grüße'                   'Grüße'
Chinese         '你好世界'                    '????'                    '你好世界'                    '你好世界'
Japanese        'こんにちは'                   '?????'                   'こんにちは'                   'こんにちは'
Russian         'Привет'                  '??????'                  'Привет'                  'Привет'
Mixed           'Hello 世界'                'Hello ??'                'Hello 世界'                'Hello 世界'
Emoji           '😀😃😄😁'                    '????????'                '😀😃😄😁'                    '😀😃😄😁'

I assume the problem lies with the decoding, as decoding seems to work in sqlcmd:

1> :setvar SQLCMDMAXVARTYPEWIDTH 30
2> :setvar SQLCMDMAXFIXEDTYPEWIDTH 30
3> SELECT varchar_data, varchar_data_workaround, nvarchar_data FROM test_utf8;
4> GO
varchar_data                   varchar_data_workaround          nvarchar_data
------------------------------ -------------------------------- ------------------------------
Hello World                    0x48656C6C6F20576F726C64         Hello World
Grüße                          0x4772C3BCC39F65                 Grüße
你好世界                           0xE4BDA0E5A5BDE4B896E7958C       你好世界
こんにちは                          0xE38193E38293E381ABE381A1E381AF こんにちは
Привет                         0xD09FD180D0B8D0B2D0B5D182       Привет
Hello 世界                       0x48656C6C6F20E4B896E7958C       Hello 世界
😀😃😄😁                       0xF09F9880F09F9883F09F9884F09F98 😀😃😄😁

Expected behavior

Proper decoding of Unicode strings in VARCHAR columns with new UTF-8 collations using:

conn.setencoding(encoding="utf-8", ctype=SQL_CHAR)  # For VARCHAR
conn.setdecoding(SQL_CHAR, encoding="utf-8", ctype=SQL_CHAR)

Further technical details

Python version: 3.14.4
SQL Server version: SQL Server 2022 (docker mssql/server:2022-latest)
Operating system: Docker container

Metadata

Metadata

Assignees

Labels

triage neededFor new issues, not triaged yet.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions