Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failures with NCLOB text containing four byte Unicode emoji #596

Open
zzzeek opened this issue Dec 19, 2021 · 19 comments
Open

Failures with NCLOB text containing four byte Unicode emoji #596

zzzeek opened this issue Dec 19, 2021 · 19 comments
Labels

Comments

@zzzeek
Copy link

zzzeek commented Dec 19, 2021

Hi Anthony / Christopher -

I've narrowed down at least one of the failures I'm having re: LOB to the NCLOB datatype, and it seems to be independent of whether or not I'm using setinputsizes and whether or not I use an outputtype handler. What's most disturbing is that the failure is non-deterministic, failing only sometimes for large strings of text that have many random occurrences of characters in random orders.

This is using NLS_LANG=AMERICAN_AMERICA.AL32UTF8 so we would normally assume the full range of Unicode codepoints should round trip accurately.

The character in question is this: 🐍 , the Python snake (on some GUI elements it looks more like a duck, though if I look closely, it's still a snake).

Oracle server 18c, client version info:

>>> import sys
>>> import platform
>>> 
>>> print("platform.platform:", platform.platform())
platform.platform: Linux-5.14.9-200.fc34.x86_64-x86_64-with-glibc2.33
>>> print("sys.maxsize > 2**32:", sys.maxsize > 2**32)
sys.maxsize > 2**32: True
>>> print("platform.python_version:", platform.python_version())
platform.python_version: 3.10.0
>>> 
>>> import cx_Oracle
>>> print("cx_Oracle.version:", cx_Oracle.version)
cx_Oracle.version: 8.3.0
>>> print("cx_Oracle.clientversion:", cx_Oracle.clientversion())
cx_Oracle.clientversion: (19, 3, 0, 0, 0)

The test case below inserts and returns a row that contains the character embedded in a randomized field of characters. the larger the field of characters, the more chance that it fails. See sample run at the end.

import cx_Oracle
import random

conn = cx_Oracle.connect(user="scott", password="tiger", dsn="oracle18c")


def output_type_handler(cursor, name, default_type, size, precision, scale):
    if default_type in (
        cx_Oracle.CLOB,
        cx_Oracle.NCLOB,
    ):
        return cursor.var(cx_Oracle.LONG_STRING, size, cursor.arraysize)

# fetching LOB without the output type handler also doesn't help
conn.outputtypehandler = output_type_handler


def setup():
    cursor = conn.cursor()

    try:
        cursor.execute("drop table long_text")
    except:
        pass

    cursor.execute(
        """
    CREATE TABLE long_text (
        x INTEGER,
        y NCLOB,
        z INTEGER
    )
    """
    )

    cursor.close()


def run_test(datasize):

    cursor = conn.cursor()

    # the third character is the failure character
    word_seed = u"ab🐍’«cdefg"

    data = u" ".join(
        "".join(random.choice(word_seed) for j in range(150))
        for i in range(datasize)
    )

    # print(f"test data has {len(data.split('🐍'))} of the problematic character 🐍")

    # this actually doesn't help
    # cursor.setinputsizes(
    #     **{"x": cx_Oracle.NUMBER, "y": cx_Oracle.NCLOB, "z": cx_Oracle.NUMBER}
    # )

    cursor.execute(
        "INSERT INTO long_text (x, y, z) VALUES (:x, :y, :z)",
        {"x": 5, "y": data, "z": 10},
    )

    cursor.execute(
        "SELECT long_text.x, long_text.y, long_text.z FROM long_text"
    )
    row = cursor.fetchone()

    try:
        assert row[1] == data
    finally:
        cursor.close()
        conn.rollback()


def harness(datasize, num_runs):
    for i in range(num_runs):
        try:
            run_test(datasize)
        except AssertionError as err:
            print(f"datasize: {datasize} test run {i} failed")
        else:
            print(f"datasize: {datasize} test run {i} succeeded")

setup()

for datasize in (10, 25, 50, 100, 250):
    print(f"\ndatasize: {datasize}")
    harness(datasize, 10)

sample run

$ python test3.py 

datasize: 10
datasize: 10 test run 0 succeeded
datasize: 10 test run 1 succeeded
datasize: 10 test run 2 succeeded
datasize: 10 test run 3 succeeded
datasize: 10 test run 4 succeeded
datasize: 10 test run 5 succeeded
datasize: 10 test run 6 succeeded
datasize: 10 test run 7 succeeded
datasize: 10 test run 8 succeeded
datasize: 10 test run 9 succeeded

datasize: 25
datasize: 25 test run 0 succeeded
datasize: 25 test run 1 succeeded
datasize: 25 test run 2 succeeded
datasize: 25 test run 3 succeeded
datasize: 25 test run 4 succeeded
datasize: 25 test run 5 succeeded
datasize: 25 test run 6 succeeded
datasize: 25 test run 7 succeeded
datasize: 25 test run 8 succeeded
datasize: 25 test run 9 succeeded

datasize: 50
datasize: 50 test run 0 succeeded
datasize: 50 test run 1 succeeded
datasize: 50 test run 2 succeeded
datasize: 50 test run 3 succeeded
datasize: 50 test run 4 succeeded
datasize: 50 test run 5 succeeded
datasize: 50 test run 6 succeeded
datasize: 50 test run 7 succeeded
datasize: 50 test run 8 succeeded
datasize: 50 test run 9 succeeded

datasize: 100
datasize: 100 test run 0 failed
datasize: 100 test run 1 succeeded
datasize: 100 test run 2 succeeded
datasize: 100 test run 3 failed
datasize: 100 test run 4 succeeded
datasize: 100 test run 5 failed
datasize: 100 test run 6 succeeded
datasize: 100 test run 7 failed
datasize: 100 test run 8 failed
datasize: 100 test run 9 succeeded

datasize: 250
datasize: 250 test run 0 succeeded
datasize: 250 test run 1 failed
datasize: 250 test run 2 failed
datasize: 250 test run 3 failed
datasize: 250 test run 4 failed
datasize: 250 test run 5 failed
datasize: 250 test run 6 failed
datasize: 250 test run 7 failed
datasize: 250 test run 8 failed
datasize: 250 test run 9 failed
@zzzeek zzzeek added the bug label Dec 19, 2021
@zzzeek
Copy link
Author

zzzeek commented Dec 19, 2021

just to make sure I'm not doing anything silly I ported the above program to the mysqlclient driver and ran against MySQL with charset=utf8mb4, there's no failures on that platform, so this is definitely something on the Oracle side.

@zzzeek
Copy link
Author

zzzeek commented Dec 19, 2021

character set info reported by the server

>>> 
>>> cursor.execute("""
... SELECT value AS db_charset
... FROM nls_database_parameters
... WHERE parameter = 'NLS_CHARACTERSET'
... """)
<cx_Oracle.Cursor on <cx_Oracle.Connection to scott@oracle18c>>
>>> 
>>> cursor.fetchall()
[('AL32UTF8',)]
>>> cursor.execute("""
... SELECT value AS db_ncharset
... FROM nls_database_parameters
... WHERE parameter = 'NLS_NCHAR_CHARACTERSET'
... """)
<cx_Oracle.Cursor on <cx_Oracle.Connection to scott@oracle18c>>
>>> cursor.fetchall()
[('AL16UTF16',)]
>>> cursor.execute("""
... SELECT DISTINCT client_charset AS client_charset
... FROM v$session_connect_info
... WHERE sid = SYS_CONTEXT('USERENV', 'SID')
... """)
<cx_Oracle.Cursor on <cx_Oracle.Connection to scott@oracle18c>>
>>> cursor.fetchall()
[('AL32UTF8',)]

@anthony-tuininga
Copy link
Member

I am able to replicate with the script you provided. It is not, in fact, intermittent, but very consistent. It depends on the data that is being supplied to the database -- it was intermittent for you simply because you are randomly creating the string to send to the database! I'll see what the problem is and report back once I figure it out. Thanks for supplying this to me!

@zzzeek
Copy link
Author

zzzeek commented Dec 20, 2021

Yes I figured there's some pattern in my random production of strings that causes the issue. looks like a fairly deep issue w/ the encoding.

@cjbj cjbj changed the title sporadic failures with NCLOB text containing four byte Unicode emoji Failures with NCLOB text containing four byte Unicode emoji Dec 21, 2021
@cjbj
Copy link
Member

cjbj commented Dec 21, 2021

@anthony-tuininga has opened an Oracle bug against the DB itself. The number is 33691692

@anthony-tuininga
Copy link
Member

To provide a little more information: the issue has nothing to do with insertng the data, only with fetching it using the output type handler.

@zzzeek
Copy link
Author

zzzeek commented Dec 21, 2021

OK for my script above, I can take the outputtypehandler away totally and it still fails in the same way; change the row[1] to row[1].read() and I get the same pattern of early success / late failures. can you try the script below?

import cx_Oracle
import random

conn = cx_Oracle.connect(user="scott", password="tiger", dsn="oracle18c")


def setup():
    cursor = conn.cursor()

    try:
        cursor.execute("drop table long_text")
    except:
        pass

    cursor.execute(
        """
    CREATE TABLE long_text (
        x INTEGER,
        y NCLOB,
        z INTEGER
    )
    """
    )

    cursor.close()


def run_test(datasize):

    cursor = conn.cursor()

    # the third character is the failure character
    word_seed = u"ab🐍’«cdefg"

    data = u" ".join(
        "".join(random.choice(word_seed) for j in range(150))
        for i in range(datasize)
    )

    cursor.execute(
        "INSERT INTO long_text (x, y, z) VALUES (:x, :y, :z)",
        {"x": 5, "y": data, "z": 10},
    )

    cursor.execute(
        "SELECT long_text.x, long_text.y, long_text.z FROM long_text"
    )
    row = cursor.fetchone()

    try:
        assert row[1].read() == data
    finally:
        cursor.close()
        conn.rollback()


def harness(datasize, num_runs):
    for i in range(num_runs):
        try:
            run_test(datasize)
        except AssertionError as err:
            print(f"datasize: {datasize} test run {i} failed")
        else:
            print(f"datasize: {datasize} test run {i} succeeded")

setup()

for datasize in (10, 25, 50, 100, 250):
    print(f"\ndatasize: {datasize}")
    harness(datasize, 10)

@anthony-tuininga
Copy link
Member

I already tried that...and I don't get any failures that way! I am using 21.3 database, however, so not sure if this is something that may have been fixed? I'll try with an older database a little later and let you know.

@zzzeek
Copy link
Author

zzzeek commented Dec 21, 2021

im seeing it on 18c which is not too old

@anthony-tuininga
Copy link
Member

Ok. I tried with a 12.1 database. It works fine, there. I did need to add this code:

cursor.setinputsizes(y=cx_Oracle.DB_TYPE_NVARCHAR)

That's because that database is using a single-byte character set, though. If the data type is NCLOB you should use that setinputsizes() call to avoid the double conversion (and possible data loss when the primary character set is a single-byte character set). For CLOB and BLOB no setinputsizes() call should be needed.

Since you're getting failures, can you show the first place in the string where things are failing?

@zzzeek
Copy link
Author

zzzeek commented Dec 21, 2021

are you able to try with an 18c database? it's the same test case, so I dont know what method you used to narrow down the pattern of characters that doesnt come back, I guess you could trap the error then compare the strings character by character?

@cjbj
Copy link
Member

cjbj commented Dec 21, 2021

I just started 18c XE for @anthony-tuininga

(Also our doc on setinputsizes needs some love....)

@anthony-tuininga
Copy link
Member

I am able to replicate with 18c. I will look into that as well and get back to you!

@anthony-tuininga
Copy link
Member

@zzzeek, can you adjust your test case to add the following?

cursor.setinputsizes(y=cx_Oracle.DB_TYPE_NVARCHAR)

just prior to the execute that inserts the row into the database? That resolves the issue for me.

Similarly, for the output type handler, use this:

    if default_type in (oracledb.DB_TYPE_CLOB, oracledb.DB_TYPE_NCLOB):
        return cursor.var(oracledb.DB_TYPE_NVARCHAR, size, cursor.arraysize)

That resolves the fetch side. We'll look at improving the documentation and suggesting an improvement to SQLAlchemy to resolve these matters long-term -- please confirm that this resolves the issue for you, too! Thanks.

@zzzeek
Copy link
Author

zzzeek commented Dec 22, 2021

the second test works if I use DB_TYPE_NVARCHAR as well as if I use NCLOB. But DB_TYPE_NVARCHAR is the one that wont build up a separate statement handle, so I should use that one.

the original test is giving me a new error though:

 cx_Oracle.DatabaseError: DPI-1037: column at array position 0 fetched with error 1406

here's exactly how I am running it:

import cx_Oracle
import random

conn = cx_Oracle.connect(user="scott", password="tiger", dsn="oracle18c")


def output_type_handler(cursor, name, default_type, size, precision, scale):
    if default_type in (
        cx_Oracle.CLOB,
        cx_Oracle.NCLOB,
    ):
        return cursor.var(cx_Oracle.DB_TYPE_NVARCHAR, size, cursor.arraysize)

# fetching LOB without the output type handler also doesn't help
conn.outputtypehandler = output_type_handler


def setup():
    cursor = conn.cursor()

    try:
        cursor.execute("drop table long_text")
    except:
        pass

    cursor.execute(
        """
    CREATE TABLE long_text (
        x INTEGER,
        y NCLOB,
        z INTEGER
    )
    """
    )

    cursor.close()


def run_test(datasize):

    cursor = conn.cursor()

    # the third character is the failure character
    word_seed = u"ab🐍’«cdefg"

    data = u" ".join(
        "".join(random.choice(word_seed) for j in range(150))
        for i in range(datasize)
    )

    # print(f"test data has {len(data.split('🐍'))} of the problematic character 🐍")

    # this actually doesn't help
    cursor.setinputsizes(
        **{"x": cx_Oracle.NUMBER, "y": cx_Oracle.DB_TYPE_NVARCHAR, "z": cx_Oracle.NUMBER}
    )

    cursor.execute(
        "INSERT INTO long_text (x, y, z) VALUES (:x, :y, :z)",
        {"x": 5, "y": data, "z": 10},
    )

    cursor.execute(
        "SELECT long_text.x, long_text.y, long_text.z FROM long_text"
    )
    row = cursor.fetchone()

    try:
        assert row[1] == data
    finally:
        cursor.close()
        conn.rollback()


def harness(datasize, num_runs):
    for i in range(num_runs):
        try:
            run_test(datasize)
        except AssertionError as err:
            print(f"datasize: {datasize} test run {i} failed")
        else:
            print(f"datasize: {datasize} test run {i} succeeded")

setup()

for datasize in (10, 25, 50, 100, 250):
    print(f"\ndatasize: {datasize}")
    harness(datasize, 10)

@anthony-tuininga
Copy link
Member

Replace the line in the output type handler that creates the variable with this one:

        return cursor.var(cx_Oracle.DB_TYPE_NVARCHAR, 131072, cursor.arraysize)

That appears to resolve the issue. The size has to be longer than 32767 in order for the "LONG" conversion to kick in. With that change your test works on the 18c database that Chris provided -- so in theory it should work for you, too. We'll look into making that a little less messy!

@zzzeek
Copy link
Author

zzzeek commented Dec 22, 2021

okey doke, that works, so this gives me a path to removing the BLOB setinputsize. 131072 is...number of units for a 1G size? it would be helpful if I could document the origin of this number ?

@zzzeek
Copy link
Author

zzzeek commented Dec 22, 2021

that is, I hope this number isn't a fixed buffer size I have to estimate based on data size, since we dont have that

@anthony-tuininga
Copy link
Member

No, it just has be a number larger than 32767. Any number will do. :-) I don't like that myself so will see about making it a little less "disgusting". ;-) Thanks for confirming that the adjusted code works for you!

sqlalchemy-bot pushed a commit to sqlalchemy/sqlalchemy that referenced this issue Jun 10, 2022
Adjustments made to the BLOB / CLOB / NCLOB datatypes in the cx_Oracle and
oracledb dialects, to improve performance based on recommendations from
Oracle developers.

References: oracle/python-cx_Oracle#596
Fixes: #7494
Change-Id: I0d8cc3579140aa65cacf5b7d3373f7e1929a8f85
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants