Failures with NCLOB text containing four byte Unicode emoji #596

zzzeek · 2021-12-19T17:44:23Z

Hi Anthony / Christopher -

I've narrowed down at least one of the failures I'm having re: LOB to the NCLOB datatype, and it seems to be independent of whether or not I'm using setinputsizes and whether or not I use an outputtype handler. What's most disturbing is that the failure is non-deterministic, failing only sometimes for large strings of text that have many random occurrences of characters in random orders.

This is using NLS_LANG=AMERICAN_AMERICA.AL32UTF8 so we would normally assume the full range of Unicode codepoints should round trip accurately.

The character in question is this: 🐍 , the Python snake (on some GUI elements it looks more like a duck, though if I look closely, it's still a snake).

Oracle server 18c, client version info:

>>> import sys
>>> import platform
>>> 
>>> print("platform.platform:", platform.platform())
platform.platform: Linux-5.14.9-200.fc34.x86_64-x86_64-with-glibc2.33
>>> print("sys.maxsize > 2**32:", sys.maxsize > 2**32)
sys.maxsize > 2**32: True
>>> print("platform.python_version:", platform.python_version())
platform.python_version: 3.10.0
>>> 
>>> import cx_Oracle
>>> print("cx_Oracle.version:", cx_Oracle.version)
cx_Oracle.version: 8.3.0
>>> print("cx_Oracle.clientversion:", cx_Oracle.clientversion())
cx_Oracle.clientversion: (19, 3, 0, 0, 0)

The test case below inserts and returns a row that contains the character embedded in a randomized field of characters. the larger the field of characters, the more chance that it fails. See sample run at the end.

import cx_Oracle
import random

conn = cx_Oracle.connect(user="scott", password="tiger", dsn="oracle18c")


def output_type_handler(cursor, name, default_type, size, precision, scale):
    if default_type in (
        cx_Oracle.CLOB,
        cx_Oracle.NCLOB,
    ):
        return cursor.var(cx_Oracle.LONG_STRING, size, cursor.arraysize)

# fetching LOB without the output type handler also doesn't help
conn.outputtypehandler = output_type_handler


def setup():
    cursor = conn.cursor()

    try:
        cursor.execute("drop table long_text")
    except:
        pass

    cursor.execute(
        """
    CREATE TABLE long_text (
        x INTEGER,
        y NCLOB,
        z INTEGER
    )
    """
    )

    cursor.close()


def run_test(datasize):

    cursor = conn.cursor()

    # the third character is the failure character
    word_seed = u"ab🐍’«cdefg"

    data = u" ".join(
        "".join(random.choice(word_seed) for j in range(150))
        for i in range(datasize)
    )

    # print(f"test data has {len(data.split('🐍'))} of the problematic character 🐍")

    # this actually doesn't help
    # cursor.setinputsizes(
    #     **{"x": cx_Oracle.NUMBER, "y": cx_Oracle.NCLOB, "z": cx_Oracle.NUMBER}
    # )

    cursor.execute(
        "INSERT INTO long_text (x, y, z) VALUES (:x, :y, :z)",
        {"x": 5, "y": data, "z": 10},
    )

    cursor.execute(
        "SELECT long_text.x, long_text.y, long_text.z FROM long_text"
    )
    row = cursor.fetchone()

    try:
        assert row[1] == data
    finally:
        cursor.close()
        conn.rollback()


def harness(datasize, num_runs):
    for i in range(num_runs):
        try:
            run_test(datasize)
        except AssertionError as err:
            print(f"datasize: {datasize} test run {i} failed")
        else:
            print(f"datasize: {datasize} test run {i} succeeded")

setup()

for datasize in (10, 25, 50, 100, 250):
    print(f"\ndatasize: {datasize}")
    harness(datasize, 10)

sample run

$ python test3.py 

datasize: 10
datasize: 10 test run 0 succeeded
datasize: 10 test run 1 succeeded
datasize: 10 test run 2 succeeded
datasize: 10 test run 3 succeeded
datasize: 10 test run 4 succeeded
datasize: 10 test run 5 succeeded
datasize: 10 test run 6 succeeded
datasize: 10 test run 7 succeeded
datasize: 10 test run 8 succeeded
datasize: 10 test run 9 succeeded

datasize: 25
datasize: 25 test run 0 succeeded
datasize: 25 test run 1 succeeded
datasize: 25 test run 2 succeeded
datasize: 25 test run 3 succeeded
datasize: 25 test run 4 succeeded
datasize: 25 test run 5 succeeded
datasize: 25 test run 6 succeeded
datasize: 25 test run 7 succeeded
datasize: 25 test run 8 succeeded
datasize: 25 test run 9 succeeded

datasize: 50
datasize: 50 test run 0 succeeded
datasize: 50 test run 1 succeeded
datasize: 50 test run 2 succeeded
datasize: 50 test run 3 succeeded
datasize: 50 test run 4 succeeded
datasize: 50 test run 5 succeeded
datasize: 50 test run 6 succeeded
datasize: 50 test run 7 succeeded
datasize: 50 test run 8 succeeded
datasize: 50 test run 9 succeeded

datasize: 100
datasize: 100 test run 0 failed
datasize: 100 test run 1 succeeded
datasize: 100 test run 2 succeeded
datasize: 100 test run 3 failed
datasize: 100 test run 4 succeeded
datasize: 100 test run 5 failed
datasize: 100 test run 6 succeeded
datasize: 100 test run 7 failed
datasize: 100 test run 8 failed
datasize: 100 test run 9 succeeded

datasize: 250
datasize: 250 test run 0 succeeded
datasize: 250 test run 1 failed
datasize: 250 test run 2 failed
datasize: 250 test run 3 failed
datasize: 250 test run 4 failed
datasize: 250 test run 5 failed
datasize: 250 test run 6 failed
datasize: 250 test run 7 failed
datasize: 250 test run 8 failed
datasize: 250 test run 9 failed

The text was updated successfully, but these errors were encountered:

zzzeek · 2021-12-19T17:47:54Z

just to make sure I'm not doing anything silly I ported the above program to the mysqlclient driver and ran against MySQL with charset=utf8mb4, there's no failures on that platform, so this is definitely something on the Oracle side.

zzzeek · 2021-12-19T23:46:46Z

character set info reported by the server

>>> 
>>> cursor.execute("""
... SELECT value AS db_charset
... FROM nls_database_parameters
... WHERE parameter = 'NLS_CHARACTERSET'
... """)
<cx_Oracle.Cursor on <cx_Oracle.Connection to scott@oracle18c>>
>>> 
>>> cursor.fetchall()
[('AL32UTF8',)]
>>> cursor.execute("""
... SELECT value AS db_ncharset
... FROM nls_database_parameters
... WHERE parameter = 'NLS_NCHAR_CHARACTERSET'
... """)
<cx_Oracle.Cursor on <cx_Oracle.Connection to scott@oracle18c>>
>>> cursor.fetchall()
[('AL16UTF16',)]
>>> cursor.execute("""
... SELECT DISTINCT client_charset AS client_charset
... FROM v$session_connect_info
... WHERE sid = SYS_CONTEXT('USERENV', 'SID')
... """)
<cx_Oracle.Cursor on <cx_Oracle.Connection to scott@oracle18c>>
>>> cursor.fetchall()
[('AL32UTF8',)]

anthony-tuininga · 2021-12-20T22:05:43Z

I am able to replicate with the script you provided. It is not, in fact, intermittent, but very consistent. It depends on the data that is being supplied to the database -- it was intermittent for you simply because you are randomly creating the string to send to the database! I'll see what the problem is and report back once I figure it out. Thanks for supplying this to me!

zzzeek · 2021-12-20T23:10:33Z

Yes I figured there's some pattern in my random production of strings that causes the issue. looks like a fairly deep issue w/ the encoding.

cjbj · 2021-12-21T07:13:55Z

@anthony-tuininga has opened an Oracle bug against the DB itself. The number is 33691692

anthony-tuininga · 2021-12-21T14:31:39Z

To provide a little more information: the issue has nothing to do with insertng the data, only with fetching it using the output type handler.

zzzeek · 2021-12-21T15:06:07Z

OK for my script above, I can take the outputtypehandler away totally and it still fails in the same way; change the row[1] to row[1].read() and I get the same pattern of early success / late failures. can you try the script below?

import cx_Oracle
import random

conn = cx_Oracle.connect(user="scott", password="tiger", dsn="oracle18c")


def setup():
    cursor = conn.cursor()

    try:
        cursor.execute("drop table long_text")
    except:
        pass

    cursor.execute(
        """
    CREATE TABLE long_text (
        x INTEGER,
        y NCLOB,
        z INTEGER
    )
    """
    )

    cursor.close()


def run_test(datasize):

    cursor = conn.cursor()

    # the third character is the failure character
    word_seed = u"ab🐍’«cdefg"

    data = u" ".join(
        "".join(random.choice(word_seed) for j in range(150))
        for i in range(datasize)
    )

    cursor.execute(
        "INSERT INTO long_text (x, y, z) VALUES (:x, :y, :z)",
        {"x": 5, "y": data, "z": 10},
    )

    cursor.execute(
        "SELECT long_text.x, long_text.y, long_text.z FROM long_text"
    )
    row = cursor.fetchone()

    try:
        assert row[1].read() == data
    finally:
        cursor.close()
        conn.rollback()


def harness(datasize, num_runs):
    for i in range(num_runs):
        try:
            run_test(datasize)
        except AssertionError as err:
            print(f"datasize: {datasize} test run {i} failed")
        else:
            print(f"datasize: {datasize} test run {i} succeeded")

setup()

for datasize in (10, 25, 50, 100, 250):
    print(f"\ndatasize: {datasize}")
    harness(datasize, 10)

anthony-tuininga · 2021-12-21T15:33:41Z

I already tried that...and I don't get any failures that way! I am using 21.3 database, however, so not sure if this is something that may have been fixed? I'll try with an older database a little later and let you know.

zzzeek · 2021-12-21T15:34:11Z

im seeing it on 18c which is not too old

anthony-tuininga · 2021-12-21T18:51:50Z

Ok. I tried with a 12.1 database. It works fine, there. I did need to add this code:

cursor.setinputsizes(y=cx_Oracle.DB_TYPE_NVARCHAR)

That's because that database is using a single-byte character set, though. If the data type is NCLOB you should use that setinputsizes() call to avoid the double conversion (and possible data loss when the primary character set is a single-byte character set). For CLOB and BLOB no setinputsizes() call should be needed.

Since you're getting failures, can you show the first place in the string where things are failing?

zzzeek · 2021-12-21T22:08:42Z

are you able to try with an 18c database? it's the same test case, so I dont know what method you used to narrow down the pattern of characters that doesnt come back, I guess you could trap the error then compare the strings character by character?

cjbj · 2021-12-21T22:30:33Z

I just started 18c XE for @anthony-tuininga

(Also our doc on setinputsizes needs some love....)

anthony-tuininga · 2021-12-21T23:43:14Z

I am able to replicate with 18c. I will look into that as well and get back to you!

anthony-tuininga · 2021-12-22T01:05:11Z

@zzzeek, can you adjust your test case to add the following?

cursor.setinputsizes(y=cx_Oracle.DB_TYPE_NVARCHAR)

just prior to the execute that inserts the row into the database? That resolves the issue for me.

Similarly, for the output type handler, use this:

    if default_type in (oracledb.DB_TYPE_CLOB, oracledb.DB_TYPE_NCLOB):
        return cursor.var(oracledb.DB_TYPE_NVARCHAR, size, cursor.arraysize)

That resolves the fetch side. We'll look at improving the documentation and suggesting an improvement to SQLAlchemy to resolve these matters long-term -- please confirm that this resolves the issue for you, too! Thanks.

zzzeek · 2021-12-22T01:24:00Z

the second test works if I use DB_TYPE_NVARCHAR as well as if I use NCLOB. But DB_TYPE_NVARCHAR is the one that wont build up a separate statement handle, so I should use that one.

the original test is giving me a new error though:

 cx_Oracle.DatabaseError: DPI-1037: column at array position 0 fetched with error 1406

here's exactly how I am running it:

import cx_Oracle
import random

conn = cx_Oracle.connect(user="scott", password="tiger", dsn="oracle18c")


def output_type_handler(cursor, name, default_type, size, precision, scale):
    if default_type in (
        cx_Oracle.CLOB,
        cx_Oracle.NCLOB,
    ):
        return cursor.var(cx_Oracle.DB_TYPE_NVARCHAR, size, cursor.arraysize)

# fetching LOB without the output type handler also doesn't help
conn.outputtypehandler = output_type_handler


def setup():
    cursor = conn.cursor()

    try:
        cursor.execute("drop table long_text")
    except:
        pass

    cursor.execute(
        """
    CREATE TABLE long_text (
        x INTEGER,
        y NCLOB,
        z INTEGER
    )
    """
    )

    cursor.close()


def run_test(datasize):

    cursor = conn.cursor()

    # the third character is the failure character
    word_seed = u"ab🐍’«cdefg"

    data = u" ".join(
        "".join(random.choice(word_seed) for j in range(150))
        for i in range(datasize)
    )

    # print(f"test data has {len(data.split('🐍'))} of the problematic character 🐍")

    # this actually doesn't help
    cursor.setinputsizes(
        **{"x": cx_Oracle.NUMBER, "y": cx_Oracle.DB_TYPE_NVARCHAR, "z": cx_Oracle.NUMBER}
    )

    cursor.execute(
        "INSERT INTO long_text (x, y, z) VALUES (:x, :y, :z)",
        {"x": 5, "y": data, "z": 10},
    )

    cursor.execute(
        "SELECT long_text.x, long_text.y, long_text.z FROM long_text"
    )
    row = cursor.fetchone()

    try:
        assert row[1] == data
    finally:
        cursor.close()
        conn.rollback()


def harness(datasize, num_runs):
    for i in range(num_runs):
        try:
            run_test(datasize)
        except AssertionError as err:
            print(f"datasize: {datasize} test run {i} failed")
        else:
            print(f"datasize: {datasize} test run {i} succeeded")

setup()

for datasize in (10, 25, 50, 100, 250):
    print(f"\ndatasize: {datasize}")
    harness(datasize, 10)

anthony-tuininga · 2021-12-22T02:51:35Z

Replace the line in the output type handler that creates the variable with this one:

        return cursor.var(cx_Oracle.DB_TYPE_NVARCHAR, 131072, cursor.arraysize)

That appears to resolve the issue. The size has to be longer than 32767 in order for the "LONG" conversion to kick in. With that change your test works on the 18c database that Chris provided -- so in theory it should work for you, too. We'll look into making that a little less messy!

zzzeek · 2021-12-22T13:39:20Z

okey doke, that works, so this gives me a path to removing the BLOB setinputsize. 131072 is...number of units for a 1G size? it would be helpful if I could document the origin of this number ?

zzzeek · 2021-12-22T13:40:28Z

that is, I hope this number isn't a fixed buffer size I have to estimate based on data size, since we dont have that

anthony-tuininga · 2021-12-22T15:50:45Z

No, it just has be a number larger than 32767. Any number will do. :-) I don't like that myself so will see about making it a little less "disgusting". ;-) Thanks for confirming that the adjusted code works for you!

Adjustments made to the BLOB / CLOB / NCLOB datatypes in the cx_Oracle and oracledb dialects, to improve performance based on recommendations from Oracle developers. References: oracle/python-cx_Oracle#596 Fixes: #7494 Change-Id: I0d8cc3579140aa65cacf5b7d3373f7e1929a8f85

zzzeek added the bug label Dec 19, 2021

cjbj changed the title ~~sporadic failures with NCLOB text containing four byte Unicode emoji~~ Failures with NCLOB text containing four byte Unicode emoji Dec 21, 2021

zzzeek mentioned this issue Dec 22, 2021

update cx_oracle for BLOB/CLOB/NCLOB adjustments sqlalchemy/sqlalchemy#7494

Closed

zzzeek mentioned this issue Jun 8, 2022

DB_TYPE_NVARCHAR recipe from cx_Oracle does not seem to work with oracledb oracle/python-oracledb#12

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failures with NCLOB text containing four byte Unicode emoji #596

Failures with NCLOB text containing four byte Unicode emoji #596

zzzeek commented Dec 19, 2021

zzzeek commented Dec 19, 2021 •

edited

Loading

zzzeek commented Dec 19, 2021 •

edited

Loading

anthony-tuininga commented Dec 20, 2021

zzzeek commented Dec 20, 2021

cjbj commented Dec 21, 2021

anthony-tuininga commented Dec 21, 2021

zzzeek commented Dec 21, 2021

anthony-tuininga commented Dec 21, 2021

zzzeek commented Dec 21, 2021

anthony-tuininga commented Dec 21, 2021

zzzeek commented Dec 21, 2021

cjbj commented Dec 21, 2021

anthony-tuininga commented Dec 21, 2021

anthony-tuininga commented Dec 22, 2021

zzzeek commented Dec 22, 2021

anthony-tuininga commented Dec 22, 2021

zzzeek commented Dec 22, 2021

zzzeek commented Dec 22, 2021

anthony-tuininga commented Dec 22, 2021

Failures with NCLOB text containing four byte Unicode emoji #596

Failures with NCLOB text containing four byte Unicode emoji #596

Comments

zzzeek commented Dec 19, 2021

zzzeek commented Dec 19, 2021 • edited Loading

zzzeek commented Dec 19, 2021 • edited Loading

anthony-tuininga commented Dec 20, 2021

zzzeek commented Dec 20, 2021

cjbj commented Dec 21, 2021

anthony-tuininga commented Dec 21, 2021

zzzeek commented Dec 21, 2021

anthony-tuininga commented Dec 21, 2021

zzzeek commented Dec 21, 2021

anthony-tuininga commented Dec 21, 2021

zzzeek commented Dec 21, 2021

cjbj commented Dec 21, 2021

anthony-tuininga commented Dec 21, 2021

anthony-tuininga commented Dec 22, 2021

zzzeek commented Dec 22, 2021

anthony-tuininga commented Dec 22, 2021

zzzeek commented Dec 22, 2021

zzzeek commented Dec 22, 2021

anthony-tuininga commented Dec 22, 2021

zzzeek commented Dec 19, 2021 •

edited

Loading

zzzeek commented Dec 19, 2021 •

edited

Loading