Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting hexa code values instead of characters for Unicode code characters. #483

Closed
amitc005 opened this issue Oct 7, 2020 · 12 comments
Closed
Labels

Comments

@amitc005
Copy link

amitc005 commented Oct 7, 2020

  1. What versions are you using?
    platform.platform: Linux-5.4.0-48-generic-x86_64-with-debian-buster-sid
    sys.maxsize > 2**32: True
    platform.python_version: 3.6.5
    cx_Oracle.version: 6.3
    cx_Oracle.clientversion: (12, 2, 0, 1, 0)

  2. Describe the problem
    When I am trying to fetch data from a database(NLS_CHARACTERSET is set to "WE8ISO8859P15"), Unicode texts are coming as "??" which is expected because cx_oracle is using ASCII by default. But when I am running this line export NLS_LANG=AMERICAN_AMERICA.AL32UTF8 and trying to fetch the same data, Unicode texts are coming as there hexa values like this "\x80\x99"

for example --> 'â\x80\x99'

I was wondering if someone can help me in this situation.

@cjbj
Copy link
Member

cjbj commented Oct 7, 2020

See https://cx-oracle.readthedocs.io/en/latest/user_guide/globalization.html#setting-the-client-character-set

With your version of cx_Oracle try something like:

connection = cx_Oracle.connect(connectString, encoding="UTF-8",  nencoding="UTF-8")

@amitc005
Copy link
Author

amitc005 commented Oct 8, 2020

@cjbj Thanks for the reply, I am actually passing those arguments when I am creating the connection but still, I am getting hex values.

connection = cx_Oracle.connect("XXXX", "XXXXX", "192.168.30.15/XXXXXX", encoding="UTF-8", nencoding="UTF-8")

'â\x80\x99'

@cjbj
Copy link
Member

cjbj commented Oct 8, 2020

Corrupt data? Some other layer (Windows command shell?) not understanding the encoding?

What happens if you insert new data and select it?

@amitc005
Copy link
Author

amitc005 commented Oct 8, 2020

Okay, I tried to set this character in one of the columns " ’ " and when I fetched the record, it is now coming as " ¿ " this.
I also executed this query.
SELECT DISTINCT client_charset AS client_charset FROM v$session_connect_info WHERE sid = SYS_CONTEXT('USERENV', 'SID'
and the result of this query is ('AL32UTF8',)

@cjbj
Copy link
Member

cjbj commented Oct 8, 2020

Show us a complete runnable example Python script.

@amitc005
Copy link
Author

amitc005 commented Oct 8, 2020

I am actually using shell so this how it looks like now.

$ export NLS_LANG="AMERICAN_AMERICA.AL32UTF8"
$ python
Python 3.6.5 (default, Dec  3 2019, 08:05:27) 
[GCC 7.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

>>> import cx_Oracle
>>> connection = cx_Oracle.connect("XXXX", "XXXXX", "XXX.XXX.XXX.XXX/XXX", encoding="UTF-8", nencoding='UTF-8')
>>> connection.cursor()
>>> cursor.execute("update rspf set rspf_external_resource_url = '’' WHERE CS_ID = 74 and rspf_prod_n = '     1001-01'")
>>> cursor.execute("commit")
>>> result = cursor.execute("SELECT rspf_external_resource_url FROM RSPF WHERE CS_ID = 74 and rspf_prod_n = '     1001-01'")
>>> for i in result:
...     print(i)
... 
('¿',)
>>> result = cursor.execute("SELECT DISTINCT client_charset AS client_charset FROM v$session_connect_info WHERE sid = SYS_CONTEXT('USERENV', 'SID')")
>>> for i in result:
...     print(i)
... 
('AL32UTF8',)

@amitc005
Copy link
Author

amitc005 commented Oct 8, 2020

I've also tried to use the latest oracle client library 18.5 Basic Light Package and cx_Oracle 8.0.01 but still no luck.

@anthony-tuininga
Copy link
Member

You've retrieved the client character set. But it would be important to also get the database character set as shown here. The other thing you can do is use SQL*Plus to perform the same query, remembering to set NLS_LANG before running it as you did for Python (if you use the parameter in cx_Oracle, though, you don't need to set NLS_LANG, and in cx_Oracle 8, the default is UTF-8 anyway).

@amitc005
Copy link
Author

amitc005 commented Oct 8, 2020

@anthony-tuininga Thank you so much for the information.

I think It might be the database character set problem because it is set to NLS_CHARACTERSET: WE8ISO8859P15 and I've also read here https://docs.oracle.com/cd/B19306_01/server.102/b14225/ch2charset.htm#i1007228 that

Data loss occurs when the database character set does not include all of the characters in the client character set. The database character set is US7ASCII. The client's character set is WE8MSWIN1252, and the language used by the client is German. When the client inserts a string that contains ß, the database replaces ß with ?, resulting in lost data."

So we could be experiencing the data loss because WE8ISO8859P15 does not have the characters which I am using in my test.
And also one of our product which is actually working fine with this database setting because internally that application treats the data as a byte string, and just shoves bytes into oracle and reads them back. So when I am inserting the data into using that product and retrieving the same data back from the same database using python, I am getting the hex values because internally cx_oracle or oracle client parsing that data as a string and returning back hex values.

I tried to add byte strings as the value but in database, it stored as Unicode value "E28099"

>>> query = "Update rspf set rspf_external_resource_url = :weird_value WHERE CS_ID = 74 and rspf_prod_n = '     1001-01'"
>>> weird_value = '’'.encode()
>>> cursor.execute(query, weird_value=weird_value)
>>> cursor.execute('commit')
>>> result = cursor.execute("SELECT * FROM RSPF WHERE CS_ID = 74 and rspf_prod_n = '     1001-01'")
>>> for i in result:
...     print(i)
... 
('E28099',)

So I think we have to solve this problem internally.
However, I have a question that can we add the byte values and cx_oracle will send straight to the database as bytes (for varchar column type) and when we reterive back the value as it is or internally perform the conversion(oracle client might do) and send it back as sting value.

@anthony-tuininga
Copy link
Member

With Python 2, VARCHAR data is transferred to/from the database as bytes. With Python 3, however, the data is decoded into a string using the client encoding (in this case UTF-8), so the data must be stored in the database properly (with the advertised database character set) in order for the conversion to the client character set (UTF-8) to perform properly. With Python 2 (and other tools) it is possible to store data improperly in the database -- and once that is done you'll need to correct that corrupt data. There is an enhancement request (#385) that would allow you to do the same with Python 3 by simply returning/accepting bytes. This is intended solely to fix the corrupted data, not to maintain the corruption, of course! 👍

As for using byte strings, the problem there is that Oracle sees the bytes, treats them as RAW and converts them to a hex string in order to store them in the database! You can see that by doing the following:

'’'.encode().hex()

which will return the string e28099 -- exactly what you got back from the database!

@amitc005
Copy link
Author

amitc005 commented Oct 8, 2020

One of the hack we tried that we encode the return value by iso8859_15 because this the database character encoding and then decode the return value

import cx_Oracle


# Establish the database connection
userpwd = "xxxxxx"
connection = cx_Oracle.connect("xxxxx", userpwd, "192.168.30.15/xxxxxxx", encoding="utf-8")

# Obtain a cursor
cursor = connection.cursor()

# sql = "ALTER SESSION SET NLS_LANGUAGE='WE8ISO8859P15'"
# cursor.execute(sql)

sql = "SELECT rspf_external_resource_url FROM RSPF WHERE CS_ID = 74 and rspf_prod_n = '     1001-01'"

cursor.execute(sql)

# Loop over the result set
value = None
for row in cursor:
    value = row[0]

encoded_value = value.encode("iso8859_15")
print(encoded_value)
print(encoded_value.decode())

sql = "UPDATE rspf set rspf_external_resource_url=:value WHERE CS_ID = 74 and rspf_prod_n = '     1001-01'"

update_value = b"\xe2\x80\x99"
update_value = update_value.decode("iso8859-15")

cursor.execute(sql, value=update_value)
cursor.execute("commit")

sql = "SELECT rspf_external_resource_url FROM RSPF WHERE CS_ID = 74 and rspf_prod_n = '     1001-01'"
cursor.execute(sql)
# Loop over the result set
value = None
for row in cursor:
    value = row[0]

encoded_value = value.encode("iso8859_15")
print(encoded_value)
print(encoded_value.decode())
b'\xe2\x80\x99'
’
b'\xe2\x80\x99'
’

So If I set export NLS_LANG="AMERICAN_AMERICA.AL32UTF8"
and perform operations like before and do the encode with iso8859_15 and decode to utf-8, I get the correct result.
It works but I think we have to change the database character encoding to utf-8.

@anthony-tuininga
Copy link
Member

Yes, you definitely need to change your database encoding to AL32UTF8 -- or at least one that can store all of the characters that you want to store! Glad you have a workaround for now -- not a very pleasant one, but as you noted, it works! :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants