Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thin mode should support DB_NCHARSET 'UTF8' #16

Closed
damarvin opened this issue Jun 11, 2022 · 6 comments
Closed

Thin mode should support DB_NCHARSET 'UTF8' #16

damarvin opened this issue Jun 11, 2022 · 6 comments
Labels
enhancement New feature or request

Comments

@damarvin
Copy link

Connecting with encoding='UTF8'—as DB_NCHARSET is set so, and same with 'utf-8'—I get 'DPY-3012: national character set id 871 is not supported by python-oracledb in thin mode'.
I do not want a national character set, but good standard 'utf-8'.
What is the "Thin" mode for, when it does not support the basics, or do I misinterpret things? Is there a work-around or missing extra parameter?

So I propose the enhancement: Thin mode should support DB_NCHARSET 'UTF8'.

@damarvin damarvin added the enhancement New feature or request label Jun 11, 2022
@anthony-tuininga
Copy link
Member

anthony-tuininga commented Jun 11, 2022

The national character set (871 - UTF8) is an older implementation of the current standard UTF-8. It is known today by the name CESU-8 and Python does not have built-in support for it. It is no longer recommended for use but (clearly!) some databases were built that way and are still in use. Adding support for CESU-8 would require writing our own encoder/decoder of that character set. One possibility, however, might be to simply defer raising the error until the first attempt to actually use the national character set is used. That sounds like it might resolve your situation since you aren't using any NCHAR, NVARCHAR2 or NCLOB columns? For now the only option you have is to use thick mode.

@cjbj
Copy link
Member

cjbj commented Jun 11, 2022

  • The differences between Thin and Thick modes are described in the documentation Differences between python-oracledb Thin and Thick Modes
  • Your solutions with the current 1.0.0 release are to use Thick mode, see here, or connect to a modern database which has a national character set of AL16UTF16.
  • In this first release of the Thin mode, it already supports a heap of functionality. Work is ongoing to add more.
  • Each DB has two characters sets. The 'national character set' referenced in the error is used by NCHAR, NVARCHAR2 and NCLOB columns. Thin mode does support a national character set of AL16UTF16 but not the older UTF8. (Again, this is for the national character set, not the basic database character set).
  • Anthony's suggestion of only throwing DPY-3012 when one of the N* types is used by the app, not at connect time, is a good way forward. To fully support the older UTF8 national character set is a non-trivial amount of work, and we want to look forward, not backwards (which is also why Thin mode connects to Oracle DB 12.1 or later, whereas Thick mode can connect to older databases).
  • The old connection encoding parameter is ignored by python-oracledb, see the doc. This parameter used to relate to the basic database character set. It was the nencoding parameter that related to the national character set. This parameter is also ignored by python-oracledb.

@damarvin
Copy link
Author

damarvin commented Jun 13, 2022

Thanks a lot for the fast precise and comprehensive clarification.
Not sure I can convince the DB operator, still I withdraw the request for "would require writing an own encoder/decoder".

@cjbj
Copy link
Member

cjbj commented Jun 13, 2022

@damarvin I'll leave this closed but I am tracking the general problem so we know how to prioritize our efforts. I believe @anthony-tuininga's suggested enhancement will be a good for many people.

@doerwalter
Copy link

Since the difference between UTF-8 and CESU-8 is only how surrogates are encoded, so it might be possible to implement decoding CESU-8 with Python's standard utf-8 codec and a codec error handler. See PEP 293 for details: https://peps.python.org/pep-0293/

anthony-tuininga added a commit that referenced this issue Jun 22, 2022
supported; an error is now raised only when the first attempt to use
NCHAR, NVARCHAR2 or NCLOB data is made (#16).
@anthony-tuininga
Copy link
Member

I've just pushed code that allows you to connect to a database using national character set UTF8 and raises the exception only upon attempting to use NCHAR, NVARCHAR2 or NCLOB data.

anthony-tuininga added a commit that referenced this issue Jul 15, 2022
supported; an error is now raised only when the first attempt to use
NCHAR, NVARCHAR2 or NCLOB data is made (#16).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants