Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding error on py3.4 #331

Closed
ilvalle opened this issue Jun 14, 2015 · 12 comments
Closed

encoding error on py3.4 #331

ilvalle opened this issue Jun 14, 2015 · 12 comments
Milestone

Comments

@ilvalle
Copy link

ilvalle commented Jun 14, 2015

The following works on python2.7 but fails on python3.4.

>>> import psycopg2
>>> psycopg2.extensions.register_type(psycopg2.extensions.UNICODE)
>>> adapted = psycopg2.extensions.adapt('ἀγοραζε')
>>> adapted.getquoted()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-6: ordinal not in range(256)
>>> psycopg2.__version__
'2.6 (dt dec pq3 ext lo64)'

Is it a bug or am I missing something?

@dvarrazzo
Copy link
Member

It crashes on Py2 too with:

adapted = psycopg2.extensions.adapt('ἀγοραζε'.decode('utf8'))

it seems a small bug, yes: normally it shouldn't be triggered, unless you really want to write Greek chars into a latin1 connection, which would fail anyway downstream. The adapter uses the connection encoding to encode the strings; if no connection is set, as here, it uses latin1 as (not enough) "sensible" default.

Normally the adapters are "prepared": this is what happens behind the scene of a query:

In [15]: cnn = psycopg2.connect('')
In [16]: adapted.prepare(cnn)
In [17]: adapted.getquoted()
Out[17]: "'\xe1\xbc\x80\xce\xb3\xce\xbf\xcf\x81\xce\xb1\xce\xb6\xce\xb5'"

I'll take a look if it would be possible to set the default encoding to utf8

@ilvalle
Copy link
Author

ilvalle commented Jun 15, 2015

Well, beside this issue, I don't see any drawback in switching the default encoding to utf8

@dvarrazzo
Copy link
Member

Update about this issue:

Changing the default encoding from latin1 to utf8 is something that would likely break existing code: as much as the user opening this ticket expected it to work with utf8, there must be someone around expecting to work with latin1. So I'm afraid bluntly switching the default in 2.6.2 is out of question.

One possibility would be to use a 'replace' strategy to emit ?s instead of crashing upon chars that can't be handled, but that's dangerous in itself as there would be strings for which simulating passing them to the connection would work, while running a query for real would crash. Or worse, if the result is really used for entry somewhere else there will be silent corruption in the data.

So I think the right solution would be to expose the encoder used on the adapter as a writeable property, leaving the default as latin1 but allowing users to change it (so that it can be customized without creating a connection, which seems useful e.g. in vertica/vertica-python#112). Maybe the default could be switched to utf8 in psycopg 2.7.

dvarrazzo added a commit that referenced this issue Jul 1, 2016
Would help using adapt(unicode) to quote strings without a connection,
see ticket #331.

Currently in heisenbug state: if test_connection_wins_anyway and
test_encoding_default run (in this order), the latter fail because the
returned value is "'\xe8 '", with an extra space. Skipping the first
test, the second succeed.

The bad value is returned by the libpq:

    ql = PQescapeString(to+eq+1, from, len);

just returns len = 2 and an extra space in the string... meh.
dvarrazzo added a commit that referenced this issue Jul 1, 2016
Would help using adapt(unicode) to quote strings without a connection,
see ticket #331.

Currently in heisenbug state: if test_connection_wins_anyway and
test_encoding_default run (in this order), the latter fail because the
returned value is "'\xe8 '", with an extra space. Skipping the first
test, the second succeed.

The bad value is returned by the libpq:

    ql = PQescapeString(to+eq+1, from, len);

just returns len = 2 and an extra space in the string... meh.
@dvarrazzo
Copy link
Member

The encoding of the adapted string is now settable. Unfortunately it doesn't work in a generic way: it only works in applications that use consistently the same encoding.

Psycopg uses PQescapeStringConn when escaping with a connection avaliable, PQescapeString otherwise. The latter will use a global encoding to validate the chars (taken from the last database it connected, it seems). This means that it will only work ok for programs that connect to a single database or to databases with a consistent encoding.

So, the feature is in, but I'll leave it only available to people who want to use it, hence I'll leave it undocumented.

@JoeSham
Copy link

JoeSham commented Mar 15, 2017

So is there currently a way to set the default encoding for adapt globally?

@dvarrazzo
Copy link
Member

@JoeSham what is your use case? the encoding is a property of the connection and the database you are talking to, not really a global thing. The only thing global, the encoding the libpq wants to use in a context where there is no connection, is out of psycopg control and not something I'd rely on.

@JoeSham
Copy link

JoeSham commented Mar 15, 2017

@dvarrazzo

My db's encoding is utf-8.
show CLIENT_ENCODING; -- UTF8

I am using airflow and there is the following line in postgres_hook.py:
psycopg2.extensions.adapt(cell).getquoted().decode('utf-8')

This gives me an error for non latin-1 characters (mostly for russian and greek alphabet):
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-2: ordinal not in range(256)

I was able to "solve" it in my own test script by using prepare(), but if I don't want to change airflow's code, this solution is not usable:

[...]
con.set_client_encoding('utf8')  # this does nothing in my case, I just tried it
results = cur.fetchall()
for r in results:
    cell = r[1]
    try:
        # the following line is taken from airflow/hooks/postgres_hook.py
        psycopg2.extensions.adapt(cell).getquoted().decode('utf-8')
    except:
        print(cell)
        adapted = psycopg2.extensions.adapt(cell)
        adapted.prepare(con)
        # now it works
        print(adapted.getquoted().decode('utf-8'))

So I hoped there could be a solution where I just set some global variable or perhaps modify the adapter or something, basically anything where I don't have to modify airflow's code.

@dvarrazzo
Copy link
Member

If you don't want to modify airflow code you can write your own customised adapter for the text types. It should receive the data from the database as bytes and you can convert it to any python object the way you want. However you will change a global behaviour: if something else in the process uses psycopg it may not work as expected. Can't you just fix airflow?

To answer more concisely to your question: no, there is no thing such as a global encoding: there is only a connection encoding, plus some global implementation accidents that may or not may work for you.

@soaxelbrooke
Copy link

For other people running into this looking for a super explicit solution, add this in your adapter:

class AnalyzedTextAdapter:
    def __init__(self, text: AnalyzedText):
        self.text = text

    def prepare(self, conn):
        self.conn = conn

    def getquoted(self):
        content = adapt(self.text.content)
        content.prepare(self.conn)

        terms = adapt(self.text.terms)
        terms.prepare(self.conn)

        return f'({content},' \
               f'{adapt(self.text.sentiment)},' \
               f'{terms},' \
               f'\'{json.dumps(self.text.meta)}\'::JSONB)'

(you only need it for fields that may contain UTF-8 chars)

@Sieboldianus
Copy link

Sieboldianus commented Jul 10, 2018

I understand all earlier responses, but in my case this does not work: I don't have a connection available when I need to use Escape strings:

I am writing Escaped SQL Values to a file in the same way /Copy .. to ... does so I can import these files later with /Copy ... from ... myfilewithescapedvalues.copy

Is there a way to create a passive default connection without connecting? (e.g. a connection that has utf-8 set). Or any other way to specify utf-8 explicitly?

@f0rk
Copy link

f0rk commented Nov 16, 2018

As @dvarrazzo mentions above that encoding is exposed, you can solve this problem like so:

adapted = psycopg2.extensions.adapt(cell)
adapted.encoding = "utf-8"

@Sieboldianus
Copy link

Thank you @f0rk , I'll look into it! I don't remember if I solved this problem or surrounded it somehow..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants