encoding error on py3.4 #331

ilvalle · 2015-06-14T19:45:10Z

The following works on python2.7 but fails on python3.4.

>>> import psycopg2
>>> psycopg2.extensions.register_type(psycopg2.extensions.UNICODE)
>>> adapted = psycopg2.extensions.adapt('ἀγοραζε')
>>> adapted.getquoted()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-6: ordinal not in range(256)
>>> psycopg2.__version__
'2.6 (dt dec pq3 ext lo64)'

Is it a bug or am I missing something?

The text was updated successfully, but these errors were encountered:

dvarrazzo · 2015-06-15T03:19:21Z

It crashes on Py2 too with:

adapted = psycopg2.extensions.adapt('ἀγοραζε'.decode('utf8'))

it seems a small bug, yes: normally it shouldn't be triggered, unless you really want to write Greek chars into a latin1 connection, which would fail anyway downstream. The adapter uses the connection encoding to encode the strings; if no connection is set, as here, it uses latin1 as (not enough) "sensible" default.

Normally the adapters are "prepared": this is what happens behind the scene of a query:

In [15]: cnn = psycopg2.connect('')
In [16]: adapted.prepare(cnn)
In [17]: adapted.getquoted()
Out[17]: "'\xe1\xbc\x80\xce\xb3\xce\xbf\xcf\x81\xce\xb1\xce\xb6\xce\xb5'"

I'll take a look if it would be possible to set the default encoding to utf8

ilvalle · 2015-06-15T12:10:10Z

Well, beside this issue, I don't see any drawback in switching the default encoding to utf8

dvarrazzo · 2016-06-10T13:03:33Z

Update about this issue:

Changing the default encoding from latin1 to utf8 is something that would likely break existing code: as much as the user opening this ticket expected it to work with utf8, there must be someone around expecting to work with latin1. So I'm afraid bluntly switching the default in 2.6.2 is out of question.

One possibility would be to use a 'replace' strategy to emit ?s instead of crashing upon chars that can't be handled, but that's dangerous in itself as there would be strings for which simulating passing them to the connection would work, while running a query for real would crash. Or worse, if the result is really used for entry somewhere else there will be silent corruption in the data.

So I think the right solution would be to expose the encoder used on the adapter as a writeable property, leaving the default as latin1 but allowing users to change it (so that it can be customized without creating a connection, which seems useful e.g. in vertica/vertica-python#112). Maybe the default could be switched to utf8 in psycopg 2.7.

Would help using adapt(unicode) to quote strings without a connection, see ticket #331. Currently in heisenbug state: if test_connection_wins_anyway and test_encoding_default run (in this order), the latter fail because the returned value is "'\xe8 '", with an extra space. Skipping the first test, the second succeed. The bad value is returned by the libpq: ql = PQescapeString(to+eq+1, from, len); just returns len = 2 and an extra space in the string... meh.

dvarrazzo · 2016-07-01T17:16:47Z

The encoding of the adapted string is now settable. Unfortunately it doesn't work in a generic way: it only works in applications that use consistently the same encoding.

Psycopg uses PQescapeStringConn when escaping with a connection avaliable, PQescapeString otherwise. The latter will use a global encoding to validate the chars (taken from the last database it connected, it seems). This means that it will only work ok for programs that connect to a single database or to databases with a consistent encoding.

So, the feature is in, but I'll leave it only available to people who want to use it, hence I'll leave it undocumented.

JoeSham · 2017-03-15T15:05:06Z

So is there currently a way to set the default encoding for adapt globally?

dvarrazzo · 2017-03-15T15:18:02Z

@JoeSham what is your use case? the encoding is a property of the connection and the database you are talking to, not really a global thing. The only thing global, the encoding the libpq wants to use in a context where there is no connection, is out of psycopg control and not something I'd rely on.

JoeSham · 2017-03-15T15:29:28Z

@dvarrazzo

My db's encoding is utf-8.
show CLIENT_ENCODING; -- UTF8

I am using airflow and there is the following line in postgres_hook.py:
psycopg2.extensions.adapt(cell).getquoted().decode('utf-8')

This gives me an error for non latin-1 characters (mostly for russian and greek alphabet):
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-2: ordinal not in range(256)

I was able to "solve" it in my own test script by using prepare(), but if I don't want to change airflow's code, this solution is not usable:

[...]
con.set_client_encoding('utf8')  # this does nothing in my case, I just tried it
results = cur.fetchall()
for r in results:
    cell = r[1]
    try:
        # the following line is taken from airflow/hooks/postgres_hook.py
        psycopg2.extensions.adapt(cell).getquoted().decode('utf-8')
    except:
        print(cell)
        adapted = psycopg2.extensions.adapt(cell)
        adapted.prepare(con)
        # now it works
        print(adapted.getquoted().decode('utf-8'))

So I hoped there could be a solution where I just set some global variable or perhaps modify the adapter or something, basically anything where I don't have to modify airflow's code.

dvarrazzo · 2017-03-15T15:48:57Z

If you don't want to modify airflow code you can write your own customised adapter for the text types. It should receive the data from the database as bytes and you can convert it to any python object the way you want. However you will change a global behaviour: if something else in the process uses psycopg it may not work as expected. Can't you just fix airflow?

To answer more concisely to your question: no, there is no thing such as a global encoding: there is only a connection encoding, plus some global implementation accidents that may or not may work for you.

soaxelbrooke · 2017-10-19T13:08:22Z

For other people running into this looking for a super explicit solution, add this in your adapter:

class AnalyzedTextAdapter:
    def __init__(self, text: AnalyzedText):
        self.text = text

    def prepare(self, conn):
        self.conn = conn

    def getquoted(self):
        content = adapt(self.text.content)
        content.prepare(self.conn)

        terms = adapt(self.text.terms)
        terms.prepare(self.conn)

        return f'({content},' \
               f'{adapt(self.text.sentiment)},' \
               f'{terms},' \
               f'\'{json.dumps(self.text.meta)}\'::JSONB)'

(you only need it for fields that may contain UTF-8 chars)

Sieboldianus · 2018-07-10T11:12:13Z

I understand all earlier responses, but in my case this does not work: I don't have a connection available when I need to use Escape strings:

I am writing Escaped SQL Values to a file in the same way /Copy .. to ... does so I can import these files later with /Copy ... from ... myfilewithescapedvalues.copy

Is there a way to create a passive default connection without connecting? (e.g. a connection that has utf-8 set). Or any other way to specify utf-8 explicitly?

f0rk · 2018-11-16T19:04:04Z

As @dvarrazzo mentions above that encoding is exposed, you can solve this problem like so:

adapted = psycopg2.extensions.adapt(cell)
adapted.encoding = "utf-8"

Sieboldianus · 2018-11-16T19:48:07Z

Thank you @f0rk , I'll look into it! I don't remember if I solved this problem or surrounded it somehow..

dvarrazzo added this to the psycopg 2.6.2 milestone Aug 13, 2015

stas mentioned this issue Sep 24, 2015

Make array field adapter independent from the connection. coleifer/peewee#721

Closed

FredrikAppelros mentioned this issue Apr 29, 2016

Support for named parameters is broken in Python 3 vertica/vertica-python#112

Closed

dvarrazzo mentioned this issue Jun 18, 2016

using psycopg2.extensions.adapt(u"∴").getquoted() with a unicode string that can't be encoded with latin-1 #441

Closed

dvarrazzo closed this as completed Jul 1, 2016

dvarrazzo mentioned this issue Jul 1, 2016

Backslash escaping changes after creating a connection #394

Closed

zeroheure mentioned this issue Mar 23, 2017

Whole Odoo breaks if one database name use a non ascii char odoo/odoo#16002

Closed

dlackty mentioned this issue May 4, 2017

[AIRFLOW-1171] Fix up encoding for Postgres apache/airflow#2273

Closed

4 tasks

satterly mentioned this issue Mar 20, 2018

'utf-8' codec can't decode byte error while acking an alert using postgres backend alerta/alerta#492

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

encoding error on py3.4 #331

encoding error on py3.4 #331

ilvalle commented Jun 14, 2015

dvarrazzo commented Jun 15, 2015

ilvalle commented Jun 15, 2015

dvarrazzo commented Jun 10, 2016

dvarrazzo commented Jul 1, 2016

JoeSham commented Mar 15, 2017

dvarrazzo commented Mar 15, 2017

JoeSham commented Mar 15, 2017 •

edited

Loading

dvarrazzo commented Mar 15, 2017

soaxelbrooke commented Oct 19, 2017

Sieboldianus commented Jul 10, 2018 •

edited

Loading

f0rk commented Nov 16, 2018

Sieboldianus commented Nov 16, 2018

encoding error on py3.4 #331

encoding error on py3.4 #331

Comments

ilvalle commented Jun 14, 2015

dvarrazzo commented Jun 15, 2015

ilvalle commented Jun 15, 2015

dvarrazzo commented Jun 10, 2016

dvarrazzo commented Jul 1, 2016

JoeSham commented Mar 15, 2017

dvarrazzo commented Mar 15, 2017

JoeSham commented Mar 15, 2017 • edited Loading

dvarrazzo commented Mar 15, 2017

soaxelbrooke commented Oct 19, 2017

Sieboldianus commented Jul 10, 2018 • edited Loading

f0rk commented Nov 16, 2018

Sieboldianus commented Nov 16, 2018

JoeSham commented Mar 15, 2017 •

edited

Loading

Sieboldianus commented Jul 10, 2018 •

edited

Loading