New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utf8 decoding in dbapi looks like it could be sped up #473

Closed
ugtar opened this Issue Sep 6, 2016 · 3 comments

Comments

Projects
None yet
2 participants
@ugtar
Copy link

ugtar commented Sep 6, 2016

I was doing some timing tests loading columns from the db comparing sqlalchemy and using the dbapi connection directly, and I discovered that it appears that psycopg2's utf-8 decoding is noticeably slower than the "fallback" decoding provided by sqlalchemy's c extensions.

Here is some info from Mike Bayer's test:

If we OTOH use native_unicode=False and use the Unicode type for the
columns we care about, that seems a lot faster, e.g. changing the
mapping to:

class Customer(Base):
tablename = "customer"
id = Column(Integer, primary_key=True)
name = Column(Unicode(255))
description = Column(Unicode(255))

This will only take effect with ORM and Core loads, not raw psycopg2.
The core load of 100K rows now takes:

test_core_fetchall : Load Core result rows using fetchall. (100000
iterations); total time 0.182480 sec

this is better than the 0.37719 sec that PG's unicode handling took.

And here is some explanation of why sqlalchemy has its own c extension for unicode decoding.

Well, originally we did it all ourselves when we didn't have C
extensions, and DBAPIs barely did it. Then the DBAPIs started supplying
it natively, and especially with the coming of Python 3, they all had
to; compared to SQLAlchemy doing it all in pure Python, there was no
contest. But then our C extensions came along and sped things up, and
then we started doing things like caching the codec object which is
probably what the DBAPIs aren't doing yet and gained even more speed, so
it seems like we've surpassed the DBAPIs in just this one area, which is
ridiculous because the pure C DBAPIs are always so much faster in every
way, it's quite annoying that this weird incongruity seems to be present.

Under Python 3 we generally don't have an option here, the DBAPIs now
all do unicode encoding automatically. So the architectures have been
pushed to assume that's the default, but in the case of cx_Oracle and
now apparently psycopg2 we're seeing that in Py2X their unicode
facilities still seem to perform worse than those of SQLAlchemy's.
There's not a clear answer. I'd prefer if the DBAPIs that are written
in pure C anyway like psycopg2 could just allow their performance to be
faster here, I'd maybe report it to them.

Don't know if this type of optimization is worthwhile to you but just letting you know.

The full thread is here:
https://groups.google.com/forum/#!topic/sqlalchemy/TtIel3LTGMY

@dvarrazzo

This comment has been minimized.

Copy link
Member

dvarrazzo commented Sep 7, 2016

Thank you for the pointer, will take a look.

On my workplace in the past caching the code actually provided some good speedup. The codec was only utf8 there and the python version was well known. In psycopg there is more variability, but probably there are fast paths deserving a look.

@dvarrazzo

This comment has been minimized.

Copy link
Member

dvarrazzo commented Oct 12, 2016

I've played a bit with the idea and I thing it's good stuff.

I've tried a quick test: storing a pointer to a fast C decode function for known codec in the connection (e.g. for an utf8 connection store the pointer to PyUnicode_DecodeUTF8). The results are totally worth more work. This script generates unicode data on the server and measures the decode time (decode happens on fetch*() so the operation is not I/O bound but CPU and memory access). Decoding 400K of 1KB strings has a 17% speedup:

$ PYTHONPATH=orig python ./timeenc.py -s 1000 -c 4096 -m 100
timing for strsize: 1000, chrrange: 4096, mult: 100
times: 2.588915, 2.310006, 2.308195, 2.305879, 2.304648 sec
best: 2.304648 sec

$ PYTHONPATH=fast python ./timeenc.py -s 1000 -c 4096 -m 100
timing for strsize: 1000, chrrange: 4096, mult: 100
times: 2.159055, 1.922977, 1.922651, 1.933926, 1.932110 sec
best: 1.922651 sec

Because the overhead paid for the codec lookup is per string, not per data size, the improvement is more relevant decoding the same amount of data, but in more, shorter strings: 55% for 4M of 100B strings:

$ PYTHONPATH=orig python ./timeenc.py -s 100 -c 4096 -m 1000
timing for strsize: 100, chrrange: 4096, mult: 1000
times: 5.997742, 5.909936, 5.914419, 5.967713, 6.779648 sec
best: 5.909936 sec

$ PYTHONPATH=fast python ./timeenc.py -s 100 -c 4096 -m 1000
timing for strsize: 100, chrrange: 4096, mult: 1000
times: 2.738192, 2.669642, 2.647298, 2.657130, 2.651866 sec
best: 2.647298 sec

Other things to do:

  • the lookup can be cached also for other encodings, not only the two blessed ones for which there is a public C function in the Python API (similar in Python to saving codecs.getdecoder() instead of calling codecs.decode())
  • encoding data to the connection can be optimised the same way.

If someone wants contribute to the idea, the first commit is in this branch. Any feedback or help is welcome.

@dvarrazzo dvarrazzo added this to the psycopg 2.7 milestone Oct 12, 2016

@dvarrazzo

This comment has been minimized.

Copy link
Member

dvarrazzo commented Dec 29, 2016

Merged to master together with all the above "things to do".

@dvarrazzo dvarrazzo closed this Dec 29, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment