SQLAlchemy in io.sql to manage different SQL dialects #2717

Closed
mangecoeur opened this Issue Jan 21, 2013 · 21 comments

Comments

Projects
None yet
7 participants
Contributor

mangecoeur commented Jan 21, 2013

Currently, read_frame and write_frame in sql are specific to sqlite/mysql dialects (see #4163).

Rather than adding all possible dialects to pandas, another option is to detect whether sqlalchemy is installed and prefer to use its DB support.

Contributor

garaud commented Jan 21, 2013

Quite interesting. Maybe get a look at issue #1662 which deals SQL connection improvements. I would like to contribute to these pandas features. I'll try to write something about it this week.

Contributor

mangecoeur commented Feb 11, 2013

I started work on this idea, very much Work in Progress, branch is here: https://github.com/mangecoeur/pandas/tree/sqlalchemy-integration

Contributor

danielballan commented Mar 7, 2013

I also ran across #191 -- apparently this idea has been broached before. Any progress?

Contributor

mangecoeur commented Mar 13, 2013

I've commited a couple more changes, notably started work on a autoload_frame which will uses sqlalchemy to figure out the contents of a table and turn it into a dataframe. I still need to figure out how to handle type conversions as well as some tests.

derrley commented Nov 22, 2013

We've built parts of an ETL tool on top of SQLAlchemy. When an extract is pointed at a database flavor that doesn't support bulk copy (read: Oracle) we simply use for row in table.select(). We've decided to move away from this, because of the overhead SQLAlchemy introduces. Plan on spending 2-4x the CPU cycles on top of your database driver to load the same number of rows. I landed on this thread as part of my hopes that pandas could do better. :)

In any case, unless this feature is always intended to load fairly small tables into dataframes, I'd recommend against going the route of SQLAlchemy as part of a library that is, in most other aspects, quite fast. SQLAlchemy's power is really its OO query builder and ORM framework. Too much cruft for something like this.

Contributor

jtratner commented Nov 22, 2013

@derrley I've experienced this with SQLAlchemy too.

Contributor

mangecoeur commented Nov 28, 2013

@derrley I have difficulty seeing how you would provide compatibility for all the DBs that SQLAlchemy supports without introducing the same amount of overhead, and adding the burden of maintaining the compatibility layer. Perhaps a better strategy would be to work with the SQLAlchemy guys to see how to optimize the kind of operations that Pandas needs to be fast.

zzzeek commented Jan 10, 2014

@derrley the row fetching overhead of SQLAlchemy's core ResultProxy/RowProxy is nothing like 2x-4x the CPU cycles of plain DBAPI, unless you have integrated type-processing functions like in-Python unicode conversion or somnething like that. Within row fetching, most of what's more than negligible is ported to C functions. There may be specific aspects of your experience that were slowing it down, do you have any benchmarks illustrating your results?

zzzeek commented Jan 10, 2014

@derrley here is an actual test against MySQL, comparing the SQLAlchemy Core result proxy with C extensions installed to the raw MySQLdb driver. To execute a query with 50K rows, fetch all the rows and fetch a single column from the row takes 44 calls / .032 sec on MySQLdb raw and 82 calls / .057 sec with SQLA core. So sure, SQLA introduces overhead but it is not very much - by the time you implement your own logic on top of the raw MySQLdb cursor, you'd be pretty much at the same place or worse: https://gist.github.com/zzzeek/8346896

zzzeek commented Jan 10, 2014

@derrley also as far as Oracle, the SQLAlchemy cx_oracle dialect goes through (documented) effort in order to fix some issues with the driver, most notably being able to return numerics with full precision, rather than receiving floating points. There is overhead to this process which is detailed here: http://docs.sqlalchemy.org/en/rel_0_9/dialects/oracle.html#precision-numerics . If this process is specific to the performance issues you've been seeing, this feature can be turned off by specifying coerce_to_decimal=False.

derrley commented Jan 17, 2014

Appreciate the suggestions.

Just tried both the coerce trick and the cdecimal trick, and neither prevent talking directly to cx_Oracle from being 3-4x faster, depending on the table. :/

On Jan 10, 2014, at 11:21 AM, mike bayer notifications@github.com wrote:

@derrley also as far as Oracle, the SQLAlchemy cx_oracle dialect goes through (documented) effort in order to fix some issues with the driver, most notably being able to return numerics with full precision, rather than receiving floating points. There is overhead to this process which is detailed here: http://docs.sqlalchemy.org/en/rel_0_9/dialects/oracle.html#precision-numerics . If this process is specific to the performance issues you've been seeing, this feature can be turned off by specifying coerce_to_decimal=False.


Reply to this email directly or view it on GitHub.

zzzeek commented Jan 17, 2014

@derrley if you can provide self-contained test scripts with sample tables/data I can isolate the cause of a 400% slowdown.

zzzeek commented Jan 17, 2014

let me run my above script against an Oracle database here first just to make sure nothing funny is going on...

zzzeek commented Jan 17, 2014

nope, nothing unusual, script + output is at https://gist.github.com/zzzeek/8479592

SQLAlchemy Core: 100058 function calls in 0.302 CPU seconds
cx_oracle: 100012 function calls in 0.263 CPU seconds

so that's around 1.2 times slower. Feel free to show me your code and also make sure you're running the C extensions.

zzzeek commented Jan 17, 2014

ah, lets try again, SQLA's output type handler leaked into that, one moment

zzzeek commented Jan 17, 2014

OK, so in both cases it's the coercion to unicode adding the majority of overhead. https://gist.github.com/zzzeek/8479592 is now updated to run both tests without any coercion - in the SQLAlchemy case we are using an event to "undo" the cursor.outputtypehandler used to coerce to unicode. I will look today into current cx_oracle releases to see if cx_oracle has decided to coerce to unicode for us yet (this is required of it in Python 3), and if so I will add version detection for this feature; otherwise, I will add a flag to turn it off with a documentation note.

with unicode coercion turned off, we again have similar results of:

SQLA core: 56 function calls in 0.113 CPU seconds
cx_oracle: 9 function calls in 0.086 CPU seconds

this is again about 1.3 times slower. Feel free to apply this event to your application:

from sqlalchemy import event
@event.listens_for(engine, "connect")
def connect(dbapi_connection, connection_record):
    dbapi_connection.outputtypehandler = None

that will disable all numeric/unicode type conversion within the cx_oracle driver.

zzzeek commented Jan 17, 2014

I hope it's clear that when using SQLAlchemy, one needn't "plan on spending 2-4x the CPU cycles on top of your database driver to load the same number of rows." I've demonstrated that in the specific case of cx_oracle, we have converters in place to accommodate cx_oracle's default behavior of returning inaccurate decimal data and encoded bytestrings, as SQLAlchemy prefers to return the correct result first versus the fastest - normalizing behavior across DBAPIs is one of SQLAlchemy's primary features and in the case of cx_oracle it requires us to do more work than that of a driver like psycopg2. These converters can however be disabled and I will add further documentation and potential features regarding being able to customize this.

derrley commented Jan 17, 2014

It's sufficiently tangled up in our ETL tool (and the data I'm extracting is private).

I can probably reproduce it with fixture data over a weekend some time. The SQLAlchemy interface is much nicer to use, so I'd love if this didn't produce the slowdown (or if I was discovered to be a moron).

ubuntu@test-slave-jenkins-i-25e5480b:~$ python
Python 2.7.3 (default, Sep 26 2013, 20:03:06)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import sqlalchemy
import sqlalchemy.cprocessors
print sqlalchemy.version
0.8.3
import cx_Oracle
print cx_Oracle.version
5.1.2

select * from product_component_version yields
NLSRTL 11.2.0.3.0 Production
Oracle Database 11g Enterprise Edition 11.2.0.3.0 64bit Production
PL/SQL 11.2.0.3.0 Production
TNS for Linux: 11.2.0.3.0 Production

The "fast" hack is:

  try:
    with self._engine.connect() as connection:
      # SQLAlchemy is no good for the hot path of extraction. Too much
      # overhead. Instead, use the underlying connection object and the
      # python DBAPI.
      connection = connection.connection.connection
      cursor = connection.cursor()
      cursor.arraysize = 3000
      compiled_query = query.compile(bind=self._engine)
      params = compiled_query.params

      if isinstance(self._engine.dialect,
                    sqlalchemy.dialects.sqlite.pysqlite.SQLiteDialect):
        # The SQLite dialect seems to stupidly compile expressions that
        # always have ? characters for parameters but returns a dictionary
        # representation of parameter values. In order to bridge this gap
        # (and make unit tests work), I convert the params here. I don't
        # want to do this outside of this if block, because the code seems
        # dangerous and I don't want it running against real databases
        # (which seem to compile their expressions just fine).
        params = [v for k, v in sorted(params.items())]

      logger.debug("Extract query\n %s\nparameters:\n%r",
                   compiled_query, params)
      cursor.execute(str(compiled_query), params)

      cols = [d[0].lower() for d in cursor.description]

      for row in cursor:
        yield {c: row[i] for i, c in enumerate(cols)}

  except sys.modules[type(connection).__module__].DatabaseError as e:
    msg = str(e.message)
    if any(k in msg for k in KEYBOARD_INTERRUPT_STRINGS):
      raise KeyboardInterrupt()
    raise civetl.source.DataSourceError(e)

The original SQLA code was:

  try:
    for row in self._engine.execute(query):
      yield dict(row)

  except sqlalchemy.exc.SQLAlchemyError as e:
    raise civetl.source.DataSourceError(e)

On Jan 17, 2014, at 1:05 PM, mike bayer notifications@github.com wrote:

@derrley if you can provide self-contained test scripts with sample tables/data I can isolate the cause of a 400% slowdown.


Reply to this email directly or view it on GitHub.

zzzeek commented Jan 17, 2014

for your code above, use [params[key] for key in compiled_query.positiontup]. Sorting params.items() is not going to produce the correct order, that's not a sorted dictionary.

Also, if the overhead issue on the result fetching side, you should stick with connection.execute() - then, use result.cursor to get at the raw DBAPI cursor.

zzzeek commented Jan 17, 2014

I've made a change to the Oracle dialect in http://www.sqlalchemy.org/trac/ticket/2911 such that we no longer use cx_oracle's "outputtypehandler" to coerce to unicode; SQLAlchemy's own converters have minimal overhead while cx_Oracle's within Py2K seems to have full blown Python function overhead (but oddly not when run under Py3K). So a result set with cx_oracle will in 0.9.2 no longer have any string conversion overhead for plain strings, minimal overhead for Python unicode. I've enhanced the C extensions to better provide for DBAPIs like cx_Oracle that sometimes return unicode and sometimes str.

Closing this, as SQLAlchemy integration in io.sql is now merged: #5950

@zzzeek zzzeek referenced this issue in zzzeek/sqlalchemy Mar 9, 2014

Closed

Moved speedups to a separate package. #77

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment