SQLAlchemy in io.sql to manage different SQL dialects #2717

mangecoeur · 2013-01-21T09:14:33Z

Currently, read_frame and write_frame in sql are specific to sqlite/mysql dialects (see #4163).

Rather than adding all possible dialects to pandas, another option is to detect whether sqlalchemy is installed and prefer to use its DB support.

garaud · 2013-01-21T10:30:27Z

Quite interesting. Maybe get a look at issue #1662 which deals SQL connection improvements. I would like to contribute to these pandas features. I'll try to write something about it this week.

mangecoeur · 2013-02-11T23:43:57Z

I started work on this idea, very much Work in Progress, branch is here: https://github.com/mangecoeur/pandas/tree/sqlalchemy-integration

danielballan · 2013-03-07T21:23:37Z

I also ran across #191 -- apparently this idea has been broached before. Any progress?

mangecoeur · 2013-03-13T15:41:52Z

I've commited a couple more changes, notably started work on a autoload_frame which will uses sqlalchemy to figure out the contents of a table and turn it into a dataframe. I still need to figure out how to handle type conversions as well as some tests.

derrley · 2013-11-22T03:35:33Z

We've built parts of an ETL tool on top of SQLAlchemy. When an extract is pointed at a database flavor that doesn't support bulk copy (read: Oracle) we simply use for row in table.select(). We've decided to move away from this, because of the overhead SQLAlchemy introduces. Plan on spending 2-4x the CPU cycles on top of your database driver to load the same number of rows. I landed on this thread as part of my hopes that pandas could do better. :)

In any case, unless this feature is always intended to load fairly small tables into dataframes, I'd recommend against going the route of SQLAlchemy as part of a library that is, in most other aspects, quite fast. SQLAlchemy's power is really its OO query builder and ORM framework. Too much cruft for something like this.

jtratner · 2013-11-22T08:08:38Z

@derrley I've experienced this with SQLAlchemy too.

mangecoeur · 2013-11-28T16:03:23Z

@derrley I have difficulty seeing how you would provide compatibility for all the DBs that SQLAlchemy supports without introducing the same amount of overhead, and adding the burden of maintaining the compatibility layer. Perhaps a better strategy would be to work with the SQLAlchemy guys to see how to optimize the kind of operations that Pandas needs to be fast.

zzzeek · 2014-01-10T03:45:45Z

@derrley the row fetching overhead of SQLAlchemy's core ResultProxy/RowProxy is nothing like 2x-4x the CPU cycles of plain DBAPI, unless you have integrated type-processing functions like in-Python unicode conversion or somnething like that. Within row fetching, most of what's more than negligible is ported to C functions. There may be specific aspects of your experience that were slowing it down, do you have any benchmarks illustrating your results?

zzzeek · 2014-01-10T04:14:16Z

@derrley here is an actual test against MySQL, comparing the SQLAlchemy Core result proxy with C extensions installed to the raw MySQLdb driver. To execute a query with 50K rows, fetch all the rows and fetch a single column from the row takes 44 calls / .032 sec on MySQLdb raw and 82 calls / .057 sec with SQLA core. So sure, SQLA introduces overhead but it is not very much - by the time you implement your own logic on top of the raw MySQLdb cursor, you'd be pretty much at the same place or worse: https://gist.github.com/zzzeek/8346896

zzzeek · 2014-01-10T17:20:56Z

@derrley also as far as Oracle, the SQLAlchemy cx_oracle dialect goes through (documented) effort in order to fix some issues with the driver, most notably being able to return numerics with full precision, rather than receiving floating points. There is overhead to this process which is detailed here: http://docs.sqlalchemy.org/en/rel_0_9/dialects/oracle.html#precision-numerics . If this process is specific to the performance issues you've been seeing, this feature can be turned off by specifying coerce_to_decimal=False.

derrley · 2014-01-17T18:52:55Z

Appreciate the suggestions.

Just tried both the coerce trick and the cdecimal trick, and neither prevent talking directly to cx_Oracle from being 3-4x faster, depending on the table. :/

On Jan 10, 2014, at 11:21 AM, mike bayer notifications@github.com wrote:

@derrley also as far as Oracle, the SQLAlchemy cx_oracle dialect goes through (documented) effort in order to fix some issues with the driver, most notably being able to return numerics with full precision, rather than receiving floating points. There is overhead to this process which is detailed here: http://docs.sqlalchemy.org/en/rel_0_9/dialects/oracle.html#precision-numerics . If this process is specific to the performance issues you've been seeing, this feature can be turned off by specifying coerce_to_decimal=False.

—
Reply to this email directly or view it on GitHub.

zzzeek · 2014-01-17T19:05:16Z

@derrley if you can provide self-contained test scripts with sample tables/data I can isolate the cause of a 400% slowdown.

zzzeek · 2014-01-17T19:09:13Z

let me run my above script against an Oracle database here first just to make sure nothing funny is going on...

zzzeek · 2014-01-17T19:17:52Z

nope, nothing unusual, script + output is at https://gist.github.com/zzzeek/8479592

SQLAlchemy Core: 100058 function calls in 0.302 CPU seconds
cx_oracle: 100012 function calls in 0.263 CPU seconds

so that's around 1.2 times slower. Feel free to show me your code and also make sure you're running the C extensions.

zzzeek · 2014-01-17T19:20:21Z

ah, lets try again, SQLA's output type handler leaked into that, one moment

zzzeek · 2014-01-17T19:29:27Z

OK, so in both cases it's the coercion to unicode adding the majority of overhead. https://gist.github.com/zzzeek/8479592 is now updated to run both tests without any coercion - in the SQLAlchemy case we are using an event to "undo" the cursor.outputtypehandler used to coerce to unicode. I will look today into current cx_oracle releases to see if cx_oracle has decided to coerce to unicode for us yet (this is required of it in Python 3), and if so I will add version detection for this feature; otherwise, I will add a flag to turn it off with a documentation note.

with unicode coercion turned off, we again have similar results of:

SQLA core: 56 function calls in 0.113 CPU seconds
cx_oracle: 9 function calls in 0.086 CPU seconds

this is again about 1.3 times slower. Feel free to apply this event to your application:

from sqlalchemy import event
@event.listens_for(engine, "connect")
def connect(dbapi_connection, connection_record):
    dbapi_connection.outputtypehandler = None

that will disable all numeric/unicode type conversion within the cx_oracle driver.

zzzeek · 2014-01-17T19:32:17Z

I hope it's clear that when using SQLAlchemy, one needn't "plan on spending 2-4x the CPU cycles on top of your database driver to load the same number of rows." I've demonstrated that in the specific case of cx_oracle, we have converters in place to accommodate cx_oracle's default behavior of returning inaccurate decimal data and encoded bytestrings, as SQLAlchemy prefers to return the correct result first versus the fastest - normalizing behavior across DBAPIs is one of SQLAlchemy's primary features and in the case of cx_oracle it requires us to do more work than that of a driver like psycopg2. These converters can however be disabled and I will add further documentation and potential features regarding being able to customize this.

derrley · 2014-01-17T19:32:17Z

It's sufficiently tangled up in our ETL tool (and the data I'm extracting is private).

I can probably reproduce it with fixture data over a weekend some time. The SQLAlchemy interface is much nicer to use, so I'd love if this didn't produce the slowdown (or if I was discovered to be a moron).

ubuntu@test-slave-jenkins-i-25e5480b:~$ python
Python 2.7.3 (default, Sep 26 2013, 20:03:06)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import sqlalchemy
import sqlalchemy.cprocessors
print sqlalchemy.version
0.8.3
import cx_Oracle
print cx_Oracle.version
5.1.2

select * from product_component_version yields
NLSRTL 11.2.0.3.0 Production
Oracle Database 11g Enterprise Edition 11.2.0.3.0 64bit Production
PL/SQL 11.2.0.3.0 Production
TNS for Linux: 11.2.0.3.0 Production

The "fast" hack is:

  try:
    with self._engine.connect() as connection:
      # SQLAlchemy is no good for the hot path of extraction. Too much
      # overhead. Instead, use the underlying connection object and the
      # python DBAPI.
      connection = connection.connection.connection
      cursor = connection.cursor()
      cursor.arraysize = 3000
      compiled_query = query.compile(bind=self._engine)
      params = compiled_query.params

      if isinstance(self._engine.dialect,
                    sqlalchemy.dialects.sqlite.pysqlite.SQLiteDialect):
        # The SQLite dialect seems to stupidly compile expressions that
        # always have ? characters for parameters but returns a dictionary
        # representation of parameter values. In order to bridge this gap
        # (and make unit tests work), I convert the params here. I don't
        # want to do this outside of this if block, because the code seems
        # dangerous and I don't want it running against real databases
        # (which seem to compile their expressions just fine).
        params = [v for k, v in sorted(params.items())]

      logger.debug("Extract query\n %s\nparameters:\n%r",
                   compiled_query, params)
      cursor.execute(str(compiled_query), params)

      cols = [d[0].lower() for d in cursor.description]

      for row in cursor:
        yield {c: row[i] for i, c in enumerate(cols)}

  except sys.modules[type(connection).__module__].DatabaseError as e:
    msg = str(e.message)
    if any(k in msg for k in KEYBOARD_INTERRUPT_STRINGS):
      raise KeyboardInterrupt()
    raise civetl.source.DataSourceError(e)

The original SQLA code was:

  try:
    for row in self._engine.execute(query):
      yield dict(row)

  except sqlalchemy.exc.SQLAlchemyError as e:
    raise civetl.source.DataSourceError(e)

On Jan 17, 2014, at 1:05 PM, mike bayer notifications@github.com wrote:

@derrley if you can provide self-contained test scripts with sample tables/data I can isolate the cause of a 400% slowdown.

—
Reply to this email directly or view it on GitHub.

zzzeek · 2014-01-17T19:37:03Z

for your code above, use [params[key] for key in compiled_query.positiontup]. Sorting params.items() is not going to produce the correct order, that's not a sorted dictionary.

Also, if the overhead issue on the result fetching side, you should stick with connection.execute() - then, use result.cursor to get at the raw DBAPI cursor.

zzzeek · 2014-01-17T22:41:52Z

I've made a change to the Oracle dialect in http://www.sqlalchemy.org/trac/ticket/2911 such that we no longer use cx_oracle's "outputtypehandler" to coerce to unicode; SQLAlchemy's own converters have minimal overhead while cx_Oracle's within Py2K seems to have full blown Python function overhead (but oddly not when run under Py3K). So a result set with cx_oracle will in 0.9.2 no longer have any string conversion overhead for plain strings, minimal overhead for Python unicode. I've enhanced the C extensions to better provide for DBAPIs like cx_Oracle that sometimes return unicode and sometimes str.

jorisvandenbossche · 2014-02-07T21:44:14Z

Closing this, as SQLAlchemy integration in io.sql is now merged: #5950

danielballan mentioned this issue Mar 7, 2013

ENH: sql support for NaN/NaT conversions #2754

Closed

ghost mentioned this issue Mar 14, 2013

value_counts() can now compute relative frequencies. #2710

Closed

hayd mentioned this issue Jul 8, 2013

ENH: sql support #4163

Closed

20 tasks

jorisvandenbossche closed this as completed Feb 7, 2014

zzzeek mentioned this issue Mar 9, 2014

Moved speedups to a separate package. zzzeek/sqlalchemy#77

Closed

This was referenced Nov 27, 2018

new doc section, in engines/connections, "working with DBAPI connections" sqlalchemy/sqlalchemy#2218

Closed

mysql dialect discards cast to float, should probably emit a warning sqlalchemy/sqlalchemy#3237

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SQLAlchemy in io.sql to manage different SQL dialects #2717

SQLAlchemy in io.sql to manage different SQL dialects #2717

mangecoeur commented Jan 21, 2013

garaud commented Jan 21, 2013

mangecoeur commented Feb 11, 2013

danielballan commented Mar 7, 2013

mangecoeur commented Mar 13, 2013

derrley commented Nov 22, 2013

jtratner commented Nov 22, 2013

mangecoeur commented Nov 28, 2013

zzzeek commented Jan 10, 2014

zzzeek commented Jan 10, 2014

zzzeek commented Jan 10, 2014

derrley commented Jan 17, 2014

zzzeek commented Jan 17, 2014

zzzeek commented Jan 17, 2014

zzzeek commented Jan 17, 2014

zzzeek commented Jan 17, 2014

zzzeek commented Jan 17, 2014

zzzeek commented Jan 17, 2014

derrley commented Jan 17, 2014

zzzeek commented Jan 17, 2014

zzzeek commented Jan 17, 2014

jorisvandenbossche commented Feb 7, 2014

SQLAlchemy in io.sql to manage different SQL dialects #2717

SQLAlchemy in io.sql to manage different SQL dialects #2717

Comments

mangecoeur commented Jan 21, 2013

garaud commented Jan 21, 2013

mangecoeur commented Feb 11, 2013

danielballan commented Mar 7, 2013

mangecoeur commented Mar 13, 2013

derrley commented Nov 22, 2013

jtratner commented Nov 22, 2013

mangecoeur commented Nov 28, 2013

zzzeek commented Jan 10, 2014

zzzeek commented Jan 10, 2014

zzzeek commented Jan 10, 2014

derrley commented Jan 17, 2014

zzzeek commented Jan 17, 2014

zzzeek commented Jan 17, 2014

zzzeek commented Jan 17, 2014

zzzeek commented Jan 17, 2014

zzzeek commented Jan 17, 2014

zzzeek commented Jan 17, 2014

derrley commented Jan 17, 2014

zzzeek commented Jan 17, 2014

zzzeek commented Jan 17, 2014

jorisvandenbossche commented Feb 7, 2014