ENH: sql support #4163

Closed
hayd opened this Issue Jul 8, 2013 · 176 comments

Projects

None yet
@hayd
Contributor
hayd commented Jul 8, 2013

UPDATE: this is continued in #6292 after the refactor based on sqlalchemy.


This is a central issue to track sql support (migrated from the mysql support ticket - #2482 and #1662).

I think we are agreed we will be moving to use SQLAlchemy, in the hope of comprehensive cross-platform support:

Todo:

  • travis should not skip tests (e.g. mysql #4177)
  • add flavor argument to read_sql
  • handling no row returns (#3745)
  • Specify chunksize (and iterator) for read_frame (#2908)
  • read and write the index (and multi-index)
  • mangle_dup_columns for read_frame (#2738)
  • bug in write_frame replace (#4110, #2971, PR #4304, #6164)
  • option in config to specify default engine/connection for SQL (maybe?)
  • refactor SQLAlchemy/sqliite tests (@hayd was thinking different flavors could inherit tests from a baseclass, so tests are only flavor explicit if obviously flavor explicit...)

Better Dtype coercion:

Peformance

  • Vbenches! (for various flavors even?)

cc @garaud @mangecoeur @danielballan @changhiskhan

@garaud
Contributor
garaud commented Jul 8, 2013

I think it can be good if the "way to go" was defined before doing remaining tasks. I think SQLAlchemy is a good way:

  • less code
  • better maintainable
  • less specific to a given SQL engine

I don't know well SQLAlchemy. I can't tell if all wanted features can be implemented with it, or if it can cause some tricky problems.

Damien G.

@hayd
Contributor
hayd commented Jul 8, 2013

@garaud I agree, I suspect it'll create fewer tricky problems than maintaining every flavour. :)

I think I've pulled in all the open issues re sql (as they'll have to be addressed in either case).

@danielballan
Contributor

And while we're addressing the big picture, write_frame ignores the index. HDFStore, by contrast, saves it. It's easy enough to promote the index to a column if you want to write it to a SQL table, but would it better to write the index column by default? I have no good ideas for how to implement this cleanly; I'm raising the issue in case someone else can think of a way.

@jreback
Contributor
jreback commented Jul 8, 2013

you should write the index for sure (and provide index_col=) on reading

I guess this is a bit tricky because you don't have control of the table format (in HDFStore its opaque to the user)

maybe provide an option on writing, maybe : index_col='in_table' or False for dont' write it ?

or have the user provide a column name (whichif its already being written won't then do anything)

@danielballan
Contributor

I guess the question is what to call it, then, if "index" is taken.

@danielballan
Contributor

...and whether to attempt to use it as the PRIMARY KEY

@hayd
Contributor
hayd commented Jul 8, 2013

index.name or index.names (do we have MultiIndex support ?) otherwise 'index' (or if Multi: [index_0, index_1, ...]) ?

Can check if already a column name and raise, I guess that is also under the multiple columns with the same name.. raise (or mangle_dup_columns first)... :s

Not sure can PK as index may not be unique... does pk matter?

@hayd
Contributor
hayd commented Jul 8, 2013

@danielballan what am I supposed to be adding to get mysql tests to pass? Does travis run them?

SKIP: Cannot connect to database. Create a group of connection parameters under the heading [pandas] in your system's mysql default file, typically located at ~/.my.cnf or /etc/.my.cnf.

@danielballan
Contributor

Travis does not run them. (I wrote those tests before I had any experience with Travis, but I will start a branch to install MySQL on the full_deps build.)

Add a block to your MySQL configuration file that provides valid connection parameters for pandas to use.

[pandas]
host = localhost
user = ...
password = ...
database = any_database_that_exists
@jreback
Contributor
jreback commented Jul 10, 2013

I think #4110 should be in here/closed is it a dup?

@danielballan
Contributor

No, not a dup, and still open I believe. @hayd can add it to the checklist.

Half the issues on the list could be closed in an afternoon's work, but maybe we should first decide whether we are moving to SQLAlchemy. @mangecoeur?

@hayd
Contributor
hayd commented Jul 10, 2013

I rebased @mangecoeur's branch cleanly (no conflicts!) to master a few days ago, but couldn't get the tests working on my system - haven't got time to play for a couple of days.

Do you mind having a go @danielballan ?

@danielballan
Contributor

Sure... To avoid git surgery down the road, I'll ask a dumb question now. I should add your fork of pandas as a remote and work on a local copy of this branch, correct?

@jreback
Contributor
jreback commented Jul 10, 2013

fyi...need to add SQLAlchemy as a build dep (in the requirements file) for full deps, @cpcloud ? (its alread in ci/print_versions.py thought)

@cpcloud
Contributor
cpcloud commented Jul 10, 2013

fyi u can add to the requirements file "at will" (meaning, of course, when it makes sense) and things will be pulled from pypi unless i put up a wheel. wheels should be used only if the package is slow to install.

@mangecoeur
Contributor

@danielballan sadly because of workload i haven't had time to look into this much - i know pretty much what needs doing but i just don't have time to actually implement+test it. At the simplest level reading a frame using SQLAlchmey as a db abstraction looks like this:
https://gist.github.com/mangecoeur/5383970

(That's code i actually needed in production, so it works and is neater than what's in my branch)

The code for writing is in my branch, the principle is simple: map each Pandas Dtype to an SQLAlchemy supported type (most python builtins included), create a Table, write the values.

Testing is the kicker, i have very little experience with TDD but i do know that running tests that require DBs is generally a pain. On the other hand, SQLAlchemy's cross DB support is very good, so if it works with one DB it should work for them all, with SQLite as a kind of lowest common denominator.

Finally, I would recommend making SQLAlchemy a soft dependency (like a lot of the other IO) with a fallback to SQLite-support only since that's built into python - there's no point preventing people accessing SQLite just because they don't have SQLAlchemy installed. It's a simple try/except import, set an "sqlalchemy mode" flag, choose either genric or SQLite functions accordingly.

@hayd
Contributor
hayd commented Jul 10, 2013

@danielballan I didn't get the travis build working anyway so feel free throw away my commit (I'd have to rebase after we pull in your mysql travis testing anyway).

@jreback
Contributor
jreback commented Jul 10, 2013

I think that read_frame/write_frame should become internal only methods (and instead use the external facing read_sql/to_sql for consistency with the rest of the i/o (currently read_sql is just aliased to read_frame. I know that you will almost certainly just read/write frames, but leaves open then the possibility of dealing with Panel/Series (Series is sort of trivial), and e.g. in HDFStore, Panel is essentially converted to a mi-frame (though storing meta-data is different in SQL so not sure that is an option here).

@hayd
Contributor
hayd commented Jul 10, 2013

Perhaps get SQLAlchemy working (passing the same tests as MySQL) first and then worry about precisely how to call from read_sql and to_sql... Saying that other operations use an engine argument (maybe this makes sense here).

@mangecoeur
Contributor

@jreback or maybe it should be more explicit: read_sqlite and read_sqlalchemy
Sqlalchemy supports raw SQL as well as its own "expression syntax" (which abstracts from the different SQL flavours used by different DBs) - so it would be nice in the future if read_sqlalchemy could also accept a query object as well as an SQL string.
At the moment if you use SQLAlchemy to read a DB using a raw SQL query even though it connects to all DBs transparently it's up to the user to make sure their SQL is formatted for the DB they're using - there's nothing you can really do about that for reading but it would be nice to make it possible to use sqlalchemy's abstractions (though this doesn't need to be done just yet)

@jreback
Contributor
jreback commented Jul 10, 2013

added extra cool SQL label for these (CSV/HDF5) also exist as well

@jreback
Contributor
jreback commented Jul 10, 2013

@mangecoeur I disagree, I believe write_frame as flavor= arg....so should read_sql
then this becomes straightforward, e.g. try this using say flavor=sqlite, or use the default (SQLAlchemy if installed, otherwise sqlite?)

you could also provide a module level set_default_flavor maybe

@cpcloud
Contributor
cpcloud commented Jul 10, 2013

that could go in config

@jreback
Contributor
jreback commented Jul 10, 2013

ahh...yes makes more sense, and fallback really doesn't make sense, and people rarely use more than one SQL backend

@cpcloud
Contributor
cpcloud commented Jul 10, 2013

probably some of the other module level variables floating around could go in config as well

@mangecoeur
Contributor

@jreback thought about it some more and actually I agree with you now.

The SQL string given to read_sql has to be adapted by the user to fit the DB they know they are using, there's no point having different functions for different DBs. The function should accept both DBAPI connections and SQLAlchemy engines. There's no need to specify the flavor for read since pandas just needs to run the provided query through the given DB connection or engine, it doesn't need to modify it - therefore read will work so long as you provide a valid SQL for your DB and either a DBAPI compliant connection or an SQLAlchemy engine for that DB.

I think however we need to distinguish between 2 distinct user stories (use cases):

  1. "I want to read an arbitrary dataset from a database using an arbitrarily complex SQL query" plus "I want to store dataframe values in an SQL table"

This case is served by read_sql and to_sql, which just needs to be able to save values, a user will write an appropriate SQL query to turn them back into a Dataframe if needed. There is no need for a "flavor" for to_sql because we can infer the correct column type mapping and let SQLAlchemy generate the correct SQL syntax based on the supplied engine. However we would like to have SQLite available as an option even if SQLAlchemy is not installed (this is what i meant by fallback, not that it would try to write to one database and fall back to writing to sqlite).

  1. "I want to store the Dataframes I'm working in an SQL DB and retrieve them transparently later."

This is different from 1. because we have to assume a save format convention, or save additional metadata. You don't want to have to specify an SQL string when loading because the save/load layer should deal with it for you. This is akin to an ORM mapping, while 1. is dealing with raw SQL.

I believe the read_sqland to_sql feature under discussion should only attempt to address use case 1. Use case 2. is a different feature and might deserve its own read_ and to_ functions.

I think it's important to keep this difference in mind in order to have a clear understanding of what the IO functions should and should not attempt to do, and what problems they're actually trying to address.

It should also make the api simpler since, as I said above, there's no need for the user to specify SQL flavours as it can all be done automatically, and we don't need any additional machinery for defining save conventions because we assume they do not apply for this specific feature.

@jreback
Contributor
jreback commented Jul 10, 2013

@mangecoeur

but if sqlalchemy is the flavor, then you dont't need to specify DB specific sql (which is a good thing), but you DO need to know the backend (I agree conn already has some of this info, so maybe the flavor can be inferred)

just trying to keep api as simple as possible (but not too simple!)

@mangecoeur
Contributor

@jreback I think you slightly misunderstood how sqlalchemy works. It provides you with a variety of abstractions in order to generate DB specific SQL (e.g. you create a Table object, SQLAlchemy figures out the correct CREATE TABLE syntax for the DB you're using). It also deals with things like transactions and connection pools for you. However if you tell the engine to execute a plain SQL string, it will not attempt to "translate" that string - how could it seeing as there is in practice no "strictly correct" SQL syntax and feature set for it to translate from in the first place.
What you can do is use their expression api to write queries in python which will be translated to the correct SQL flavour.

So if you are supply raw sql, you do need to know the DB you are using and to write SQL specific for that DB. SQLAlchemy also knows the db you're using, since you have to specify it when you create the engine with the connection URI, and it deals with all the communication business for you. The main strength of using SQLAlchemy in this case is not for reading using a provided SQL string (where it just provides a light abstraction from DBAPI) but for writing to the DB, because we can programatically generate an SQLAlchemy Table and let it generate the correct SQL based on the db you provided it through engine. All the information needed for this to happen is contained in the engine, so no flavour is required.

The only condition is see is if you have flavour= {'sqlite', 'sqlalchemy'}, rather than having two functions to_sqliteand to_sqlalchemy (or some other name).
However:

  1. its easy to detect if sqlalchemy support is present through a try import... If it's not present you can't even try to use it because you can't supply an sqlalchemy engine. Setting the flag just raises an error, which you could have done simply during the try import...
  2. If you set sqlite the function would accept a DBAPI connection and work with Sqlite featureset only, while setting sqlalchemy would accept an sqlalchemy engine. So the flavour is already implied by the arguments.
  3. That said, I personally don't like the idea of a single to_sql function that could have very different output depending on whether it was using sqlalchemy (with advanced type mapping) vs basic built in mapping.
@jreback
Contributor
jreback commented Jul 10, 2013

@mangecoeur

a user just wants to to_sql, (in reality I don't even want to specify the connection!),

that's why flavor is useful (obviously would also have to pass host/user/pw) and with sqlite3 maybe as the default this would just work

what you are talking is more advanced for sure, but a user shouldn't have to think too hard about this

so I am actually in favor of not requiring conn (of course if its there, then use it), and alternately specify user/host/pw etc....

@hayd
Contributor
hayd commented Jul 10, 2013

@mangecoeur What is "engine" in this context?

I think it probably makes sense to copy the sqlalchemy syntax/arguments for passing in a raw_query.

I also think we should keep around at least the sqlite flavour (as for one thing we can use this to test against some of the SQLAlchemy functionality), and this should be an argument in to_sql/read_sql (probably the default) if engine/flavor is reserved (has some other meaning in sqlachemy) should think of some other argument name.

Agree with @jreback that as much as possible should just workTM, perhaps default connection could also go in config...

@mangecoeur
Contributor

@jreback ah i see, yes supplying user/pw etc instead of conn would be another way to do it.

But I don't think that's practical - it would mean hard-coding DB driver support since you don't give the user the control over which db driver to use (there's 5 or 6 for MySql alone, and different reasons for using each), unless you copied sqlalchemy connection uri syntax. A bigger problem is that You want more dynamic control over the connections you're using - the system I was working involved connecting to 6 different databases and pulling and combining data from all of them, and letting another part of my system manage the connections which i passed to the read_frame as needed. Creating a connection or SQLAlchemy engine and passing it to read_sql is not difficult for basic use (1 extra line) and maintains flexibility - and trust me, many people using Pandas are dealing with much more complex use-cases than mine and would appreciate that flexibility!

@hayd engine in this context is: http://docs.sqlalchemy.org/en/rel_0_8/core/connections.html#

I also argue against default connections in config - zen of python says "explicit is better than implicit" and i don't like the idea of a connection that exists "by magic". There should be no more defaults set than already exist in the DB drivers used. I'm also concerned it maybe be a safety risk, as it would allow someone to modify an existing database unintentionally if they accidentally didn't override the defaults. Again, with a nice documentation page it's very easy for someone to set up a db connection with 1 line, while the alternative introduces headaches for other use cases.

@jreback
Contributor
jreback commented Jul 10, 2013

@mangecoeur

how about we do this:

if conn is a string

e.g.a standard connection string (that you then pass to sqlalchemy)
(which will then open/close the db too)

pd.read_sql("postgresql+psycopg2://user.pw@host/db")

else conn provides the connection (which is opened/closed by the user)

you could also allow args (flavor,user,pw,host,db) and default flavor=sqlite3

that way a user can pass a connection, a string to their favorite db, or use the connection parms
and get sqlite3

I didn't mean default connections, just a default for the flavor

?

@mangecoeur
Contributor

@jreback I think accept conn or the sqlalchemy string would be fine. I think adding user pw etc args on top of that gets cumbersome, and is mixing different concerns. I would really really prefer that the sqlite support not be mixed with the sqlalchemy support because the results they give could end up being very different - the last thing you want from an API is for it to surprise you! You could easily get a situation where you use this function write to Sqlite through Sqlalchemy and write the same data to sqlite through "raw" support and get different data in the database because of subtle differences in conversions.

I would suggest:
read_sql(sql: string, engine: {string or SqlAlchemy Engine}) => throws error if SqlAlchemy not installed
and
read_sqlite(sql: string, conn: DBAPI (default None), user, pw, host, db) => always works

@jreback
Contributor
jreback commented Jul 10, 2013

to make this back_compat (and to avoid having 2 functions), I propose this:

read_sql(sql: string,
             conn: if a string, then use ``flavor/conn_kw`` to figure out how construct the connection
                      or use the connection (e.g. user created the connection) (or pass to ``SQLAlchemy``
                      if an ``SQLAlchemy Engine``)
             conn_parms: a dict of connection arguments (e.g. user/pw/host/db), could also be sep arguments
             flavor : string, default is ``None``, if specified will create that type of connection
                       e.g. ``sqlite3`` or ``sqlalchemy`` directly
             )

@mangecoeur

case 1:

connect to sqlalchemy
read_sql(conn=None,flavor='sqlalchemy'),conn_parms=dict(user=....,host=....))

case 2:

connect to sqllite

read_sql(conn=None,flavor='sqlite),conn_parms=dict(user=.....,host=....))`

@danielballan
Contributor

I don't see the need for read_sqlite. To elaborate on @jreback's idea.

  • If conn is formatted like a connection string 'postgresql+psycopg2://user.pw@host/db' use SqlAlchemy and raise if it not there.
  • If conn is formatted like a local file path, use sqlite, which (!) takes no username or password -- just a path.
  • If conn is a connection object, determine whether it is a sqlite3.Connection, and try SqlAlchemy if it is not.
  • Also accept user/passwd/host/flavor kwargs, which will always need SqlAlchemy.

So the simplest possible use case would be read_sql('data.sql').

@hayd
Contributor
hayd commented Jul 10, 2013

Could it also pass a an engine (the suggestion from those docs @mangecoeur linked to is that engine should be created only once ("not per-object or per-function call").

Does this mean connection string is frowned upon (but maybe could cache the created engine? :S)

@jreback
Contributor
jreback commented Jul 10, 2013

isn't an engine the same as a connection? from a user perspective?

@hayd
Contributor
hayd commented Jul 10, 2013

@jreback agree it should be in the connection argument. Just wondering if they pass a string like 'postgresql+psycopg2://user.pw@host/db' whether we should cache, or whether we shouldn't accept it at all. (docs seem to suggest engine shouldn't be created on each function call, but should be "global".)

@jreback
Contributor
jreback commented Jul 10, 2013

I think if you want to cache it, then you should create it externally and pass in conn (which the sql engine shouldn't close). But if you provide a connection string (or args/flavor), then pandas should open/close for you (consistent with other i/o parts)

@jreback
Contributor
jreback commented Jul 10, 2013

my 2c

I think allowing the common connection args is fine, e.g.

read_sql(sql,user='foo',host=..,db=....) makes it pretty natural

and I don't think this will break back compat as basically making conn optional

@hayd
Contributor
hayd commented Jul 10, 2013

Perhaps I'm wrong but the impression I got was that the engine opened and closed the connection on each call. The point being it wants you to pass it an engine (and not create an engine each time), then the engine opens and closes a connection.

+1 on standard connection args.

@jreback
Contributor
jreback commented Jul 10, 2013

@hayd I think that is right

as an aside....it might make sense to have this in 2 pieces internally, basically the connection manager guy and then the ``parser` (takes the rows and produces a frame)

then, a power user could use the parser only and manage the connections themselves (say they wanted to do connection pooling or whatever)

@hayd
Contributor
hayd commented Jul 10, 2013

That is we'd cache the Engine (but I think we needn't worry about this yet... too early to be that clever), so:

  • If conn is formatted like a connection string 'postgresql+psycopg2://user.pw@host/db' use SqlAlchemy and raise if it not there.
  • If conn is an Engine use SqlAlchemy (we can shove create_engine,.. in sql.io).
  • If conn is formatted like a local file path, use sqlite, which (!) takes no username or password -- just a path.
  • If conn is a connection object, determine whether it is a sqlite3.Connection, and try SqlAlchemy if it is not.
  • Also accept user/passwd/host/flavor kwargs, which will always need SqlAlchemy.

Not sure if the Engine would necessarily have already been given the kwargs.

@jreback
Contributor
jreback commented Jul 10, 2013

also flavor can override your last point (of trying sqlalchemy first)

@danielballan
Contributor

I'll take a shot at this, leaving all our existing code but ultimately aiming for SQLAlchemy to take over the current mysql code, once it passes all tests with the original usage.

@jreback
Contributor
jreback commented Jul 10, 2013

gr8!

@danielballan
Contributor

Progress report, requesting some input...

My revamped read_sql accepts:

  • ':memory:' -> Connect to an in-memory sqlite DB. Use sqlalchemy if it is available; fall back on pandas code based on build-in sqlite3.
  • '/path/to/file.db' -> Connect to a sqlite DB. Ditto above...
  • sqlalchemy.engine.Connection -> Ready to use!
  • sqlalchemy.engine.Engine -> Connect to engine.connect()
  • connection parameters kwargs: dialect (a.k.a., "flavor"), username, password, database, host, optional port and driver -> Use sqlalchemy to create an Engine and connect to it.

But what if it receives one of these?

  • sqlite3.Connection Suggestion: If sqlalchemy is not available, just proceed without it (as with the first two bullet points above). If it IS available, warn that more robust support is available if the user builds the connection using SQLAlchemy.
  • MySQLdb.Connection Suggestion: Whether or not sqlalchemy is installed, warn the user that support for connections created from MySQLdb is buggy and will be removed in an upcoming release. Recommend creating one from SQLAlchemy instead, or pass the connection parameters so that read_sql can create one (using SQLAlchemy.)
@jreback
Contributor
jreback commented Jul 11, 2013

@danielballan that looks great...

the only comment I have is that IIRC, @mangecoeur mentioned that you may have a different preference for using sqlite3 even if SQLAlchemy is installed. This could easily be handled by:

arg: flavor=None

if flavor = 'sqlite3':
    ....
else:
    try to process via sqlalchemy
@mangecoeur
Contributor

@danielballan @jreback yes that would be my preference: if SQLAlchemy is not installed, only support SQLite. The logic is that sqlite is always available since it comes with python, and therefore it makes sense to support it without additional dependencies. If SQLAlchemy is installed it should be used to support all databases (i.e. no special case for MySQL). @jreback code snippet is correct.

@hayd
Contributor
hayd commented Jul 13, 2013

@danielballan Now I look at this (was interested / checking out your branch), and probably you knew this already, I think we'll need a pretty big refactor to use SQLalchemy notation throughout write_frame i.e. let SQLalchemy derive the SQL from the engine (at the moment it's not). I had a little go at this, but I don't think my sqlaclhemy foo is up to it (yet?) but may be able to look at it later in the week.

Same for the tests, my idea was that we could have a base test TestSQLAlchemy class which had self.engine = sqlite memory, and everything else was platform independent i.e. once this works all you had to do was subclass it, override engine in setUp with a super (or perhaps a flavor_setup method).

The good thing being we can use sqlalchemy.types.

@danielballan
Contributor

@hayd Ah, I hadn't pushed any interesting stuff to github yet. See my branch now -- get_connection is the bulk of my progress.

I have no SQLAlchemy-fu whatsoever; I'm just learning as I go here. From what I've seen so far, I think the substantial refactoring will be worth the trouble for types and generality over SQL dialects.

To support old code that passes a MySQLdb.Connection (at least for the next version or two) I'm thinking our new read_sql and write_sql will hand off to legacy copies of these functions as they were before the SQLAlcehmy refactor. Maybe we'll stash them in a separate file, sql_legacy_support.py. sqlalchemy.engine.Connection objects are just too different to support both inside the same function.

Can I look at your go at deriving SQL from the engine?

@meteore
Contributor
meteore commented Jul 13, 2013

Hi all,

I just thought this is the right place to remind of the major problem of SQL-to-numpy: performance. This message (2006) from Francesc Alted (with benchmarks) shows that intermediate conversion to Python objects degrades performance. Does anybody know if there were any developments on DBAPI since then?

@jreback
Contributor
jreback commented Jul 13, 2013

@meteore performance is of course important
I think getting the io correct (eg types inferred correctly) and dealing with flavors properly has to be the first goal

optimization can be done for reading/writing at some point; however SQL is not optimized for array access and by definition will be slower than say HDF5

@meteore
Contributor
meteore commented Jul 13, 2013

Thanks Jeff, that is a good reference for order-of-magnitude comparisons; It seems that PGSQL could cut some of the lost terrain, but also that the man-in-the-middle of conversion to Python datastructures is still there.

@jreback
Contributor
jreback commented Jul 13, 2013

@meteore absolutely there are some optimizations

eg I have some cython routines that take a frame and replace nan to None for example

@hayd
Contributor
hayd commented Jul 13, 2013

@danielballan this was the commit but was really just hacking around, although the test structure may be worth using (please take a look, code in main class obviously doesn't work yet, I haven't even tried to run it!).

I don't understand engine vs meta yet, so I'm not really sure what's going on (mainly with regards to finding all tables, they seems to not stay updated with each other... not read up on it). Even something as simple as dropping all tables doesn't seem to work as expected on that front :S

But the idea seems to be you describe your tables with the sqlachemy syntax and it worries about writing and executing statements. My opinion is we should stay in SQLAlchemy syntax as much as possible:

In [63]: t
Out[63]: Table(u't', MetaData(bind=Engine(sqlite:///:memory:)), Column(u'x', INTEGER(), table=<t>, primary_key=True), Column(u'y', VARCHAR(), table=<t>), Column(u'z', VARCHAR(), table=<t>), schema=None)

In [64]: t.select()
Out[64]: <sqlalchemy.sql.expression.Select at 0x102e598d0; Select object>

In [65]: t.select().execute()  # prefer to use this
Out[65]: <sqlalchemy.engine.result.ResultProxy at 0x102e59ed0>

In [66]: str(t.select())  # and completely avoid this
Out[66]: 'SELECT t.x, t.y, t.z \nFROM t'

So perhaps read_frame and write from should accept sql.expressions... suspect this will end up being so vastly different we should throw the old sql stuff into a separate sql_legacy file (and not even trying to if/else it all) or maybe even this stuff in an sql_alchemy file and the sql stuff just contains deciding whether to look at sql_alchemy or sql_legacy logic?

@danielballan
Contributor

@hayd I like the test structure, and I'm with you on staying with SQLAlchemy syntax as much as possible for code within the module. I think write_sql can be done entirely within the SQLAlchemy syntax.

I would like to support user code like 'SELECT t.x, t.y, t.z \nFROM t'. Someone who comes to pandas with SQL experience but no real understanding of SQLAlchemy should be able to feed a select statement to read_sql.

@hayd
Contributor
hayd commented Jul 13, 2013

@danielballan totally agree, excellent point.

@hayd
Contributor
hayd commented Jul 13, 2013

@danielballan I think there also need to be able to do it from a connection (or cursor?) and not an engine (unless you can make an engine from either of those), this is also mentioned the relevant PEP. Which perhaps makes it trickier.

Another syntax would that read_frame could actually read a table name (just like write_sql) e.g. read_sql(name='t', engine=engine), wonder if we can do that from a connection... ?

@danielballan
Contributor

@hayd Absolutely. It looks like we can do it from a connection, maybe not from a cursor. (But that's probably OK.)

@mangecoeur
Contributor

@hayd since you were wondering about meta vs engine: first read these if you haven't
http://docs.sqlalchemy.org/en/rel_0_8/core/engines.html
http://docs.sqlalchemy.org/en/latest/core/schema.html#describing-databases-with-metadata

Basically engine replaces the "connection" used when you are using raw DBAPI access. It adds a level of abstraction, for example by handling connection pooling, timeouts, etc. On engine might be maintaining several connections in a pool. When you access the engine it may be opening and closing connections to provide the db cursor - you don't need to know or care. It also keeps track of DB-specific information (such as SQL dialect stuff) which allows it to take expressions and turn them into SQL.

Metadata is a fairly different beast. It contains information about the tables themselves, so column definitions, indexes, character encoding, table settings and references (in case you're not familiar, most modern DBs can handle columns where the values refer to rows in other tables, and check/enforce consistency on those references).

In short, you deal with Engine to connect to the DB and MetaData to manipulate tables. You could create one Metadata with some table definitions but apply it to several databases by doing metadate.create_all(engine) with several different engines. The engine deals with turning everything into appropriate SQL.

@hayd
Contributor
hayd commented Jul 15, 2013

@mangecoeur it seems giving an engine leads to being able to reflect a meta (i.e. get all the table info from the database, and therefore what the dtype would be), as well as create SQL dialect stuff... and so life much easier if you're given an engine

Is it possible to do both those from just a connection (or to create an engine from a connection?)

@mangecoeur
Contributor

@hayd indeed, because the engine knows about the DB it's connected to it can reflect lots of information from DB tables and generate MetaData (essentially reversing the process of creating metadata and applying it with an engine). You can actually reflect entire ORM object structures this way (have a look at SQLSoup). On a lower level it makes it easy to get column types which can be used to turn them into Dtypes according to some kind of mapping (note that this isn't trivial for more complicated data such as complex numbers - you need some kind of serialization/deserialization convention/configuration)

To my knowledge you can't create an engine from a connection, only a connection from an engine. Which makes sense because an engine is an abstraction on top of a connection, and allows you to select from different DB drivers to create that connection.

I would simply follow the recommendation from before and let the function accept an engine or an sqlalchemy compliant connection URI. While that's different behaviour from currently, it appears the consensus is to change the sql related function names anyway. The old functions could be kept for legacy access with depreciation warnings for 1 Pandas release before being removed (I don't know if pandas has a depreciation policy yet).

@jreback
Contributor
jreback commented Jul 20, 2013

@hayd / @danielballan

maybe you want to have a master PR for this and then various other PR's can be incorporated, e.g. #4304

you might want to create a separate master branch to do this (conceptually easier that way I think), that you track via a PR

@hayd
Contributor
hayd commented Jul 20, 2013

I think an upstream/sql branch is probably a good/sensible idea.... I created upstream/sql for the purpose, please PR away. @danielballan

Wondering how to deal with #4304... if it's not going to make it into 0.12 should we stick it in sql branch?

I think we should get the sql/sqllegacy separation in the codebase asap (after 0.12) to make contributions to legacy easier (there will be some merge conflicts from the above pr...)

@jreback
Contributor
jreback commented Jul 20, 2013

that fine to separate legacy and not (maybe rename sql.py to sql_legacy.py) but I think you may need to have it in the same pr
it's pretty tightly wrapped (as read_sql needs to figure out the dispatch)

@danielballan
Contributor

I segregated the files, and the tests pass, falling back where appropriate. I think I PRed into pandas/master by mistake. Still unclear how this went wrong or what I should do. I will return to this tomorrow.

@hayd
Contributor
hayd commented Aug 1, 2013

@jreback wow, that's interesting, I guess we shouldn't forget how performance critical we are (good to have something to compete with). Do you think memory footprint/speed could be improved later by cythonising (could that speed up writing to ndarrays)? Do you think we get it down to similar? (I don't think SQLAlchemy necessarily impacts on speed...unless it's does a load of slow type conversion :S) thoughts?

@jreback
Contributor
jreback commented Aug 1, 2013

actually prob easier to support their version of pyodbc as an optional dep (use if available), with the new methods http://docs.continuum.io/iopro/pyodbc.html ( and the deps for this are simple, NOT the whole numba stack)

API looks similar to the current one

hmmm....doesn't look open source.....

@hayd
Contributor
hayd commented Aug 1, 2013

@jreback I'm confused is this part of iopro free? (Either way we could support it.)

@jreback
Contributor
jreback commented Aug 1, 2013

do you have a ref for io free? looks like maybe only in io pro.....they should open source it.....

@hayd
Contributor
hayd commented Aug 1, 2013

Ah no, I was/am confused, I think this is based off an open source project (which hasn't been updated in a couple of years), I guess it's not free - truth is in the name!

Hopefully cythonizing will get a similar speed boost.

@danielballan
Contributor

We should try to reproduce his "pandas" performance measurement, at least approximately, and then see about cythonizing to push it down. If I understand correctly, the bottleneck that iopro addresses is buried in MySQLdb and the like. Optimizing those connections, with cython or otherwise, is feasible but ambitious.

@hayd
Contributor
hayd commented Aug 5, 2013

Related discussion https://groups.google.com/forum/#!topic/pydata/NaBbrPZW8SE about "ExpressionFile: given SQL Expression Language, provide file-like CSV object." which uses chunksize to sql (think we should probably add chunksize to TODO list, ooop it's already there :) ) ...

Also, perhaps (way future) could use eval's expression language...

@hayd hayd referenced this issue Aug 6, 2013
Closed

Sql alchemy tests #4475

@jtratner
Contributor

@hayd I think the current sql branch is py3 incompatible right now (e.g., the raise statements), okay if I push some changes to fix that? I also suggest that you split these out into different classes, so you don't need to do dict lookups in every function. (I'd put up a PR for that though)

@hayd
Contributor
hayd commented Sep 12, 2013

@jtratner I have some big changes coming, to see working (language indep) tests and had to tweak the main code quite a bit to allow that. Saying that, there'll be scope for a big refactor once it's merged (with working tests)... :)

@hayd
Contributor
hayd commented Sep 12, 2013

@jtratner it feels like there ought to be a clever OOP refactor similar to Excel (?) etc. to take out tangley con/cur/engine jumping.

@jtratner
Contributor

@hayd okay, I'll wait before working on any changes. and if you put together everything with tests...I'd be happy to refactor that afterwards.

@hayd
Contributor
hayd commented Sep 13, 2013
@jtratner
Contributor

@hayd btw - one good thing would be to change the custom errors to inherit from somethingother than Exception (e.g., IOError or ValueError)

@hayd
Contributor
hayd commented Sep 13, 2013

Tests pass in with SQlite with SQLalchemy in language agnostic way, have a small bug in MySQL with NaN. Will push into branch when this one passes :)

Needs a big ol' refactor though which I wanted to discuss, one idea was to use an SQL class (similar to others):

class SQLHelper(PandasObject):
    def __init__(self, cur=None, con=None, engine=None):
        if engine:
               return SQLHelperWithEngine(engine)
        etc.

    def execute(...):
         raise SomeError('Must be called with either con, cur or engine')

     etc.

class SQLHelperWithEngine(SQLHelper)
    def execute(self, ...):
        # have access to self.engine

class SQLHelperWithCur(SQLHelper)

class SQLHelperWithCon(SQLHelper)

Keep execute, read_sql etc globals to be agnostic (i.e. select which class to use), and work out which helper to use.

Maybe there's a better way to get cleaner code here?

@jtratner
Contributor

@hayd can you link to your current sql branch?

@hayd
Contributor
hayd commented Sep 13, 2013

Sorry, here it is: https://github.com/hayd/pandas/tree/sql_tests

And the failing test is the one at the bottom (TestSQLA_MySQLdb .test_write_row_by_row): https://travis-ci.org/hayd/pandas/jobs/11314669

@hayd
Contributor
hayd commented Sep 24, 2013

So, I think we should give an engine constructor function (which creates an engine url/pandas_sql object from dialect, username, password, etc.), but connection and cursor should be user-defined (the user should set up themselves) - except in the special case of an sqlite file w/o sqlalchemy.

imo these are three distinct concerns, engine / conn / cursor... and tbh I liked my factory for creating these (I think it made the logic for persistence really clear - although we weren't using connections before it was pretty easy to drop them in)...

  • Engine (methods create and close connections, does so within SQLAlchemy loveliness)
  • Connection (methods create and close cursors), using flavor info
  • Cursor, using flavor info

I think creating the pandas_sql object (or whatever) once and query against that without inputting the engine or connection for every method is super useful (also can have multiple dbs as difference variables to query against) and I think that these should definitely be "public" (we could pretend they are all PandasSQL objects in the repr anyways.)...

@danielballan
Contributor

@hayd, can you clarify with an example how PandasSQL objects fit into the API? (P.S. I'm still game to be involved, with testing or whatever is needed, but as is now apparent, I'm not finding time to keep up with the conversation as fast as Team hayd & jtratner, so I'm content to take the back seat here.)

@hayd
Contributor
hayd commented Sep 24, 2013

@danielballan In sql atm (which is different from @jtratner's branch!), you create via a connection, cur or engine:

pandas_sql = PandasSQL(engine=..)  # obviously would use descriptive variable name for your db!
# pandas_sql = PandasSQL(con=..)

and the idea would be that you would query against that:

pandas_sql.read_sql('SELECT * ...')
pandas_sql.read_sql('table_name')  # is possible if it's an engine..
pandas_sql.to_sql(df, 'table_x')

You can still work with the old syntax too (which just calls the above):

sql.read_sql('SELECT * ...', engine=..) # or con=... or cur=...

I would propose that read_sql, to_sql and PandasSQL were the only things fed to io.api.

@danielballan
Contributor

I assume we also still support df.to_sql(table_name, con). Why shouldn't pandas_sql.read_sql('table_name') work on connections as well as engines?

@hayd
Contributor
hayd commented Sep 24, 2013

@danielballan That would be neat, I just didn't know how to do it :)

Definitely, df.to_sql (table_name, con=...) would still be needed. but I guess we could also pass in the pandas_sql class here: df.to_sql(table_name, con=pandas_sql) (maybe using kwarg con for everything is bad..)

@jtratner
Contributor

@danielballan is it problematic to create an engine or connection separately and then pass it to read_sql every time? Only requires one extra argument...

@jtratner
Contributor

also, at everyone (I know I asked before) - test cases would be really appreciated!

@dragoljub

Are there any plans to support MSSQL flavor via pymssql?

@danielballan
Contributor

@jtratner I would say it is not problematic, and it resembles the usage familiar to MySQLdb users, FWIW.

The PandasSQL class suggested by @hayd invites a comparison with HDFStore (obviously). I think pandas_sql.to_sql and pandas_sql.read_sql seems a little redundant ("sql"). Could we devise something closer to store? Maybe this is a bad idea, but I want to put it into the discussion before the API congeals....

@dragoljub [EDITED -- I misread that at first] No, but it is reasonable to propose including this in the release. Any interest in helping?

@jtratner
Contributor

@danielballan where does the handing of MSSQL flavor via pymssql come in to the code?

@danielballan
Contributor

@jtratner Yeah, sorry I was typing my edit when your reply came up. Misread MSSYQ for MySQL, which I see we are supporting my pymysql as well as / instead of MySQL now.

@jtratner
Contributor

btw - is there any harm in always going to a cursor? Seems much easier to just do that in the actual querying and reading...

@hayd
Contributor
hayd commented Sep 24, 2013

@danielballan pandas_sql is just a variable name, you'd call it something descriptive to do with the database's name, and not do the sql repeating... but happy if we have something better.

Anything SQLAlchemy supports and travis can handle, we can add to the SQLAlchemy tests. May be able to support MSSQL (travis-ci/travis-ci#216) certainly SQLAlchemy works with it.

@jtratner
Contributor

Also, @danielballan please check out this refactor I'm proposing - I think it should be a lot cleaner - https://github.com/jtratner/pandas/tree/sql . I'm just not clear on how you generate a connection.

@danielballan
Contributor

I'd say always using a cursor is preferred, and as far as I know there is never harm in it. Specifically, cleanup (closing the cursor) and transactional operations (e.g., rollback in the event of an exception) are more explicit.

I agree that pandas_sql.table_names (or something like that) is much better than using a string like pandas_sql.read_sql('table_names').

@jtratner, I like BackendBase and that structure. I agree some PandasSQL object in the API, associated with an engine and a persistent connection, is needed to avoid repeating connection parameters and, moreover, opening the connection each time, which can take ~ 1 s. A connection should be opened when the Backend class is instantiated. I believe this will be generic (i.e., can be in BackendBase) with a special case for sqlite.

@hayd
Contributor
hayd commented Sep 25, 2013

@danielballan Ah, so you want the tables as properties like DataFrame columns? That's a really great idea.

@jtratner I think there is some happy middle ground between our two implementations (they are very similar!), can we mixin the flavor and whether its connection/cursor (and possibly engine) ? Essentially a connection wraps each cursor method call by first creating a cursor (perhaps we could make engine the same, but I'm not sure if we lose some SQLAlchemy power by doing that or not).

I like having SQL in the class name to describe what it does (although I wouldn't use it in variable names in real life!)... so BackendBase doesn't really do it for me, HDF5Store has HDF5 in the name, what about SQLStore? Now I think about it SQLsomething will make it easier to find (PandasSQL is a silly name).

@danielballan
Contributor

Yes, to be concrete, pandas_sql.tables would return ['some_table_name', 'another_table_name']. And maybe pandas_sql['some_table_name'] could effectively SELECT * FROM table_name, for consistency with HDFStore. I prefer SQLdb over SQLStore.

@jtratner
Contributor

@hayd I'd actually written out something similar to what you were discussing above.


Let's get all of this working with cursors first then. Once we have that part working, it's very simple to wrap the engine part. (In fact, I'd imagine an engine class that instantiates a cursor each time and then instantiates the appropriate backend. (i.e., some class that's like the following and allows you to pass args in a contextmanager which rollsback on any exception, just need to fill in a bit of logic around creating connections).:

class PandasSQLConnection(object):
    """Wrapper that maintains a connection and delegates to a particular backend based on flavor"""
    def __init__(self, flavor='sqlite', *args, **kwargs): # connection parameters
        self.flavor = self.create_connection(flavor, *args, **kwargs)
        if self.flavor not in flavor_mapping:
            raise ValueError("Unknown flavor %s" % flavor)
        else:
            self.backend = flavor_mapping[self.flavor]
   def read_sql(self, table_name, *args, **kwargs):
        reader = self.backend(self.cursor())
        reader.read_sql(table_name, *args, **kwargs)
   def write_frame(self, df, table_name, *args, **kwargs):
        writer = self.backend(self.cursor())
        writer.write_frame(df table_name, *args, **kwargs)
   def create_connection(flavor, *args, **kwargs):
        # try sqlite
        # try postgres/other special-cased connections
        # try SQLAlchemy (in which case return 'sqla')
        self.con = con # I'm pretty sure that SQLAEngine and Conn will work similarly
        return flavor
   # separating this out b/c not sure if SQLA needs to work differently.
   # or have to deal with some other non-standard weird thing
   def cursor(self):
        return self.con.cursor()
   def rollback(self):
        return self.con.rollback() # this should always be available, but may not do anything
   def close(self):
        return self.con.close()
   def commit(self):
        return self.con.commit()
   def __enter__(self):
        return self
   def __exit__(self, exc, exc_type, tb):
        if exc is None:
           self.commit()
        self.close() # sends a rollback if we haven't committed.
        return False # bubble up the Exception

The nice property of the above class is that it means that the backends only ever have to handle cursors, you can use the class in a with statement, and it has the potential to make it dead-simple to connect to sql databases... It also means that BackendBase never has to worry about committing changes.

This does add some complications and after writing up the class above, I'm thinking may be able to add a few methods to BackendBase to encapsulate the above.

@jtratner
Contributor

Also, I'm really unconvinced that sql should work like hdf5 here. SQL already has a full query language and a lot of flexibility and you can already accomplish querying by passing a cursor. Why is it helpful to reimplement or wrap things that other libraries (like SQLAlchemy and specific drivers) already provide?

@hayd
Contributor
hayd commented Sep 25, 2013

@jtratner No, sorry, I wasn't saying it should work with HDF5!

Can live with SQLConnection (drop the Pandas) :)

@jtratner
Contributor

Not work with, but work like (i.e., I don't want us to wrap querying and such like PyTables does).

Also @danielballan here's potential get/setitem, it just means setting whatever args or kwargs need to be set ahead of time.

def __getitem__(self, name):
    return self.read_sql(name)
def __setitem__(self, name, df):
    return self.write_sql(df, name)
@hayd
Contributor
hayd commented Sep 25, 2013

Also, one of the key things is that read_sql accepts sql ! So it's sql_statement_or_table_name (if table_name then we write the sql statement for the user).

Would be coool to have tab complete for table names (or am I being far too fussy :) ), so would on init run query to get all tables then add them as properties and/or get items.

@jtratner
Contributor

@hayd you'd do what DataFrame does: add them to the list on __dir__ (i.e., object.dir(self) + list(self.tables)) and then override __getattr__ to try to lookup attribute on the object then see if it's in the tables.

@jtratner
Contributor

I'll try to pull something together tonight. It may make sense to merge the
backend and SQLConnection object...I'll see.

On Wed, Sep 25, 2013 at 11:26 AM, Dan Allan notifications@github.comwrote:

Yes, to be concrete, pandas_sql.tables would return ['some_table_name',
'another_table_name']. And maybe pandas_sql['some_table_name'] could
effectively SELECT * FROM table_name, for consistency with HDFStore. I
prefer SQLdb over SQLStore.


Reply to this email directly or view it on GitHubhttps://github.com/pydata/pandas/issues/4163#issuecomment-25096872
.

@hayd
Contributor
hayd commented Sep 27, 2013

Thinking again about this, perhaps SQLStore is better than SQLConnection since it's not a connection object...

@jreback
Contributor
jreback commented Oct 3, 2013

how's the sql stuff coming?

@jreback
Contributor
jreback commented Oct 7, 2013

@hayd what's the SQL status?

@cmorgan
cmorgan commented Oct 10, 2013

i too would be keen to hear news on this, with a view to contributing

@stringfellow

I think there might be a bug in the postgres flavour - it calls con.commit() in read_frame() which is not always usable (since you can be operating in a transaction-free environment). I guess it should catch the TransactionManagementError and carry on instead?

@danielballan
Contributor

@stringfellow Yes, I think that's the way to go for now. We'll be more careful with transactions in this rewrite.

We have two possible designs to reconcile, @hayd's PandasSQL and @jtratner's BackendBase. Once the bones are in place, I think it would be possible to divide tasks among more contributors. @jtratner, were you going to pull together a branch merging these designs?

@jtratner
Contributor

@stringfellow I think what we're going to do is move all the transactional handling to the user if they pass in a connection, and if we create the connection, we'll start off a transaction (where possible) and always roll back if there's an error.

@jtratner
Contributor

The branch I posted as PR pretty much already does all the merging of the designs (because it converted what @hayd had already done 😄) except that it's not hooked into the connection-creating function - if I have time this weekend, I'll try to work on that.

My issue is tests - we don't have enough of them and that makes me concerned about correctness. Definitely going to keep working on this, but if worst comes to worst, we can add this in as an experimental module and explicitly make no promises about it working correctly.

@stringfellow

@jtratner @danielballan Great! Thanks for replying! I don't know if it is useful, but as a use case - I'm using the connection object supplied by Django (and doing some raw SQL with it).

@jtratner
Contributor

Steve - can you send some minimal test cases of how you'd use it? Bonus
points if you can provide a minimal DB seed too :)

@stringfellow

@jtratner how do you want it? I guess it will be a whole Django project - I can't attach tar here can I?

@stringfellow

@jtratner Actually - I think I'm just not doing it right. For anyone that gets here via some search, this only really affects you when you are in django's shell, and you can get round it like this:

>>> from django.db import transaction
>>> transaction.enter_transaction_management()
>>> transaction.managed(True)
>>> transaction.is_managed()
True

I might stackoverflow this one too, just in case. Thanks folks!

@jtratner
Contributor

Need to think about this - might make sense to create a separate pydata
repo for SQL acceptance testing (vs. unit tests in pandas proper). I don't
think it's worth the time to seed multiple DBs for every test (esp when SQL
is unlikely to be a moving target after we finish the refactoring)

On Tue, Oct 22, 2013 at 11:30 AM, Steve Pike notifications@github.comwrote:

@jtratner https://github.com/jtratner how do you want it? I guess it
will be a whole Django project - I can't attach tar here can I?


Reply to this email directly or view it on GitHubhttps://github.com/pydata/pandas/issues/4163#issuecomment-26813424
.

@hayd
Contributor
hayd commented Oct 22, 2013

@jtratner perhaps one option is to trigger an sql travis build, similar to "travis faster please", (say, if it has sql in the commit message) where it installs dbs and sets up seeds etc etc. ?

@jtratner
Contributor

Sure. For the moment we can see how big the sql branch with tests gets and
then go from there.

@danielballan
Contributor

Note to self/selves: I saved a DataFrame to a SQL database, and I wish I could query rows based on location. The SQL database implicitly creates a unique, sequential primary key. Pandas should create it explicitly so that we have access to it and can retrieve chunks by location.

@jtratner
Contributor

I don't want to hold this up and I don't think I'm going to have time to finish up the changes I proposed any time soon. @danielballan or @hayd do you want to take the lead on this? If you want to incorporate that branch I proposed go ahead, your call.

@danielballan
Contributor

Maybe. What's the trajectory in light of this and Wes's interest in a "low-level integration with ODBC"?

@cmorgan
cmorgan commented Nov 22, 2013

Why not rely on SQLAlchemy to manage SQL dialects and connectivity? Then ODBC integration is handled with:
import sqlalchemy
sqlalchemy.create_engine("DRIVER{FreeTDS};Server=my.db.server;Database=mydb;UID=myuser;PWD=mypwd;TDS_Version=8.0;Port=1433;").connect()

@danielballan
Contributor

@cmorgan We were (maybe are) headed in that direction. We drafted some code that introduced sqlalchemy dependence, handling engines and so on. The idea is to rely on SQLalchemy as much as possible, but also accept user-defined connections made through sqlite3, MySQLdb, etc. (At minimum, we need to support those or we'll break old code.) SQLAlchemy handles data types from different flavors, and that would save us a lot of trouble.

But the links above suggest that (1) SQLalchemy is lacking in performance, and pandas is all about speed; and (2) there is some interest in a more ambitious, low-level solution.

@mangecoeur
Contributor

@danielballan @hayd @jtratner Just wondering what the current status of SQL support is? I'm finally back on this topic and am available to help out. What is the most up-to-date branch? i cloned @hayd "sql_tests_amazing" since it seems to have been updated recently. Will give it a whirl in real-world use and see how it holds up.

@mangecoeur
Contributor

I forked the sql branch of the main repo: https://github.com/mangecoeur/pandas/tree/mangecoeur_sql

I've cleaned up the current sql.py file and worked with it in SQLAlchemy mode, but not yet in legacy mode. You can go look at my changes

@hayd
Contributor
hayd commented Dec 4, 2013

@mangecoeur I'm not really sure where to see your commit(s) ?

...I'll rebase the pandas/sql branch. I have done a little more on this, but I don't think I was able to get it passing MySQL tests, so more eyes would be great. I should really go through my dead branches!

The status is that this was left in the air with two possible ways forward: an SQLStore (a la HDFStore) or BackendBase ( @jtratner )... ?

@jtratner
Contributor
jtratner commented Dec 4, 2013

@hayd - really it's just that someone needs to take the BackendBase code
and finish it up so it works with the crazy connection function. Then we
can add in whatever SQLStore features we want. Just needs to not mish-mash
all the different SQL syntax together.

@mangecoeur
Contributor

@hayd ah forgot to push my commits https://github.com/mangecoeur/pandas/commits/mangecoeur_sql
I mostly just refactored things that were in the pandas/sql branch to be cleaner and removed unused code. Because I actually need this for work I'm doing I didn't test much, just made sure the stuff I use actually works (and it does beautifully!)

I personally would not recommend pushing for an SQLStore OR BackendBase at this point, instead focus on tests, api, docs, and getting this into a Pandas release ASAP rather than blocking on extra features - we have the read/write functions and a nice flexible system for handling SQL backend support.

I would go so far as to make SQLStore/BackendBase a new ENH proposal that extends this one - the nature of SQL stores means that while something like an HDFStore is useful in some situations it's just one use case on top of the lower-level ability to read/write tabular data from an SQL query.

Also, the SQLAlchemy specific "read a table by name" function is awesome (as the code comment says) though we need to choose a name - In my version it's just "read_table" but we need something that's clear and concise, and docs to go with it.

@hayd
Contributor
hayd commented Dec 6, 2013

Cool, will take a look over the next couple of days. Agree we should merge soon after 0.13 release.

Since the read_sql takes a string/SQL, we could just check first whether this is a table name, if it is then use the awesome function. Then no need for sep global function?

RE names, I hate the names read_sql and to_sql (I find them really confusing) think should change in future to be read_db and to_db (perhaps it's just me who finds it confusing)...

@mangecoeur
Contributor

Looking again over the code there's still API cleanup to do. This includes:

write_sql should be to_sql to match up to the actual docs! http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#v0-12-0-july-24-2013

The "retry" option is pointless, IMHO this doesn't handle a general use case and it should be up to the user to manage errors in their DB connection - if it failed once, why should it succeed the second time? But if retying works, why just try twice? Its pretty arbitrary. If a user thinks their DB connection is flaky they need to handle that appropriately . On top of that, its only half implemented.

The legacy handler for "append" option appears to handle a case that was never documented (or working) in the first place, I think we can safely nuke this.

The read_sql function accepts a bunch of username/password/driver info, doesn't do anything with it at the moment. IMHO this is already covered by the DBAPI for creating a connection, is DB driver specific (e.g some MYSQL drivers accept a "passord" arg, others "passwd" - do we really want to deal with that?), and so is kinda redundant for us to include - by just accepting a connection object we let the dbapi implementation handle it

The option to supply either a connection or a cursor is causing some confusion in the code - we need to think about the DB connection lifecycle and decide how we want this to work - do we expect the user to just provide connections and we manage the cursors, or do we let the user manage cursors (e.g. for custom parallel handling). Does handling both make sense, and does the implementation actually support it?

I don't think the "read table by name" function should be merged with the others since it depends on SQLAlchemy. I don't see any advantage in merging it either, it means combining too many different bits of functionality into one place (harder to test) making the function more complex to understand for a user rather than easier.

Am working my way through these in my repo, feel free to change, comment, etc

@mangecoeur
Contributor

Update on connection/cursor (no SQLAlchemy mode only) -
It's good practice to allow users to use both the top level utility functions as well as the Classes that actually provide the functionality.

With that in mind, it makes sense to think of PandasSQL as a wrapper around either the SQLAlchemy engine or the DBAPI connection.

In the DBAPI case it should be the connection because the cursors get "used up" when you fetch data from them. So if you created a PandasSQL object wrapping a cursor, you could only use it for once:

pd_sql = PandasSQL(cursor, ...)
a = pd_sql.read_sql(...)  # fine
b = pd_sql.read_sql(...)  # doesn't work, because the cursor has already executed and fetched

on the other hand, if PandasSQL wraps the connection

pd_sql = PandasSQL(connection, ...)
a = pd_sql.read_sql(...)  # fine
b = pd_sql.read_sql(...)  # also fine, because we get a new cursor for the new function call

Therefore we need PandasSQLWithCon, instead of PandasSQLWithCur. It may be possible to still accept Cur in the utility function if we can get conection from it.

Please let me know if I'm wrong about this...

@tony
tony commented Jan 10, 2014

Hey. As an update on the SQLAlchemy front, it's on Github now at https://github.com/zzzeek/sqlalchemy.

#2717 mentioned something about incorporation of sqlalchemy. Is there any other places where potential sqlalchemy<->pandas development is taking place not mentioned here? @zzzeek created https://bitbucket.org/zzzeek/calchipan/ before. It would be good if people involved with pandas and sqlalchemy knew about this, they may be interested. 🎱

@y-p
Contributor
y-p commented Jan 13, 2014

calchipan is about using pandas dataframe as a backend database, this isuue is about pandas
grabbing data from databases (and writing back?).

Seeing as SQLAlchemy is an ORM, what's the selling point for it here? most of that abstraction
layer doesn't seem to apply here.

Also, what's the pain point in supporting multiple databases? Are SQL flavors that different
for the limited scope of pandas<->db interaction? divergent type support across dbs?

I'd appreciate a concise description of what this issue is trying to tackle,
is there a concensus?

@mangecoeur
Contributor

@y-p SQLAlchemy is not just an ORM - it is constituted of a Core and an ORM part. The Core is what we use, which provides "SQL rendering engine, DBAPI integration, transaction integration, and schema description services".

SQL flavors ARE significantly different - different native type support, subtle SQL syntax differences, string escaping rules, text encoding handling, etc. These alone are complicated enough to warrant the use of SQLAlchemy, I speak from experiance.

Using SQLAlchemy's engine to manage connections also frees the user from a lot of the subtleties of connection and transaction life-cycles (e.g. dealing with connection pool recycling on MySql)

SQLAlchemy also provides db and table reflection tools which are very useful and whose implementation does vary a lot between databases. This makes it possible to automatically create SQL tables based on a Pandas Dataframe and likewise to build Pandas Dataframes from named SQL DB tables and get the correct types - this is only realistically possible using SQLAlchemy, see the read_table function in my sql.py version in my branch:

https://github.com/mangecoeur/pandas/blob/mangecoeur_sql/pandas/io/sql.py

SQLAlchemy also provides a lot of scope for building more advances SQL-db features in the future, such as mirroring the HDFStore functionality provided via PyTables.

In order to maintain legacy support, limited functionality is provided for sqlite, mysql, and postgres - i suggest placing these in maintenance-only mode as a fallback for people without SQLAlchemy installed.

That said, SQLAlchemy is trivial to install and is included in all the Scientific/data oriented Python distributions that I know (EPD/Canopy, Anaconda, PythonXY, SAGE) - if you can install Pandas, you can install SQLAlchemy.

@y-p
Contributor
y-p commented Jan 14, 2014

Good. What's ready for 0.14?

@mangecoeur
Contributor

My branch works in practice, but is currently a bit short on automated tests.

It probably also needs a little code cleanup and possibly some depreciation warnings on old function names.

I'm writing some basic tests now which i will push when i can.

@y-p
Contributor
y-p commented Jan 14, 2014

Ok, and with your branch pandas will be able to...?

@mangecoeur
Contributor

Pandas will be able to do the following with any DB supported by SQLAlchemy (currently in core: Drizzle, Firebird, Microsoft SQL Server, MySQL, Oracle, PostgreSQL, SQLite, Sybase):

@y-p
Contributor
y-p commented Jan 14, 2014

That's great. Are there docs ready?

@mangecoeur
Contributor

There are api docs but they need checking and cleanup. There are not yet sufficient manual-style docs.

@y-p
Contributor
y-p commented Jan 14, 2014

API docs meaning docstrings? or is there a new section in the manual that just needs more work?

are those ~5 commits at the tip of mangecoeur_sql all, or did you merge other bits not in master (other PR's)
into your branch?

@mangecoeur
Contributor

Yes API docs meaning docstrings. The manual section:

http://pandas.pydata.org/pandas-docs/stable/io.html#sql-queries
Also needs updating since it doesn't reflect the updated API (that was already introduced in 0.13 I think) which uses read_sql and to_sql for consistency with e.g. the CSV api.

My changes are based on Master (last merged 2013-12-4) with no other PRs. I have changes I haven't pushed yet.

@mangecoeur
Contributor

Update: Basic tests and docs now in place on my branch. They need some review still. If we can get this into beta builds of pandas so that more people can try it out, that would be great!

@y-p
Contributor
y-p commented Jan 14, 2014

That's great, good job.
If everything checks out we can perhaps get it in as experimental early into 0.14.

The missing bit for me is how your branch stands in relation to all the rest of the sql discussions
and PRs scatterd around the tracker, I lack perspective.

@jreback
Contributor
jreback commented Jan 14, 2014

@mangecoeur it prob makes sense to make this a new io module so can leave the existing in place as well

this could involve different function names or an option (io.compat)

@mangecoeur
Contributor

@jreback these updates should be backwards compatible, so there's no need for two modules. If sqlalchemy is not installed or a DBAPI connection is provided instead of an engine it automatically falls back to legacy mode and issues a warning.

If you think there may be a functional difference the best thing to do would be to write a test for currently documented API in the current version of pandas and run that same test against the new version.

I have already made tests for read_sql, to_sql, execute in legacy mode so feel free to have a look and see if you can think of e.g. error conditions that need to be caught.

I based my version on the various work found in this thread (I had some difficulty doing pull/merges from other forks so some of it was just copy-pasted) so in theory it should represent some semblance of consensus around design and API.

@jreback
Contributor
jreback commented Jan 14, 2014

@mangecoeur that is gr8 then! prob want to have an existing ci//requirements that does NOT use sqlalchemy for compat purposes (maybe the 2.6 and 3.2 builds)

@mangecoeur
Contributor

@jreback ok so the test will automatically skip the SQLAlchemy dependant parts if it is not installed. I've added sqlalchemy as a requirement for python 2.7 and 3.3 only, seems to be passing everywhere.

Docs now updated also and build tested.
Anyway I'm pretty content with the state things are in, issuing PR shortly...

@mangecoeur
Contributor

Pull request: #5950

@jreback jreback added a commit to jreback/pandas that referenced this issue Feb 6, 2014
@danielballan @jreback danielballan + jreback ENH #4163 Use SQLAlchemy for DB abstraction
TST Import sqlalchemy on Travis.

DOC add docstrings to read sql

ENH read_sql connects via Connection, Engine, file path, or :memory: string

CLN Separate legacy code into new file, and fallback so that all old tests pass.

TST to use sqlachemy syntax in tests

CLN sql into classes, legacy passes

FIX few engine vs con calls

CLN pep8 cleanup

add postgres support for pandas.io.sql.get_schema

WIP: cleaup of sql io module - imported correct SQLALCHEMY type, delete redundant PandasSQLWithCon

TODO: renamed _engine_read_table, need to think of a better name.
TODO: clean up get_conneciton function

ENH: cleanup of SQL io

TODO: check that legacy mode works
TODO: run tests

correctly enabled coerce_float option

Cleanup and bug-fixing mainly on legacy mode sql.
IMPORTANT - changed legacy to require connection rather than cursor. This is still not yet finalized.
TODO: tests and doc

Added Test coverage for basic functionality using in-memory SQLite database

Simplified API by automatically distinguishing between engine and connection. Added warnings

ENH #4163 Added tests and documentation

Initial draft of doc updates

minor doc updates

Added tests and reduced code repetition. Updated Docs. Added test coverage for legacy names

Documentation updates, more tests

Added depreciation warnings for legacy names.

Updated docs and test doc build

ENH #4163 - finalized tests and docs, ready for wider use…

TST added sqlalchemy to TravisCI build dep for py 2.7 and 3.3

TST Import sqlalchemy on Travis.

DOC add docstrings to read sql

ENH read_sql connects via Connection, Engine, file path, or :memory: string

CLN Separate legacy code into new file, and fallback so that all old tests pass.

ENH #4163 added version added coment

ENH #4163 added depreciation warning for tquery and uquery

ENH #4163 Documentation and tests

ENH #4163 Added more robust type coertion, datetime parsing, and parse date options. Updated optional dependancies

Added columns optional arg to read_table, removed failing legacy tests.

Added columns to doc

ENH #4163 Fixed class renaming, expanded docs

ENH #4163 Fixed tests in legacy mode

ENH #4163 Use SQLAlchemy for DB abstraction

TST Import sqlalchemy on Travis.

DOC add docstrings to read sql

ENH read_sql connects via Connection, Engine, file path, or :memory: string

CLN Separate legacy code into new file, and fallback so that all old tests pass.

TST to use sqlachemy syntax in tests

CLN sql into classes, legacy passes

FIX few engine vs con calls

CLN pep8 cleanup

add postgres support for pandas.io.sql.get_schema

WIP: cleaup of sql io module - imported correct SQLALCHEMY type, delete redundant PandasSQLWithCon

TODO: renamed _engine_read_table, need to think of a better name.
TODO: clean up get_conneciton function

ENH: cleanup of SQL io

TODO: check that legacy mode works
TODO: run tests

correctly enabled coerce_float option

Cleanup and bug-fixing mainly on legacy mode sql.
IMPORTANT - changed legacy to require connection rather than cursor. This is still not yet finalized.
TODO: tests and doc

Added Test coverage for basic functionality using in-memory SQLite database

Simplified API by automatically distinguishing between engine and connection. Added warnings

ENH #4163 Added tests and documentation

Initial draft of doc updates

minor doc updates

Added tests and reduced code repetition. Updated Docs. Added test coverage for legacy names

Documentation updates, more tests

Added depreciation warnings for legacy names.

Updated docs and test doc build

ENH #4163 - finalized tests and docs, ready for wider use…

TST added sqlalchemy to TravisCI build dep for py 2.7 and 3.3

TST Import sqlalchemy on Travis.

DOC add docstrings to read sql

ENH read_sql connects via Connection, Engine, file path, or :memory: string

CLN Separate legacy code into new file, and fallback so that all old tests pass.

ENH #4163 added version added coment

ENH #4163 added depreciation warning for tquery and uquery

ENH #4163 Documentation and tests

ENH #4163 Added more robust type coertion, datetime parsing, and parse date options. Updated optional dependancies

Added columns optional arg to read_table, removed failing legacy tests.

Added columns to doc

ENH #4163 Fixed class renaming, expanded docs

ENH #4163 Fixed tests in legacy mode

ENH #4163 Tweaks to docs, avoid mutable default args, mysql tests

ENH #4163 Introduce DataFrame Index support. Refactor to introduce PandasSQLTable for cleaner OOP design

ENH #4163 Fix bug in index + parse date interaction, added test case for problem

ENH #4163 Fixed missing basestring import for py3.3 compat

ENH #4163 Fixed missing string_types import for py3.3 compat
f47f9fe
@jreback jreback added a commit that referenced this issue Feb 6, 2014
@danielballan @jreback danielballan + jreback ENH #4163 Use SQLAlchemy for DB abstraction
TST Import sqlalchemy on Travis.

DOC add docstrings to read sql

ENH read_sql connects via Connection, Engine, file path, or :memory: string

CLN Separate legacy code into new file, and fallback so that all old tests pass.

TST to use sqlachemy syntax in tests

CLN sql into classes, legacy passes

FIX few engine vs con calls

CLN pep8 cleanup

add postgres support for pandas.io.sql.get_schema

WIP: cleaup of sql io module - imported correct SQLALCHEMY type, delete redundant PandasSQLWithCon

TODO: renamed _engine_read_table, need to think of a better name.
TODO: clean up get_conneciton function

ENH: cleanup of SQL io

TODO: check that legacy mode works
TODO: run tests

correctly enabled coerce_float option

Cleanup and bug-fixing mainly on legacy mode sql.
IMPORTANT - changed legacy to require connection rather than cursor. This is still not yet finalized.
TODO: tests and doc

Added Test coverage for basic functionality using in-memory SQLite database

Simplified API by automatically distinguishing between engine and connection. Added warnings
0222d39
@jreback jreback added a commit that referenced this issue Feb 6, 2014
@jreback Jonathan Chambers + jreback ENH #4163 Added tests and documentation
Initial draft of doc updates

minor doc updates

Added tests and reduced code repetition. Updated Docs. Added test coverage for legacy names

Documentation updates, more tests

Added depreciation warnings for legacy names.

Updated docs and test doc build

ENH #4163 - finalized tests and docs, ready for wider use…

TST added sqlalchemy to TravisCI build dep for py 2.7 and 3.3

TST Import sqlalchemy on Travis.

DOC add docstrings to read sql

ENH read_sql connects via Connection, Engine, file path, or :memory: string

CLN Separate legacy code into new file, and fallback so that all old tests pass.

ENH #4163 added version added coment

ENH #4163 added depreciation warning for tquery and uquery

ENH #4163 Documentation and tests
145ab5b
@jreback jreback added a commit that referenced this issue Feb 6, 2014
@jreback Jonathan Chambers + jreback ENH #4163 Added more robust type coertion, datetime parsing, and pars…
…e date options. Updated optional dependancies

Added columns optional arg to read_table, removed failing legacy tests.

Added columns to doc

ENH #4163 Fixed class renaming, expanded docs

ENH #4163 Fixed tests in legacy mode
e57e125
@jreback jreback added a commit that referenced this issue Feb 6, 2014
@danielballan @jreback danielballan + jreback ENH #4163 Use SQLAlchemy for DB abstraction
TST Import sqlalchemy on Travis.

DOC add docstrings to read sql

ENH read_sql connects via Connection, Engine, file path, or :memory: string

CLN Separate legacy code into new file, and fallback so that all old tests pass.

TST to use sqlachemy syntax in tests

CLN sql into classes, legacy passes

FIX few engine vs con calls

CLN pep8 cleanup

add postgres support for pandas.io.sql.get_schema

WIP: cleaup of sql io module - imported correct SQLALCHEMY type, delete redundant PandasSQLWithCon

TODO: renamed _engine_read_table, need to think of a better name.
TODO: clean up get_conneciton function

ENH: cleanup of SQL io

TODO: check that legacy mode works
TODO: run tests

correctly enabled coerce_float option

Cleanup and bug-fixing mainly on legacy mode sql.
IMPORTANT - changed legacy to require connection rather than cursor. This is still not yet finalized.
TODO: tests and doc

Added Test coverage for basic functionality using in-memory SQLite database

Simplified API by automatically distinguishing between engine and connection. Added warnings
74d091f
@jreback jreback added a commit that referenced this issue Feb 6, 2014
@jreback Jonathan Chambers + jreback ENH #4163 Added tests and documentation
Initial draft of doc updates

minor doc updates

Added tests and reduced code repetition. Updated Docs. Added test coverage for legacy names

Documentation updates, more tests

Added depreciation warnings for legacy names.

Updated docs and test doc build

ENH #4163 - finalized tests and docs, ready for wider use…

TST added sqlalchemy to TravisCI build dep for py 2.7 and 3.3

TST Import sqlalchemy on Travis.

DOC add docstrings to read sql

ENH read_sql connects via Connection, Engine, file path, or :memory: string

CLN Separate legacy code into new file, and fallback so that all old tests pass.

ENH #4163 added version added coment

ENH #4163 added depreciation warning for tquery and uquery

ENH #4163 Documentation and tests
e600131
@jreback jreback added a commit that referenced this issue Feb 6, 2014
@jreback Jonathan Chambers + jreback ENH #4163 Added more robust type coertion, datetime parsing, and pars…
…e date options. Updated optional dependancies

Added columns optional arg to read_table, removed failing legacy tests.

Added columns to doc

ENH #4163 Fixed class renaming, expanded docs

ENH #4163 Fixed tests in legacy mode
0aa7d84
@jreback jreback added a commit that referenced this issue Feb 6, 2014
@jreback Jonathan Chambers + jreback ENH #4163 Tweaks to docs, avoid mutable default args, mysql tests 2e3f83d
@jreback jreback added a commit that referenced this issue Feb 6, 2014
@jreback Jonathan Chambers + jreback ENH #4163 Introduce DataFrame Index support. Refactor to introduce Pa…
…ndasSQLTable for cleaner OOP design
de64af1
@jreback jreback added a commit that referenced this issue Feb 6, 2014
@mangecoeur @jreback mangecoeur + jreback ENH #4163 Fixed missing basestring import for py3.3 compat
ENH #4163 Fixed missing string_types import for py3.3 compat
8e8d067
@jreback
Contributor
jreback commented Feb 7, 2014

@jorisvandenbossche want to close this and open another issue with things that are not covered by #6292 (if they are at all relevant), but maybe want to address 'later'

@jorisvandenbossche
Member

@hayd You opened this issue, what do you think? Do you have a clear view what of this list is still relevant?

@hayd
Contributor
hayd commented Feb 8, 2014

Let's close and move parts into 6292, this issue has got very long.

@hayd hayd closed this Feb 8, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment