-
-
Notifications
You must be signed in to change notification settings - Fork 18.8k
Description
Is your feature request related to a problem?
pandas.read_sql_query
supports Python "generator" pattern when providing chunksize
argument. It's not very helpful when working with large datasets, since the whole data is initially retrieved from DB into client-side memory and later chunked into separate frames based on chunksize
. Large datasets will easily run into out-of-memory problems with this approach.
Describe the solution you'd like
Postgres/psycopg2 are addressing this problem with server-side cursors. But Pandas does not support it.
API breaking implications
is_cursor
argument of SQLDatabase
or SQLiteDatabase
is not exposed in pandas.read_sql_query
or pandas.read_sql_table
without any reason. It should be exposed, this way named (server-side) cursor could be provided.
Describe alternatives you've considered
Instead of doing:
iter = sql.read_sql_query(sql,
conn,
index_col='col1',
chunksize=chunksize)
I tried reimplementing it like this:
from pandas.io.sql import SQLiteDatabase
curs = conn.cursor(name='cur_name') # server side cursor creation
curs.itersize = chunksize
pandas_sql = SQLiteDatabase(curs, is_cursor=True)
iter = pandas_sql.read_query(
sql,
index_col='col1',
chunksize=chunksize)
but it fails because SQLiteDatabase
tries to access cursor.description
, which is NULL for some reason with server-side cursors (and idea why?).