Skip to content

ENH: Support PostgreSQL server-side cursors to prevent memory hog on large datasets #35689

@cloud-rocket

Description

@cloud-rocket

Is your feature request related to a problem?

pandas.read_sql_query supports Python "generator" pattern when providing chunksize argument. It's not very helpful when working with large datasets, since the whole data is initially retrieved from DB into client-side memory and later chunked into separate frames based on chunksize. Large datasets will easily run into out-of-memory problems with this approach.

Describe the solution you'd like

Postgres/psycopg2 are addressing this problem with server-side cursors. But Pandas does not support it.

API breaking implications

is_cursor argument of SQLDatabase or SQLiteDatabase is not exposed in pandas.read_sql_query or pandas.read_sql_table without any reason. It should be exposed, this way named (server-side) cursor could be provided.

Describe alternatives you've considered

Instead of doing:

iter = sql.read_sql_query(sql,
      conn,
      index_col='col1',
      chunksize=chunksize)

I tried reimplementing it like this:

from pandas.io.sql import SQLiteDatabase

curs = conn.cursor(name='cur_name') # server side cursor creation
curs.itersize = chunksize

pandas_sql = SQLiteDatabase(curs, is_cursor=True)
iter = pandas_sql.read_query(
      sql,
      index_col='col1',
      chunksize=chunksize)

but it fails because SQLiteDatabase tries to access cursor.description, which is NULL for some reason with server-side cursors (and idea why?).

Additional references

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions