Using a buffer to write a psycopg3 copy result through pandas #461

FlipperPA · 2022-12-16T21:59:49Z

I read up on #316 but have a slightly different issue.

Using psycopg2, I could write large results to various formats on disk using copy_expert and a BytesIO buffer like this with pandas:

copy_sql = "COPY (SELECT * FROM big_table) TO STDOUT CSV"

buffer = BytesIO()
cursor.copy_expert(copy_sql, buffer, size=8192)
buffer.seek(0)
pd.read_csv(buffer, engine="c").to_excel(self.output_file)

However, I can't figure out how to replace the buffer in copy_expert with psycopg3's new copy command. Is this still possible? It is quite likely I just haven't figured out the proper formula; if someone can help, I'd be happy to PR an addition to the documentation.

Another question: is the above code in psycopg2 doing what I intended, and minimizing memory usage, or does it actually copy the entire query result into buffer before using pandas to write to Excel format with to_excel?

Thank you for your efforts, they are very much appreciated!

The text was updated successfully, but these errors were encountered:

dvarrazzo · 2022-12-17T03:02:47Z

It is explained in the documentation: you can use copy() block-by-block:

copy_sql = "COPY (SELECT * FROM big_table) TO STDOUT CSV"
buffer = BytesIO()
with cursor.copy(copy_sql) as copy:
    for data in copy:
        buffer.write(data)

buffer.seek(0)
pd.read_csv(buffer, engine="c").to_excel(self.output_file)

This is what psycopg2 does internally; moving the iteration of the copy stream from inside a consumer function to the hands of the user (or of the file, if you copy into the database instead) allows different operation patterns, such as async copy (including mixing up an async connection with a sync file, or the other way around).

FlipperPA · 2022-12-17T12:05:24Z

@dvarrazzo Thanks so much for the reply and explanation. I'm guessing that in both cases (psycopg2 and psycopg3), the entire result set of the query is loaded into Python's memory footprint - or am I wrong here? What I was trying to do was stream the result from psycopg to pandas in chunks. Some of our queries get quite large - hundreds of GB large. :) Would using tempfile make more sense? Sorry for the questions, this is the first time I've been this deep in the weeds.

dvarrazzo · 2022-12-17T13:40:28Z

Yes, your pattern would load the entire dataset in memory, both with psycopg 2 and 3, because BytesIO is just a file interface over a chunk of memory.

Using a tempfile object, such as spooled temp file would allow to start using the disk after a certain memory threshold.

dvarrazzo closed this as completed Dec 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using a buffer to write a psycopg3 copy result through pandas #461

Using a buffer to write a psycopg3 copy result through pandas #461

FlipperPA commented Dec 16, 2022 •

edited

dvarrazzo commented Dec 17, 2022

FlipperPA commented Dec 17, 2022 •

edited

dvarrazzo commented Dec 17, 2022

Using a buffer to write a psycopg3 copy result through pandas #461

Using a buffer to write a psycopg3 copy result through pandas #461

Comments

FlipperPA commented Dec 16, 2022 • edited

dvarrazzo commented Dec 17, 2022

FlipperPA commented Dec 17, 2022 • edited

dvarrazzo commented Dec 17, 2022

FlipperPA commented Dec 16, 2022 •

edited

FlipperPA commented Dec 17, 2022 •

edited