Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using a buffer to write a psycopg3 copy result through pandas #461

Closed
FlipperPA opened this issue Dec 16, 2022 · 3 comments
Closed

Using a buffer to write a psycopg3 copy result through pandas #461

FlipperPA opened this issue Dec 16, 2022 · 3 comments

Comments

@FlipperPA
Copy link

FlipperPA commented Dec 16, 2022

I read up on #316 but have a slightly different issue.

Using psycopg2, I could write large results to various formats on disk using copy_expert and a BytesIO buffer like this with pandas:

copy_sql = "COPY (SELECT * FROM big_table) TO STDOUT CSV"

buffer = BytesIO()
cursor.copy_expert(copy_sql, buffer, size=8192)
buffer.seek(0)
pd.read_csv(buffer, engine="c").to_excel(self.output_file)

However, I can't figure out how to replace the buffer in copy_expert with psycopg3's new copy command. Is this still possible? It is quite likely I just haven't figured out the proper formula; if someone can help, I'd be happy to PR an addition to the documentation.

Another question: is the above code in psycopg2 doing what I intended, and minimizing memory usage, or does it actually copy the entire query result into buffer before using pandas to write to Excel format with to_excel?

Thank you for your efforts, they are very much appreciated!

@dvarrazzo
Copy link
Member

It is explained in the documentation: you can use copy() block-by-block:

copy_sql = "COPY (SELECT * FROM big_table) TO STDOUT CSV"
buffer = BytesIO()
with cursor.copy(copy_sql) as copy:
    for data in copy:
        buffer.write(data)

buffer.seek(0)
pd.read_csv(buffer, engine="c").to_excel(self.output_file)

This is what psycopg2 does internally; moving the iteration of the copy stream from inside a consumer function to the hands of the user (or of the file, if you copy into the database instead) allows different operation patterns, such as async copy (including mixing up an async connection with a sync file, or the other way around).

@FlipperPA
Copy link
Author

FlipperPA commented Dec 17, 2022

@dvarrazzo Thanks so much for the reply and explanation. I'm guessing that in both cases (psycopg2 and psycopg3), the entire result set of the query is loaded into Python's memory footprint - or am I wrong here? What I was trying to do was stream the result from psycopg to pandas in chunks. Some of our queries get quite large - hundreds of GB large. :) Would using tempfile make more sense? Sorry for the questions, this is the first time I've been this deep in the weeds.

@dvarrazzo
Copy link
Member

Yes, your pattern would load the entire dataset in memory, both with psycopg 2 and 3, because BytesIO is just a file interface over a chunk of memory.

Using a tempfile object, such as spooled temp file would allow to start using the disk after a certain memory threshold.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants