New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using a buffer to write a psycopg3 copy result through pandas #461
Comments
It is explained in the documentation: you can use copy_sql = "COPY (SELECT * FROM big_table) TO STDOUT CSV"
buffer = BytesIO()
with cursor.copy(copy_sql) as copy:
for data in copy:
buffer.write(data)
buffer.seek(0)
pd.read_csv(buffer, engine="c").to_excel(self.output_file) This is what psycopg2 does internally; moving the iteration of the copy stream from inside a consumer function to the hands of the user (or of the file, if you copy into the database instead) allows different operation patterns, such as async copy (including mixing up an async connection with a sync file, or the other way around). |
@dvarrazzo Thanks so much for the reply and explanation. I'm guessing that in both cases (psycopg2 and psycopg3), the entire result set of the query is loaded into Python's memory footprint - or am I wrong here? What I was trying to do was stream the result from psycopg to pandas in chunks. Some of our queries get quite large - hundreds of GB large. :) Would using tempfile make more sense? Sorry for the questions, this is the first time I've been this deep in the weeds. |
Yes, your pattern would load the entire dataset in memory, both with psycopg 2 and 3, because BytesIO is just a file interface over a chunk of memory. Using a tempfile object, such as spooled temp file would allow to start using the disk after a certain memory threshold. |
I read up on #316 but have a slightly different issue.
Using
psycopg2
, I could write large results to various formats on disk usingcopy_expert
and aBytesIO
buffer like this withpandas
:However, I can't figure out how to replace the
buffer
incopy_expert
withpsycopg3
's new copy command. Is this still possible? It is quite likely I just haven't figured out the proper formula; if someone can help, I'd be happy to PR an addition to the documentation.Another question: is the above code in psycopg2 doing what I intended, and minimizing memory usage, or does it actually copy the entire query result into
buffer
before usingpandas
to write to Excel format withto_excel
?Thank you for your efforts, they are very much appreciated!
The text was updated successfully, but these errors were encountered: