Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Impl Polars cursor #436

Open
laughingman7743 opened this issue May 1, 2023 · 7 comments
Open

Impl Polars cursor #436

laughingman7743 opened this issue May 1, 2023 · 7 comments

Comments

@laughingman7743
Copy link
Owner

https://www.pola.rs/
https://pypi.org/project/polars/
https://pola-rs.github.io/polars/py-polars/html/reference/

@darkcofy
Copy link

polars cursor would be a godsend!

@sacundim
Copy link

sacundim commented Jul 28, 2023

Polars uses Arrow as its memory representation, so, as I understand it, supporting Polars in PyAthena is mostly just a syntactic shortcut, right? Polars' documentation for the from_arrow method says:

This operation will be zero copy for the most part. Types that are not supported by Polars may be cast to the closest supported type.

So except for that note about unsupported types, the following code should have basically no overhead already today:

import polars as pl
import pyathena
from pyathena.arrow.cursor import ArrowCursor

cursor = pyathena.connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
                 region_name="us-west-2",
                 cursor_class=ArrowCursor).cursor()

# This should be zero-copy most of the time
polars_df = pl.from_arrow(cursor.execute("SELECT * FROM many_rows").as_arrow())

I actually tried out PyAthena → Arrow → Polars in this fashion the other day, so I can at least confirm this is functional (i.e. it will populate a Polars DataFrame that works, I didn't verify anything about copying or performance overheads)

@mazzma12
Copy link

mazzma12 commented Feb 19, 2024

Polars uses Arrow as its memory representation, so, as I understand it, supporting Polars in PyAthena is mostly just a syntactic shortcut, right? Polars' documentation for the from_arrow method says:

This operation will be zero copy for the most part. Types that are not supported by Polars may be cast to the closest supported type.

So except for that note about unsupported types, the following code should have basically no overhead already today:

import polars as pl
import pyathena
from pyathena.arrow.cursor import ArrowCursor

cursor = pyathena.connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
                 region_name="us-west-2",
                 cursor_class=ArrowCursor).cursor()

# This should be zero-copy most of the time
polars_df = pl.from_arrow(cursor.execute("SELECT * FROM many_rows").as_arrow())

I actually tried out PyAthena → Arrow → Polars in this fashion the other day, so I can at least confirm this is functional (i.e. it will populate a Polars DataFrame that works, I didn't verify anything about copying or performance overheads)

Hi, may I ask what version of pyarrow are you using ? I have an error with version 15.0.0

OperationalError: When reading information for key 'test/670672a1-dab2-4635-ba3b-1c6a16dc0b6f.csv' in bucket '{BucketName}': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached

or if another solution comes to your mind to explain this error. Thank you

@sacundim
Copy link

Hi, may I ask what version of pyarrow are you using ? I have an error with version 15.0.0

OperationalError: When reading information for key 'test/670672a1-dab2-4635-ba3b-1c6a16dc0b6f.csv' in bucket '{BucketName}': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached

or if another solution comes to your mind to explain this error. Thank you

This was many months ago, I no longer recall what version I was using... but your error is to all appearances a network connectivity problem, says so right there in the message

@mazzma12
Copy link

Yes, I totally agree, but it's cryptic to me since it's working with another cursor (like pandasCursor for example)

Hi, may I ask what version of pyarrow are you using ? I have an error with version 15.0.0

OperationalError: When reading information for key 'test/670672a1-dab2-4635-ba3b-1c6a16dc0b6f.csv' in bucket '{BucketName}': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached

or if another solution comes to your mind to explain this error. Thank you

This was many months ago, I no longer recall what version I was using... but your error is to all appearances a network connectivity problem, says so right there in the message

@laughingman7743
Copy link
Owner Author

laughingman7743 commented Mar 16, 2024

FAILED tests/pyathena/arrow/test_cursor.py::TestArrowCursor::test_executemany[arrow_cursor1] - pyathena.error.OperationalError: When getting information for key 'tmp/unload/20240316/a580fb77-99b1-49c8-8f70-cc3eaf663089' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_cursor.py::TestArrowCursor::test_fetchall[arrow_cursor0] - pyathena.error.OperationalError: When reading information for key 'tmp/93571118-03bb-4b01-9772-4b1f99dc9f61.csv' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_cursor.py::TestArrowCursor::test_executemany_fetch[arrow_cursor1] - pyathena.error.OperationalError: When getting information for key 'tmp/unload/20240316/f7a80ee7-26c4-4103-bf49-c94b29c6eea0' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_async_cursor.py::TestAsyncArrowCursor::test_fetchall[async_arrow_cursor0] - pyathena.error.OperationalError: When reading information for key 'tmp/573dd21d-a3fe-4bbe-a7b5-aa1807dfd2a6.csv' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_cursor.py::TestArrowCursor::test_complex_unload_as_arrow[arrow_cursor0] - pyathena.error.OperationalError: When getting information for key 'tmp/unload/20240316/f23455fe-6929-439d-864b-d52b55b7be7a' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_cursor.py::TestArrowCursor::test_iterator[arrow_cursor0] - pyathena.error.OperationalError: When reading information for key 'tmp/b94d398d-1cac-45b2-b5d3-9210897b6d5f.csv' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_async_cursor.py::TestAsyncArrowCursor::test_fetchall[async_arrow_cursor1] - pyathena.error.OperationalError: When getting information for key 'tmp/unload/20240316/c823bf59-cd6b-4e0c-9600-690de08d3f18' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_async_cursor.py::TestAsyncArrowCursor::test_iterator[async_arrow_cursor1] - pyathena.error.OperationalError: When getting information for key 'tmp/unload/20240316/c3433f82-a070-4044-be85-1d18786d1311' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_cursor.py::TestArrowCursor::test_arraysize[arrow_cursor1] - pyathena.error.OperationalError: When getting information for key 'tmp/unload/20240316/64c7bec3-8281-4807-ae95-fb19ca8d0159' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_cursor.py::TestArrowCursor::test_iceberg_table - pyathena.error.OperationalError: When reading information for key 'tmp/bbf31c91-e845-4ccf-8b8b-588147fcf4e7.csv' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_async_cursor.py::TestAsyncArrowCursor::test_arraysize[async_arrow_cursor0] - pyathena.error.OperationalError: When reading information for key 'tmp/76e1c3c4-ed18-4a18-ad6f-3fe9dc5db8a1.csv' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_async_cursor.py::TestAsyncArrowCursor::test_arraysize[async_arrow_cursor1] - pyathena.error.OperationalError: When getting information for key 'tmp/unload/20240316/4f059472-9026-49ac-892f-cfd47e5eac81' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_cursor.py::TestArrowCursor::test_complex_unload[arrow_cursor0] - pyathena.error.OperationalError: When getting information for key 'tmp/unload/20240316/170fbcee-1269-4367-8d3e-b8e43f838b79' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_async_cursor.py::TestAsyncArrowCursor::test_description[async_arrow_cursor1] - pyathena.error.OperationalError: When getting information for key 'tmp/unload/20240316/baf70311-3540-4dce-ae43-8aecc19566b1' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_async_cursor.py::TestAsyncArrowCursor::test_query_execution[async_arrow_cursor0] - pyathena.error.OperationalError: When reading information for key 'tmp/0723649b-bb5c-450b-baa2-d52ea3d8a7aa.csv' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached

An error occurred when I ran the test in the local environment. 🤔
This is not occurring in GitHubActions.
#520

@laughingman7743
Copy link
Owner Author

apache/arrow#36007

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants