-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PanicException on reading Parquet file from S3 #17864
Comments
You should edit into the title that the issue is reading from cloud. Please run this and post full output, maybe that'll help.
The team treats all panic errors as bugs but it's likely that you need to set See here https://docs.pola.rs/api/python/stable/reference/api/polars.scan_parquet.html#polars-scan-parquet and here https://docs.rs/object_store/latest/object_store/aws/enum.AmazonS3ConfigKey.html The environment variables that pyarrow via fsspec look for aren't 100% in sync with what polars via object_store looks for so that's probably how to fix the issue. |
Updated as suggested |
As per your suggestion to use storage_options: # WORKS
df = pl.read_parquet(parquet_file_name, columns=col_list, use_pyarrow=False,
storage_options = {
"aws_access_key_id": os.environ.get('AWS_ACCESS_KEY_ID'),
"aws_secret_access_key": os.environ.get('AWS_SECRET_ACCESS_KEY'),
"aws_region": AWS_REGION})
# DOES NOT WORK
df = pl.read_parquet(parquet_file_name, columns=col_list, use_pyarrow=True,
storage_options = {
"aws_access_key_id": os.environ.get('AWS_ACCESS_KEY_ID'),
"aws_secret_access_key": os.environ.get('AWS_SECRET_ACCESS_KEY'),
"aws_region": AWS_REGION}) The error message is TypeError: AioSession.init() got an unexpected keyword argument 'aws_access_key_id' |
The new error is because there isn't parity between what fsspec expects as key names and what object_store expects between key names. From here it looks like you should set the environment variable |
If you want a minimal reproduction of the panic: import polars as pl
pl.scan_parquet("s3://foobar/file.parquet").collect() |
Checks
Reproducible example
Log output
Issue description
Running the read_parquet when using use_pyarrow=False raises PanicException error.
I noticed that the below works OK, when I add use_pyarrow=True, but it seems very slow :
The file that I am reading is stored in S3. The S3 file path (parquet_file_name) has no spaces in it.
If I download the parquet file locally, and open the file from local disk, Polars does not raise any issues.
Also note, before I upgraded polars version to 1.21, I was using version 0.17 and read_parquet did not raise any issues!
Expected behavior
Dataframe read without errors
Installed versions
--------Version info---------
Polars: 1.2.1
Index type: UInt32
Platform: Windows-10-10.0.22621-SP0
Python: 3.11.9 | packaged by Anaconda, Inc. | (main, Apr 19 2024, 16:40:41) [MSC v.1916 64 bit (AMD64)]
----Optional dependencies----
adbc_driver_manager:
cloudpickle:
connectorx:
deltalake:
fastexcel:
fsspec: 2024.6.1
gevent:
great_tables:
hvplot:
matplotlib: 3.9.1
nest_asyncio: 1.6.0
numpy: 2.0.1
openpyxl:
pandas: 2.2.2
pyarrow: 17.0.0
pydantic:
pyiceberg:
sqlalchemy:
torch:
xlsx2csv:
xlsxwriter:
The text was updated successfully, but these errors were encountered: