PanicException on reading Parquet file from S3 #17864

jonimatix · 2024-07-25T11:56:59Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

parquet_file_name='s3://test-inputs-dataset/GAME_INFO/server_id=3345/date=2023-03-01/hour=06/GAME_INFO_3345_2023-03-01_06.parquet'

df = pl.read_parquet(parquet_file_name, columns=col_list, use_pyarrow=False)
# or
df = pl.scan_parquet(parquet_file_name).collect()

# This works fine: df = pl.scan_parquet(parquet_file_name)

Log output

PanicException                            Traceback (most recent call last)
File c:\Projects\Proj\scripts\main.py:1
----> 1 df = pl.read_parquet(parquet_file_name, columns=col_list, use_pyarrow=False) # scan_parquet # n_rows=10,

File c:\Users\jm\.conda\envs\env\Lib\site-packages\polars\_utils\deprecation.py:91, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     86 @wraps(function)
     87 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     88     _rename_keyword_argument(
     89         old_name, new_name, kwargs, function.__qualname__, version
     90     )
---> 91     return function(*args, **kwargs)

File c:\Users\jm\.conda\envs\env\Lib\site-packages\polars\_utils\deprecation.py:91, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     86 @wraps(function)
     87 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     88     _rename_keyword_argument(
     89         old_name, new_name, kwargs, function.__qualname__, version
     90     )
---> 91     return function(*args, **kwargs)

File c:\Users\jm\.conda\envs\env\Lib\site-packages\polars\io\parquet\functions.py:206, in read_parquet(source, columns, n_rows, row_index_name, row_index_offset, parallel, use_statistics, hive_partitioning, glob, hive_schema, try_parse_hive_dates, rechunk, low_memory, storage_options, retries, use_pyarrow, pyarrow_options, memory_map)
    203     else:
    204         lf = lf.select(columns)
--> 206 return lf.collect()
...
   1939 # Only for testing purposes atm.
   1940 callback = _kwargs.get("post_opt_callback")
-> 1942 return wrap_df(ldf.collect(callback))

PanicException: called `Result::unwrap()` on an `Err` value: InvalidHeaderValue

Issue description

Running the read_parquet when using use_pyarrow=False raises PanicException error.

I noticed that the below works OK, when I add use_pyarrow=True, but it seems very slow :

df = pl.read_parquet(parquet_file_name, columns=col_list, use_pyarrow=True)

The file that I am reading is stored in S3. The S3 file path (parquet_file_name) has no spaces in it.
If I download the parquet file locally, and open the file from local disk, Polars does not raise any issues.

Also note, before I upgraded polars version to 1.21, I was using version 0.17 and read_parquet did not raise any issues!

Expected behavior

Dataframe read without errors

Installed versions

--------Version info---------
Polars: 1.2.1
Index type: UInt32
Platform: Windows-10-10.0.22621-SP0
Python: 3.11.9 | packaged by Anaconda, Inc. | (main, Apr 19 2024, 16:40:41) [MSC v.1916 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:
cloudpickle:
connectorx:
deltalake:
fastexcel:
fsspec: 2024.6.1
gevent:
great_tables:
hvplot:
matplotlib: 3.9.1
nest_asyncio: 1.6.0
numpy: 2.0.1
openpyxl:
pandas: 2.2.2
pyarrow: 17.0.0
pydantic:
pyiceberg:
sqlalchemy:
torch:
xlsx2csv:
xlsxwriter:

deanm0000 · 2024-07-25T12:53:07Z

You should edit into the title that the issue is reading from cloud. Please run this and post full output, maybe that'll help.

import os
os.environ['RUST_TRACEBACK']='full'
import polars as pl
parquet_file_name=...
df = pl.read_parquet(parquet_file_name, columns=col_list, use_pyarrow=False)

The team treats all panic errors as bugs but it's likely that you need to set storage_options.

See here https://docs.pola.rs/api/python/stable/reference/api/polars.scan_parquet.html#polars-scan-parquet

and here

https://docs.rs/object_store/latest/object_store/aws/enum.AmazonS3ConfigKey.html

The environment variables that pyarrow via fsspec look for aren't 100% in sync with what polars via object_store looks for so that's probably how to fix the issue.

jonimatix · 2024-07-25T13:08:32Z

You should edit into the title that the issue is reading from cloud. Please run this and post full output, maybe that'll help.
import os
os.environ['RUST_TRACEBACK']='full'
import polars as pl
parquet_file_name=...
df = pl.read_parquet(parquet_file_name, columns=col_list, use_pyarrow=False)
The team treats all panic errors as bugs but it's likely that you need to set storage_options.

See here https://docs.pola.rs/api/python/stable/reference/api/polars.scan_parquet.html#polars-scan-parquet

and here

https://docs.rs/object_store/latest/object_store/aws/enum.AmazonS3ConfigKey.html

The environment variables that pyarrow via fsspec look for aren't 100% in sync with what polars via object_store looks for so that's probably how to fix the issue.

Updated as suggested

jonimatix · 2024-07-25T13:20:01Z

object_store

As per your suggestion to use storage_options:

# WORKS
df = pl.read_parquet(parquet_file_name, columns=col_list, use_pyarrow=False, 
                     storage_options = {
                            "aws_access_key_id": os.environ.get('AWS_ACCESS_KEY_ID'),
                            "aws_secret_access_key": os.environ.get('AWS_SECRET_ACCESS_KEY'),
                            "aws_region": AWS_REGION})

# DOES NOT WORK 
df = pl.read_parquet(parquet_file_name, columns=col_list, use_pyarrow=True, 
                     storage_options = {
                            "aws_access_key_id": os.environ.get('AWS_ACCESS_KEY_ID'),
                            "aws_secret_access_key": os.environ.get('AWS_SECRET_ACCESS_KEY'),
                            "aws_region": AWS_REGION})

The error message is

TypeError: AioSession.init() got an unexpected keyword argument 'aws_access_key_id'

deanm0000 · 2024-07-25T14:01:56Z

The new error is because there isn't parity between what fsspec expects as key names and what object_store expects between key names. ~~If you set your AWS_REGION as as an env var then does pl.read_parquet(parquet_file_name, columns=col_list,use_pyarrow=False) work? I don't use S3 so I'm just guessing at that.~~

From here it looks like you should set the environment variable AWS_DEFAULT_REGION to whatever it should be and then I think that pl.read_parquet(parquet_file_name, columns=col_list,use_pyarrow=False) would work.

Kuinox · 2024-08-27T15:12:58Z

If you want a minimal reproduction of the panic:

import polars as pl
pl.scan_parquet("s3://foobar/file.parquet").collect()

jonimatix added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jul 25, 2024

deanm0000 added P-low Priority: low A-io-cloud Area: reading/writing to cloud storage A-panic Area: code that results in panic exceptions and removed needs triage Awaiting prioritization by a maintainer labels Jul 25, 2024

jonimatix changed the title ~~PanicException on reading Parquet file~~ PanicException on reading Parquet file from S3 Jul 25, 2024

matt035343 mentioned this issue Aug 14, 2024

Parquet files cannot be read from pre-signed S3 URLs due to S3 forbidding HTTP HEAD #18186

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PanicException on reading Parquet file from S3 #17864

PanicException on reading Parquet file from S3 #17864

jonimatix commented Jul 25, 2024 •

edited

Loading

deanm0000 commented Jul 25, 2024 •

edited

Loading

jonimatix commented Jul 25, 2024 •

edited

Loading

jonimatix commented Jul 25, 2024

deanm0000 commented Jul 25, 2024 •

edited

Loading

Kuinox commented Aug 27, 2024

PanicException on reading Parquet file from S3 #17864

PanicException on reading Parquet file from S3 #17864

Comments

jonimatix commented Jul 25, 2024 • edited Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

deanm0000 commented Jul 25, 2024 • edited Loading

jonimatix commented Jul 25, 2024 • edited Loading

jonimatix commented Jul 25, 2024

deanm0000 commented Jul 25, 2024 • edited Loading

Kuinox commented Aug 27, 2024

jonimatix commented Jul 25, 2024 •

edited

Loading

deanm0000 commented Jul 25, 2024 •

edited

Loading

jonimatix commented Jul 25, 2024 •

edited

Loading

deanm0000 commented Jul 25, 2024 •

edited

Loading