Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PanicException on reading Parquet file from S3 #17864

Open
2 tasks done
jonimatix opened this issue Jul 25, 2024 · 5 comments
Open
2 tasks done

PanicException on reading Parquet file from S3 #17864

jonimatix opened this issue Jul 25, 2024 · 5 comments
Labels
A-io-cloud Area: reading/writing to cloud storage A-panic Area: code that results in panic exceptions bug Something isn't working P-low Priority: low python Related to Python Polars

Comments

@jonimatix
Copy link

jonimatix commented Jul 25, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

parquet_file_name='s3://test-inputs-dataset/GAME_INFO/server_id=3345/date=2023-03-01/hour=06/GAME_INFO_3345_2023-03-01_06.parquet'

df = pl.read_parquet(parquet_file_name, columns=col_list, use_pyarrow=False)
# or
df = pl.scan_parquet(parquet_file_name).collect()

# This works fine: df = pl.scan_parquet(parquet_file_name)

Log output

PanicException                            Traceback (most recent call last)
File c:\Projects\Proj\scripts\main.py:1
----> 1 df = pl.read_parquet(parquet_file_name, columns=col_list, use_pyarrow=False) # scan_parquet # n_rows=10,

File c:\Users\jm\.conda\envs\env\Lib\site-packages\polars\_utils\deprecation.py:91, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     86 @wraps(function)
     87 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     88     _rename_keyword_argument(
     89         old_name, new_name, kwargs, function.__qualname__, version
     90     )
---> 91     return function(*args, **kwargs)

File c:\Users\jm\.conda\envs\env\Lib\site-packages\polars\_utils\deprecation.py:91, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     86 @wraps(function)
     87 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     88     _rename_keyword_argument(
     89         old_name, new_name, kwargs, function.__qualname__, version
     90     )
---> 91     return function(*args, **kwargs)

File c:\Users\jm\.conda\envs\env\Lib\site-packages\polars\io\parquet\functions.py:206, in read_parquet(source, columns, n_rows, row_index_name, row_index_offset, parallel, use_statistics, hive_partitioning, glob, hive_schema, try_parse_hive_dates, rechunk, low_memory, storage_options, retries, use_pyarrow, pyarrow_options, memory_map)
    203     else:
    204         lf = lf.select(columns)
--> 206 return lf.collect()
...
   1939 # Only for testing purposes atm.
   1940 callback = _kwargs.get("post_opt_callback")
-> 1942 return wrap_df(ldf.collect(callback))

PanicException: called `Result::unwrap()` on an `Err` value: InvalidHeaderValue

Issue description

Running the read_parquet when using use_pyarrow=False raises PanicException error.

I noticed that the below works OK, when I add use_pyarrow=True, but it seems very slow :

df = pl.read_parquet(parquet_file_name, columns=col_list, use_pyarrow=True)

The file that I am reading is stored in S3. The S3 file path (parquet_file_name) has no spaces in it.
If I download the parquet file locally, and open the file from local disk, Polars does not raise any issues.

Also note, before I upgraded polars version to 1.21, I was using version 0.17 and read_parquet did not raise any issues!

Expected behavior

Dataframe read without errors

Installed versions

--------Version info---------
Polars: 1.2.1
Index type: UInt32
Platform: Windows-10-10.0.22621-SP0
Python: 3.11.9 | packaged by Anaconda, Inc. | (main, Apr 19 2024, 16:40:41) [MSC v.1916 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:
cloudpickle:
connectorx:
deltalake:
fastexcel:
fsspec: 2024.6.1
gevent:
great_tables:
hvplot:
matplotlib: 3.9.1
nest_asyncio: 1.6.0
numpy: 2.0.1
openpyxl:
pandas: 2.2.2
pyarrow: 17.0.0
pydantic:
pyiceberg:
sqlalchemy:
torch:
xlsx2csv:
xlsxwriter:

@jonimatix jonimatix added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jul 25, 2024
@deanm0000
Copy link
Collaborator

deanm0000 commented Jul 25, 2024

You should edit into the title that the issue is reading from cloud. Please run this and post full output, maybe that'll help.

import os
os.environ['RUST_TRACEBACK']='full'
import polars as pl
parquet_file_name=...
df = pl.read_parquet(parquet_file_name, columns=col_list, use_pyarrow=False)

The team treats all panic errors as bugs but it's likely that you need to set storage_options.

See here https://docs.pola.rs/api/python/stable/reference/api/polars.scan_parquet.html#polars-scan-parquet

and here

https://docs.rs/object_store/latest/object_store/aws/enum.AmazonS3ConfigKey.html

The environment variables that pyarrow via fsspec look for aren't 100% in sync with what polars via object_store looks for so that's probably how to fix the issue.

@deanm0000 deanm0000 added P-low Priority: low A-io-cloud Area: reading/writing to cloud storage A-panic Area: code that results in panic exceptions and removed needs triage Awaiting prioritization by a maintainer labels Jul 25, 2024
@jonimatix jonimatix changed the title PanicException on reading Parquet file PanicException on reading Parquet file from S3 Jul 25, 2024
@jonimatix
Copy link
Author

jonimatix commented Jul 25, 2024

You should edit into the title that the issue is reading from cloud. Please run this and post full output, maybe that'll help.

import os
os.environ['RUST_TRACEBACK']='full'
import polars as pl
parquet_file_name=...
df = pl.read_parquet(parquet_file_name, columns=col_list, use_pyarrow=False)

The team treats all panic errors as bugs but it's likely that you need to set storage_options.

See here https://docs.pola.rs/api/python/stable/reference/api/polars.scan_parquet.html#polars-scan-parquet

and here

https://docs.rs/object_store/latest/object_store/aws/enum.AmazonS3ConfigKey.html

The environment variables that pyarrow via fsspec look for aren't 100% in sync with what polars via object_store looks for so that's probably how to fix the issue.

Updated as suggested

@jonimatix
Copy link
Author

object_store

As per your suggestion to use storage_options:

# WORKS
df = pl.read_parquet(parquet_file_name, columns=col_list, use_pyarrow=False, 
                     storage_options = {
                            "aws_access_key_id": os.environ.get('AWS_ACCESS_KEY_ID'),
                            "aws_secret_access_key": os.environ.get('AWS_SECRET_ACCESS_KEY'),
                            "aws_region": AWS_REGION})

# DOES NOT WORK 
df = pl.read_parquet(parquet_file_name, columns=col_list, use_pyarrow=True, 
                     storage_options = {
                            "aws_access_key_id": os.environ.get('AWS_ACCESS_KEY_ID'),
                            "aws_secret_access_key": os.environ.get('AWS_SECRET_ACCESS_KEY'),
                            "aws_region": AWS_REGION})

The error message is

TypeError: AioSession.init() got an unexpected keyword argument 'aws_access_key_id'

@deanm0000
Copy link
Collaborator

deanm0000 commented Jul 25, 2024

The new error is because there isn't parity between what fsspec expects as key names and what object_store expects between key names. If you set your AWS_REGION as as an env var then does pl.read_parquet(parquet_file_name, columns=col_list,use_pyarrow=False) work? I don't use S3 so I'm just guessing at that.

From here it looks like you should set the environment variable AWS_DEFAULT_REGION to whatever it should be and then I think that pl.read_parquet(parquet_file_name, columns=col_list,use_pyarrow=False) would work.

@Kuinox
Copy link

Kuinox commented Aug 27, 2024

If you want a minimal reproduction of the panic:

import polars as pl
pl.scan_parquet("s3://foobar/file.parquet").collect()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-cloud Area: reading/writing to cloud storage A-panic Area: code that results in panic exceptions bug Something isn't working P-low Priority: low python Related to Python Polars
Projects
Status: Ready
Development

No branches or pull requests

3 participants