read_parquet binary not working any more (since 0.20.0) #13099

michael72 · 2023-12-18T09:02:44Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
import s3fs

s3 = s3fs.S3FileSystem()

# read parquet 2 
with s3.open(f'{BUCKET_NAME}/my.parquet', mode='rb') as f:
    print(pl.read_parquet(f).head())

Log output

08:20:05 File /usr/local/lib/python3.10/dist-packages/polars/io/parquet/functions.py:163, in read_parquet(source, columns, n_rows, row_count_name, row_count_offset, parallel, use_statistics, hive_partitioning, rechunk, low_memory, storage_options, retries, use_pyarrow, pyarrow_options, memory_map)
08:20:05     150         return pl.DataFrame._read_parquet(
08:20:05     151             source_prep,
08:20:05     152             columns=columns,
08:20:05    (...)
08:20:05     159             rechunk=rechunk,
08:20:05     160         )
08:20:05     162 # For other inputs, defer to `scan_parquet`
08:20:05 --> 163 lf = scan_parquet(
08:20:05     164     source,
08:20:05     165     n_rows=n_rows,
08:20:05     166     row_count_name=row_count_name,
08:20:05     167     row_count_offset=row_count_offset,
08:20:05     168     parallel=parallel,
08:20:05     169     use_statistics=use_statistics,
08:20:05     170     hive_partitioning=hive_partitioning,
08:20:05     171     rechunk=rechunk,
08:20:05     172     low_memory=low_memory,
08:20:05     173     cache=False,
08:20:05     174     storage_options=storage_options,
08:20:05     175     retries=retries,
08:20:05     176 )
08:20:05     178 if columns is not None:
08:20:05     179     if is_int_sequence(columns):
08:20:05 
08:20:05 File /usr/local/lib/python3.10/dist-packages/polars/io/parquet/functions.py:303, in scan_parquet(source, n_rows, row_count_name, row_count_offset, parallel, use_statistics, hive_partitioning, rechunk, low_memory, cache, storage_options, retries)
08:20:05     301     source = normalize_filepath(source)
08:20:05     302 else:
08:20:05 --> 303     source = [normalize_filepath(source) for source in source]
08:20:05     305 return pl.LazyFrame._scan_parquet(
08:20:05     306     source,
08:20:05     307     n_rows=n_rows,
08:20:05    (...)
08:20:05     317     retries=retries,
08:20:05     318 )
08:20:05 
08:20:05 File /usr/local/lib/python3.10/dist-packages/polars/io/parquet/functions.py:303, in <listcomp>(.0)
08:20:05     301     source = normalize_filepath(source)
08:20:05     302 else:
08:20:05 --> 303     source = [normalize_filepath(source) for source in source]
08:20:05     305 return pl.LazyFrame._scan_parquet(
08:20:05     306     source,
08:20:05     307     n_rows=n_rows,
08:20:05    (...)
08:20:05     317     retries=retries,
08:20:05     318 )
08:20:05 
08:20:05 File /usr/local/lib/python3.10/dist-packages/polars/utils/various.py:228, in normalize_filepath(path, check_not_directory)
08:20:05     226 """Create a string path, expanding the home directory if present."""
08:20:05     227 # don't use pathlib here as it modifies slashes (s3:// -> s3:/)
08:20:05 --> 228 path = os.path.expanduser(path)  # noqa: PTH111
08:20:05     229 if (
08:20:05     230     check_not_directory
08:20:05     231     and os.path.exists(path)  # noqa: PTH110
08:20:05     232     and os.path.isdir(path)  # noqa: PTH112
08:20:05     233 ):
08:20:05     234     raise IsADirectoryError(f"expected a file path; {path!r} is a directory")
08:20:05 
08:20:05 File /usr/lib/python3.10/posixpath.py:258, in expanduser(path)
08:20:05     256 name = path[1:i]
08:20:05     257 if isinstance(name, bytes):
08:20:05 --> 258     name = str(name, 'ASCII')
08:20:05     259 try:
08:20:05     260     pwent = pwd.getpwnam(name)
08:20:05 
08:20:05 UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position 1: ordinal not in range(128)

Issue description

The issue was introduced in polars 0.20.0 - it used to work until 0.19.19
read_parquet function supposedly works (and did work) with binary input (from reading s3 data)

Now the normalize_filepath function is called on the binary input (which does not make sense)
also read_parquet which - by the signature - supports BinaryIO internally calls scan_parquet which does not (any more).

Expected behavior

reading parquet given as binary content works

Installed versions

--------Version info---------
Polars:               0.20.0
Index type:           UInt32
Platform:             Linux-5.19.0-46-generic-x86_64-with-glibc2.35
Python:               3.11.4 (main, Jun  7 2023, 12:45:48) [GCC 11.3.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          2.2.1
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               2023.12.2
gevent:               <not installed>
matplotlib:           3.7.2
numpy:                1.26.0
openpyxl:             <not installed>
pandas:               2.0.3
pyarrow:              14.0.1
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

The text was updated successfully, but these errors were encountered:

ritchie46 · 2023-12-18T10:12:06Z

You should pass the s3 path directly to the reader. This is an intended change.

Try: pl.read_parquet(f'{BUCKET_NAME}/my.parquet')

ritchie46 · 2023-12-18T12:13:12Z

@stinodego can we raise more informative here? We essentially get an object we don't expect right?

michael72 · 2023-12-19T05:53:36Z

You should pass the s3 path directly to the reader. This is an intended change.

Try: pl.read_parquet(f'{BUCKET_NAME}/my.parquet')

OK - thanks! Yes, that works now!

It would be good however to at least rework some of the documentation / type hints for parameter source in read_parquet the same as in scan_parquet, i.e. only source: str | Path | list[str] | list[Path] and remove BinaryIO | BytesIO | bytes since they do not seem to be supported any more - or add tests to check and fix that parameter types.

see https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.read_parquet.html

stinodego · 2023-12-19T09:09:10Z

They are supported, but only for local filesystems. I think I actually made a mistake in rewriting the function and your input should still work. Will take a look later today or tomorrow.

You can use Ritchie's workaround for now - it's the recommended usage anyway.

michael72 added bug Something isn't working python Related to Python Polars labels Dec 18, 2023

ritchie46 added invalid A bug report that is not actually a bug and removed bug Something isn't working labels Dec 18, 2023

stinodego mentioned this issue Dec 20, 2023

feat(python)!: Use Object Store instead of fsspec for read_parquet #13044

Merged

stinodego added accepted Ready for implementation bug Something isn't working and removed invalid A bug report that is not actually a bug labels Dec 21, 2023

stinodego self-assigned this Dec 21, 2023

astrojuanlu mentioned this issue Dec 21, 2023

Error Reading a Parquet Dataset as an EagerPolarsDataset in a Kedro Pipeline kedro-org/kedro-plugins#500

Closed

stinodego mentioned this issue Dec 23, 2023

fix(python): Correctly use read_parquet for all binary inputs #13218

Merged

ritchie46 closed this as completed in #13218 Dec 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_parquet binary not working any more (since 0.20.0) #13099

read_parquet binary not working any more (since 0.20.0) #13099

michael72 commented Dec 18, 2023 •

edited

Loading

ritchie46 commented Dec 18, 2023 •

edited

Loading

ritchie46 commented Dec 18, 2023

michael72 commented Dec 19, 2023

stinodego commented Dec 19, 2023

read_parquet binary not working any more (since 0.20.0) #13099

read_parquet binary not working any more (since 0.20.0) #13099

Comments

michael72 commented Dec 18, 2023 • edited Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

ritchie46 commented Dec 18, 2023 • edited Loading

ritchie46 commented Dec 18, 2023

michael72 commented Dec 19, 2023

stinodego commented Dec 19, 2023

michael72 commented Dec 18, 2023 •

edited

Loading

ritchie46 commented Dec 18, 2023 •

edited

Loading