Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_parquet binary not working any more (since 0.20.0) #13099

Closed
2 tasks done
michael72 opened this issue Dec 18, 2023 · 4 comments · Fixed by #13218
Closed
2 tasks done

read_parquet binary not working any more (since 0.20.0) #13099

michael72 opened this issue Dec 18, 2023 · 4 comments · Fixed by #13218
Assignees
Labels
accepted Ready for implementation bug Something isn't working python Related to Python Polars

Comments

@michael72
Copy link

michael72 commented Dec 18, 2023

Checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
import s3fs

s3 = s3fs.S3FileSystem()

# read parquet 2 
with s3.open(f'{BUCKET_NAME}/my.parquet', mode='rb') as f:
    print(pl.read_parquet(f).head())

Log output

08:20:05 File /usr/local/lib/python3.10/dist-packages/polars/io/parquet/functions.py:163, in read_parquet(source, columns, n_rows, row_count_name, row_count_offset, parallel, use_statistics, hive_partitioning, rechunk, low_memory, storage_options, retries, use_pyarrow, pyarrow_options, memory_map)
08:20:05     150         return pl.DataFrame._read_parquet(
08:20:05     151             source_prep,
08:20:05     152             columns=columns,
08:20:05    (...)
08:20:05     159             rechunk=rechunk,
08:20:05     160         )
08:20:05     162 # For other inputs, defer to `scan_parquet`
08:20:05 --> 163 lf = scan_parquet(
08:20:05     164     source,
08:20:05     165     n_rows=n_rows,
08:20:05     166     row_count_name=row_count_name,
08:20:05     167     row_count_offset=row_count_offset,
08:20:05     168     parallel=parallel,
08:20:05     169     use_statistics=use_statistics,
08:20:05     170     hive_partitioning=hive_partitioning,
08:20:05     171     rechunk=rechunk,
08:20:05     172     low_memory=low_memory,
08:20:05     173     cache=False,
08:20:05     174     storage_options=storage_options,
08:20:05     175     retries=retries,
08:20:05     176 )
08:20:05     178 if columns is not None:
08:20:05     179     if is_int_sequence(columns):
08:20:05 
08:20:05 File /usr/local/lib/python3.10/dist-packages/polars/io/parquet/functions.py:303, in scan_parquet(source, n_rows, row_count_name, row_count_offset, parallel, use_statistics, hive_partitioning, rechunk, low_memory, cache, storage_options, retries)
08:20:05     301     source = normalize_filepath(source)
08:20:05     302 else:
08:20:05 --> 303     source = [normalize_filepath(source) for source in source]
08:20:05     305 return pl.LazyFrame._scan_parquet(
08:20:05     306     source,
08:20:05     307     n_rows=n_rows,
08:20:05    (...)
08:20:05     317     retries=retries,
08:20:05     318 )
08:20:05 
08:20:05 File /usr/local/lib/python3.10/dist-packages/polars/io/parquet/functions.py:303, in <listcomp>(.0)
08:20:05     301     source = normalize_filepath(source)
08:20:05     302 else:
08:20:05 --> 303     source = [normalize_filepath(source) for source in source]
08:20:05     305 return pl.LazyFrame._scan_parquet(
08:20:05     306     source,
08:20:05     307     n_rows=n_rows,
08:20:05    (...)
08:20:05     317     retries=retries,
08:20:05     318 )
08:20:05 
08:20:05 File /usr/local/lib/python3.10/dist-packages/polars/utils/various.py:228, in normalize_filepath(path, check_not_directory)
08:20:05     226 """Create a string path, expanding the home directory if present."""
08:20:05     227 # don't use pathlib here as it modifies slashes (s3:// -> s3:/)
08:20:05 --> 228 path = os.path.expanduser(path)  # noqa: PTH111
08:20:05     229 if (
08:20:05     230     check_not_directory
08:20:05     231     and os.path.exists(path)  # noqa: PTH110
08:20:05     232     and os.path.isdir(path)  # noqa: PTH112
08:20:05     233 ):
08:20:05     234     raise IsADirectoryError(f"expected a file path; {path!r} is a directory")
08:20:05 
08:20:05 File /usr/lib/python3.10/posixpath.py:258, in expanduser(path)
08:20:05     256 name = path[1:i]
08:20:05     257 if isinstance(name, bytes):
08:20:05 --> 258     name = str(name, 'ASCII')
08:20:05     259 try:
08:20:05     260     pwent = pwd.getpwnam(name)
08:20:05 
08:20:05 UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position 1: ordinal not in range(128)

Issue description

The issue was introduced in polars 0.20.0 - it used to work until 0.19.19
read_parquet function supposedly works (and did work) with binary input (from reading s3 data)

Now the normalize_filepath function is called on the binary input (which does not make sense)
also read_parquet which - by the signature - supports BinaryIO internally calls scan_parquet which does not (any more).

Expected behavior

reading parquet given as binary content works

Installed versions

--------Version info---------
Polars:               0.20.0
Index type:           UInt32
Platform:             Linux-5.19.0-46-generic-x86_64-with-glibc2.35
Python:               3.11.4 (main, Jun  7 2023, 12:45:48) [GCC 11.3.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          2.2.1
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               2023.12.2
gevent:               <not installed>
matplotlib:           3.7.2
numpy:                1.26.0
openpyxl:             <not installed>
pandas:               2.0.3
pyarrow:              14.0.1
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@michael72 michael72 added bug Something isn't working python Related to Python Polars labels Dec 18, 2023
@ritchie46
Copy link
Member

ritchie46 commented Dec 18, 2023

You should pass the s3 path directly to the reader. This is an intended change.

Try: pl.read_parquet(f'{BUCKET_NAME}/my.parquet')

@ritchie46 ritchie46 added invalid A bug report that is not actually a bug and removed bug Something isn't working labels Dec 18, 2023
@ritchie46
Copy link
Member

@stinodego can we raise more informative here? We essentially get an object we don't expect right?

@michael72
Copy link
Author

You should pass the s3 path directly to the reader. This is an intended change.

Try: pl.read_parquet(f'{BUCKET_NAME}/my.parquet')

OK - thanks! Yes, that works now!

It would be good however to at least rework some of the documentation / type hints for parameter source in read_parquet the same as in scan_parquet, i.e. only source: str | Path | list[str] | list[Path] and remove BinaryIO | BytesIO | bytes since they do not seem to be supported any more - or add tests to check and fix that parameter types.

see https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.read_parquet.html

@stinodego
Copy link
Member

They are supported, but only for local filesystems. I think I actually made a mistake in rewriting the function and your input should still work. Will take a look later today or tomorrow.

You can use Ritchie's workaround for now - it's the recommended usage anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation bug Something isn't working python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants