Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python)!: Use Object Store instead of fsspec for read_parquet #13044

Merged
merged 10 commits into from
Dec 15, 2023

Conversation

stinodego
Copy link
Member

@stinodego stinodego commented Dec 14, 2023

Ref #13040

Changes

  • Dispatch read_parquet to scan_parquet where appropriate. This means that for purposes of cloud reading, it no longer uses fsspec.
  • Add hive_partitioning and retries parameters.
  • Shuffle around some parameters to make the order more sensible.

@github-actions github-actions bot added performance Performance issues or improvements python Related to Python Polars labels Dec 14, 2023
@stinodego stinodego changed the title perf(python): Use Polars parquet reader for read_parquet instead of FSSPEC perf(python): Dispatch read_parquet to scan_parquet internally Dec 14, 2023
@stinodego stinodego changed the title perf(python): Dispatch read_parquet to scan_parquet internally feat(python): Dispatch read_parquet to scan_parquet internally Dec 14, 2023
@github-actions github-actions bot added the enhancement New feature or an improvement of an existing feature label Dec 14, 2023
@stinodego stinodego changed the title feat(python): Dispatch read_parquet to scan_parquet internally feat(python)!: Dispatch read_parquet to scan_parquet internally Dec 14, 2023
@stinodego stinodego removed the performance Performance issues or improvements label Dec 14, 2023
@github-actions github-actions bot added the breaking Change that breaks backwards compatibility label Dec 14, 2023
@stinodego stinodego marked this pull request as ready for review December 15, 2023 04:47
@stinodego stinodego changed the title feat(python)!: Dispatch read_parquet to scan_parquet internally feat(python)!: Use Object Store instead of fsspec for read_parquet Dec 15, 2023
@stinodego stinodego added this to the 0.20.0 milestone Dec 15, 2023
@ritchie46 ritchie46 merged commit fc03c4a into main Dec 15, 2023
14 checks passed
@ritchie46 ritchie46 deleted the read-scan-collect branch December 15, 2023 10:12
@ldacey
Copy link

ldacey commented Dec 20, 2023

Ahh, so I believe this is what broke my code here, where I open the file with fsspec filesystem and pass the data to read_parquet?

print(pl.__version__)

fs = path.hook.filesystem

with fs.open(path.dataset_uri) as f:
    test = pl.read_parquet(f)

print(test.height)

0.19.19
75594

After upgrading to 0.20.1:

0.20.1
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Cell In[4], line 6
      3 fs = path.hook.filesystem
      5 with fs.open(path.dataset_uri) as f:
----> 6     test = pl.read_parquet(f)
      8 print(test.height)

File ~/.local/lib/python3.11/site-packages/polars/io/parquet/functions.py:163, in read_parquet(source, columns, n_rows, row_count_name, row_count_offset, parallel, use_statistics, hive_partitioning, rechunk, low_memory, storage_options, retries, use_pyarrow, pyarrow_options, memory_map)
    150         return pl.DataFrame._read_parquet(
    151             source_prep,
    152             columns=columns,
   (...)
    159             rechunk=rechunk,
    160         )
    162 # For other inputs, defer to `scan_parquet`
--> 163 lf = scan_parquet(
    164     source,
    165     n_rows=n_rows,
    166     row_count_name=row_count_name,
    167     row_count_offset=row_count_offset,
    168     parallel=parallel,
    169     use_statistics=use_statistics,
    170     hive_partitioning=hive_partitioning,
    171     rechunk=rechunk,
    172     low_memory=low_memory,
    173     cache=False,
    174     storage_options=storage_options,
    175     retries=retries,
    176 )
    178 if columns is not None:
    179     if is_int_sequence(columns):

File ~/.local/lib/python3.11/site-packages/polars/io/parquet/functions.py:303, in scan_parquet(source, n_rows, row_count_name, row_count_offset, parallel, use_statistics, hive_partitioning, rechunk, low_memory, cache, storage_options, retries)
    301     source = normalize_filepath(source)
    302 else:
--> 303     source = [normalize_filepath(source) for source in source]
    305 return pl.LazyFrame._scan_parquet(
    306     source,
    307     n_rows=n_rows,
   (...)
    317     retries=retries,
    318 )

File ~/.local/lib/python3.11/site-packages/polars/io/parquet/functions.py:303, in <listcomp>(.0)
    301     source = normalize_filepath(source)
    302 else:
--> 303     source = [normalize_filepath(source) for source in source]
    305 return pl.LazyFrame._scan_parquet(
    306     source,
    307     n_rows=n_rows,
   (...)
    317     retries=retries,
    318 )

File ~/.local/lib/python3.11/site-packages/polars/utils/various.py:228, in normalize_filepath(path, check_not_directory)
    226 """Create a string path, expanding the home directory if present."""
    227 # don't use pathlib here as it modifies slashes (s3:// -> s3:/)
--> 228 path = os.path.expanduser(path)  # noqa: PTH111
    229 if (
    230     check_not_directory
    231     and os.path.exists(path)  # noqa: PTH110
    232     and os.path.isdir(path)  # noqa: PTH112
    233 ):
    234     raise IsADirectoryError(f"expected a file path; {path!r} is a directory")

File <frozen posixpath>:266, in expanduser(path)

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 4: ordinal not in range(128)

It looks like the docs for read_parquet still show that file-objects are allowed though?

source
Path to a file, or a file-like object. If the path is a directory, files in that directory will all be read.

@stinodego
Copy link
Member Author

We are aware of the issue, see:
#13099

A fix will come shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation breaking Change that breaks backwards compatibility enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

4 participants