Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dask 2021.8.0 breaks parquet io #84

Closed
iameskild opened this issue Aug 17, 2021 · 3 comments
Closed

dask 2021.8.0 breaks parquet io #84

iameskild opened this issue Aug 17, 2021 · 3 comments

Comments

@iameskild
Copy link
Contributor

Opening issue here for visibility and to track any potential updates.

ALL software version info

(this library, plus any other relevant software, e.g. bokeh, python, notebook, OS, browser, etc)

python 
3.8.10

spatialpandas
0.4.3

dask
2021.8.0

Description of expected behavior and the observed behavior

Reading and writing to parquet is producing errors when used with the latest version of dask 2021.8.0. I have tested reverting back to dask 2021.7.2 and do not experience any of the issues outlined below.

The dask team may already be aware of this issue:

Complete, minimal, self-contained example code that reproduces the issue

import dask
import spatialpandas as spd
from spatialpandas.io import read_parquet_dask

path = "s3://bucketname/data.parquet"
storage_options = {"key": <key>, "secret": <secret>}

ddf = read_parquet_dask(path, storage_options=storage_options)

The code block above and ddf.pack_partitions_to_parquet both produce the following error:

Traceback (most recent call last):
  File "/app/datum/readers/quadrant.py", line 106, in sort_quadrant_files
    ddf_packed = ddf.pack_partitions_to_parquet(
  File "/opt/conda/envs/datum/lib/python3.8/site-packages/spatialpandas/dask.py", line 530, in pack_partitions_to_parquet
    return read_parquet_dask(
  File "/opt/conda/envs/datum/lib/python3.8/site-packages/spatialpandas/io/parquet.py", line 321, in read_parquet_dask
    result = _perform_read_parquet_dask(
  File "/opt/conda/envs/datum/lib/python3.8/site-packages/spatialpandas/io/parquet.py", line 421, in _perform_read_parquet_dask
    meta = dd_read_parquet(
  File "/opt/conda/envs/datum/lib/python3.8/site-packages/dask/dataframe/io/parquet/core.py", line 313, in read_parquet
    read_metadata_result = engine.read_metadata(
  File "/opt/conda/envs/datum/lib/python3.8/site-packages/dask/dataframe/io/parquet/arrow.py", line 537, in read_metadata
    ) = cls._gather_metadata(
  File "/opt/conda/envs/datum/lib/python3.8/site-packages/dask/dataframe/io/parquet/arrow.py", line 1035, in _gather_metadata
    ds = pa_ds.parquet_dataset(
  File "/opt/conda/envs/datum/lib/python3.8/site-packages/pyarrow/dataset.py", line 457, in parquet_dataset
    factory = ParquetDatasetFactory(
  File "pyarrow/_dataset.pyx", line 2043, in pyarrow._dataset.ParquetDatasetFactory.__init__
  File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Extracting file path from RowGroup failed. The column chunks' file paths should be set, but got an empty file path.
@jorisvandenbossche
Copy link

I think at some point, dask/pyarrow wrote metdata files without the proper file path. The latest release of dask switched to use engine="pyarrow-dataset" as default, which might not be able to read such datasets. Can you try if passing engine="pyarrow-legacy" fixes it? (and if so, we should also see if we want to add some workarounds for this to the "pyarrow-dataset" engine)

@julioasotodv
Copy link

Having the exact same issue with ddf.pack_partitions_to_parquet(). I believe that there is no place in such function to include engine="pyarrow-legacy". @jorisvandenbossche in fact, cannot find that argument value documented anywhere in Dask.

@ianthomas23
Copy link
Member

This is fixed by #92. Tests all pass with dask 2021.8.0 (the first version that caused failures) and the latest 2022.7.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants