Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to read hive-style partitioned parquet file using read_parquet #10276

Closed
2 tasks done
lmocsi opened this issue Aug 3, 2023 · 10 comments
Closed
2 tasks done

Unable to read hive-style partitioned parquet file using read_parquet #10276

lmocsi opened this issue Aug 3, 2023 · 10 comments
Labels
A-io Area: reading and writing data bug Something isn't working python Related to Python Polars

Comments

@lmocsi
Copy link

lmocsi commented Aug 3, 2023

Checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

"""
I have a partitioned parquet file like this (filename is
my_table_name, and it is partitioned by calendar_date column, see attached):

my_table_name
 |
 +[calendar_date=2023-08-01 00%3A00%3A00]
 | |
 | + part-00000-b3b03a4e.c000.snappy.parquet
 | 
 +[calendar_date=2023-08-02 00%3A00%3A00]
 | |
 | + part-00000-f4b5a541.c000.snappy.parquet
 |
 +[calendar_date=2023-08-03 00%3A00%3A00]
   |
   + part-00000-6cf29fe7.c000.snappy.parquet

"""

import polars as pl
df = pl.read_parquet(path+ 'my_table_name/*')  # on SO it was recommended, that /* could be used

"""
It is giving me the error:
ComputeError: error while reading /path/my_table_name/CALENDAR_DATE=2023-08-01 00%3A00%3A00: External format error: File out of specification: underlying IO error: Invalid argument (os error 22)
"""
my_table_name.zip

Issue description

It seems, that as of now polars is supporting only two types of parquet structure:

  1. all the data is in one parquet file
  2. the data is split into separate files within one directory

Though, true partitioned parquet files (where you have separate directory for each partition) does not seem to be supported. :(

Expected behavior

Able to read the partitioned parquet file, even as a lazy dataframe (just like Spark does)

Installed versions

--------Version info---------
Polars:              0.18.11
Index type:          UInt32
Platform:            Linux-4.18.0-372.51.1.el8_6.x86_64-x86_64-with-glibc2.28
Python:              3.9.13 (main, Oct 13 2022, 21:15:33) 
[GCC 11.2.0]

----Optional dependencies----
adbc_driver_sqlite:  <not installed>
cloudpickle:         2.0.0
connectorx:          <not installed>
deltalake:           <not installed>
fsspec:              2022.02.0
matplotlib:          3.7.2
numpy:               1.21.6
pandas:              2.0.3
pyarrow:             12.0.1
pydantic:            <not installed>
sqlalchemy:          1.4.27
xlsx2csv:            <not installed>
xlsxwriter:          3.1.2

</details>
@lmocsi lmocsi added bug Something isn't working python Related to Python Polars labels Aug 3, 2023
@cmdlineluser
Copy link
Contributor

There is scan_pyarrow_dataset()

import pyarrow.dataset as ds

pl.scan_pyarrow_dataset(ds.dataset("my_table_name")).collect()

# shape: (13, 2)
# ┌─────────┬─────────┐
# │ USER_ID ┆ TRX_CNT │
# │ ---     ┆ ---     │
# │ f64     ┆ f64     │
# ╞═════════╪═════════╡
# │ 1000.0  ┆ 434.0   │
# │ 1001.0  ┆ 11.0    │
# │ 1002.0  ┆ 3.0     │
# │ 1003.0  ┆ 555.0   │
# │ …       ┆ …       │
# │ 1001.0  ┆ 21.0    │
# │ 1003.0  ┆ 44.0    │
# │ 1005.0  ┆ 111.0   │
# │ 1008.0  ┆ 222.0   │
# └─────────┴─────────┘

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Aug 3, 2023

df = pl.read_parquet(path+ 'my_table_name/') # on SO it was recommended, that / could be used

This is correct if all the files are in the same directory, but otherwise (as @cmdlineluser says) you need to use scan_pyarrow_dataset to read directory-nested (hive-style) partitioned parquet data.

(I've just committed a small update to the docs that adds a more explicit note to read_parquet and scan_parquet to help direct users to the right method).

@stinodego
Copy link
Member

stinodego commented Aug 3, 2023

I think read_parquet should support this. If I am trying to read a parquet file, I should be using read_parquet. Even if it is partitioned.

Not sure how hard this is to implement, but it should be a goal, in my opinion.

@stinodego stinodego added the accepted Ready for implementation label Aug 3, 2023
@lmocsi
Copy link
Author

lmocsi commented Aug 3, 2023

Could pl.read_parquet() just call this pl.scan_pyarrow_dataset() function?
On the other hand, the reading up this parquet should include the partitioning column (here calendar_date), as well...

@alexander-beedie alexander-beedie changed the title Unable to read partitioned parquet file Unable to read hive-style partitioned parquet file using read_parquet Aug 3, 2023
@alexander-beedie
Copy link
Collaborator

(Updated the title to clarify the issue more specifically, so we can reference it easily later).

@lmocsi
Copy link
Author

lmocsi commented Aug 4, 2023

Added my last comment as a separate issue: [https://github.com//issues/10296]

@universalmind303
Copy link
Collaborator

looks like #4347 and #426 are duplicates. Since this is the most recent one, I'll keep this open & close out the other two.

@ddutt
Copy link

ddutt commented Sep 8, 2023

I think read_parquet should support this. If I am trying to read a parquet file, I should be using read_parquet. Even if it is partitioned.

Not sure how hard this is to implement, but it should be a goal, in my opinion.

scan_pyarrow_dataset performance isn't as good. Im doing the addition of the hive directories as columns, and i have a lot of folders, and this method is still faster than using scan_pyarrow_dataset. Please add this support to scan parquet and read_parquet

@stinodego
Copy link
Member

This should be fixed by #13044

If not, please comment and I can reopen this issue.

@lmocsi
Copy link
Author

lmocsi commented Jan 21, 2024 via email

@stinodego stinodego added A-io Area: reading and writing data and removed accepted Ready for implementation labels Jan 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io Area: reading and writing data bug Something isn't working python Related to Python Polars
Projects
Archived in project
Development

No branches or pull requests

6 participants