Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Date type hive partition columns in polars.scan_parquet method #12894

Closed
baycoder0 opened this issue Dec 5, 2023 · 2 comments · Fixed by #17256
Closed

Support Date type hive partition columns in polars.scan_parquet method #12894

baycoder0 opened this issue Dec 5, 2023 · 2 comments · Fixed by #17256
Assignees
Labels
A-io-partitioning Area: reading/writing (Hive) partitioned files accepted Ready for implementation enhancement New feature or an improvement of an existing feature

Comments

@baycoder0
Copy link

baycoder0 commented Dec 5, 2023

Description

Currently, polars.scan_parquet supports reading hive-partitioned tables from cloud using hive_partitioning=True parameter. However, column types are automatically inferred. One of the most common partitioning columns are date columns. For example: path/to/table/date=2023-12-04/data.parquet, path/to/table/date=2023-12-05/data.parquet, etc.

polars.scan_parquet always parses such date columns as strings and there's no way for users to indicate the partition column type. One option is to type-cast afterwards, but that's very slow. Also, after type casting, applying filters does not seem to push-down the filter predicates to the scanning level.

After a discussion in Discord, @ritchie46 suggested adding try_parse_dates parameter to polars.scan_parquet. This will satisfy my use case. However, if there's an appetite for a more well-rounded solution, it would be great to be able to indicate desired types for each hive partition in scan_parquet (e.g. a map of column name to Polars type). As another example, here's how BigQuery does it: https://cloud.google.com/bigquery/docs/hive-partitioned-queries#custom_partition_key_schema.

@baycoder0 baycoder0 added the enhancement New feature or an improvement of an existing feature label Dec 5, 2023
@fcocquemas
Copy link

This seems to be an issue with more than just date types.

I have a dataset where the hive partition is on a string field. However, one of the partitioning values of the string happens to be "TRUE", which resulted in the error:

polars.exceptions.ComputeError: expected hive partitioned path, got test.parquet/value=TRUE/file.parquet                                                                                                                                           │······················································
                                                                                                                                                                                                                                                                                                            │······················································
This error occurs if 'hive_partitioning=true' some paths are hive partitioned and some paths are not.   

This is with polars==0.20.4.

@ritchie46
Copy link
Member

What we require first is schema inference on hive partitions. Otherwise some parts may be strings and/or different date formats. There needs to be something in place for schema inference and communicating that schema result between the partitions first.

@stinodego stinodego added the A-io-partitioning Area: reading/writing (Hive) partitioned files label Mar 29, 2024
@c-peters c-peters added the accepted Ready for implementation label Jul 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-partitioning Area: reading/writing (Hive) partitioned files accepted Ready for implementation enhancement New feature or an improvement of an existing feature
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

6 participants