Support Date type hive partition columns in `polars.scan_parquet` method #12894

baycoder0 · 2023-12-05T02:28:00Z

Description

Currently, polars.scan_parquet supports reading hive-partitioned tables from cloud using hive_partitioning=True parameter. However, column types are automatically inferred. One of the most common partitioning columns are date columns. For example: path/to/table/date=2023-12-04/data.parquet, path/to/table/date=2023-12-05/data.parquet, etc.

polars.scan_parquet always parses such date columns as strings and there's no way for users to indicate the partition column type. One option is to type-cast afterwards, but that's very slow. Also, after type casting, applying filters does not seem to push-down the filter predicates to the scanning level.

After a discussion in Discord, @ritchie46 suggested adding try_parse_dates parameter to polars.scan_parquet. This will satisfy my use case. However, if there's an appetite for a more well-rounded solution, it would be great to be able to indicate desired types for each hive partition in scan_parquet (e.g. a map of column name to Polars type). As another example, here's how BigQuery does it: https://cloud.google.com/bigquery/docs/hive-partitioned-queries#custom_partition_key_schema.

The text was updated successfully, but these errors were encountered:

fcocquemas · 2024-01-29T20:29:18Z

This seems to be an issue with more than just date types.

I have a dataset where the hive partition is on a string field. However, one of the partitioning values of the string happens to be "TRUE", which resulted in the error:

polars.exceptions.ComputeError: expected hive partitioned path, got test.parquet/value=TRUE/file.parquet                                                                                                                                           │······················································
                                                                                                                                                                                                                                                                                                            │······················································
This error occurs if 'hive_partitioning=true' some paths are hive partitioned and some paths are not.

This is with polars==0.20.4.

ritchie46 · 2024-03-10T06:46:06Z

What we require first is schema inference on hive partitions. Otherwise some parts may be strings and/or different date formats. There needs to be something in place for schema inference and communicating that schema result between the partitions first.

baycoder0 added the enhancement New feature or an improvement of an existing feature label Dec 5, 2023

This was referenced Mar 7, 2024

Hive Partition Schema #14838

Closed

Allow scanning hive partitioned data on date and datetime columns #14950

Open

stinodego added the A-io-partitioning Area: reading/writing (Hive) partitioned files label Mar 29, 2024

stinodego mentioned this issue Apr 2, 2024

Hive partitioning tracking issue #15441

Open

13 tasks

nameexhaustion mentioned this issue Jun 28, 2024

feat: Support date/datetime for hive parts #17256

Merged

ritchie46 closed this as completed in #17256 Jun 28, 2024

c-peters added the accepted Ready for implementation label Jul 1, 2024

c-peters assigned nameexhaustion Jul 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Date type hive partition columns in `polars.scan_parquet` method #12894

Support Date type hive partition columns in `polars.scan_parquet` method #12894

baycoder0 commented Dec 5, 2023 •

edited

Loading

fcocquemas commented Jan 29, 2024

ritchie46 commented Mar 10, 2024

Support Date type hive partition columns in polars.scan_parquet method #12894

Support Date type hive partition columns in polars.scan_parquet method #12894

Comments

baycoder0 commented Dec 5, 2023 • edited Loading

Description

fcocquemas commented Jan 29, 2024

ritchie46 commented Mar 10, 2024

Support Date type hive partition columns in `polars.scan_parquet` method #12894

Support Date type hive partition columns in `polars.scan_parquet` method #12894

baycoder0 commented Dec 5, 2023 •

edited

Loading