Support Date type hive partition columns in polars.scan_parquet
method
#12894
Labels
A-io-partitioning
Area: reading/writing (Hive) partitioned files
accepted
Ready for implementation
enhancement
New feature or an improvement of an existing feature
Description
Currently,
polars.scan_parquet
supports reading hive-partitioned tables from cloud usinghive_partitioning=True
parameter. However, column types are automatically inferred. One of the most common partitioning columns are date columns. For example:path/to/table/date=2023-12-04/data.parquet
,path/to/table/date=2023-12-05/data.parquet
, etc.polars.scan_parquet
always parses such date columns as strings and there's no way for users to indicate the partition column type. One option is to type-cast afterwards, but that's very slow. Also, after type casting, applying filters does not seem to push-down the filter predicates to the scanning level.After a discussion in Discord, @ritchie46 suggested adding
try_parse_dates
parameter topolars.scan_parquet
. This will satisfy my use case. However, if there's an appetite for a more well-rounded solution, it would be great to be able to indicate desired types for each hive partition inscan_parquet
(e.g. a map of column name to Polars type). As another example, here's how BigQuery does it: https://cloud.google.com/bigquery/docs/hive-partitioned-queries#custom_partition_key_schema.The text was updated successfully, but these errors were encountered: