Hive partition ranges #15446

deanm0000 · 2024-04-02T15:59:01Z

Description

For data with high cardinality that is the most optimal choice to partition by, it'd be nice if we could have ranges instead of strict equality in hive partitions. I don't think pyarrow or duckdb do this so it'd be a trailblazing feature.

Suppose you have data that looks like

{'fruit':String [tens of thousands uniques],
'timestamp':DateTime,
'somedata':Float64}

Most of the time when I query data I want some small subset of nodes but I can't have a hive partition with tens of thousands of folders or else it just takes forever during the file discovery phase (using Azure DL2) so it'd be nice if the hive could look like

/fruit=(apple-cherry)/file-0.parquet
/fruit=(date-fig)/file-0.parquet
/fruit=(grape-kiwi)/file-0.parquet
/fruit=(lime-raspberry)/file-0.parquet
/fruit=(tangerine-ziziphus)/file-0.parquet

and then have the hive reader set the min and max stats according to what's in the parenthesis.

Writing these partitions automatically would obviously be desirable too but I think the ability to query data like this is 95% of the value of the feature and having to manually write them like this is not really too bad of a burden.

The text was updated successfully, but these errors were encountered:

stinodego · 2024-04-02T21:16:06Z

I don't think this is part of the official Hive spec, although I cannot seem to find an official spec anywhere.

We may support different partitioning strategies in the future, and I can see the strategy you're describing being useful. But I don't think this would fall under Hive partitioning.

ion-elgreco · 2024-04-03T07:04:52Z

@deanm0000 delta lake solves this by keeping track of the stats per file add action. If Polars can add support for adding statistics manually per file then you get the same thing

kszlim · 2024-04-03T07:11:09Z

@deanm0000 delta lake solves this by keeping track of the stats per file add action. If Polars can add support for adding statistics manually per file then you get the same thing

Parquet statistics already achieve this to an extent, but the main problem is that with enough files you end up having to do a lot of work just grabbing and parsing the statistics afaict.

ion-elgreco · 2024-04-03T07:13:01Z

@kszlim those file stats are grabbed from the parquet file during write

deanm0000 · 2024-04-04T14:51:30Z

@ion-elgreco That is pretty compelling. I have resisted going delta because I already have my scrapers setup to just use normal parquets and because I'm uncertain of what hurdles there will be due to a lack of full polars integration.

@stinodego The feature will smell equally sweet by any other name so I'm more than happy to not call it a "hive partition".

deanm0000 added the enhancement New feature or an improvement of an existing feature label Apr 2, 2024

stinodego added the A-io Area: reading and writing data label Apr 2, 2024

stinodego added the A-io-partitioning Area: reading/writing (Hive) partitioned files label Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hive partition ranges #15446

Hive partition ranges #15446

deanm0000 commented Apr 2, 2024

stinodego commented Apr 2, 2024

ion-elgreco commented Apr 3, 2024

kszlim commented Apr 3, 2024

ion-elgreco commented Apr 3, 2024

deanm0000 commented Apr 4, 2024

Hive partition ranges #15446

Hive partition ranges #15446

Comments

deanm0000 commented Apr 2, 2024

Description

stinodego commented Apr 2, 2024

ion-elgreco commented Apr 3, 2024

kszlim commented Apr 3, 2024

ion-elgreco commented Apr 3, 2024

deanm0000 commented Apr 4, 2024