Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hive partition ranges #15446

Open
deanm0000 opened this issue Apr 2, 2024 · 5 comments
Open

Hive partition ranges #15446

deanm0000 opened this issue Apr 2, 2024 · 5 comments
Labels
A-io Area: reading and writing data A-io-partitioning Area: reading/writing (Hive) partitioned files enhancement New feature or an improvement of an existing feature

Comments

@deanm0000
Copy link
Collaborator

Description

For data with high cardinality that is the most optimal choice to partition by, it'd be nice if we could have ranges instead of strict equality in hive partitions. I don't think pyarrow or duckdb do this so it'd be a trailblazing feature.

Suppose you have data that looks like

{'fruit':String [tens of thousands uniques],
'timestamp':DateTime,
'somedata':Float64}

Most of the time when I query data I want some small subset of nodes but I can't have a hive partition with tens of thousands of folders or else it just takes forever during the file discovery phase (using Azure DL2) so it'd be nice if the hive could look like

/fruit=(apple-cherry)/file-0.parquet
/fruit=(date-fig)/file-0.parquet
/fruit=(grape-kiwi)/file-0.parquet
/fruit=(lime-raspberry)/file-0.parquet
/fruit=(tangerine-ziziphus)/file-0.parquet

and then have the hive reader set the min and max stats according to what's in the parenthesis.

Writing these partitions automatically would obviously be desirable too but I think the ability to query data like this is 95% of the value of the feature and having to manually write them like this is not really too bad of a burden.

@deanm0000 deanm0000 added the enhancement New feature or an improvement of an existing feature label Apr 2, 2024
@stinodego
Copy link
Member

I don't think this is part of the official Hive spec, although I cannot seem to find an official spec anywhere.

We may support different partitioning strategies in the future, and I can see the strategy you're describing being useful. But I don't think this would fall under Hive partitioning.

@stinodego stinodego added the A-io Area: reading and writing data label Apr 2, 2024
@ion-elgreco
Copy link
Contributor

@deanm0000 delta lake solves this by keeping track of the stats per file add action. If Polars can add support for adding statistics manually per file then you get the same thing

@kszlim
Copy link
Contributor

kszlim commented Apr 3, 2024

@deanm0000 delta lake solves this by keeping track of the stats per file add action. If Polars can add support for adding statistics manually per file then you get the same thing

Parquet statistics already achieve this to an extent, but the main problem is that with enough files you end up having to do a lot of work just grabbing and parsing the statistics afaict.

@ion-elgreco
Copy link
Contributor

@kszlim those file stats are grabbed from the parquet file during write

@stinodego stinodego added the A-io-partitioning Area: reading/writing (Hive) partitioned files label Apr 3, 2024
@deanm0000
Copy link
Collaborator Author

@ion-elgreco That is pretty compelling. I have resisted going delta because I already have my scrapers setup to just use normal parquets and because I'm uncertain of what hurdles there will be due to a lack of full polars integration.

@stinodego The feature will smell equally sweet by any other name so I'm more than happy to not call it a "hive partition".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io Area: reading and writing data A-io-partitioning Area: reading/writing (Hive) partitioned files enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

4 participants