-
Notifications
You must be signed in to change notification settings - Fork 115
[FEATURE REQUEST]: Support partitioning and bucketing of the index dataset #351
Comments
Thank you for opening this issue @andrei-ionescu! I'm copying some context from #329 for easier readability From: @imback82
From: @andrei-ionescu
From: @imback82
From: @andrei-ionescu
|
I have some minor questions:
|
I don't think this matter too much. In any of the two cases, in the best case, the plan gets transformed into reading only the indexes instead of the datasets. Can you bring more light why this matters in your opinion?
Let me put my point of view in this area...
I'm not against it at all. I think it solves this problem but ads in a lot of complexities. I've seen it in Iceberg in their partition spec where they offer such column transformations for partitioning. Adding this new concept will add more complexities to Hyperspace: migration, management, how can it be altered, which column transformation will be used/available, how to specify those when creating the index, etc. I think adding partitioning as an option besides the current bucketing adds less complexities than bringing in the "transformations".
It really depends on how the index works. I'm also a begginger with Hyperspace but I can see its power in the Data Lake world. The use case I'm currently testing Hyperspace is improving the read performance when joining two big Iceberg datasets (billions of rows). The join clause is on two non-timebased columns, while the where clause is on the timestamp. Because I have limited experience with Hyperspace I may not have been using it the best way. Given this 2 datasets join query
with
and
What indexes should I create? Now, regarding your questions "How many predicates do you expect in the queries? And do users care about the order of these predicates?", I would say I've been working with 2,3 predicates and I usually try to preserve the order. I would also suggest to clearly specify if the order matters or not. |
@andrei-ionescu could you share the plan of the query? With the current version of Hyperspace, you could create these indexes for your join query:
These 2 indexes will remove 2 shuffles for join. You can use the filter index as you tested, but I guess the performance is similar because Iceberg also handles push-down conditions & partitioning in somehow. Note that only equalTo query is the candidate of bucket pruning, bucketed scan config won't be effective for your query. |
@sezruby Thanks for confirming my current understanding of how Hyperspace works.
This is something that I want to avoid - duplicating the dataset once more (all columns means the whole dataset but bucketed and sorted differently). But I guess the need of include all columns in the
Iceberg uses partition pruning, file skipping based on the metadata it has stored and pushed down predicates. This is the reason why it is faster than Hyperspace in some cases. For |
Feature requested
In the case of very large datasets (
34 Billion
records) the generated index is formed out of big files and has a performance degradation.Given the following details...
Query
Executed with:
Dataset
34 Billion
rowstimestamp
field is oftimestamp
type and is up to seconds17 145 000
out of34 155 510 037
Index
The index has:
434GB
total index size200
files2.3GB
average file sizeExplained query
The cluster
I did run the experiment on Databricks cluster with the following details:
64
cores432GB
memory6
workers:32
cores256GB
memory2.4.5
Results
Time to get the
1000
rows:17.24s
16.86s
Acceptance criteria
The time to get
1000
rows using Hyperspace should be at least as twice as fast.Additional context
For some more context, this has been started on #329 PR.
The text was updated successfully, but these errors were encountered: