-
Notifications
You must be signed in to change notification settings - Fork 115
Add config to use bucketed scan for filter indexes #329
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will check back when the test is added.
src/main/scala/com/microsoft/hyperspace/index/IndexConstants.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the plan for this PR vs. #332? This PR wants to enable bucketing so that bucket pruning can kick in; but it may regress if bucket pruning does not kick in. Is it possible to do some check if bucket pruning would kick in or not? If this is not feasible, maybe introduce a strategy to look at the physical plan?
The config is disabled by default. 2 main points:
For #332, I'd like to exclude "pruned" index data files paths only, using the bucket pruning logic. After that, setting bucket spec could be a different option depending on their workloads. We might be able to detect such SMJ and setting bucket spec or not - as filter indexes are the last rule. |
Seems like a byproduct of introducing the new config so that we can always disable bucket union for filter rule (I see
Hmm, if this is the case (benefits SMJ), shouldn't join rule have already kicked in?
Yea, excluding index data files makes sense. But we are going introduce "additional" option for setting the bucket spec? It's usually better for Spark to decide how to distribute the tasks (instead of limiting it to the number of buckets). @apoorvedave1 has done some benchmarks on this. |
Btw, I am +1 for having this config, but just have few questions. |
Oh I forgot to disable bucket union for Filter 🤔..
If we don't set the bucket spec, spark will create splits with the input files by bin packing - One benefit using bucketSpec is that we could enforce each file will be handled separately by an executor. We could use this config internally, and then decide to expose this or not later? WDYT? |
Yes, but from my experience, this works out better by utilizing the full cluster than limiting the number of tasks to the number of buckets.
Yes, I think we can have this config and document the behavior clearly. |
I do like this new feature even if added as an undocumented flag. I did some tests over big datasets (billions of records) and having the index on a high cardinality field without bucketing and bucket pruning is not adding any performance improvement. Bucketing (and maybe partitioning by the indexed column) will add the benefit of pruning. I'm looking forward for such a feature. |
@andrei-ionescu This is very interesting. Do you have any more information (e.g., workload, numbers) you could share? Also, was it a single query or were you running a concurrent workload? I do agree with supporting partitioning though (I'm assuming you are referring to low cardinality columns). We haven't been able to get to working on partitioning yet (but it's definitely I've heard from multiple customers). It also has several other benefits like being able to support advanced scenarios like index retention (e.g., if partitioned by timestamp, we could do clean-up very easily - in the current implementation, it'd have to rewrite the index to some large degree) and I guess we can also leverage dynamic partition pruning. :) |
@rapoth Here are some details: Query
Executed with:
Dataset
Index
The index has:
Explained query
The clusterI did run the experiment on Databricks cluster with the following details:
ResultsTime to get the
Hope all these helps. |
Thanks @andrei-ionescu for the info. I guess one solution to this is to allow transformation of the indexed keys (bucketing to minutes/hours instead of seconds, for example). |
Even if we don't modify (transform) the timestamp to minutes/hours, it is better from the current form. I'm having in average about 2K duplicates for each timestamp value in the index. Bucketing or partitioning just by the values will reduce the query time tremendously. For example, instead of having the index dataset laid out like this:
I would suggest the following:
|
@andrei-ionescu Would you mind opening a feature request with your original benchmark? You are bringing up a very interesting topic of indexing for timeseries data (I did not know you were looking at timestamp predicates :)). Alternately, I can copy some of this discussion and create one. Please let me know which one you prefer. I have some follow-up questions but I will ask them in the new issue. We are also heavily interested in building specialized indexes for timeseries data so it is definitely awesome you have the same scenarios! |
Hive-partitioning was explored before but abandoned due to the fact that we need to create bucket files for each partition and wasn't scalable in our scenario. But now that we have a specific use case, we can explore this again (prob. in the form of specialized index). |
@imback82 I'm proposing to add
somewhere around this place: CreateActionBase.scala#L129-L139. This can be just a flag, or even better, an index config property, as in cases of high cardinality it may throw out a lot of folders/partitions. We can go even a step further and detect it and choose the best approach. |
@andrei-ionescu I apologize. I only meant to open up a new thread so we can continue the conversation there (I did not mean to say we'd work on it immediately). |
@rapoth One more thing, I don't think it really matter if is a time series dataset. In any dataset that has lots of data and the resulted index is also massive adding this option will bring a lot of benefits. |
@andrei-ionescu Fair point. I recently talked with a lot of customers who had timeseries data specifically so seeing your comment on timestamps made me forget every other use case 🦖 |
Yes, that's the hive partitioning I was referring to. @apoorvedave1 has done some prototyping on this and write wasn't really scaling. So you may want to try it first on your dataset as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (few nits/questions), thanks @sezruby!
src/main/scala/com/microsoft/hyperspace/index/IndexConstants.scala
Outdated
Show resolved
Hide resolved
@@ -62,7 +62,8 @@ object FilterIndexRule | |||
spark, | |||
index, | |||
originalPlan, | |||
useBucketSpec = false) | |||
useBucketSpec = HyperspaceConf.useBucketSpecForFilterRule(spark), | |||
useBucketUnionForAppended = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment why this will always be false for the filter index rule? We will never take advantage of bucketing from the union right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be beneficial for later ops which requires bucketing, but just for filter index, we don't need it.
I think it's better to write a new rule for the cases if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add the comment to the code (since it's not straightforward to understand)?
test( | ||
"Append-only: filter rule and non-parquet format," + | ||
"appended data should be shuffled and merged by Union even with bucketSpec.") { | ||
// Note: for delta lake, this test is also eligible as the dataset is partitioned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How should we handle this "Note"? This sounds like a TODO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's note because this test is already included in delta lake hybrid scan tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we just remove the comment then? (not sure about the importance unless I am missing something)
…cala Co-authored-by: Terry Kim <yuminkim@gmail.com>
What is the context for this pull request?
What changes were proposed in this pull request?
Add a config for filter index rule to apply bucketing information when reading the index data.
Does this PR introduce any user-facing change?
Yes, plan is changed a bit for filter indexes if the config is true.
How was this patch tested?
Unit test