Skip to content
This repository has been archived by the owner on Jun 14, 2024. It is now read-only.

Fix rank algorithm for Hybrid Scan #164

Merged
merged 30 commits into from
Jan 6, 2021

Conversation

sezruby
Copy link
Collaborator

@sezruby sezruby commented Sep 15, 2020

What changes were proposed in this pull request?

This PR introduces new rank algorithm for each FilterIndex & JoinIndex to support Hybrid Scan properly.
You can find more details about Hybrid scan in #150.

In order to support rank algorithm for Hybrid Scan efficiently, this PR introduces a new IndexLogEntryTag using #223

  // COMMON_BYTES stores overlapping bytes of index source files and given relation.
  // This is set in getCandidateIndexes and utilized in rank functions.
  val COMMON_BYTES: IndexLogEntryTag[Long] = IndexLogEntryTag[Long]("commonBytes")

If hybrid scan is enabled, the outdated index might be chosen even if there are refresher ones.
Since Hybrid Scan can incur additional overhead such as on-the-fly shuffle for appended files or filtering for deleted files, it would be good to pick the indexes with less "diff" data.

For FilterIndexRule, the index with the largest common bytes will be chosen if Hybrid Scan is enabled.
For JoinIndexRule, the bucket number would be prior to common bytes as removing shuffles will outweigh the overhead from Hybrid Scan.

Does this PR introduce any user-facing change?

Yes.
This PR changes the rank order of candidate indexes when hybrid scan is enabled.

How was this patch tested?

Unit test

assert(FilterIndexRanker.rank(indexes, false).get.equals(ind1))
}

test("testRank: return the index with the largest source file list if HybridScan is enabled") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Why testRank instead of FilterIndexRankerTest? Sorry for the naive question - I'm trying to understand how we should name these tests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just used the previous one in JoinRankerTest. Yes we need some guideline for the naming 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@imback82 Any recommendations?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, testRank seems weird. I would do something like test("rank() should return the index with the largest number of source files if HybridScan is enabled). (Note that FilterIndexRanker is omitted since it's already under FilterIndexRankerTest).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh... good you wrote a great name 🙂

I was also hoping: can we agree on a naming convention for tests now that the code base complexity is increasing? For instance:

  1. Tests do not need to contain the test method name
  2. Name of the test should clearly state the hypothesis and what call out what is being tested
  3. ...

@rapoth rapoth added this to the 0.4.0 milestone Sep 17, 2020
@rapoth rapoth added enhancement New feature or request intermediate issue This is the tag for intermediate issues which involve discussion labels Sep 18, 2020
@sezruby
Copy link
Collaborator Author

sezruby commented Oct 6, 2020

Changed as draft since further investigation is required for this issue; we cannot guarantee that not using hybrid scan is always better than full indexes, especially for join index.

For example, it's might be more efficient to avoid full-shuffling of index data. So an index pair with bucketing and hybrid scan might be more efficient than a full index pair without bucketing (no hybrid scan).

@rapoth rapoth modified the milestones: 0.4.0, 0.5.0 Oct 8, 2020
@imback82 imback82 modified the milestones: October 2020, November 2020 Oct 13, 2020
@sezruby sezruby marked this pull request as ready for review October 20, 2020 10:31
@sezruby sezruby force-pushed the hybridscan_rank branch 3 times, most recently from 6c81e48 to 36ccfda Compare October 20, 2020 10:46
* A mutable map for holding auxiliary information of this index log entry while applying rules.
*/
@JsonIgnore
private val tags: mutable.Map[IndexLogEntryTag[_], Any] = mutable.Map.empty
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +288 to +290
// If the index contains the source update info, it means the index was validated
// with the latest signature including appended files and deleted files, but
// index data is not updated with those files. Therefore, we need to handle
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised comment for quick refresh PR

@sezruby
Copy link
Collaborator Author

sezruby commented Dec 10, 2020

@imback82 Could you review this change when you have the time? Thanks!

@imback82
Copy link
Contributor

Sorry for the delay. I will try to get to this this week. Btw, does this change any of the index selections for our TPCH/TPCDS benchmarks?

@sezruby
Copy link
Collaborator Author

sezruby commented Dec 11, 2020

Sorry for the delay. I will try to get to this this week. Btw, does this change any of the index selections for our TPCH/TPCDS benchmarks?

This change only affects Hybrid Scan case and currently all of the common bytes will be the same as we don't refresh any of indexes after modifying the dataset. I didn't check it but I guess any of index selection won't be changed.

For explain time, 100k source files, 100 deleted file case (TPC-H queries):

Oct binary

  • hybrid scan - sum of max explain time (of 3 runs): 310625
  • full index - sum of max explain time (of 3): 235606

Dec binary (current master + rank pr)

  • hybrid scan - sum of max explain time (of 3): 233333
  • full index - sum of max explain time (of 3): 333228 (=> I think it's because of signature calculation fix (sorting))

For hybrid scan result,

  • other optimizations ( column schema check first .. etc)
  • shows this pr doesn't have too much overhead

val actualOrder = JoinIndexRanker.rank(spark, dummy, dummy, indexPairs)
assert(actualOrder.equals(expectedOrder))
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we testing all the if conditions in the ranker? How about number of default partitions?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition is removed.

Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (few minor comments), thanks @sezruby!

@imback82 imback82 merged commit 3472bd4 into microsoft:master Jan 6, 2021
@sezruby sezruby deleted the hybridscan_rank branch January 6, 2021 07:41
@imback82 imback82 modified the milestones: November 2020, January 2021 Jan 29, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request intermediate issue This is the tag for intermediate issues which involve discussion
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants