New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Change FilterIndex rule to cover select all columns case #73

Merged

imback82 merged 31 commits into microsoft:master from pirz:pouriap/issue-16-filterrule

Jul 21, 2020

Contributor

pirz commented Jun 25, 2020

What changes were proposed in this pull request?

This PR extends FilterIndex rule to cover the case when all columns are selected from a relation.
Such a case happens when logical plan is "Scan -> Filter" and Project node is not present or is eliminated due to some other optimizations (for example in a select * scenario).

Why are the changes needed?

Extend FilterIndex rule to cover more potential optimization cases.
Fixing issue #16

Does this PR introduce any user-facing change?

No

How was this patch tested?

Test cases added under FilterIndexRule tests and E2E tests.

Pouria Pirzadeh and others added 18 commits

June 23, 2020 16:55


          add filter index rule test

8afaa25


          Update _config.yml

767abd4


          Update 07-dg-code-structure.md

d004f0e


          Update _config.yml

bac87fb


          Update 07-dg-code-structure.md


          Update 07-dg-code-structure.md

3cf094c


          Merge branch 'master' into master


          Update 07-dg-code-structure.md

1b03e4a


          Merge branch 'master' into master

691f36a


          Merge branch 'master' into master

de1bc9c


          Update 07-dg-code-structure.md

f311575


          Merge branch 'master' into master

08c9d6d


          Merge branch 'master' of https://github.com/pirz/hyperspace-1

df8c26e


          Merge remote-tracking branch 'upstream/master'

05a5cba


          Merge branch 'master' into pouriap/issue-16-filterrule

d250033


          Fix issue 16 - Change FilterIndex rule to cover select all columns case

dfad960


          Merge remote-tracking branch 'upstream/master'

910974f


          Merge branch 'master' into pouriap/issue-16-filterrule

5c1466c

pirz requested review from apoorvedave1, imback82, rapoth and thrajput

June 25, 2020 22:18

pirz self-assigned this

pirz added bug enhancement labels

apoorvedave1 reviewed

View reviewed changes

src/test/scala/com/microsoft/hyperspace/index/rules/FilterIndexRuleTest.scala Outdated

+                  }
+                }
+                private def verifyTransformedPlanWithIndex2(logicalPlan: LogicalPlan): Unit = {

Contributor

apoorvedave1 Jun 25, 2020

Can we remove some duplication in verify..Index1 and verify..Index2? I was thinking the following:

def verifyIndex(logicalPlan, indexName): Unit = {
  val logicalRelation = logicalPlan.collect {
      case l : LogicalRelation => l
   }.head
   logicalRelation match {
      case l@LogicalRelation(HadoopRelation(newLocation..........ParquetFileFormat.....) => 
          assert(condition1)
          assert(condition2)
          assert(index name related checks)
          ...
       case _ => fail("Unexpected plan")
   }
}

basically parameterize indexname in the function.
The only issue is we don't explicitly check if the plan is a Project or a Filter. I guess it's ok.
Let me know if this makes sense or if there's any suggestion to deduplicate. If we want to keep Project and Filter check, it's ok too.

Contributor Author

pirz Jul 17, 2020

Done.

imback82 changed the title ~~Fix issue 16 - Change FilterIndex rule to cover select all columns case~~ Change FilterIndex rule to cover select all columns case

Pouria Pirzadeh added 2 commits

July 8, 2020 18:21


          add select all columns support to FilterIndex rule

cbb1595


          Merge remote-tracking branch 'upstream/master'

af5bc94

imback82 reviewed

View reviewed changes

src/main/scala/com/microsoft/hyperspace/index/rules/FilterIndexRule.scala Outdated

+                            _,
+                            _,
+                            _)) =>
+                      try {

Contributor

imback82 Jul 15, 2020

Did you try refactoring this as discussed offline?

Contributor Author

pirz Jul 17, 2020

Using extractor pattern is done.

src/main/scala/com/microsoft/hyperspace/index/rules/FilterIndexRule.scala Outdated

+                  // Pattern-2 covers the case where project node is eliminated or not present. An example is
+                  // when all columns are selected.
+                  // Currently, this rule replaces a relation with an index when:
+                  //  1. The index covers all columns from the filter predicate and output columns list, and
                   //  2. Filter predicate's columns include the first 'indexed' column of the index.
                   plan transform {

Contributor

imback82 Jul 15, 2020

let's be explicit and use transformDown

Contributor Author

pirz Jul 17, 2020

Done.


          Merge remote-tracking branch 'upstream/master'

120c782

rapoth added this to the 0.2.0 milestone

Pouria Pirzadeh added 2 commits

July 16, 2020 17:33


          Merge branch 'master' into pouriap/issue-16-filterrule

805e727


          Change filter index rule to support select all columns case

aff54ef

imback82 reviewed

View reviewed changes

Contributor

imback82 left a comment

few minor comments, but generally looking good.

src/main/scala/com/microsoft/hyperspace/index/rules/FilterIndexRule.scala Outdated

-                              _))) =>
+                  plan transformDown {
+                    case FilterRuleExtractor(
+                        planHandle,

Contributor

imback82 Jul 17, 2020

originalPlan?

Contributor Author

pirz Jul 20, 2020

changed it.

src/main/scala/com/microsoft/hyperspace/index/rules/FilterIndexRule.scala Outdated

                  * For a given relation, check its available indexes and replace it with the top-ranked index
                  * (according to cost model).
                  *
-                 * @param project  top-most node in the logical plan that is being optimized.
-                 * @param projectColumns List of project columns.
+                 * @param filter  Filter node in the subplan that is being optimized.

Contributor

imback82 Jul 17, 2020

nit: one space after filter?

Contributor Author

pirz Jul 20, 2020

Fixed.

src/main/scala/com/microsoft/hyperspace/index/rules/FilterIndexRule.scala Outdated

+              object FilterRuleExtractor extends Logging {
+                type returnType = (
+                    LogicalPlan,

Contributor

imback82 Jul 17, 2020

nit: LogicalPlan, // original plan

src/main/scala/com/microsoft/hyperspace/index/rules/FilterIndexRule.scala Outdated

                   }
                 }
               }
+              object FilterRuleExtractor extends Logging {

Contributor

imback82 Jul 17, 2020

Logging not used.

Contributor Author

pirz Jul 20, 2020

added.

src/main/scala/com/microsoft/hyperspace/index/rules/FilterIndexRule.scala Outdated

-                              _,
-                              _))) =>
+                  plan transformDown {
+                    case FilterRuleExtractor(

Contributor

imback82 Jul 17, 2020

How about ExtractFilterNode

Contributor Author

pirz Jul 20, 2020

Renamed.

src/main/scala/com/microsoft/hyperspace/index/rules/FilterIndexRule.scala Outdated

+                  case filter @ Filter(
+                        condition: Expression,
+                        logicalRelation @ LogicalRelation(
+                          fsRelation @ HadoopFsRelation(location, _, _, _, _, _),

Contributor

imback82 Jul 17, 2020

looks like location: FileIndex is not being used in this file. Can we remove it?

Contributor Author

pirz Jul 20, 2020

Thanks for catching it. Removed it.

src/test/scala/com/microsoft/hyperspace/index/E2EHyperspaceRulesTests.scala Outdated


		hyperspace.createIndex(df, indexConfig)

		def query(): DataFrame = df.filter("c4 == 1").select("c3", "c1", "c2", "c5", "c4")

Contributor

imback82 Jul 17, 2020

should we verify if no project node is present in this query?

Contributor Author

pirz Jul 20, 2020

Switched query to use "SELECT *" and added assert.

src/test/scala/com/microsoft/hyperspace/index/rules/FilterIndexRuleTest.scala Show resolved Hide resolved

src/test/scala/com/microsoft/hyperspace/index/rules/FilterIndexRuleTest.scala Outdated

                     case _ => fail("Unexpected plan.")
                   }
                 }
+                private def verifyIndexProperties(
+                    indexName: String,
+                    newLocation: InMemoryFileIndex,

Contributor

imback82 Jul 17, 2020

let's use actual (vs. expected) instead of new. Or even removing new all together is fine.

Contributor Author

pirz Jul 20, 2020

Done.

src/test/scala/com/microsoft/hyperspace/index/rules/FilterIndexRuleTest.scala Outdated

+                    bucketSpec: Option[BucketSpec]): Unit = {
+                  val allIndexes = IndexCollectionManager(spark).getIndexes(Seq(Constants.States.ACTIVE))
+                  val expectedLocation = getIndexDataFilesPath(indexName)
+                  assert(newLocation.rootPaths.head.equals(expectedLocation), "Invalid location.")

Contributor

imback82 Jul 17, 2020

This is an existing code, but can we remove the assert message "Invalid blah" in this block? I don't think it adds any value and it is just more burdensome for the developer.

Contributor Author

pirz Jul 20, 2020

Done.

Pouria Pirzadeh added 6 commits

July 20, 2020 11:36


          Merge remote-tracking branch 'upstream/master'

0b9c8fe


          resolve merge conflicts

d20999d


          Merge branch 'master' into pouriap/issue-16-filterrule

31765ea


          Modify FilterIndex rule to cover select all columns case

4eabb08


          Merge remote-tracking branch 'upstream/master'

23f5c53


          Merge branch 'master' into pouriap/issue-16-filterrule

6d75699

imback82 previously approved these changes

View reviewed changes

Contributor

imback82 left a comment

LGTM except for one nit comment. Thanks @pirz!

src/test/scala/com/microsoft/hyperspace/index/E2EHyperspaceRulesTests.scala Outdated

+                  def query(): DataFrame = spark.sql("SELECT * from t where c4 = 1")
+                  // Verify no Project node is present in the query plan, as a result of using SELECT *
+                  assert(query().queryExecution.optimizedPlan.collect {

Contributor

imback82 Jul 20, 2020

nit:

assert(query().queryExecution.optimizedPlan.collect { case p: Project => p }.isEmpty)

Contributor Author

pirz Jul 20, 2020

thnx, fixed this.


          minor fix in E2EHyperspaceRulesTests

75eb4ae

pirz dismissed imback82’s stale review via

75eb4ae

July 20, 2020 22:25

imback82 reviewed

View reviewed changes

src/test/scala/com/microsoft/hyperspace/index/E2EHyperspaceRulesTests.scala Outdated

+                  def query(): DataFrame = spark.sql("SELECT * from t where c4 = 1")
+                  // Verify no Project node is present in the query plan, as a result of using SELECT *
+                  assert(query().queryExecution.optimizedPlan.collect { case p: Project => p }.isEmpty, true)

Contributor

imback82 Jul 20, 2020

no need for true.

Contributor Author

pirz Jul 20, 2020

Oops, I fixed this too. Thnx!


          minor change in E2E tests code

682f9d4

imback82 approved these changes

View reviewed changes

Contributor

imback82 commented Jul 20, 2020

@apoorvedave1 did you need more time review this?

Contributor

apoorvedave1 commented Jul 21, 2020

Thanks, LGTM @imback82 , @pirz

imback82 merged commit 7f235ca into microsoft:master

rapoth mentioned this pull request

Hyperspace.explain / df.explain doesn't work as expected #16

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment