[Gold Standard] Updated plans for all tpcds queries with spark-only setup #377

apoorvedave1 · 2021-03-10T18:43:58Z

What is the context for this pull request?

Parent Issue: [PROPOSAL]: Gold Standard #334
Dependency: [Gold Standard] Add resources files for spark queries from spark's plan stability suite #383, [Gold Standard]: Initial code for spark only setup with a single query #384

What changes were proposed in this pull request?

In this PR we have updated plans for all tpcds queries q2-q99. Please review the dependency PR #384 first which contains the code for creating and validating golden files (query plan files) for gold standard.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests

src/test/scala/com/microsoft/hyperspace/goldstandard/PlanStabilitySuite.scala

apoorvedave1 · 2021-03-12T20:34:45Z

Note to reviewers, currently q49 doesn't work well with the build pipelines so as per offline suggestions I have removed it from this PR. I will add it back once the issue is resolved.

imback82 · 2021-03-12T21:03:01Z

@apoorvedave1 Could you do this?

Take resource files from OSS Spark and create a PR, and we will just merge quickly.
Rebase this PR.

The reason is I want to make sure changes other than the expression reorder didn't go in.

apoorvedave1 · 2021-03-12T21:34:45Z

@apoorvedave1 Could you do this?

Take resource files from OSS Spark and create a PR, and we will just merge quickly.

Rebase this PR.

The reason is I want to make sure changes other than the expression reorder didn't go in.

Ok sure, let me get back.

src/test/resources/tpcds/spark-2.4/approved-plans-v1_4/q1/explain.txt

imback82 · 2021-03-15T23:10:25Z

This is an initial setup with spark only (non-hyperspace) version of gold standard. Please refer parent proposal for more details

Could you update this PR description as well? Not easy to follow which portion of parent proposal applies to this PR.

…_initial # Conflicts: # src/test/scala/com/microsoft/hyperspace/goldstandard/PlanStabilitySuite.scala # src/test/scala/com/microsoft/hyperspace/goldstandard/TPCDSBase.scala

imback82 · 2021-03-16T23:21:42Z

Are the test failures related to the cross join?

apoorvedave1 · 2021-03-16T23:48:35Z

Are the test failures related to the cross join?

@imback82 yeah outside of the test function it was not picking up the config for enabling cross join. I made slight changes to the code and it works now. Please take a look

apoorvedave1 · 2021-04-15T23:11:18Z

src/test/resources/tpcds/spark-2.4/approved-plans-v1_4/q11/simplified.txt

+            Union
+              LocalTableScan [customer_id,year_total] [customer_id,year_total]


why union with LocalTableScan?

apoorvedave1 · 2021-04-15T23:24:21Z

src/test/resources/tpcds/spark-2.4/approved-plans-v1_4/q11/explain.txt

+      :     :     :  :                    +- *(2) Project [d_date_sk#21, d_year#13]
+      :     :     :  :                       +- *(2) Filter ((isnotnull(d_year#13) && (d_year#13 = 2001)) && isnotnull(d_date_sk#21))
+      :     :     :  :                          +- *(2) FileScan parquet default.date_dim[d_date_sk#21,d_year#13] Batched: true, Format: Parquet, Location [not included in comparison]/{warehouse_dir}/date_dim], PartitionFilters: [], PushedFilters: [IsNotNull(d_year), EqualTo(d_year,2001), IsNotNull(d_date_sk)], ReadSchema: struct<d_date_sk:int,d_year:int>
+      :     :     :  +- LocalTableScan <empty>, [customer_id#24, year_total#25]


val df = Seq.empty[(String,String,String,String)].toDF("a","b","c","d") println(df.queryExecution.toString())

creates this

== Physical Plan == LocalTableScan <empty>, [a#12, b#13, c#14, d#15]

apoorvedave1 · 2021-04-15T23:38:31Z

src/test/resources/tpcds/spark-2.4/approved-plans-v1_4/q10/explain.txt

@@ -1,279 +1,49 @@
 == Physical Plan ==


apoorvedave1 · 2021-04-15T23:38:46Z

src/test/resources/tpcds/spark-2.4/approved-plans-v1_4/q10/simplified.txt

-TakeOrderedAndProject [cd_gender,cd_marital_status,cd_education_status,cd_purchase_estimate,cd_credit_rating,cd_dep_count,cd_dep_employed_count,cd_dep_college_count,cnt1,cnt2,cnt3,cnt4,cnt5,cnt6]
-  WholeStageCodegen (10)
-    HashAggregate [cd_gender,cd_marital_status,cd_education_status,cd_purchase_estimate,cd_credit_rating,cd_dep_count,cd_dep_employed_count,cd_dep_college_count,count] [count(1),cnt1,cnt2,cnt3,cnt4,cnt5,cnt6,count]
+TakeOrderedAndProject [cd_credit_rating,cd_dep_college_count,cd_dep_count,cd_dep_employed_count,cd_education_status,cd_gender,cd_marital_status,cd_purchase_estimate,cnt1,cnt2,cnt3,cnt4,cnt5,cnt6]


apoorvedave1 · 2021-04-15T23:39:48Z

src/test/resources/tpcds/spark-2.4/approved-plans-v1_4/q12/simplified.txt

-TakeOrderedAndProject [i_category,i_class,i_item_id,i_item_desc,revenueratio,i_current_price,itemrevenue]
-  WholeStageCodegen (6)
-    Project [i_item_desc,i_category,i_class,i_current_price,itemrevenue,_w0,_we0,i_item_id]
+TakeOrderedAndProject [i_category,i_class,i_current_price,i_item_desc,i_item_id,itemrevenue,revenueratio]


apoorvedave1 · 2021-04-15T23:40:01Z

src/test/resources/tpcds/spark-2.4/approved-plans-v1_4/q12/explain.txt

@@ -1,137 +1,24 @@
 == Physical Plan ==


apoorvedave1 · 2021-04-15T23:51:29Z

src/test/resources/tpcds/spark-2.4/approved-plans-v1_4/q13/simplified.txt

+            Project [ss_ext_sales_price,ss_ext_wholesale_cost,ss_quantity]
+              BroadcastHashJoin [cd_education_status,cd_marital_status,hd_demo_sk,hd_dep_count,ss_hdemo_sk,ss_sales_price]
+                Project [cd_education_status,cd_marital_status,ss_ext_sales_price,ss_ext_wholesale_cost,ss_hdemo_sk,ss_quantity,ss_sales_price]
+                  BroadcastHashJoin [cd_demo_sk,ss_cdemo_sk]


check this part?
more columns/checks pushed down in spark 3.x?

maybe because the remaining columns are used in higher level broadcast join two lines above

BroadcastHashJoin [cd_education_status,cd_marital_status,hd_demo_sk,hd_dep_count,ss_hdemo_sk,ss_sales_price]

apoorvedave1 added 5 commits February 18, 2021 17:23

gold standard initial commit

208ea5e

fix q32

cc14991

Merge branch 'master' of github.com:apoorvedave1/hyperspace-1 into gs

7c8ee78

Merge branch 'master' of github.com:apoorvedave1/hyperspace-1 into gs

fa9bd4a

keep only tpcds v1.4 and remove others

4b007a5

apoorvedave1 marked this pull request as draft March 10, 2021 18:44

Merge remote-tracking branch 'upstream/master' into gs_initial

59207f6

apoorvedave1 mentioned this pull request Mar 10, 2021

Gold Standard: spark-only version for creating and comparing golden files #361

Open

apoorvedave1 added 5 commits March 10, 2021 12:13

Trigger Build

5c0cee9

build error: test with sequential run

900539d

revert previous commit

530dfa7

update plans

add01f1

update plan

8b58b6b

sezruby reviewed Mar 11, 2021

View reviewed changes

src/test/scala/com/microsoft/hyperspace/goldstandard/PlanStabilitySuite.scala Outdated Show resolved Hide resolved

apoorvedave1 added 5 commits March 10, 2021 16:48

added sorting for fixing build pipeline

97e8441

udpated plans with sorting

411450f

fix q49

b70f00b

updated instructions on how to run tests

7880f40

test with updated plans for q47, q49

3dcd2d5

apoorvedave1 self-assigned this Mar 12, 2021

apoorvedave1 added 4 commits March 11, 2021 17:36

update q47, 49 plans

a7a1149

remove rogue query

dd295a9

remove q49

f47fbd3

fix scalastyle

d15a4e8

apoorvedave1 changed the title ~~[WIP] Gold Standard: Initial commit with spark only setup, tpcds 1.4 query set~~ Gold Standard: Initial commit with spark only setup Mar 12, 2021

apoorvedave1 marked this pull request as ready for review March 12, 2021 20:33

apoorvedave1 requested review from imback82 and thugsatbay March 12, 2021 20:35

add query files for q49.sql

4c511fb

restructuring

e48c713

apoorvedave1 mentioned this pull request Mar 12, 2021

[Gold Standard] Add resources files for spark queries from spark's plan stability suite #383

Merged

apoorvedave1 changed the title ~~Gold Standard: Initial commit with spark only setup~~ [Gold Standard] Initial commit with spark only setup Mar 12, 2021

apoorvedave1 added 3 commits March 12, 2021 19:46

cleanup before rebase

6c5a2f7

Merge remote-tracking branch 'upstream/master' into gs_initial

25cb361

updated plans based on the plan stability suite

0b40853

imback82 reviewed Mar 13, 2021

View reviewed changes

src/test/resources/tpcds/spark-2.4/approved-plans-v1_4/q1/explain.txt Outdated Show resolved Hide resolved

normalize location: fix

32f5899

apoorvedave1 mentioned this pull request Mar 15, 2021

[Gold Standard]: Initial code for spark only setup with a single query #384

Merged

apoorvedave1 changed the title ~~[Gold Standard] Initial commit with spark only setup~~ [Gold Standard] Updated plans for all tpcds queries with spark-only setup Mar 15, 2021

apoorvedave1 mentioned this pull request Mar 15, 2021

[PROPOSAL]: Gold Standard #334

Open

6 tasks

Merge branch 'master' of github.com:apoorvedave1/hyperspace-1 into gs…

352f2f5

…_initial # Conflicts: # src/test/scala/com/microsoft/hyperspace/goldstandard/PlanStabilitySuite.scala # src/test/scala/com/microsoft/hyperspace/goldstandard/TPCDSBase.scala

query plans for all queries

d5fbec8

apoorvedave1 requested review from imback82 and sezruby March 17, 2021 16:44

apoorvedave1 mentioned this pull request Mar 18, 2021

[Gold Standard] Initial Code showing Hyperspace Indexes with a sample query #385

Open

apoorvedave1 commented Apr 15, 2021

View reviewed changes

src/test/resources/tpcds/spark-2.4/approved-plans-v1_4/q10/explain.txt

@@ -1,279 +1,49 @@

== Physical Plan ==

Copy link

Contributor Author

apoorvedave1 Apr 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

apoorvedave1 commented Apr 15, 2021

View reviewed changes

src/test/resources/tpcds/spark-2.4/approved-plans-v1_4/q12/explain.txt

@@ -1,137 +1,24 @@

== Physical Plan ==

Copy link

Contributor Author

apoorvedave1 Apr 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

apoorvedave1 commented Apr 15, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Gold Standard] Updated plans for all tpcds queries with spark-only setup #377

[Gold Standard] Updated plans for all tpcds queries with spark-only setup #377

apoorvedave1 commented Mar 10, 2021 •

edited

Loading

apoorvedave1 commented Mar 12, 2021

imback82 commented Mar 12, 2021

apoorvedave1 commented Mar 12, 2021

imback82 commented Mar 15, 2021

imback82 commented Mar 16, 2021

apoorvedave1 commented Mar 16, 2021

apoorvedave1 Apr 15, 2021

apoorvedave1 Apr 15, 2021

apoorvedave1 Apr 15, 2021

apoorvedave1 Apr 15, 2021

apoorvedave1 Apr 15, 2021

apoorvedave1 Apr 15, 2021

apoorvedave1 Apr 15, 2021 •

edited

Loading

apoorvedave1 Apr 15, 2021

		Union
		LocalTableScan [customer_id,year_total] [customer_id,year_total]

[Gold Standard] Updated plans for all tpcds queries with spark-only setup #377

Are you sure you want to change the base?

[Gold Standard] Updated plans for all tpcds queries with spark-only setup #377

Conversation

apoorvedave1 commented Mar 10, 2021 • edited Loading

What is the context for this pull request?

What changes were proposed in this pull request?

Does this PR introduce any user-facing change?

How was this patch tested?

apoorvedave1 commented Mar 12, 2021

imback82 commented Mar 12, 2021

apoorvedave1 commented Mar 12, 2021

imback82 commented Mar 15, 2021

imback82 commented Mar 16, 2021

apoorvedave1 commented Mar 16, 2021

apoorvedave1 Apr 15, 2021

Choose a reason for hiding this comment

apoorvedave1 Apr 15, 2021

Choose a reason for hiding this comment

apoorvedave1 Apr 15, 2021

Choose a reason for hiding this comment

apoorvedave1 Apr 15, 2021

Choose a reason for hiding this comment

apoorvedave1 Apr 15, 2021

Choose a reason for hiding this comment

apoorvedave1 Apr 15, 2021

Choose a reason for hiding this comment

apoorvedave1 Apr 15, 2021 • edited Loading

Choose a reason for hiding this comment

apoorvedave1 Apr 15, 2021

Choose a reason for hiding this comment

apoorvedave1 commented Mar 10, 2021 •

edited

Loading

apoorvedave1 Apr 15, 2021 •

edited

Loading