Skip to content
This repository has been archived by the owner on Jun 14, 2024. It is now read-only.

[Gold Standard] Updated plans for all tpcds queries with spark-only setup #377

Open
wants to merge 28 commits into
base: master
Choose a base branch
from

Conversation

apoorvedave1
Copy link
Contributor

@apoorvedave1 apoorvedave1 commented Mar 10, 2021

What is the context for this pull request?

What changes were proposed in this pull request?

In this PR we have updated plans for all tpcds queries q2-q99. Please review the dependency PR #384 first which contains the code for creating and validating golden files (query plan files) for gold standard.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests

@apoorvedave1 apoorvedave1 marked this pull request as draft March 10, 2021 18:44
@apoorvedave1 apoorvedave1 self-assigned this Mar 12, 2021
@apoorvedave1 apoorvedave1 changed the title [WIP] Gold Standard: Initial commit with spark only setup, tpcds 1.4 query set Gold Standard: Initial commit with spark only setup Mar 12, 2021
@apoorvedave1 apoorvedave1 marked this pull request as ready for review March 12, 2021 20:33
@apoorvedave1
Copy link
Contributor Author

Note to reviewers, currently q49 doesn't work well with the build pipelines so as per offline suggestions I have removed it from this PR. I will add it back once the issue is resolved.

@imback82
Copy link
Contributor

@apoorvedave1 Could you do this?

  1. Take resource files from OSS Spark and create a PR, and we will just merge quickly.
  2. Rebase this PR.

The reason is I want to make sure changes other than the expression reorder didn't go in.

@apoorvedave1
Copy link
Contributor Author

@apoorvedave1 Could you do this?

  1. Take resource files from OSS Spark and create a PR, and we will just merge quickly.
  2. Rebase this PR.

The reason is I want to make sure changes other than the expression reorder didn't go in.

Ok sure, let me get back.

@apoorvedave1 apoorvedave1 changed the title Gold Standard: Initial commit with spark only setup [Gold Standard] Initial commit with spark only setup Mar 12, 2021
@imback82
Copy link
Contributor

This is an initial setup with spark only (non-hyperspace) version of gold standard. Please refer parent proposal for more details

Could you update this PR description as well? Not easy to follow which portion of parent proposal applies to this PR.

@apoorvedave1 apoorvedave1 changed the title [Gold Standard] Initial commit with spark only setup [Gold Standard] Updated plans for all tpcds queries with spark-only setup Mar 15, 2021
@apoorvedave1 apoorvedave1 mentioned this pull request Mar 15, 2021
6 tasks
…_initial

# Conflicts:
#	src/test/scala/com/microsoft/hyperspace/goldstandard/PlanStabilitySuite.scala
#	src/test/scala/com/microsoft/hyperspace/goldstandard/TPCDSBase.scala
@imback82
Copy link
Contributor

Are the test failures related to the cross join?

@apoorvedave1
Copy link
Contributor Author

Are the test failures related to the cross join?

@imback82 yeah outside of the test function it was not picking up the config for enabling cross join. I made slight changes to the code and it works now. Please take a look

Comment on lines +91 to +92
Union
LocalTableScan [customer_id,year_total] [customer_id,year_total]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why union with LocalTableScan?

: : : : +- *(2) Project [d_date_sk#21, d_year#13]
: : : : +- *(2) Filter ((isnotnull(d_year#13) && (d_year#13 = 2001)) && isnotnull(d_date_sk#21))
: : : : +- *(2) FileScan parquet default.date_dim[d_date_sk#21,d_year#13] Batched: true, Format: Parquet, Location [not included in comparison]/{warehouse_dir}/date_dim], PartitionFilters: [], PushedFilters: [IsNotNull(d_year), EqualTo(d_year,2001), IsNotNull(d_date_sk)], ReadSchema: struct<d_date_sk:int,d_year:int>
: : : +- LocalTableScan <empty>, [customer_id#24, year_total#25]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val df = Seq.empty[(String,String,String,String)].toDF("a","b","c","d")
  println(df.queryExecution.toString())

creates this

== Physical Plan ==
LocalTableScan <empty>, [a#12, b#13, c#14, d#15]

@@ -1,279 +1,49 @@
== Physical Plan ==
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

TakeOrderedAndProject [cd_gender,cd_marital_status,cd_education_status,cd_purchase_estimate,cd_credit_rating,cd_dep_count,cd_dep_employed_count,cd_dep_college_count,cnt1,cnt2,cnt3,cnt4,cnt5,cnt6]
WholeStageCodegen (10)
HashAggregate [cd_gender,cd_marital_status,cd_education_status,cd_purchase_estimate,cd_credit_rating,cd_dep_count,cd_dep_employed_count,cd_dep_college_count,count] [count(1),cnt1,cnt2,cnt3,cnt4,cnt5,cnt6,count]
TakeOrderedAndProject [cd_credit_rating,cd_dep_college_count,cd_dep_count,cd_dep_employed_count,cd_education_status,cd_gender,cd_marital_status,cd_purchase_estimate,cnt1,cnt2,cnt3,cnt4,cnt5,cnt6]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

TakeOrderedAndProject [i_category,i_class,i_item_id,i_item_desc,revenueratio,i_current_price,itemrevenue]
WholeStageCodegen (6)
Project [i_item_desc,i_category,i_class,i_current_price,itemrevenue,_w0,_we0,i_item_id]
TakeOrderedAndProject [i_category,i_class,i_current_price,i_item_desc,i_item_id,itemrevenue,revenueratio]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -1,137 +1,24 @@
== Physical Plan ==
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Project [ss_ext_sales_price,ss_ext_wholesale_cost,ss_quantity]
BroadcastHashJoin [cd_education_status,cd_marital_status,hd_demo_sk,hd_dep_count,ss_hdemo_sk,ss_sales_price]
Project [cd_education_status,cd_marital_status,ss_ext_sales_price,ss_ext_wholesale_cost,ss_hdemo_sk,ss_quantity,ss_sales_price]
BroadcastHashJoin [cd_demo_sk,ss_cdemo_sk]
Copy link
Contributor Author

@apoorvedave1 apoorvedave1 Apr 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check this part?
more columns/checks pushed down in spark 3.x?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe because the remaining columns are used in higher level broadcast join two lines above

BroadcastHashJoin [cd_education_status,cd_marital_status,hd_demo_sk,hd_dep_count,ss_hdemo_sk,ss_sales_price]

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants