-
Notifications
You must be signed in to change notification settings - Fork 115
[Gold Standard] Updated plans for all tpcds queries with spark-only setup #377
base: master
Are you sure you want to change the base?
Conversation
src/test/scala/com/microsoft/hyperspace/goldstandard/PlanStabilitySuite.scala
Outdated
Show resolved
Hide resolved
Note to reviewers, currently q49 doesn't work well with the build pipelines so as per offline suggestions I have removed it from this PR. I will add it back once the issue is resolved. |
@apoorvedave1 Could you do this?
The reason is I want to make sure changes other than the expression reorder didn't go in. |
Ok sure, let me get back. |
src/test/resources/tpcds/spark-2.4/approved-plans-v1_4/q1/explain.txt
Outdated
Show resolved
Hide resolved
Could you update this PR description as well? Not easy to follow which portion of parent proposal applies to this PR. |
…_initial # Conflicts: # src/test/scala/com/microsoft/hyperspace/goldstandard/PlanStabilitySuite.scala # src/test/scala/com/microsoft/hyperspace/goldstandard/TPCDSBase.scala
Are the test failures related to the cross join? |
@imback82 yeah outside of the test function it was not picking up the config for enabling cross join. I made slight changes to the code and it works now. Please take a look |
Union | ||
LocalTableScan [customer_id,year_total] [customer_id,year_total] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why union with LocalTableScan?
: : : : +- *(2) Project [d_date_sk#21, d_year#13] | ||
: : : : +- *(2) Filter ((isnotnull(d_year#13) && (d_year#13 = 2001)) && isnotnull(d_date_sk#21)) | ||
: : : : +- *(2) FileScan parquet default.date_dim[d_date_sk#21,d_year#13] Batched: true, Format: Parquet, Location [not included in comparison]/{warehouse_dir}/date_dim], PartitionFilters: [], PushedFilters: [IsNotNull(d_year), EqualTo(d_year,2001), IsNotNull(d_date_sk)], ReadSchema: struct<d_date_sk:int,d_year:int> | ||
: : : +- LocalTableScan <empty>, [customer_id#24, year_total#25] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
val df = Seq.empty[(String,String,String,String)].toDF("a","b","c","d")
println(df.queryExecution.toString())
creates this
== Physical Plan ==
LocalTableScan <empty>, [a#12, b#13, c#14, d#15]
@@ -1,279 +1,49 @@ | |||
== Physical Plan == |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
TakeOrderedAndProject [cd_gender,cd_marital_status,cd_education_status,cd_purchase_estimate,cd_credit_rating,cd_dep_count,cd_dep_employed_count,cd_dep_college_count,cnt1,cnt2,cnt3,cnt4,cnt5,cnt6] | ||
WholeStageCodegen (10) | ||
HashAggregate [cd_gender,cd_marital_status,cd_education_status,cd_purchase_estimate,cd_credit_rating,cd_dep_count,cd_dep_employed_count,cd_dep_college_count,count] [count(1),cnt1,cnt2,cnt3,cnt4,cnt5,cnt6,count] | ||
TakeOrderedAndProject [cd_credit_rating,cd_dep_college_count,cd_dep_count,cd_dep_employed_count,cd_education_status,cd_gender,cd_marital_status,cd_purchase_estimate,cnt1,cnt2,cnt3,cnt4,cnt5,cnt6] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
TakeOrderedAndProject [i_category,i_class,i_item_id,i_item_desc,revenueratio,i_current_price,itemrevenue] | ||
WholeStageCodegen (6) | ||
Project [i_item_desc,i_category,i_class,i_current_price,itemrevenue,_w0,_we0,i_item_id] | ||
TakeOrderedAndProject [i_category,i_class,i_current_price,i_item_desc,i_item_id,itemrevenue,revenueratio] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
@@ -1,137 +1,24 @@ | |||
== Physical Plan == |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
Project [ss_ext_sales_price,ss_ext_wholesale_cost,ss_quantity] | ||
BroadcastHashJoin [cd_education_status,cd_marital_status,hd_demo_sk,hd_dep_count,ss_hdemo_sk,ss_sales_price] | ||
Project [cd_education_status,cd_marital_status,ss_ext_sales_price,ss_ext_wholesale_cost,ss_hdemo_sk,ss_quantity,ss_sales_price] | ||
BroadcastHashJoin [cd_demo_sk,ss_cdemo_sk] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check this part?
more columns/checks pushed down in spark 3.x?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe because the remaining columns are used in higher level broadcast join two lines above
BroadcastHashJoin [cd_education_status,cd_marital_status,hd_demo_sk,hd_dep_count,ss_hdemo_sk,ss_sales_price]
What is the context for this pull request?
What changes were proposed in this pull request?
In this PR we have updated plans for all tpcds queries q2-q99. Please review the dependency PR #384 first which contains the code for creating and validating golden files (query plan files) for gold standard.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Unit tests