Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Data Maintenance Benchmark #4

Closed
wjxiz1992 opened this issue May 25, 2022 · 1 comment · Fixed by #9
Closed

Support Data Maintenance Benchmark #4

wjxiz1992 opened this issue May 25, 2022 · 1 comment · Fixed by #9
Assignees

Comments

@wjxiz1992
Copy link
Collaborator

Data Maintenance is detailed explained in TPC-DS Specification at Section 5. To simplify, the following 3 steps are needed for it:

  1. Use TPC-DS Tool to regenerate data
    • add "update" option for data generation script
  2. Use SQL files in tests folder in DSGen-software-code-3.2.0rc1, to update table data.
    • add script(currently we have a nds_power.py to read SQL stream files) to read these new SQL files to update the existing table data.
    • the SQL files may contain multiple SQL statement, need to be able to recognize them and execute them one by one.
    • Data Maintenance Functions are at section 5.3.2 in Specification.
    • add new data source schema described under $TPCDS_HOME/tools/tpcds_source.sql
  3. Do Throughput Run again.

Step 2. is called "Data Maintenance". the SQL files are "lf_*.sql" and "dm_*.sql". Note, we need to make them Spark-compatible.

Step 1+2+3 is called Refresh Run.

@wjxiz1992
Copy link
Collaborator Author

The UPDATE data contains some unnecessary data, we should drop them. See details in Specification(v3.2.0) pdf at Page 7 of 141

remove tables that are no longer part of the data maintenance refresh in TPC-DS v2.0 A-1 
(s_zip_to_cmt) A-3 (s_customer) A-7 (s_item) A-10 (s_store) A-11 (s_call_center) A-12 
(s_web_site) A-13 (s_warehouse) A-14 (s_web_page) A-15 (s_promotion) A-20 (s_catalog_page) 
(FogBugz 2178)

The Specification doc doesn't explicitly provide the DELETE query strings, I come up with the following SQLs.
See Section 5.3.11

-- DF_CS
-- for catalog_sales 
delete from catalog_sales where cs_sold_date_sk in (select d_date_sk from date_dim where d_date between DATE1 and DATE2);
-- for catalog_returns 
delete from catalog_returns where cr_order_number in (
select cs_order_number from catalog_sales where cs_sold_date_sk in (select d_date_sk from date_dim where d_date between DATE1 and DATE2)
);

-- DF_SS
-- for store_sales 
delete from store_sales 
where ss_sold_date_sk in 
(select d_date_sk from date_dim where d_date between DATE1 and DATE2);
-- for store_sales 
delete from store_returns where sr_ticket_number in (
select ss_ticket_number from store_sales where ss_sold_date_sk in (select d_date_sk from date_dim where d_date between DATE1 and DATE2)
);

-- DF_WS
-- for web_sales 
delete from web_sales where ws_sold_date_sk in (select d_date_sk from date_dim where d_date between DATE1 and DATE2);
-- for web_returns 
delete from web_returns where wr_order_number in (
select ws_order_number from web_sales where ws_sold_date_sk in (select d_date_sk from date_dim where d_date between DATE1 and DATE2)
);

-- DF_I
-- for inventory
delete from inventory where inv_date_sk in (select d_date_sk from date_dim where d_date between DATE1 and DATE2);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant