-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize queries with similar subplans (CTE) #19744
Comments
Will start looking |
This is also something we have been exploring. In order to do this right, we need a mechanism to materialize intermediate results. But short of that, this paper describes a holistic framework for identifying and reusing CTE's with streaming/staged in-memory execution |
For part 1, yes I think we need to store the results in a temp table and create a new exchange which will stream it from the temp table to the repeated parts of the plan. Thanks for the reference, will go through. |
CC: @feilong-liu |
Maybe some examples? E.g. TPCH Q15 is a simple CSE (Common SubExpression)
case.
How are CSEs detected? In the physical plan? Or in the logical algebra?
What would the algorithm
detect as CSEs for this query (where column c<n> comes from table t<n>) :
explain select * from (select b1, c1 from t1 join (select a2, a3 from t2
join t3 on b2=b3) dt3 on a1=a2) dt1, (select c1, d1 from t1 join (select
a2, a3 from t3 join t2 on b3=b2) dt4 on a1=a2) dt2;
?? dt3 and dt4 are logically identicaI. dt1 and dt2 can be computed from
the same CSE if their SELECT lists are UNIONed.
I think the decision has to be cost based (i.e. a cost-based rewrite like
pushing joins into unions).
…On Mon, Jun 5, 2023 at 5:03 PM Jay Narale ***@***.***> wrote:
[image: image]
<https://user-images.githubusercontent.com/19339828/243509619-d41549cb-925b-44fd-8bf0-8612ecf8e6ca.png>
This is the architecture that I was thinking about. Basically create a
table definition in planning and read and write to that table while
managing state in execution.
In planning, there will be a new CTE optimizer which will detect and
choose CTEs by hashing all plans beginning from exchange and maybe using
CBO. this optimizer will transform the plan to new Bridge writers and
reader operators which will be based off TableWriter and TableScanOperator
respectively.
[image: image]
<https://user-images.githubusercontent.com/19339828/243510121-14162120-af27-4dc4-9286-3c4ea0fedae0.png>
Table Deletion will need to be done from co-ordinator after the query
lifecyle.
One of the part which are challenging
- Since Insertion will be handled in execution and we are only
creating the table metadata during optimization, the part where we assign
which drivers to read respective split (maybe in scheduler) is unclear to
me.
—
Reply to this email directly, view it on GitHub
<#19744 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A2APKUHVYFNMOB4P3X6MMVTXJZXVFANCNFSM6AAAAAAYQRZNMQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@bmckennaah the physical plan hashing will be tailored to canonicalize the subtrees to detect CTE's accurately hence dt3 and dt4 would be detected. The decision has to be cost based ofcourse. Initially however, I am focusing on a complete end to end flow and deciding using non cbo. |
I think you would rather detect
t1 join (select a2, a3 from t2 join t3 on b2=b3) on a1=a2
as the CSE than dt3 and dt4 (since dt3 and dt4 are contained within this
larger CSE). Even better is to use
select b1, c1, d1 from t1 join (select a2, a3 from t2 join t3 on b2=b3) on
a1=a2
as your CSE. My point here is that in general CSEs referenced in different
places may have different project lists, e.g.
explain with cte1 as (select a1, b1, c1 from t1 join t2 on a1=a2) select *
from cte1 where b1 not in (select a1 from cte1);
Here the build and probe sides of the semijoin will have different
projection lists (in the physical plan). Even the 2 physical instances
of t1 will project different sets of columns.
…On Tue, Jun 6, 2023 at 10:30 AM Jay Narale ***@***.***> wrote:
@bmckennaah <https://github.com/bmckennaah> the physical plan hashing
will be tailored to canonicalize the subtrees to detect CTE's accurately
hence dt3 and dt4 would be detected. The decision has to be cost based
ofcourse. Initially however, I am focusing on a complete end to end flow
and deciding using non cbo.
—
Reply to this email directly, view it on GitHub
<#19744 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A2APKUGOTCR3CB3222EEOS3XJ5SKBANCNFSM6AAAAAAYQRZNMQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
So we would need to store b1, c1, and d1 as the common temporary table. I think maybe the hashing can be tailored later to process column-wise and then a project on the common CTE while reading. i.e. if we focus on the same subtree first it can be extended for this use case. |
Thanks for looking at this. This is a feature I've wanted for a while but haven't had a chance to prioritize. |
Yes you need to union the SELECT lists in some cases. And in the physical
plan CTE/CSEs may no longer be present, e.g. if t3 is small and t1 and t2
are large in this query:
explain with cte1 as (select a1, b1, c1 from t1 join t2 on a1=a2) select *
from cte1, t3 where a1=a3 and b1 not in (select a1 from cte1)
t1 JOIN t2 may only be present once in the physical plan (since joining
t1-t3 in the outer block is cheaper) because of view
folding/composition/merging
and join ordering. So it seems CTE/CSE analysis needs to happen on the
logical expression, not the physical plan.
If you use temp tables for CSEs you can create them, analyze them (collect
stats on write), replace the CSEs with the temp tables in the original
query, and invoke the planner. You'll get much
better estimates in the CBO this way (and hence better plans).
…On Tue, Jun 6, 2023 at 11:29 AM Jay Narale ***@***.***> wrote:
So we would need to store b1, c1, and d1 as the common temporary table. I
think maybe the hashing can be tailored later to process column-wise and
then a project on the common CTE while reading. i.e. if we focus on the
same subtree first it can be extended for this use case.
—
Reply to this email directly, view it on GitHub
<#19744 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A2APKUGAAXUNOH2NMVFWBJLXJ5ZI5ANCNFSM6AAAAAAYQRZNMQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Look at: ExchangeMaterializationStrategy when it is set to ALL, I think we already materialize exchanges to hive. Maybe you can piggyback on that framework for this work. |
Wow thanks! I will take a look |
Thanks everyone for all the pointers! I reused the materialized exchange framework and could create a prototype for detection and replacement for a simple Union query. I used naiive detection logic based on hash of a plan (did not use CBO) in a new optimizer, basically added a new exchange type and used the basePlanFragmenter to create and store Temporary tables. Similar to here. This basically uses the There are a lot of sanity tests and a lot of work needed to make this production ready but the prototype shows that this can be done |
This is cool! Once this framework is setup, it could be interesting to make CTE melding more first-class like merging filters/projections, based on the filter cardinalities and matching output partitioning etc. |
Very promising result, Jay!
…On Fri, Jun 9, 2023 at 6:11 PM Sreeni Viswanadha ***@***.***> wrote:
This is cool! Once this framework is setup, it could be interesting to
make CTE melding more first-class like merging filters/projectsions, based
on the filer cardinalities and matching output partitioning etc.
—
Reply to this email directly, view it on GitHub
<#19744 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A2APKUGROCEYGOFLLKAM3WTXKPCSXANCNFSM6AAAAAAYQRZNMQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thank you! I will begin work on breaking down components, creating PR's for them and making this production ready, along with a more detailed analysis of performance of temporary tables, any risks etc |
And a design doc to start now |
Just a suggestion : write your test cases (positive and negative) first so
folks understand what the scope is.
…On Mon, Jun 12, 2023 at 3:15 PM Jay Narale ***@***.***> wrote:
And a design doc to start now
—
Reply to this email directly, view it on GitHub
<#19744 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A2APKUDRKS5RDC53WTRRGLDXK6IGXANCNFSM6AAAAAAYQRZNMQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Maybe I should open a separate issue but a Tee operator could be quite useful in Presto in general (also in this specific case). That could help quite a bit in things like window functions as well as this CTE/subplan reuse case. |
I've put together a proposal that's been inspired and based on the paper http://www.vldb.org/pvldb/vol8/p1704-elhelw.pdf, which was recommended by @rschlussel. I'm hoping we could discuss it and improve the idea further. https://docs.google.com/document/d/10J9_j08imqe6xEH9iepteG0DA5B-IpmYv4lSRvCgdJA/edit. |
@kaikalur @rschlussel @ClarenceThreepwood @bmckennaah what do you think about this approach? - link |
@jaystarshot wanted to check if there was any update here |
@rschlussel Unfortunately nope, kind of got sidetracked by other tasks due to the estimated work and ROI for this one but we plan to dedicate some resources this half. Interested in reprioritizing if we can dedicate more resources using the community. |
We are starting POC development based on the design and will have something this quarter |
've prepared the POC - accessible here: link. Initially, we executed tpcds query 4 and observed a 30% CPU reduction on sf100. On sf1000, Q4, which previously failed, successfully finished within 5 minutes on our test cluster using the PAGEFILE storage format for the temporary table. Additionally, we mimicked production traffic by arbitrarily converting one Common Table Expression (CTE) to be persistent. Unfortunately, 20-40% of the queries encountered an identical internal error, stating 'Malformed PageFile format, footer length is missing. ' However, these errors were not reproducible upon subsequent runs. The remaining 60-80% of the queries executed successfully." |
We don't see these errors with parquet format for temp tables. The speed is around 5x slower though. 10% queries failed with hive bad data. ( |
This error message suggested it is not a valid PAGEFILE format, is it possible that this file is empty, not exit, or it is not a PAGEFILE format (less likely, if so , error will throw later)? |
I debugged with deletion off, The pagefile file was non empty. It happens with parquet too. |
So my suspicion was now on 2 things.
|
Finally found the issue, we had a legacy configuration where read from observer namenode was enabled. I think OBNN needs some time to sync with active NN, Turning that off works. |
Where are we on this. Even we are waiting for this PR to get merged. |
A part of the work ( materialize all CTEs) has already been merged.
On Tue, Dec 26, 2023 at 9:14 PM sutodi ***@***.***> wrote:
Where are we on this. Even we are waiting for this PR to get merged.
—
Reply to this email directly, view it on GitHub
<#19744 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AETRUNAPF24MHGA6XWC4CADYLOVD3AVCNFSM6AAAAAAYQRZNMSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRZHE2TKNJQGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Warm Regards,
Jay Narale
|
Thank you @jaystarshot ! |
Did you mean to refere something? Because these links just point to the general page |
Summary: The Velox HiveConnector supports writing bucketed files only when they are partitioned as well. This presents a feature gap wrt Presto. Presto behavior (for bucketed but not partitioned): - Supports CTAS into bucketed (but not partitioned tables) - Cannot append/overwrite to existing bucketed tables (though can append to TEMPORARY ones). The CTAS into bucketed tables has become important because such tables are used for CTE (WITH clause). Note: This PR only handles CTAS situations. There will be a separate PR for TEMPORARY tables. prestodb/presto#19744 prestodb/presto#22630 ### Background #### TableWriter and TableFinish Presto uses TableWriter PlanNodes to do the writing operations. The TableWriter nodes run on the workers. These nodes write the input rows into data files (on a staging directory before moving them to a target directory). The TableWriter node works in conjunction with a TableCommit node on the co-ordinator. The TableCommit node (TableFinishOperator) does the final renaming of target directory and commit to the meta-store. It is important to note that plans with Bucketed tables involve a LocalExchange that brings all the data to a single driver for TableWriter so that it can bucket and write the data appropriately. ``` EXPLAIN CREATE TABLE lineitem_bucketed2(orderkey, partkey, suppkey, linenumber, quantity, ds) WITH (bucket_count = 10, bucketed_by = ARRAY['orderkey'], sorted_by = ARRAY['orderkey']) AS SELECT orderkey, partkey, suppkey, linenumber, quantity, '2021-12-20' FROM tpch.tiny.lineitem; ``` Plan with TableWriter and TableCommit mode. Note the LocalExchange moving all data to a single driver. ``` - Output[PlanNodeId 7] - TableCommit[PlanNodeId 5][Optional[hive.tpch_bucketed.lineitem_bucketed2]] => [rows_23:bigint] - RemoteStreamingExchange[PlanNodeId 299][GATHER] => [rows:bigint, fragments:varbinary, commitcontext:varbinary] - TableWriter[PlanNodeId 6] => [rows:bigint, fragments:varbinary, commitcontext:varbinary] orderkey := orderkey (1:194) partkey := partkey (1:204) suppkey := suppkey (1:213) linenumber := linenumber (1:222) quantity := quantity (1:234) ds := expr (1:244) - LocalExchange[PlanNodeId 330][SINGLE] () => [orderkey:bigint, partkey:bigint, suppkey:bigint, linenumber:integer, quantity:double, expr:varchar(10)] > - RemoteStreamingExchange[PlanNodeId 298][REPARTITION] => [orderkey:bigint, partkey:bigint, suppkey:bigint, linenumber:integer, quantity:double, expr:varcha> - ScanProject[PlanNodeId 0,187][table = TableHandle {connectorId='tpch', connectorHandle='lineitem:sf0.01', layout='Optional[lineitem:sf0.01]'}, project> expr := VARCHAR'2021-12-20' suppkey := tpch:suppkey (1:262) partkey := tpch:partkey (1:262) linenumber := tpch:linenumber (1:262) orderkey := tpch:orderkey (1:262) quantity := tpch:quantity (1:262) ``` The above command creates 10 files as follows. 10 is the bucket count. ``` Aditis-MacBook-Pro:lineitem_bucketed aditipandit$ pwd ${DATA_DIR}/hive_data/tpch/lineitem_bucketed Aditis-MacBook-Pro:lineitem_bucketed2 aditipandit$ ls 000000_0_20240507_221727_00018_73r2r 000003_0_20240507_221727_00018_73r2r 000006_0_20240507_221727_00018_73r2r 000009_0_20240507_221727_00018_73r2r 000001_0_20240507_221727_00018_73r2r 000004_0_20240507_221727_00018_73r2r 000007_0_20240507_221727_00018_73r2r 000002_0_20240507_221727_00018_73r2r 000005_0_20240507_221727_00018_73r2r 000008_0_20240507_221727_00018_73r2r ``` #### TableWriter output The TableWriter output contains three columns per fragment (one for each individual target file). This format is being presented for completeness. **There are no special changes for bucketed tables here. The only important difference is that the writePath/targetPath would not contain the partition directory.** | TableWriter output row | |--------| | ROW<rows:BIGINT,fragments:VARBINARY,commitcontext:VARBINARY> | | Rows | | Fragments | | CommitContext | |--------|--------|--------|--------|--------| | N (numPartitionUpdates) | | NULL | | TaskCommitContext | | NULL | | PartitionUpdate0 | | | | NULL | | PartitionUpdate1 | | | | NULL | | ... | | | | NULL | | PartitionUpdateN | | | The fragments column is JSON strings of PartitionUpdate as in the following format ``` { "Name": "ds=2022-08-06/partition=events_pcp_product_finder_product_similartiy__groupby__999999998000212604", "updateMode": "NEW", "writePath": "", "targetPath": "", "fileWriteInfos": [ { "writeFileName": "", "targetFileName": "", "fileSize": 3517346970 }, { "writeFileName": "", "targetFileName": "", "fileSize": 4314798687 }, ] "rowCount": 3950431150, "inMemoryDataSizeInBytes": 4992001194927, "onDiskDataSizeInBytes": 1374893372141, "containsNumberedFileNames": false } ``` The commitcontext column is a constant vector of TaskCommitContext in JSON string ``` { "lifespan": "TaskWide", "taskId": "20220822_190126_00000_78c2f.1.0.0", "pageSinkCommitStrategy": "TASK_COMMIT", "lastPage": false } ``` #### Empty buckets The TableWriter generates PartitionUpdate messages only for the files it has written. So if there are empty buckets then there isn't a PartitionUpdate message for it. If there are no PartitionUpdate output messages for any bucket, then the TableFinish operator fixes the HiveMetaStore with empty files for each bucket. https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/HiveMetadata.java#L1794 ### Design As outlined above all table writing happens in the TableWriter operator. The TableWriter forwards the write to the HiveDataSink which is registered by the HiveConnector for it. The HiveDataSink already supported bucketed (and partitioned) tables. So all the logic for wiring bucket metadata and bucket computation already existed. The only missing piece was to handle fileNames for bucketed but not partitioned files in the writerIds, and map the proper writerId to input rows when appending to the HiveDataSink. This PR fixes that. ******************************************** Note: The Prestissimo changes are in prestodb/presto#22737 Pull Request resolved: #9740 Reviewed By: kewang1024 Differential Revision: D57748876 Pulled By: xiaoxmeng fbshipit-source-id: 33bb77c6fce4d2519f3214e2fb93891f1f910716
…ncubator#9740) Summary: The Velox HiveConnector supports writing bucketed files only when they are partitioned as well. This presents a feature gap wrt Presto. Presto behavior (for bucketed but not partitioned): - Supports CTAS into bucketed (but not partitioned tables) - Cannot append/overwrite to existing bucketed tables (though can append to TEMPORARY ones). The CTAS into bucketed tables has become important because such tables are used for CTE (WITH clause). Note: This PR only handles CTAS situations. There will be a separate PR for TEMPORARY tables. prestodb/presto#19744 prestodb/presto#22630 ### Background #### TableWriter and TableFinish Presto uses TableWriter PlanNodes to do the writing operations. The TableWriter nodes run on the workers. These nodes write the input rows into data files (on a staging directory before moving them to a target directory). The TableWriter node works in conjunction with a TableCommit node on the co-ordinator. The TableCommit node (TableFinishOperator) does the final renaming of target directory and commit to the meta-store. It is important to note that plans with Bucketed tables involve a LocalExchange that brings all the data to a single driver for TableWriter so that it can bucket and write the data appropriately. ``` EXPLAIN CREATE TABLE lineitem_bucketed2(orderkey, partkey, suppkey, linenumber, quantity, ds) WITH (bucket_count = 10, bucketed_by = ARRAY['orderkey'], sorted_by = ARRAY['orderkey']) AS SELECT orderkey, partkey, suppkey, linenumber, quantity, '2021-12-20' FROM tpch.tiny.lineitem; ``` Plan with TableWriter and TableCommit mode. Note the LocalExchange moving all data to a single driver. ``` - Output[PlanNodeId 7] - TableCommit[PlanNodeId 5][Optional[hive.tpch_bucketed.lineitem_bucketed2]] => [rows_23:bigint] - RemoteStreamingExchange[PlanNodeId 299][GATHER] => [rows:bigint, fragments:varbinary, commitcontext:varbinary] - TableWriter[PlanNodeId 6] => [rows:bigint, fragments:varbinary, commitcontext:varbinary] orderkey := orderkey (1:194) partkey := partkey (1:204) suppkey := suppkey (1:213) linenumber := linenumber (1:222) quantity := quantity (1:234) ds := expr (1:244) - LocalExchange[PlanNodeId 330][SINGLE] () => [orderkey:bigint, partkey:bigint, suppkey:bigint, linenumber:integer, quantity:double, expr:varchar(10)] > - RemoteStreamingExchange[PlanNodeId 298][REPARTITION] => [orderkey:bigint, partkey:bigint, suppkey:bigint, linenumber:integer, quantity:double, expr:varcha> - ScanProject[PlanNodeId 0,187][table = TableHandle {connectorId='tpch', connectorHandle='lineitem:sf0.01', layout='Optional[lineitem:sf0.01]'}, project> expr := VARCHAR'2021-12-20' suppkey := tpch:suppkey (1:262) partkey := tpch:partkey (1:262) linenumber := tpch:linenumber (1:262) orderkey := tpch:orderkey (1:262) quantity := tpch:quantity (1:262) ``` The above command creates 10 files as follows. 10 is the bucket count. ``` Aditis-MacBook-Pro:lineitem_bucketed aditipandit$ pwd ${DATA_DIR}/hive_data/tpch/lineitem_bucketed Aditis-MacBook-Pro:lineitem_bucketed2 aditipandit$ ls 000000_0_20240507_221727_00018_73r2r 000003_0_20240507_221727_00018_73r2r 000006_0_20240507_221727_00018_73r2r 000009_0_20240507_221727_00018_73r2r 000001_0_20240507_221727_00018_73r2r 000004_0_20240507_221727_00018_73r2r 000007_0_20240507_221727_00018_73r2r 000002_0_20240507_221727_00018_73r2r 000005_0_20240507_221727_00018_73r2r 000008_0_20240507_221727_00018_73r2r ``` #### TableWriter output The TableWriter output contains three columns per fragment (one for each individual target file). This format is being presented for completeness. **There are no special changes for bucketed tables here. The only important difference is that the writePath/targetPath would not contain the partition directory.** | TableWriter output row | |--------| | ROW<rows:BIGINT,fragments:VARBINARY,commitcontext:VARBINARY> | | Rows | | Fragments | | CommitContext | |--------|--------|--------|--------|--------| | N (numPartitionUpdates) | | NULL | | TaskCommitContext | | NULL | | PartitionUpdate0 | | | | NULL | | PartitionUpdate1 | | | | NULL | | ... | | | | NULL | | PartitionUpdateN | | | The fragments column is JSON strings of PartitionUpdate as in the following format ``` { "Name": "ds=2022-08-06/partition=events_pcp_product_finder_product_similartiy__groupby__999999998000212604", "updateMode": "NEW", "writePath": "", "targetPath": "", "fileWriteInfos": [ { "writeFileName": "", "targetFileName": "", "fileSize": 3517346970 }, { "writeFileName": "", "targetFileName": "", "fileSize": 4314798687 }, ] "rowCount": 3950431150, "inMemoryDataSizeInBytes": 4992001194927, "onDiskDataSizeInBytes": 1374893372141, "containsNumberedFileNames": false } ``` The commitcontext column is a constant vector of TaskCommitContext in JSON string ``` { "lifespan": "TaskWide", "taskId": "20220822_190126_00000_78c2f.1.0.0", "pageSinkCommitStrategy": "TASK_COMMIT", "lastPage": false } ``` #### Empty buckets The TableWriter generates PartitionUpdate messages only for the files it has written. So if there are empty buckets then there isn't a PartitionUpdate message for it. If there are no PartitionUpdate output messages for any bucket, then the TableFinish operator fixes the HiveMetaStore with empty files for each bucket. https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/HiveMetadata.java#L1794 ### Design As outlined above all table writing happens in the TableWriter operator. The TableWriter forwards the write to the HiveDataSink which is registered by the HiveConnector for it. The HiveDataSink already supported bucketed (and partitioned) tables. So all the logic for wiring bucket metadata and bucket computation already existed. The only missing piece was to handle fileNames for bucketed but not partitioned files in the writerIds, and map the proper writerId to input rows when appending to the HiveDataSink. This PR fixes that. ******************************************** Note: The Prestissimo changes are in prestodb/presto#22737 Pull Request resolved: facebookincubator#9740 Reviewed By: kewang1024 Differential Revision: D57748876 Pulled By: xiaoxmeng fbshipit-source-id: 33bb77c6fce4d2519f3214e2fb93891f1f910716
…ncubator#9740) Summary: The Velox HiveConnector supports writing bucketed files only when they are partitioned as well. This presents a feature gap wrt Presto. Presto behavior (for bucketed but not partitioned): - Supports CTAS into bucketed (but not partitioned tables) - Cannot append/overwrite to existing bucketed tables (though can append to TEMPORARY ones). The CTAS into bucketed tables has become important because such tables are used for CTE (WITH clause). Note: This PR only handles CTAS situations. There will be a separate PR for TEMPORARY tables. prestodb/presto#19744 prestodb/presto#22630 ### Background #### TableWriter and TableFinish Presto uses TableWriter PlanNodes to do the writing operations. The TableWriter nodes run on the workers. These nodes write the input rows into data files (on a staging directory before moving them to a target directory). The TableWriter node works in conjunction with a TableCommit node on the co-ordinator. The TableCommit node (TableFinishOperator) does the final renaming of target directory and commit to the meta-store. It is important to note that plans with Bucketed tables involve a LocalExchange that brings all the data to a single driver for TableWriter so that it can bucket and write the data appropriately. ``` EXPLAIN CREATE TABLE lineitem_bucketed2(orderkey, partkey, suppkey, linenumber, quantity, ds) WITH (bucket_count = 10, bucketed_by = ARRAY['orderkey'], sorted_by = ARRAY['orderkey']) AS SELECT orderkey, partkey, suppkey, linenumber, quantity, '2021-12-20' FROM tpch.tiny.lineitem; ``` Plan with TableWriter and TableCommit mode. Note the LocalExchange moving all data to a single driver. ``` - Output[PlanNodeId 7] - TableCommit[PlanNodeId 5][Optional[hive.tpch_bucketed.lineitem_bucketed2]] => [rows_23:bigint] - RemoteStreamingExchange[PlanNodeId 299][GATHER] => [rows:bigint, fragments:varbinary, commitcontext:varbinary] - TableWriter[PlanNodeId 6] => [rows:bigint, fragments:varbinary, commitcontext:varbinary] orderkey := orderkey (1:194) partkey := partkey (1:204) suppkey := suppkey (1:213) linenumber := linenumber (1:222) quantity := quantity (1:234) ds := expr (1:244) - LocalExchange[PlanNodeId 330][SINGLE] () => [orderkey:bigint, partkey:bigint, suppkey:bigint, linenumber:integer, quantity:double, expr:varchar(10)] > - RemoteStreamingExchange[PlanNodeId 298][REPARTITION] => [orderkey:bigint, partkey:bigint, suppkey:bigint, linenumber:integer, quantity:double, expr:varcha> - ScanProject[PlanNodeId 0,187][table = TableHandle {connectorId='tpch', connectorHandle='lineitem:sf0.01', layout='Optional[lineitem:sf0.01]'}, project> expr := VARCHAR'2021-12-20' suppkey := tpch:suppkey (1:262) partkey := tpch:partkey (1:262) linenumber := tpch:linenumber (1:262) orderkey := tpch:orderkey (1:262) quantity := tpch:quantity (1:262) ``` The above command creates 10 files as follows. 10 is the bucket count. ``` Aditis-MacBook-Pro:lineitem_bucketed aditipandit$ pwd ${DATA_DIR}/hive_data/tpch/lineitem_bucketed Aditis-MacBook-Pro:lineitem_bucketed2 aditipandit$ ls 000000_0_20240507_221727_00018_73r2r 000003_0_20240507_221727_00018_73r2r 000006_0_20240507_221727_00018_73r2r 000009_0_20240507_221727_00018_73r2r 000001_0_20240507_221727_00018_73r2r 000004_0_20240507_221727_00018_73r2r 000007_0_20240507_221727_00018_73r2r 000002_0_20240507_221727_00018_73r2r 000005_0_20240507_221727_00018_73r2r 000008_0_20240507_221727_00018_73r2r ``` #### TableWriter output The TableWriter output contains three columns per fragment (one for each individual target file). This format is being presented for completeness. **There are no special changes for bucketed tables here. The only important difference is that the writePath/targetPath would not contain the partition directory.** | TableWriter output row | |--------| | ROW<rows:BIGINT,fragments:VARBINARY,commitcontext:VARBINARY> | | Rows | | Fragments | | CommitContext | |--------|--------|--------|--------|--------| | N (numPartitionUpdates) | | NULL | | TaskCommitContext | | NULL | | PartitionUpdate0 | | | | NULL | | PartitionUpdate1 | | | | NULL | | ... | | | | NULL | | PartitionUpdateN | | | The fragments column is JSON strings of PartitionUpdate as in the following format ``` { "Name": "ds=2022-08-06/partition=events_pcp_product_finder_product_similartiy__groupby__999999998000212604", "updateMode": "NEW", "writePath": "", "targetPath": "", "fileWriteInfos": [ { "writeFileName": "", "targetFileName": "", "fileSize": 3517346970 }, { "writeFileName": "", "targetFileName": "", "fileSize": 4314798687 }, ] "rowCount": 3950431150, "inMemoryDataSizeInBytes": 4992001194927, "onDiskDataSizeInBytes": 1374893372141, "containsNumberedFileNames": false } ``` The commitcontext column is a constant vector of TaskCommitContext in JSON string ``` { "lifespan": "TaskWide", "taskId": "20220822_190126_00000_78c2f.1.0.0", "pageSinkCommitStrategy": "TASK_COMMIT", "lastPage": false } ``` #### Empty buckets The TableWriter generates PartitionUpdate messages only for the files it has written. So if there are empty buckets then there isn't a PartitionUpdate message for it. If there are no PartitionUpdate output messages for any bucket, then the TableFinish operator fixes the HiveMetaStore with empty files for each bucket. https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/HiveMetadata.java#L1794 ### Design As outlined above all table writing happens in the TableWriter operator. The TableWriter forwards the write to the HiveDataSink which is registered by the HiveConnector for it. The HiveDataSink already supported bucketed (and partitioned) tables. So all the logic for wiring bucket metadata and bucket computation already existed. The only missing piece was to handle fileNames for bucketed but not partitioned files in the writerIds, and map the proper writerId to input rows when appending to the HiveDataSink. This PR fixes that. ******************************************** Note: The Prestissimo changes are in prestodb/presto#22737 Pull Request resolved: facebookincubator#9740 Reviewed By: kewang1024 Differential Revision: D57748876 Pulled By: xiaoxmeng fbshipit-source-id: 33bb77c6fce4d2519f3214e2fb93891f1f910716
Ref - trinodb/trino#5878
The text was updated successfully, but these errors were encountered: