-
Notifications
You must be signed in to change notification settings - Fork 71
Add "Complete Table Scan" blog post by Orri Erling #31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Masha can you let me know if the formatting of the SQL and table/column names from TPC-H look correct to you? Some of the time they have prefixes and suffixes and refer to our internal stuff like hive.tpch.lineitem_s. Not sure what is optimal for a blog post, but if it can all execute against Presto I think it's good enough. The original post is on medium. |
|
The scale is 100G. The point is that this is small enough to run from memory and large enough not to be dominated by query setup costs. The point of 12345 is that this is not at either end of the 1 – 1M scale range of suppkey values. I think the text says a 1/1M selection, implying a 100G scale.
Here you want a value that is not in the top or bottom 1/10K of the values, where there would be a good chance of row group summaries skipping whole row groups. You can try this with a value of 1 and you’ll see what I proportions of cost factors are off.
From: Maria Basmanova <notifications@github.com>
Sent: Friday, June 28, 2019 5:04 PM
To: prestodb/prestodb.github.io <prestodb.github.io@noreply.github.com>
Cc: oerling <erling@xs4all.nl>; Mention <mention@noreply.github.com>
Subject: Re: [prestodb/prestodb.github.io] Add "Complete Table Scan" blog post by Orri Erling (#31)
@mbasmanova commented on this pull request.
_____
In website/blog/2019-07-15-complete-table-scan.md <#31 (comment)> :
+In the previous article we looked at the abstract problem statement and possibilities inherent in scanning tables. In this piece we look at the quantitative upside with Presto. We look at a number of queries and explain the findings.
+
+The initial impulse motivating this work is the observation that table scan is by far the #1 operator in Presto workloads I have seen. This is a little over half of all Presto CPU, with repartitioning a distant second, at around 1/10 of the total. The other half of the motivation is ready opportunity: Presto in its pre-Aria state does almost none of the things that are common in table scan.
+
+For easy reproducibility and staying away from proprietary material, we use a TPC-H 100G dataset running on a desktop machine with 2x4 Skylake cores at 3.5GHz. The data is compressed with Snappy and we are running with warm OS cache. The Presto is a modified 0.221 where the Aria functionality can be switched on and off. Not to worry, we will talk about disaggregated storage and IO in due time but the basics will come first.
+
+# Simple scan
+
+The base case for scan optimization is the simplest possible query:
+
+```sql
+SELECT SUM(l_extendedprice)
+FROM lineitem
+WHERE suppkey = 12345;
+```
+| Version | Wall time (seconds) | CPU time (seconds) |
@aweisberg <https://github.com/aweisberg>
I'd pivot this table and a column for ration between aria and baseline. I think this will be clearer.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#31?email_source=notifications&email_token=AKPPPT2YT6FGE6KQYCQ5U4TP42RHJA5CNFSM4H4ILGVKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOB5BG63Q#pullrequestreview-256012142> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AKPPPT2TFULAXSMAFJGMV4LP42RHJANCNFSM4H4ILGVA> .
|
|
You can use my LinkedIn public profile link.
Thanks
Orri
From: Joel Marcey <notifications@github.com>
Sent: Friday, June 28, 2019 5:42 PM
To: prestodb/prestodb.github.io <prestodb.github.io@noreply.github.com>
Cc: oerling <erling@xs4all.nl>; Mention <mention@noreply.github.com>
Subject: Re: [prestodb/prestodb.github.io] Add "Complete Table Scan" blog post by Orri Erling (#31)
@JoelMarcey commented on this pull request.
_____
In website/blog/2019-07-15-complete-table-scan.md <#31 (comment)> :
@@ -0,0 +1,169 @@
+---
+title: Complete Table Scan: A Quantitative Assessment
+author: Orri Erling
+authorURL: http://code.fb.com/
Don't you want a different URL than this?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#31?email_source=notifications&email_token=AKPPPT7QELSYOZUOHYEQEDLP42VWFA5CNFSM4H4ILGVKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOB5BHZHI#pullrequestreview-256015517> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AKPPPT27S5A5477JNSFIZ43P42VWFANCNFSM4H4ILGVA> .
|
b6e9f96 to
061e42c
Compare
|
I spent some time last week reproducing the benchmark. I was able to reproduce the results for every query with some performance differences that we ascribing to differences in hardware. See BENCHMARK.md. We also ran it again on his workstation to make sure he stills gets the same results and he did. There was one query where Presto was getting lucky and putting the right filter first so Orri added [this|https://github.com/aweisberg/presto/blob/aria-scan-prototype/BENCHMARK.md}. @mbasmanova WDYT of this? I am hoping to post this on Monday. |
|
Any way to preview the rendered result? :) |
|
Yes if you go through the instructions in website/README.md and get yarn installed you can "yarn start" and it will host the site live locally |
mbasmanova
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look great % minor comments.
|
|
||
| The initial impulse motivating this work is the observation that table scan is by far the #1 operator in Presto workloads I have seen. This is a little over half of all Presto CPU, with repartitioning a distant second, at around 1/10 of the total. The other half of the motivation is ready opportunity: Presto in its pre-Aria state does almost none of the things that are common in table scan. | ||
|
|
||
| <!--truncate--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The blog has an index as well as individual posts. In the index it posts a snippet of each blog. This tag determines where the snippet ends.
|
|
||
| ## Mechanics of a scan | ||
|
|
||
| Baseline Presto does this as follows: The scan `OrcPageSource` produces consecutive `Page` instances that contain a `LazyBlock` for each column. This operation as such takes no time since the `LazyBlock` instances are just promises. The actual work takes place when evaluating the generated code for the comparison. This sees that the column is not loaded, loads all the values in the range of the `LazyBlock`, typically 1024 values and then does the operation and produces a set of passing row numbers. This set is empty for all but 1/100k of the cases. If this is empty, the `LazyBlock` for `extendedprice` is not touched. If there are hits, the `extendedprice` `LazyBlock` is loaded and the values for the selected rows are copied out. When this happens, 1024 values are decoded from the column and most often one of them is accessed. Loading a `LazyBlock` allocates memory for each value. In the present case this becomes garbage immediately after first use. The same applies to the values in extended price, of which only one is copied to a `Block` of output. This is handled by a special buffering stage that accumulates rows from multiple loaded `LazyBlock` instances until there is a minimum batch worth of rows to pass to the next operator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: The scan OrcPageSource -> The OrcPageSource``
| | Aria | 4 | 44.2 | 1.0 | | ||
| | Baseline | 21 | 271 | 6.13 | | ||
|
|
||
| The filtered columns are of low cardinality and are encoded as dictionaries. This is an example of evaluating an expensive predicate on only distinct values. Baseline Presto misses the opportunity because all filters are generated into a monolithic code block. Aria generates filter expressions for each distinct set of required columns. In this case the filters are independent and reorderable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oerling I didn't realize that complex filters also run on dictionaries. This is super cool. Do you have a pointer for me to check out how this is done?
| The ideas presented here are currently being integrated into mainline Presto. | ||
|
|
||
| # Conclusions and Next Steps | ||
| We have so far had a look at the low-hanging fruits for scanning flat tables. These techniques are widely known and once one considers the fundamentals these become just matters of common sense. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
once one considers the fundamentals these become just matters of common sense
Is there a way to soften this sentence?
aad68fb to
5ec218c
Compare
No description provided.