Prefilter join build side when it's too large #22667

kaikalur · 2024-05-03T23:43:36Z

Description

Optimize the build side of join using the distinct keys from left when the right (and left too) are large.

Motivation and Context

SELECT .. FROM T1 JOIN T2 USING(x)

can be very slow/memory intense when T2 is big (and T1 is also big). So the idea is to do something like dynamic filter except on the build side! So the above query becomes:

SELECT ... FROM T1 LEFT JOIN (SELECT * FROM T2 WHERE x IN (SELECT DISTINCT x FROM T1)) T2 USING(x)

This has helped us tremendously in some of our production workloads. So making it an optimization.

Impact

Test Plan

Added tests

Contributor checklist

Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
Documented new properties (with its default value), SQL syntax, functions, or other functionality.
If release notes are required, they follow the release notes guidelines.
Adequate tests were added if applicable.
CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

Added a new optimization for prefiltering the build side of a join with distinct keys from the probe side.  This can be enabled with the ``join_prefilter_build_side `` session property. :pr:`22667`

     join_prefilter_build_side

agrawaldevesh · 2024-05-05T06:59:23Z

Awesome ! Is this strictly opt in or can it be hbo or cbo'd ?

Is there also a way to fail the select distinct early if its cardinality is too big ?

Finally, I did not follow: can it be applied to multiple equi join keys too ?

elharo

Great idea!

presto-tests/src/main/java/com/facebook/presto/tests/AbstractTestQueries.java

ClarenceThreepwood

Can you share some performance numbers that you see in your workloads? Maybe even add a SqlBenchmark that showcases this optimization?

IIUC - this optimization reduces the size of the hash table that is built out of T2. In order to do this it adds a second table scan on T1 and then builds a second hash table to compute the distinct join key from T1. I'm curious where the benefit comes from? Is it just the improved performance of the semijoin?

Any thoughts on how this can be used in practice? I ask because this is not a cost based decision and can degrade performance in many usecases

presto-main/src/main/java/com/facebook/presto/sql/planner/optimizations/JoinPrefilter.java

yingsu00

I'm curious, if we already know the distinct keys in T1, why not just make it as the build side? No need to calculate the hash values, just use the distinct values as the hash values. This way there is no need to scan T1 twice.

presto-main/src/main/java/com/facebook/presto/SystemSessionProperties.java

steveburnett · 2024-05-09T14:25:21Z

suggest minor revision of the release notes entry

== RELEASE NOTES ==

General Changes
* Add optimization for prefiltering the build side of a join with distinct keys from the probe side. This can be enabled with the ``join_prefilter_enabled`` session property. :pr:`22667`

kaikalur · 2024-05-09T14:43:13Z

Can you share some performance numbers that you see in your workloads? Maybe even add a SqlBenchmark that showcases this optimization?

IIUC - this optimization reduces the size of the hash table that is built out of T2. In order to do this it adds a second table scan on T1 and then builds a second hash table to compute the distinct join key from T1. I'm curious where the benefit comes from? Is it just the improved performance of the semijoin?

Any thoughts on how this can be used in practice? I ask because this is not a cost based decision and can degrade performance in many usecases

Two potential cases - a) build side is very big and only a few keys actually match so we shuffle a lot less right side, b) after the semijoin, the build side becomes small enough to broadcast which can eliminate shuffling the full left side which could have a lot of payload.

kaikalur · 2024-05-09T14:43:51Z

I'm curious, if we already know the distinct keys in T1, why not just make it as the build side? No need to calculate the hash values, just use the distinct values as the hash values. This way there is no need to scan T1 twice.

You need to get the rest of the fields!

kaikalur · 2024-05-09T15:07:46Z

Can you share some performance numbers that you see in your workloads? Maybe even add a SqlBenchmark that showcases this optimization?
IIUC - this optimization reduces the size of the hash table that is built out of T2. In order to do this it adds a second table scan on T1 and then builds a second hash table to compute the distinct join key from T1. I'm curious where the benefit comes from? Is it just the improved performance of the semijoin?
Any thoughts on how this can be used in practice? I ask because this is not a cost based decision and can degrade performance in many usecases

Two potential cases - a) build side is very big and only a few keys actually match so we shuffle a lot less right side, b) after the semijoin, the build side becomes small enough to broadcast which can eliminate shuffling the full left side which could have a lot of payload.

Added benchmark with:

select count(1) from part join lineitem using (partkey) where part.name like '%x%'

Original: join_prefilter_build_side :: 2158.693 cpu ms :: 4.17MB peak memory :: in 75.2K, 0B, 34.8K/s, 0B/s :: out 381, 26.7KB, 176/s, 12.4KB/s With optimization: join_prefilter_build_side :: 2189.438 cpu ms :: 2.02MB peak memory :: in 90.2K, 0B, 41.2K/s, 0B/s :: out 381, 26.7KB, 174/s, 12.2KB/s See the memory reduction

kaikalur · 2024-05-09T16:38:48Z

Add a task for making it cost based - needs tracking some new stats in hbo:

https://fburl.com/2aui8j1i

presto-main/src/main/java/com/facebook/presto/SystemSessionProperties.java

kaikalur · 2024-05-13T17:19:43Z

@ClarenceThreepwood - can you take a look when you get a chance?

ClarenceThreepwood

Please update the release note with the new name of the session property

presto-main/src/main/java/com/facebook/presto/sql/planner/optimizations/JoinPrefilter.java

ClarenceThreepwood · 2024-05-14T18:24:03Z

Add a task for making it cost based - needs tracking some new stats in hbo:

https://fburl.com/2aui8j1i

This is meta internal only?

kaikalur · 2024-05-14T18:44:21Z

Add a task for making it cost based - needs tracking some new stats in hbo:
https://fburl.com/2aui8j1i

This is meta internal only?

Oops. Sorry. here correct link:

#22706

kaikalur · 2024-05-14T18:51:05Z

Please update the release note with the new name of the session property

Done

kaikalur · 2024-05-14T18:53:04Z

addressed comments

ClarenceThreepwood · 2024-05-14T19:02:02Z

Please update the release note with the new name of the session property

Done

It still has the old name here
"Added a new optimization for prefiltering the build side of a join with distinct keys from the probe side. This can be enabled with the join_prefilter_enabled session property. :pr:22667"

kaikalur · 2024-05-14T19:12:09Z

Please update the release note with the new name of the session property

Done

It still has the old name here Added a new optimization for prefiltering the build side of a join with distinct keys from the probe side. This can be enabled with the join_prefilter_enabled session property. :pr:22667

OK for real this time lol - damn scrollbar!

kaikalur · 2024-05-14T20:18:59Z

OK all comments addressed (again)

yingsu00 · 2024-06-26T17:18:37Z

I'm curious, if we already know the distinct keys in T1, why not just make it as the build side? No need to calculate the hash values, just use the distinct values as the hash values. This way there is no need to scan T1 twice.

You need to get the rest of the fields!

@kaikalur Could you explain a bit more? If we just make T1 the build side, we can still get the rest of the fields, can't we? For example, select part.partkey, lineitem.quantity from part join lineitem using (partkey) where part.name like '%x%' if we make part the build side, the dynamic filter on partkey can be propergated to lineitem, and part.partkey, lineitem.quantity can be output as well. It's still inner equijoin so the semantics should not change. Does the SQL standard say T2 must be the build side in T1 join T2?

kaikalur · 2024-06-27T21:44:43Z

I'm curious, if we already know the distinct keys in T1, why not just make it as the build side? No need to calculate the hash values, just use the distinct values as the hash values. This way there is no need to scan T1 twice.

You need to get the rest of the fields!

@kaikalur Could you explain a bit more? If we just make T1 the build side, we can still get the rest of the fields, can't we? For example, select part.partkey, lineitem.quantity from part join lineitem using (partkey) where part.name like '%x%' if we make part the build side, the dynamic filter on partkey can be propergated to lineitem, and part.partkey, lineitem.quantity can be output as well. It's still inner equijoin so the semantics should not change. Does the SQL standard say T2 must be the build side in T1 join T2?

No but that's a join reordering problem?

yingsu00 · 2024-06-28T13:30:16Z

No but that's a join reordering problem?

Yes, that's what I was asking. Can't we just reorder the join? Or does the SQL standard say the table on the right side of the JOIN keyword has to be the build side? It seems to me the join order is a decision of the engine, isn't it?

feilong-liu · 2024-06-28T22:33:16Z

No but that's a join reordering problem?

Yes, that's what I was asking. Can't we just reorder the join? Or does the SQL standard say the table on the right side of the JOIN keyword has to be the build side? It seems to me the join order is a decision of the engine, isn't it?

My understanding is that it works when the probe side has large payload to output. For example:
select T1.k, T1.map, T1.row, T1.array, T2.k, T2.v from T1 Join T2 using(k)
here T1.map, T1.row, T1.array are large columns, if we put T1 on build side, we will pay a large memory cost to store these payloads, which can OOM.

yingsu00 · 2024-06-29T22:01:42Z

My understanding is that it works when the probe side has large payload to output. For example:
select T1.k, T1.map, T1.row, T1.array, T2.k, T2.v from T1 Join T2 using(k)
here T1.map, T1.row, T1.array are large columns, if we put T1 on build side, we will pay a large memory cost to store these payloads, which can OOM.

@feilong-liu Thanks for your message. It makes sense. Maybe we can keep the pointers instead of copying all these fields to the hash table in that case.

feilong-liu · 2024-06-29T22:23:50Z

My understanding is that it works when the probe side has large payload to output. For example:
select T1.k, T1.map, T1.row, T1.array, T2.k, T2.v from T1 Join T2 using(k)
here T1.map, T1.row, T1.array are large columns, if we put T1 on build side, we will pay a large memory cost to store these payloads, which can OOM.

@feilong-liu Thanks for your message. It makes sense. Maybe we can keep the pointers instead of copying all these fields to the hash table in that case.

Keep pointers sounds an interesting idea. Can you elaborate more? One challenge I can think of is that, even we keep pointers, we still need to read these payload and attach to the result later, which still need to load the payload from memory.

yingsu00 · 2024-07-03T14:41:05Z

The payload is already in memory from the incoming pages. But you're right, if both sides are too big to be held in memory, then the side that produces smaller hash table (including the output payload) should be the build side.

kaikalur requested review from jaystarshot, feilong-liu and a team as code owners May 3, 2024 23:43

kaikalur requested a review from presto-oss May 3, 2024 23:43

kaikalur force-pushed the prefilter branch 6 times, most recently from 0eea294 to 6b56668 Compare May 5, 2024 03:22

jaystarshot requested a review from ClarenceThreepwood May 5, 2024 03:28

elharo reviewed May 5, 2024

View reviewed changes

kaikalur force-pushed the prefilter branch 4 times, most recently from 0c67ec3 to 8cf9cdc Compare May 5, 2024 22:09

ClarenceThreepwood requested changes May 8, 2024

View reviewed changes

yingsu00 reviewed May 8, 2024

View reviewed changes

feilong-liu reviewed May 8, 2024

View reviewed changes

presto-main/src/main/java/com/facebook/presto/SystemSessionProperties.java Outdated Show resolved Hide resolved

kaikalur force-pushed the prefilter branch from 8cf9cdc to a3e46b8 Compare May 9, 2024 14:39

kaikalur force-pushed the prefilter branch 2 times, most recently from 2db1b6e to 74cf802 Compare May 9, 2024 15:06

kaikalur force-pushed the prefilter branch from 74cf802 to 572b908 Compare May 9, 2024 15:41

kaikalur requested a review from ClarenceThreepwood May 9, 2024 15:44

jaystarshot reviewed May 9, 2024

View reviewed changes

presto-main/src/main/java/com/facebook/presto/SystemSessionProperties.java Show resolved Hide resolved

feilong-liu approved these changes May 9, 2024

View reviewed changes

ClarenceThreepwood reviewed May 14, 2024

View reviewed changes

presto-main/src/main/java/com/facebook/presto/sql/planner/optimizations/JoinPrefilter.java Show resolved Hide resolved

ClarenceThreepwood approved these changes May 14, 2024

View reviewed changes

kaikalur requested a review from pranjalssh May 15, 2024 00:03

pranjalssh approved these changes May 15, 2024

View reviewed changes

kaikalur merged commit 20f6640 into prestodb:master May 15, 2024
56 checks passed

wanglinsong mentioned this pull request Jun 25, 2024

Add release notes for 0.288 #23079

Merged

36 tasks

feilong-liu mentioned this pull request Oct 18, 2024

Extend join prefilter optimizer to use hash for filter #23858

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefilter join build side when it's too large #22667

Prefilter join build side when it's too large #22667

kaikalur commented May 3, 2024 •

edited

Loading

agrawaldevesh commented May 5, 2024

elharo left a comment

ClarenceThreepwood left a comment

yingsu00 left a comment

steveburnett commented May 9, 2024

kaikalur commented May 9, 2024

kaikalur commented May 9, 2024

kaikalur commented May 9, 2024 •

edited

Loading

kaikalur commented May 9, 2024

kaikalur commented May 13, 2024

ClarenceThreepwood left a comment

ClarenceThreepwood commented May 14, 2024

kaikalur commented May 14, 2024

kaikalur commented May 14, 2024

kaikalur commented May 14, 2024

ClarenceThreepwood commented May 14, 2024 •

edited

Loading

kaikalur commented May 14, 2024

kaikalur commented May 14, 2024

yingsu00 commented Jun 26, 2024

kaikalur commented Jun 27, 2024

yingsu00 commented Jun 28, 2024

feilong-liu commented Jun 28, 2024

yingsu00 commented Jun 29, 2024

feilong-liu commented Jun 29, 2024

yingsu00 commented Jul 3, 2024

Prefilter join build side when it's too large #22667

Prefilter join build side when it's too large #22667

Conversation

kaikalur commented May 3, 2024 • edited Loading

Description

Motivation and Context

Impact

Test Plan

Contributor checklist

Release Notes

agrawaldevesh commented May 5, 2024

elharo left a comment

Choose a reason for hiding this comment

ClarenceThreepwood left a comment

Choose a reason for hiding this comment

yingsu00 left a comment

Choose a reason for hiding this comment

steveburnett commented May 9, 2024

kaikalur commented May 9, 2024

kaikalur commented May 9, 2024

kaikalur commented May 9, 2024 • edited Loading

kaikalur commented May 9, 2024

kaikalur commented May 13, 2024

ClarenceThreepwood left a comment

Choose a reason for hiding this comment

ClarenceThreepwood commented May 14, 2024

kaikalur commented May 14, 2024

kaikalur commented May 14, 2024

kaikalur commented May 14, 2024

ClarenceThreepwood commented May 14, 2024 • edited Loading

kaikalur commented May 14, 2024

kaikalur commented May 14, 2024

yingsu00 commented Jun 26, 2024

kaikalur commented Jun 27, 2024

yingsu00 commented Jun 28, 2024

feilong-liu commented Jun 28, 2024

yingsu00 commented Jun 29, 2024

feilong-liu commented Jun 29, 2024

yingsu00 commented Jul 3, 2024

kaikalur commented May 3, 2024 •

edited

Loading

kaikalur commented May 9, 2024 •

edited

Loading

ClarenceThreepwood commented May 14, 2024 •

edited

Loading