Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non equality predicates in outer join (v3) #5088

Conversation

losipiuk
Copy link
Contributor

Supersedes #4950

@martint
Copy link
Contributor

martint commented Apr 26, 2016

Have you benchmarked this? Without compiling the join expressions, it may be a big performance hit for queries that push the filter into the join.

@losipiuk
Copy link
Contributor Author

No - I did not. I expect that performance will not be great. Though the penalty will only apply to queries which would not execute at all before this change.
Before adding support for join-filter calling to generated classes I wanted to pass this patch through you to validate overall approach.

@losipiuk
Copy link
Contributor Author

losipiuk commented May 5, 2016

@martint. Friendly ping. Please take a look at interpreted versions so we can proceed with making changes to class generators.

@martint
Copy link
Contributor

martint commented May 9, 2016

@losipiuk, the general approach is fine. I'm a bit concerned about potential performance projects. For instance, in InMemoryJoinHash.getNextJoinPositionFrom (https://github.com/prestodb/presto/pull/5088/files#diff-cc58d9f8668f9f278dae5330c9e4d7a6R137), if that code path is hit from executions that have a filter function and ones that don't, the branch prediction may not work well and affect performance for regular joins.

@losipiuk
Copy link
Contributor Author

@martint
For this exact case we could possibly have two distinct LookupSource implementations:
InMemoryHashJoin and FilteringInMemoryHashJoin. And select implementation in PagesIndex#createLookupSource(). There is a chance that it would behave better than branch in InMemoryHashJoin.

It would be nice to benchmark that. This seems a bit hard to me though without actually running query set on real cluster.

@losipiuk losipiuk force-pushed the non-equality-predicates-in-outer-join_v3 branch from 8fa487a to 6460857 Compare May 17, 2016 09:52
@losipiuk
Copy link
Contributor Author

@martint I separated the implementations. (commit 195a56a).

This patch uses inheritance.
Other approach would be to have single implementation of InMemoryJoinHash but inject it the implementation of interface which would be used in place of getNextJoinPositionFrom.
IMO neither is very nice. Let me know if you would rather see composition here.

@martint martint assigned martint and unassigned losipiuk May 20, 2016
Add support for non-equi conjuncts in join condition for outer joins.
Supported case is when conjunct is resolvable as a whole using symbols
of inner join table. In such case conjunct is pushed down as filter to
inner source relation.
Previous message was misleading because classification, if conjunct is
supported or not, is more complext that checking if it is equality
comparison or not.

There are cases when specific equality conjuncts are not supported
and cases when different types of conjuncts are supported.
This patch adds more generic support for any conjuncts in join
condition in OUTER joins.

Limitations:
 - no support in generated ProbeSource/PagesHashStrategy
 - there is requirement that at least one conjunct is equality based
@losipiuk losipiuk force-pushed the non-equality-predicates-in-outer-join_v3 branch from 6460857 to 1a57590 Compare May 23, 2016 10:04
@losipiuk
Copy link
Contributor Author

I rebased this onto master.

@martint
Copy link
Contributor

martint commented May 23, 2016

@losipiuk, what parts changed compared to the last version? Are the changes mostly in "Separate filtering/non-filtering LookupSource implementations"?

@martint
Copy link
Contributor

martint commented May 23, 2016

Can you remove the hotspot_pidXXXX.log files from this commit: edfdfa7? Github can't handle them and it refuses to show the diff.

@losipiuk losipiuk force-pushed the non-equality-predicates-in-outer-join_v3 branch from 1a57590 to 34675f9 Compare May 23, 2016 21:32
@losipiuk
Copy link
Contributor Author

@martint. Apart from simple conflict resolutions the changes are in "Separate filtering/non-filtering LookupSource implementations" commit. It addresses your concern of branching within InMemoryJoinHash. This commit could probably be merged with parent. I just left it as separate for easier reviewing.

I removed the hotspot logs. Sorry about that :).

@martint
Copy link
Contributor

martint commented May 24, 2016

Looking at this a bit closer, we may not actually need to split the two implementations. In JoinCompiler, we load the InMemoryJoinHash in an isolated classloader for each query, so profile pollution shouldn't happen.

In any case, if we weren't doing that, this implementation wouldn't help much -- the "loop" is not part of the LookupSource, so the call would get virtualized for multiple implementations of that interface.

Sorry about the churn.

@losipiuk
Copy link
Contributor Author

losipiuk commented May 24, 2016

@martint thanks for clarification. Please let me know if I am getting this right.

Was IsolatedClass mechanism for InMemoryJoinHash in JoinCompiler originally introduced to avoid polymorphic virtual calls to methods of LookupSource?
Does it play well with fact that later in the code path the LookupSource is wrapped (or not) with OuterLookupSource/PartitionedLookupSource...?

Or is the latter problem solved with some other classloader mangling?

Or the IsolateClass mechanism has nothing to do with virtual calls?

Sorry for possibly dumb questions but I wanted to get better understanding on that.

@losipiuk losipiuk force-pushed the non-equality-predicates-in-outer-join_v3 branch from 34675f9 to 6d274df Compare May 24, 2016 10:19
@losipiuk
Copy link
Contributor Author

I removed the commit splitting the implementations. Is that good to be merged?

@martint
Copy link
Contributor

martint commented May 24, 2016

Yes, the IsolatedClass mechanism is specifically to avoid virtual calls and profile pollution. It's a simple way to generate a class by copying the bytecodes from an existing class.

Does it play well with fact that later in the code path the LookupSource is wrapped (or not) with OuterLookupSource/PartitionedLookupSource...?

It doesn't matter. The entire join operator is generated in the same way, so the loops should be specialized for each query.

@losipiuk
Copy link
Contributor Author

Cool - thanks :)

@losipiuk
Copy link
Contributor Author

Btw @martint. We discovered a bug in presto (I think introduced by me - but I did not check git log ;) ).

Queries like this one:

 select * from test1 a left join test2 b on a.col1=b.col1 and b.col2=b.col2

should be failing.
But are executed currently - producing wrong results.
This patch fixes that.

@martint
Copy link
Contributor

martint commented May 24, 2016

Merged, thanks!

@martint martint closed this May 24, 2016
@losipiuk losipiuk deleted the non-equality-predicates-in-outer-join_v3 branch June 8, 2016 21:12
@buremba
Copy link
Contributor

buremba commented Aug 2, 2016

Is there any plan to improve support for non-equity predicates in outer join? I tested it and the performance is actually worse than equity joins.

I simply put a false expression in JOIN clause and it took x2 time to execute the query.

> select count(*) from tweet left join (select 1 as id) mapping on (false and tweet.id = mapping.id);

Query 20160802_133540_00030_kwbyd, FINISHED, 1 node
Splits: 131 total, 131 done (100.00%)
0:29 [39.9M rows, 173MB] [1.36M rows/s, 5.88MB/s]
> select count(*) from tweet left join (select 1 as id) mapping on (tweet.id = mapping.id);

Query 20160802_133406_00027_kwbyd, FINISHED, 1 node
Splits: 131 total, 131 done (100.00%)
0:15 [39.9M rows, 316MB] [2.61M rows/s, 20.7MB/s]

Also, even though the right side of the JOIN only has one row, JOIN has a big performance impact on the query:

> select count(*) from tweet where id is not null;

Query 20160802_134119_00035_kwbyd, FINISHED, 1 node
Splits: 128 total, 128 done (100.00%)
0:02 [39.9M rows, 316MB] [20M rows/s, 159MB/s]

@kbajda
Copy link

kbajda commented Aug 2, 2016

Yes, our next step is to optimize performance of such predicates. Will create issues/PRs soon.
CC: @sopel39 @losipiuk

@buremba
Copy link
Contributor

buremba commented Aug 2, 2016

Thanks, looking forward to it.

@losipiuk
Copy link
Contributor Author

losipiuk commented Aug 8, 2016

@buremba: here is the PR #5810. It fixes the performance for case where there is at least one equality based conjunct in join codition.

@buremba
Copy link
Contributor

buremba commented Aug 8, 2016

@losipiuk thanks, I'll look into it. Hope that it can be merged soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants