-
Notifications
You must be signed in to change notification settings - Fork 447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add data columns to join hash table payload #387
Conversation
Hi @ienkovich , thank you for the patch. This is something we thought about quite a bit in the past, there is a case to be made for it and the benchmarks prove it. As a default policy, it has at least two drawbacks:
I don't think we'd be fine with taking any regression, so the right approach would be to collect runtime statistics about the workload and whether doing this would be advantageous. In other words, we'd have this as an additional policy which kicks in selectively, based on workload. On the other hand, I see no obvious blocker to merging this behind the flag you've added, as long as all tests pass with the flag set. The smart policy which activates it can be follow-up work, there's no requirement as far as I'm concerned to bundle it with this work. We'll still need to review this carefully since it's a rather big change, but as long as changes are well isolated we should be able to take it. This has great potential, thanks! |
I understand the initial version is not good enough to become a default strategy. For now it can be a tool to find cases when payload really helps. Hopefully we will come with some heuristics to enable it in most profitable cases. I agree cache reusage becomes an issue. This first implementation simply checks payload is the same. This definitely should be relaxed to allow better reusage (probably even if required payload is partly cached). But anyway payload is basically an additional index to speed-up query and it has to come with additional maintenance cost. I'm not sure fused payload can be profitable on GPU as much as on CPU and enabled it for CPU only. |
Hi @ienkovich I am having some problems building this with
Were you able to build w/ CUDA enabled? |
The patch was tested on not CUDA enabled machine. I'll fix the CUDA build. |
If you have the cuda drivers but no GPUs you should still be able to build -- we need to make some modifications in our unit test framework to pass on a CUDA build with no GPUs available. If you don't want to pollute your env, |
CUDA build should be fixed now but I found another issue, in some case a wrong payload is loaded into a hash table. I couldn't find the reason yet. Will look into it. |
Fixed issue with wrong payload loaded. |
This looks good, I've merged all the commits from our internal repo over and this repo is now fully up to date. Do you mind rebasing and fixing the conflicts? After that we can get it merged. |
Thanks for review! I fixed conflicts. |
@alexbaden Let's check that the IR is unchanged (if the flag is not set) before merging. |
I compared generated IRs (using IR log channel) for TPC-H queries and found no difference. |
Analyzing profiles for TPC-H queries I found there are many cache misses when we access columns with scan index > 0 and this is the main performance issue for join queries. I tried to fight with those LLC misses by storing required columns in join hash tables as a payload. Basically this gives us row representation for joined tables which can significantly reduce number of LLC misses. In case join results small number or no matches at all this patch can harm performance because stored payload is not used and therefore increased hash table size and payload build time are not payed off.
The patch adds payload support only for JoinHashTable which is used by most of TPC-H queries. There are performance results (SF=30):
As expected we can get significant gain when stored payload is frequently used and some performance loss due to increased hash table size when payload is not used much.
Current implementation has some work to do:
Still I believe this patch is enough to start discussion of the approach and to collect some performance data. I put the feature under a flag to simplify performance measurements. For now it is disabled by default but I hope we can enable it by default eventually.