Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data columns to join hash table payload #387

Closed
wants to merge 4 commits into from
Closed

Add data columns to join hash table payload #387

wants to merge 4 commits into from

Conversation

ienkovich
Copy link
Contributor

Analyzing profiles for TPC-H queries I found there are many cache misses when we access columns with scan index > 0 and this is the main performance issue for join queries. I tried to fight with those LLC misses by storing required columns in join hash tables as a payload. Basically this gives us row representation for joined tables which can significantly reduce number of LLC misses. In case join results small number or no matches at all this patch can harm performance because stored payload is not used and therefore increased hash table size and payload build time are not payed off.

The patch adds payload support only for JoinHashTable which is used by most of TPC-H queries. There are performance results (SF=30):

TPC-H Ref (ms) Patched(ms) Gain
Q1 914 920 0.99347826
Q5 4029 2992 1.34659091
Q8 453 455 0.9956044
Q7 4552 3029 1.50280621
Q8 8124 5929 1.3702142
Q9 21860 15748 1.38811278
Q10 7342 5849 1.25525731
Q11 818 830 0.98554217
Q12 428 455 0.94065934
Q18 24406 25404 0.96071485
Q19 18278 9407 1.94302115

As expected we can get significant gain when stored payload is frequently used and some performance loss due to increased hash table size when payload is not used much.

Current implementation has some work to do:

  • row ids can be actually completely removed from the table is all required columns are stored as a payload. For one-to-one tables this requires some new way to detect empty hash entries. In case payload has not-null columns a NULL value can be used for that. This should decrease table size and hopefully reduce performance losses in bad scenarios
  • hash table cache can be improved to better re-use tables in case of partial table key match (e.g. we can re-use only a subset of columns from cached table)

Still I believe this patch is enough to start discussion of the approach and to collect some performance data. I put the feature under a flag to simplify performance measurements. For now it is disabled by default but I hope we can enable it by default eventually.

@asuhan
Copy link
Contributor

asuhan commented Sep 11, 2019

Hi @ienkovich , thank you for the patch. This is something we thought about quite a bit in the past, there is a case to be made for it and the benchmarks prove it. As a default policy, it has at least two drawbacks:

  1. Decrease the cache hit rate. In the current world, we can reuse the same hash table with different payloads. If we fuse the payload into the hash table, that's not true anymore.
  2. In certain cases (join on dictionary strings with different encodings) we're building the hash table on the host and use it on GPU. With this approach, we force the payload to be fetched on the host, which happens at a lower bandwidth than on GPU.

I don't think we'd be fine with taking any regression, so the right approach would be to collect runtime statistics about the workload and whether doing this would be advantageous. In other words, we'd have this as an additional policy which kicks in selectively, based on workload.

On the other hand, I see no obvious blocker to merging this behind the flag you've added, as long as all tests pass with the flag set. The smart policy which activates it can be follow-up work, there's no requirement as far as I'm concerned to bundle it with this work. We'll still need to review this carefully since it's a rather big change, but as long as changes are well isolated we should be able to take it. This has great potential, thanks!

@ienkovich
Copy link
Contributor Author

I understand the initial version is not good enough to become a default strategy. For now it can be a tool to find cases when payload really helps. Hopefully we will come with some heuristics to enable it in most profitable cases.

I agree cache reusage becomes an issue. This first implementation simply checks payload is the same. This definitely should be relaxed to allow better reusage (probably even if required payload is partly cached). But anyway payload is basically an additional index to speed-up query and it has to come with additional maintenance cost.

I'm not sure fused payload can be profitable on GPU as much as on CPU and enabled it for CPU only.

@alexbaden
Copy link
Contributor

Hi @ienkovich

I am having some problems building this with ENABLE_CUDA = on:

[ 47%] Linking CXX static library libthrift_handler.a
/home/alexb/Projects/omniscidb/QueryEngine/HashJoinRuntime.cpp(124): error: identifier "fixed_width_int_decode_noinline" is undefined

/home/alexb/Projects/omniscidb/QueryEngine/HashJoinRuntimeGpu.cu(40): error: no suitable conversion function from "const JoinColumn" to "int32_t" exists

/home/alexb/Projects/omniscidb/QueryEngine/HashJoinRuntimeGpu.cu(41): error: no suitable user-defined conversion from "const JoinColumnTypeInfo" to "JoinColumn" exists

/home/alexb/Projects/omniscidb/QueryEngine/HashJoinRuntimeGpu.cu(42): error: no suitable constructor exists to convert from "long" to "JoinColumnTypeInfo"

/home/alexb/Projects/omniscidb/QueryEngine/HashJoinRuntimeGpu.cu(44): warning: integer conversion resulted in a change of sign

/home/alexb/Projects/omniscidb/QueryEngine/HashJoinRuntimeGpu.cu(45): error: argument of type "int" is incompatible with parameter of type "const void *"

/home/alexb/Projects/omniscidb/QueryEngine/HashJoinRuntimeGpu.cu(46): error: argument of type "int64_t" is incompatible with parameter of type "const void *"

/home/alexb/Projects/omniscidb/QueryEngine/HashJoinRuntimeGpu.cu(46): error: too few arguments in function call

/home/alexb/Projects/omniscidb/QueryEngine/HashJoinRuntimeGpu.cu(140): error: too few arguments in function call

/home/alexb/Projects/omniscidb/QueryEngine/HashJoinRuntimeGpu.cu(230): error: identifier "fill_row_ids_gpu" is undefined

/home/alexb/Projects/omniscidb/QueryEngine/HashJoinRuntimeGpu.cu(274): error: identifier "fill_row_ids_bucketized_gpu" is undefined

[ 47%] Built target thrift_handler
/home/alexb/Projects/omniscidb/QueryEngine/HashJoinRuntime.cpp(838): error: identifier "payload" is undefined

/home/alexb/Projects/omniscidb/QueryEngine/HashJoinRuntime.cpp(838): error: identifier "payload_count" is undefined

12 errors detected in the compilation of "/tmp/tmpxft_000441b4_00000000-6_HashJoinRuntimeGpu.cpp1.ii".
make[2]: *** [QueryEngine/CMakeFiles/QueryEngine.dir/build.make:96: QueryEngine/HashJoinRuntimeGpu.o] Error 1
make[2]: *** Waiting for unfinished jobs....
[ 48%] Linking CXX static library libChunk.a
[ 48%] Built target Chunk
make[1]: *** [CMakeFiles/Makefile2:1379: QueryEngine/CMakeFiles/QueryEngine.dir/all] Error 2
make: *** [Makefile:163: all] Error 2

Were you able to build w/ CUDA enabled?

@ienkovich
Copy link
Contributor Author

The patch was tested on not CUDA enabled machine. I'll fix the CUDA build.

@alexbaden
Copy link
Contributor

If you have the cuda drivers but no GPUs you should still be able to build -- we need to make some modifications in our unit test framework to pass on a CUDA build with no GPUs available. If you don't want to pollute your env, nvidia-docker is a good option -- we have pre-built containers here: https://hub.docker.com/r/omnisci/core-build-centos7-cuda10.0

@ienkovich
Copy link
Contributor Author

CUDA build should be fixed now but I found another issue, in some case a wrong payload is loaded into a hash table. I couldn't find the reason yet. Will look into it.

@ienkovich
Copy link
Contributor Author

Fixed issue with wrong payload loaded.

@alexbaden
Copy link
Contributor

This looks good, I've merged all the commits from our internal repo over and this repo is now fully up to date. Do you mind rebasing and fixing the conflicts? After that we can get it merged.

@ienkovich
Copy link
Contributor Author

Thanks for review! I fixed conflicts.

@asuhan
Copy link
Contributor

asuhan commented Sep 25, 2019

@alexbaden Let's check that the IR is unchanged (if the flag is not set) before merging.

@ienkovich
Copy link
Contributor Author

I compared generated IRs (using IR log channel) for TPC-H queries and found no difference.

@intel-ai intel-ai closed this by deleting the head repository Apr 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants