Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DYOD] Partial Hash Index #2386

Closed
wants to merge 49 commits into from
Closed

Conversation

bengelhaupt
Copy link
Contributor

Description

This PR implements a partial hash index (PartialHashIndex) and integrates it into the index join.

This new index is different from the indexes already implemented, because it is created over multiple chunks of a table, instead of being on a per-chunk basis. It can be constructed for a set of chunks and can later be modified by adding additional chunks or removing already indexed chunks.
The already existing index implementations also share the property of having the methods lower_bound and upper_bound used for range queries. The PartialHashIndex is based on an unsorted hash map and therefore is not able to provide such functionality efficiently. Instead, it provides the functions equals and not_equals, which allow lookups in constant time.
This is achieved by using a hash map that maps a given value to a list of RowIDs where this value is located. The TableIndexIterator iterates over these lists of RowIDs.

We made the following design decisions:

  • The PartialHashIndex does not inherit from the AbstractIndex or does share a common superclass with e.g. the GroupKeyIndex, because of an iterator inheritance problem. All indexes that currently inherit from AbstractIndex coincidentally share its iterator type std::vector<ChunkOffset>::const_iterator, making inheritance work. The iterator of the PartialHashIndex does not match that, nor is it covariant. To resolve this, the iterator type of the AbstractIndex would need to be made polymorph, further requiring that pointer or references of it need to be passed around. We felt that to be a change too big to justify. Instead, the AbstractTableIndex has been introduced to act as a superclass for the PartialHashIndex and other future hash-based table indexes.
  • The templated class PartialHashIndexImpl has been introduced to allow for typesafe usage of the map keys (which are of the segment/column datatype). Such an object is held by the actual PartialHashIndex class which forwards respective calls. When a PartialHashIndex is created using an empty list of chunks, this Impl cannot be created because the data type cannot be extracted from the given chunks. A mock instance of BasePartialHashIndexImpl is then used until a chunk is added and an Impl can be constructed.

TPC-H Benchmark Comparisons

Standalone: ./hyriseBenchmarkTPCH -s 1

Chunk Indexes: ./hyriseBenchmarkTPCH -s 1 --indexes

PartialHashIndex: ./hyriseBenchmarkTPCH -s 1 --table_indexes

Standalone vs Table Indexes

+----------++----------+---------+--------++----------+----------+--------+---------+
| Item     || Latency (ms/iter)  | Change || Throughput (iter/s) | Change | p-value |
|          ||      old |     new |        ||      old |      new |        |         |
+----------++----------+---------+--------++----------+----------+--------+---------+
| TPC-H 01 ||  30190.7 | 30894.3 |   +2%  ||     0.03 |     0.03 |   -2%  |       ˅ |
| TPC-H 02 ||    417.3 |   244.3 |  -41%  ||     2.40 |     4.09 |  +71%  |  0.0000 |
| TPC-H 03 ||   4113.8 |  2989.1 |  -27%  ||     0.24 |     0.33 |  +38%  |  0.0000 |
| TPC-H 04 ||   2806.5 |  1615.6 |  -42%  ||     0.36 |     0.62 |  +74%  |  0.0000 |
| TPC-H 05 ||   6010.2 |  4859.3 |  -19%  ||     0.17 |     0.21 |  +24%  |  0.0001 |
| TPC-H 06 ||   2102.3 |  1971.0 |   -6%  ||     0.48 |     0.51 |   +7%  |  0.0746 |
| TPC-H 07 ||   3389.4 |  1053.5 |  -69%  ||     0.30 |     0.95 | +222%  |  0.0000 |
| TPC-H 08 ||   2532.4 |  1066.4 |  -58%  ||     0.39 |     0.94 | +137%  |  0.0000 |
| TPC-H 09 ||   9663.7 | 10461.9 |   +8%  ||     0.10 |     0.10 |   -8%  |       ˅ |
| TPC-H 10 ||   3785.8 |  2719.3 |  -28%  ||     0.26 |     0.37 |  +39%  |  0.0000 |
| TPC-H 11 ||    386.6 |   225.9 |  -42%  ||     2.59 |     4.43 |  +71%  |  0.0000 |
| TPC-H 12 ||   3667.2 |  3605.2 |   -2%  ||     0.27 |     0.28 |   +2%  |  0.6429 |
| TPC-H 13 ||   6792.8 |  7052.3 |   +4%  ||     0.15 |     0.14 |   -4%  |       ˅ |
| TPC-H 14 ||   1904.5 |  1772.9 |   -7%  ||     0.53 |     0.56 |   +7%  |  0.1214 |
| TPC-H 15 ||   1766.7 |  1679.6 |   -5%  ||     0.57 |     0.60 |   +5%  |  0.1395 |
| TPC-H 16 ||   2077.2 |  2391.6 |  +15%  ||     0.48 |     0.42 |  -13%  |  0.0004 |
| TPC-H 17 ||   1616.9 |   226.0 |  -86%  ||     0.62 |     4.42 | +615%  |  0.0000 |
| TPC-H 18 ||   7141.9 |  6936.1 |   -3%  ||     0.14 |     0.14 |   +3%  |       ˅ |
| TPC-H 19 ||   1985.5 |   467.0 |  -76%  ||     0.50 |     2.14 | +325%  |  0.0000 |
| TPC-H 20 ||   2079.0 |   540.4 |  -74%  ||     0.48 |     1.85 | +285%  |  0.0000 |
| TPC-H 21 ||   9002.3 |  1139.9 |  -87%  ||     0.11 |     0.88 | +690%  |       ˅ |
| TPC-H 22 ||   1112.9 |  1143.6 |   +3%  ||     0.90 |     0.87 |   -3%  |  0.3735 |
+----------++----------+---------+--------++----------+----------+--------+---------+
| Sum      || 104545.4 | 85055.2 |  -19%  ||          |          |        |         |
| Geomean  ||          |         |        ||          |          |  +67%  |         |
+----------++----------+---------+--------++----------+----------+--------+---------+
|          || ˅ Insufficient number of runs for p-value calculation                 |
+----------++----------+---------+--------++----------+----------+--------+---------+

Chunk Indexes vs PartialHashIndex

+----------++-----------+---------+--------++----------+----------+-----------+---------+
| Item     || Latency (ms/iter)   | Change || Throughput (iter/s) |    Change | p-value |
|          ||       old |     new |        ||      old |      new |           |         |
+----------++-----------+---------+--------++----------+----------+-----------+---------+
| TPC-H 01 ||   29441.3 | 30894.3 |   +5%  ||     0.03 |     0.03 |      -5%  |       ˅ |
| TPC-H 02 ||     351.9 |   244.3 |  -31%  ||     2.84 |     4.09 |     +44%  |  0.0000 |
| TPC-H 03 ||  153752.8 |  2989.1 |  -98%  ||     0.01 |     0.33 |   +5044%  |       ˅ |
| TPC-H 04 ||   60310.5 |  1615.6 |  -97%  ||     0.02 |     0.62 |   +3633%  |       ˅ |
| TPC-H 05 ||  230521.4 |  4859.3 |  -98%  ||     0.00 |     0.21 |   +4644%  |       ˅ |
| TPC-H 06 ||    2003.0 |  1971.0 |   -2%  ||     0.50 |     0.51 |      +2%  |  0.5395 |
| TPC-H 07 || 1218764.9 |  1053.5 | -100%  ||     0.00 |     0.95 | +115580%  |       ˅ |
| TPC-H 08 ||    6421.3 |  1066.4 |  -83%  ||     0.16 |     0.94 |    +502%  |  0.0000 |
| TPC-H 09 ||   66230.5 | 10461.9 |  -84%  ||     0.02 |     0.10 |    +533%  |       ˅ |
| TPC-H 10 ||   58751.8 |  2719.3 |  -95%  ||     0.02 |     0.37 |   +2061%  |       ˅ |
| TPC-H 11 ||     237.5 |   225.9 |   -5%  ||     4.21 |     4.43 |      +5%  |  0.0096 |
| TPC-H 12 ||    6574.0 |  3605.2 |  -45%  ||     0.15 |     0.28 |     +82%  |  0.0000 |
| TPC-H 13 ||    6703.1 |  7052.3 |   +5%  ||     0.15 |     0.14 |      -5%  |       ˅ |
| TPC-H 14 ||    1744.6 |  1772.9 |   +2%  ||     0.57 |     0.56 |      -2%  |  0.6070 |
| TPC-H 15 ||    1676.2 |  1679.6 |   +0%  ||     0.60 |     0.60 |      -0%  |  0.9349 |
| TPC-H 16 ||    6535.1 |  2391.6 |  -63%  ||     0.15 |     0.42 |    +173%  |  0.0000 |
| TPC-H 17 ||     511.5 |   226.0 |  -56%  ||     1.95 |     4.42 |    +126%  |  0.0000 |
| TPC-H 18 ||    6506.0 |  6936.1 |   +7%  ||     0.15 |     0.14 |      -6%  |       ˅ |
| TPC-H 19 ||    1045.4 |   467.0 |  -55%  ||     0.96 |     2.14 |    +124%  |  0.0000 |
| TPC-H 20 ||    8259.3 |   540.4 |  -93%  ||     0.12 |     1.85 |   +1428%  |       ˅ |
| TPC-H 21 ||  802745.0 |  1139.9 | -100%  ||     0.00 |     0.88 |  +70321%  |       ˅ |
| TPC-H 22 ||    1123.8 |  1143.6 |   +2%  ||     0.89 |     0.87 |      -2%  |  0.5893 |
+----------++-----------+---------+--------++----------+----------+-----------+---------+
| Sum      || 2670210.7 | 85055.2 |  -97%  ||          |          |           |         |
| Geomean  ||           |         |        ||          |          |    +461%  |         |
+----------++-----------+---------+--------++----------+----------+-----------+---------+
|          || ˅ Insufficient number of runs for p-value calculation                     |
+----------++-----------+---------+--------++----------+----------+-----------+---------+

Query Plan Example: TPC-H Q7

Standalone

TPC-H_07-PQP

Chunk Indexes

TPC-H_07-PQP (1)

PartialHashIndex

TPC-H_07-PQP (2)

bengelhaupt and others added 30 commits July 21, 2021 01:42
Co-authored-by: Jasperhino <33397387+Jasperhino@users.noreply.github.com>
src/lib/storage/table.hpp Outdated Show resolved Hide resolved
src/lib/storage/table.hpp Outdated Show resolved Hide resolved
src/lib/storage/table.hpp Outdated Show resolved Hide resolved
src/lib/storage/index/segment_index_type.hpp Outdated Show resolved Hide resolved
src/test/lib/operators/join_index_test.cpp Outdated Show resolved Hide resolved
src/test/lib/operators/join_index_test.cpp Outdated Show resolved Hide resolved
src/test/lib/operators/join_index_test.cpp Outdated Show resolved Hide resolved
src/test/lib/operators/join_index_test.cpp Outdated Show resolved Hide resolved
src/test/lib/operators/join_test_runner.cpp Outdated Show resolved Hide resolved
src/test/lib/operators/join_test_runner.cpp Outdated Show resolved Hide resolved
src/test/lib/operators/join_test_runner.cpp Outdated Show resolved Hide resolved
src/test/lib/operators/join_test_runner.cpp Outdated Show resolved Hide resolved
src/test/lib/operators/join_test_runner.cpp Outdated Show resolved Hide resolved
@mweisgut
Copy link
Collaborator

Thank you for your contribution. In addition to the above comments, please make sure that the CI run finishes successfully.

@bengelhaupt
Copy link
Contributor Author

Thanks a lot for the useful feedback on this PR!
A little remark: If this is to be merged, we think the Wiki also has to be updated to reflect the changes for the indexes.

@Bouncner
Copy link
Collaborator

Concerning the TPC-H benchmarks: on which columns did you create the indexes?

@vxrahn
Copy link
Contributor

vxrahn commented Aug 25, 2021

Concerning the TPC-H benchmarks: on which columns did you create the indexes?

We used -- similar to the creation of chunk indexes -- the columns defined in TPCHTableGenerator::_indexes_by_table.

@mweisgut
Copy link
Collaborator

Since this work is currently not continued, I'm going to close this PR for now.

@mweisgut mweisgut closed this Sep 21, 2021
@mweisgut mweisgut mentioned this pull request Apr 8, 2022
27 tasks
@mweisgut mweisgut mentioned this pull request Nov 24, 2022
mweisgut added a commit that referenced this pull request Feb 16, 2023
This PR implements a partial hash index (`PartialHashIndex`), i.e., a
single-column index structure that can be created for one or more
segments of the corresponding column without the need to index all
segments. This PR is based on the work started by @bengelhaupt,
@Jasperhino, and @vxrahn in #2386.

This `PartialHashIndex` index differs from the already implemented
indexes because it is created over multiple chunks of a table instead of
being on a per-chunk basis. It can be constructed for a set of chunks
and can later be modified by adding additional chunks or removing
already indexed chunks.
The already existing index implementations also share the property of
having the methods `lower_bound` and `upper_bound` used for range
queries. The `PartialHashIndex` is based on an unsorted hash map and,
therefore, cannot provide such functionality efficiently. Instead, it
provides the functions `equals` and `not_equals`, which allow lookups in
constant time.
This is achieved by using a hash map that maps a given value to a vector
of positions (i.e., `RowID`s) where this value is located. The
`FlatMapIterator` allows iterating over these lists of `RowID`s.

Co-authored-by: Tobias Jordan <tobias.jordan@student.hpi.uni-potsdam.de>
Co-authored-by: Ben-Noah Engelhaupt <code@bengelhaupt.com>
Co-authored-by: Vincent Xeno Rahn <vinni@akv.i24.cc>
Co-authored-by: Jasperhino <blum.jasper@gmail.com>
@tjjordan tjjordan mentioned this pull request Mar 21, 2023
16 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FullCI Run all CI tests (slow, but required for merge)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants