[DYOD] Partial Hash Index #2386

bengelhaupt · 2021-07-21T00:03:02Z

Description

This PR implements a partial hash index (PartialHashIndex) and integrates it into the index join.

This new index is different from the indexes already implemented, because it is created over multiple chunks of a table, instead of being on a per-chunk basis. It can be constructed for a set of chunks and can later be modified by adding additional chunks or removing already indexed chunks.
The already existing index implementations also share the property of having the methods lower_bound and upper_bound used for range queries. The PartialHashIndex is based on an unsorted hash map and therefore is not able to provide such functionality efficiently. Instead, it provides the functions equals and not_equals, which allow lookups in constant time.
This is achieved by using a hash map that maps a given value to a list of RowIDs where this value is located. The TableIndexIterator iterates over these lists of RowIDs.

We made the following design decisions:

The PartialHashIndex does not inherit from the AbstractIndex or does share a common superclass with e.g. the GroupKeyIndex, because of an iterator inheritance problem. All indexes that currently inherit from AbstractIndex coincidentally share its iterator type std::vector<ChunkOffset>::const_iterator, making inheritance work. The iterator of the PartialHashIndex does not match that, nor is it covariant. To resolve this, the iterator type of the AbstractIndex would need to be made polymorph, further requiring that pointer or references of it need to be passed around. We felt that to be a change too big to justify. Instead, the AbstractTableIndex has been introduced to act as a superclass for the PartialHashIndex and other future hash-based table indexes.
The templated class PartialHashIndexImpl has been introduced to allow for typesafe usage of the map keys (which are of the segment/column datatype). Such an object is held by the actual PartialHashIndex class which forwards respective calls. When a PartialHashIndex is created using an empty list of chunks, this Impl cannot be created because the data type cannot be extracted from the given chunks. A mock instance of BasePartialHashIndexImpl is then used until a chunk is added and an Impl can be constructed.

TPC-H Benchmark Comparisons

Standalone: ./hyriseBenchmarkTPCH -s 1

Chunk Indexes: ./hyriseBenchmarkTPCH -s 1 --indexes

PartialHashIndex: ./hyriseBenchmarkTPCH -s 1 --table_indexes

Standalone vs Table Indexes

+----------++----------+---------+--------++----------+----------+--------+---------+
| Item     || Latency (ms/iter)  | Change || Throughput (iter/s) | Change | p-value |
|          ||      old |     new |        ||      old |      new |        |         |
+----------++----------+---------+--------++----------+----------+--------+---------+
| TPC-H 01 ||  30190.7 | 30894.3 |   +2%  ||     0.03 |     0.03 |   -2%  |       ˅ |
| TPC-H 02 ||    417.3 |   244.3 |  -41%  ||     2.40 |     4.09 |  +71%  |  0.0000 |
| TPC-H 03 ||   4113.8 |  2989.1 |  -27%  ||     0.24 |     0.33 |  +38%  |  0.0000 |
| TPC-H 04 ||   2806.5 |  1615.6 |  -42%  ||     0.36 |     0.62 |  +74%  |  0.0000 |
| TPC-H 05 ||   6010.2 |  4859.3 |  -19%  ||     0.17 |     0.21 |  +24%  |  0.0001 |
| TPC-H 06 ||   2102.3 |  1971.0 |   -6%  ||     0.48 |     0.51 |   +7%  |  0.0746 |
| TPC-H 07 ||   3389.4 |  1053.5 |  -69%  ||     0.30 |     0.95 | +222%  |  0.0000 |
| TPC-H 08 ||   2532.4 |  1066.4 |  -58%  ||     0.39 |     0.94 | +137%  |  0.0000 |
| TPC-H 09 ||   9663.7 | 10461.9 |   +8%  ||     0.10 |     0.10 |   -8%  |       ˅ |
| TPC-H 10 ||   3785.8 |  2719.3 |  -28%  ||     0.26 |     0.37 |  +39%  |  0.0000 |
| TPC-H 11 ||    386.6 |   225.9 |  -42%  ||     2.59 |     4.43 |  +71%  |  0.0000 |
| TPC-H 12 ||   3667.2 |  3605.2 |   -2%  ||     0.27 |     0.28 |   +2%  |  0.6429 |
| TPC-H 13 ||   6792.8 |  7052.3 |   +4%  ||     0.15 |     0.14 |   -4%  |       ˅ |
| TPC-H 14 ||   1904.5 |  1772.9 |   -7%  ||     0.53 |     0.56 |   +7%  |  0.1214 |
| TPC-H 15 ||   1766.7 |  1679.6 |   -5%  ||     0.57 |     0.60 |   +5%  |  0.1395 |
| TPC-H 16 ||   2077.2 |  2391.6 |  +15%  ||     0.48 |     0.42 |  -13%  |  0.0004 |
| TPC-H 17 ||   1616.9 |   226.0 |  -86%  ||     0.62 |     4.42 | +615%  |  0.0000 |
| TPC-H 18 ||   7141.9 |  6936.1 |   -3%  ||     0.14 |     0.14 |   +3%  |       ˅ |
| TPC-H 19 ||   1985.5 |   467.0 |  -76%  ||     0.50 |     2.14 | +325%  |  0.0000 |
| TPC-H 20 ||   2079.0 |   540.4 |  -74%  ||     0.48 |     1.85 | +285%  |  0.0000 |
| TPC-H 21 ||   9002.3 |  1139.9 |  -87%  ||     0.11 |     0.88 | +690%  |       ˅ |
| TPC-H 22 ||   1112.9 |  1143.6 |   +3%  ||     0.90 |     0.87 |   -3%  |  0.3735 |
+----------++----------+---------+--------++----------+----------+--------+---------+
| Sum      || 104545.4 | 85055.2 |  -19%  ||          |          |        |         |
| Geomean  ||          |         |        ||          |          |  +67%  |         |
+----------++----------+---------+--------++----------+----------+--------+---------+
|          || ˅ Insufficient number of runs for p-value calculation                 |
+----------++----------+---------+--------++----------+----------+--------+---------+

Chunk Indexes vs PartialHashIndex

+----------++-----------+---------+--------++----------+----------+-----------+---------+
| Item     || Latency (ms/iter)   | Change || Throughput (iter/s) |    Change | p-value |
|          ||       old |     new |        ||      old |      new |           |         |
+----------++-----------+---------+--------++----------+----------+-----------+---------+
| TPC-H 01 ||   29441.3 | 30894.3 |   +5%  ||     0.03 |     0.03 |      -5%  |       ˅ |
| TPC-H 02 ||     351.9 |   244.3 |  -31%  ||     2.84 |     4.09 |     +44%  |  0.0000 |
| TPC-H 03 ||  153752.8 |  2989.1 |  -98%  ||     0.01 |     0.33 |   +5044%  |       ˅ |
| TPC-H 04 ||   60310.5 |  1615.6 |  -97%  ||     0.02 |     0.62 |   +3633%  |       ˅ |
| TPC-H 05 ||  230521.4 |  4859.3 |  -98%  ||     0.00 |     0.21 |   +4644%  |       ˅ |
| TPC-H 06 ||    2003.0 |  1971.0 |   -2%  ||     0.50 |     0.51 |      +2%  |  0.5395 |
| TPC-H 07 || 1218764.9 |  1053.5 | -100%  ||     0.00 |     0.95 | +115580%  |       ˅ |
| TPC-H 08 ||    6421.3 |  1066.4 |  -83%  ||     0.16 |     0.94 |    +502%  |  0.0000 |
| TPC-H 09 ||   66230.5 | 10461.9 |  -84%  ||     0.02 |     0.10 |    +533%  |       ˅ |
| TPC-H 10 ||   58751.8 |  2719.3 |  -95%  ||     0.02 |     0.37 |   +2061%  |       ˅ |
| TPC-H 11 ||     237.5 |   225.9 |   -5%  ||     4.21 |     4.43 |      +5%  |  0.0096 |
| TPC-H 12 ||    6574.0 |  3605.2 |  -45%  ||     0.15 |     0.28 |     +82%  |  0.0000 |
| TPC-H 13 ||    6703.1 |  7052.3 |   +5%  ||     0.15 |     0.14 |      -5%  |       ˅ |
| TPC-H 14 ||    1744.6 |  1772.9 |   +2%  ||     0.57 |     0.56 |      -2%  |  0.6070 |
| TPC-H 15 ||    1676.2 |  1679.6 |   +0%  ||     0.60 |     0.60 |      -0%  |  0.9349 |
| TPC-H 16 ||    6535.1 |  2391.6 |  -63%  ||     0.15 |     0.42 |    +173%  |  0.0000 |
| TPC-H 17 ||     511.5 |   226.0 |  -56%  ||     1.95 |     4.42 |    +126%  |  0.0000 |
| TPC-H 18 ||    6506.0 |  6936.1 |   +7%  ||     0.15 |     0.14 |      -6%  |       ˅ |
| TPC-H 19 ||    1045.4 |   467.0 |  -55%  ||     0.96 |     2.14 |    +124%  |  0.0000 |
| TPC-H 20 ||    8259.3 |   540.4 |  -93%  ||     0.12 |     1.85 |   +1428%  |       ˅ |
| TPC-H 21 ||  802745.0 |  1139.9 | -100%  ||     0.00 |     0.88 |  +70321%  |       ˅ |
| TPC-H 22 ||    1123.8 |  1143.6 |   +2%  ||     0.89 |     0.87 |      -2%  |  0.5893 |
+----------++-----------+---------+--------++----------+----------+-----------+---------+
| Sum      || 2670210.7 | 85055.2 |  -97%  ||          |          |           |         |
| Geomean  ||           |         |        ||          |          |    +461%  |         |
+----------++-----------+---------+--------++----------+----------+-----------+---------+
|          || ˅ Insufficient number of runs for p-value calculation                     |
+----------++-----------+---------+--------++----------+----------+-----------+---------+

Query Plan Example: TPC-H Q7

Standalone

Chunk Indexes

PartialHashIndex

…mplementation of PartialHashIndex

…on entry type (e.g. RowID, ChunkOffset)

…to AbstractOrderedIndex

…le_indexes on Table

…in_index_test

…uals implementation

…nner

Co-authored-by: Jasperhino <33397387+Jasperhino@users.noreply.github.com>

…ex metric

src/lib/storage/index/partial_hash/partial_hash_index_impl.hpp

src/lib/storage/index/partial_hash/partial_hash_index_impl.cpp

src/lib/storage/table.hpp

src/lib/storage/index/segment_index_type.hpp

src/test/lib/operators/join_index_test.cpp

src/test/lib/operators/join_test_runner.cpp

mweisgut · 2021-08-19T14:29:04Z

Thank you for your contribution. In addition to the above comments, please make sure that the CI run finishes successfully.

#3) Incorporate PR Feedback

bengelhaupt · 2021-08-24T10:33:42Z

Thanks a lot for the useful feedback on this PR!
A little remark: If this is to be merged, we think the Wiki also has to be updated to reflect the changes for the indexes.

Bouncner · 2021-08-25T13:36:29Z

Concerning the TPC-H benchmarks: on which columns did you create the indexes?

vxrahn · 2021-08-25T13:59:27Z

Concerning the TPC-H benchmarks: on which columns did you create the indexes?

We used -- similar to the creation of chunk indexes -- the columns defined in TPCHTableGenerator::_indexes_by_table.

src/benchmarklib/abstract_table_generator.cpp

src/lib/storage/index/abstract_chunk_index.cpp

src/lib/storage/index/abstract_table_index.cpp

src/lib/storage/index/chunk_index_type.hpp

src/lib/storage/index/partial_hash/partial_hash_index.cpp

src/lib/storage/index/partial_hash/partial_hash_index.hpp

src/lib/operators/join_index.cpp

src/lib/storage/index/partial_hash/partial_hash_index_impl.hpp

mweisgut · 2021-09-21T11:39:09Z

Since this work is currently not continued, I'm going to close this PR for now.

@bengelhaupt

This PR implements a partial hash index (`PartialHashIndex`), i.e., a single-column index structure that can be created for one or more segments of the corresponding column without the need to index all segments. This PR is based on the work started by @bengelhaupt, @Jasperhino, and @vxrahn in #2386. This `PartialHashIndex` index differs from the already implemented indexes because it is created over multiple chunks of a table instead of being on a per-chunk basis. It can be constructed for a set of chunks and can later be modified by adding additional chunks or removing already indexed chunks. The already existing index implementations also share the property of having the methods `lower_bound` and `upper_bound` used for range queries. The `PartialHashIndex` is based on an unsorted hash map and, therefore, cannot provide such functionality efficiently. Instead, it provides the functions `equals` and `not_equals`, which allow lookups in constant time. This is achieved by using a hash map that maps a given value to a vector of positions (i.e., `RowID`s) where this value is located. The `FlatMapIterator` allows iterating over these lists of `RowID`s. Co-authored-by: Tobias Jordan <tobias.jordan@student.hpi.uni-potsdam.de> Co-authored-by: Ben-Noah Engelhaupt <code@bengelhaupt.com> Co-authored-by: Vincent Xeno Rahn <vinni@akv.i24.cc> Co-authored-by: Jasperhino <blum.jasper@gmail.com>

bengelhaupt and others added 30 commits July 21, 2021 01:42

introduce additional abstraction layer AbstractRangeIndex and start i…

299c428

…mplementation of PartialHashIndex

add PartialHashIndexTest and make AbstractIndex generic by its positi…

d28f883

…on entry type (e.g. RowID, ChunkOffset)

add iterator test in PartialHashIndexTest

49fd944

implement equals() on PartialHashIndex and rename AbstractRangeIndex …

6ca40a4

…to AbstractOrderedIndex

add ToDos

24a3f45

introduce AbstractTableIndex and implement create_table_index get_tab…

feb52cd

…le_indexes on Table

change PartialHashIndex parameters and start temporarily modifying jo…

708ba91

…in_index_test

implement table index joining

ca0a16e

Enable remaining index join types

e734418

implement fallback NLJ for index join + adapt JoinTestRunner

8cd4702

fix JoinTestRunner

cb91524

Integrate into TPCH benchmarks

af3bfc9

Add IndexJoin to LQP Translator

a728b0e

Implement templated PartialHashIndexImpl

edc5d34

Fix tests and Segfault

67aabc7

Fix architecture

0e21ef4

Add copy constructor and assignment on IteratorWrapper and add not_eq…

7b3f535

…uals implementation

implement adding and removing chunks from PartialHashIndex

7d6a373

integrate notequals predicate and revert renaming of AbstractIndex

d70eb74

Resolve some ToDos

62639db

Fix return type of equals

9e85c53

Parametrize JoinIndexTest, revert configuration changes in JoinTestRu…

c677687

…nner

Add PHI memory consumption tests

d99cdc5

Fix OperatorsJoinIndexTest parametrization

2caebc2

Write tests for PHI memory consumption

0f7408b

Fix minor issues

8334fb6

Extend documentation strings

4b0bbad

Co-authored-by: Jasperhino <33397387+Jasperhino@users.noreply.github.com>

Remove PartialIndexStatistics

d67fa24

Restructure table-based joining in IndexJoin

b487cfe

Support multiple table indexes per column, fix chunks_joined_with_ind…

9fde53f

…ex metric

mweisgut requested changes Aug 19, 2021

View reviewed changes

Refactor AbstractIndex to AbstractChunkIndex, apply first feedback on… (

8d6d813

#3) Incorporate PR Feedback

vxrahn and others added 5 commits August 24, 2021 12:33

Re-trigger CI

1a8af0f

Re-trigger CI

64e4fa8

fix CI errors (missing include, wrong TPCH argument)

3e1d4c6

fix CI errors (wrong TPCC argument)

3bd7787

Re-trigger CI

dc2b6d9