-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DYOD] Partial Hash Index #2386
Conversation
…mplementation of PartialHashIndex
…on entry type (e.g. RowID, ChunkOffset)
…to AbstractOrderedIndex
…le_indexes on Table
…uals implementation
Co-authored-by: Jasperhino <33397387+Jasperhino@users.noreply.github.com>
Thank you for your contribution. In addition to the above comments, please make sure that the CI run finishes successfully. |
#3) Incorporate PR Feedback
Thanks a lot for the useful feedback on this PR! |
Concerning the TPC-H benchmarks: on which columns did you create the indexes? |
We used -- similar to the creation of chunk indexes -- the columns defined in |
Since this work is currently not continued, I'm going to close this PR for now. |
This PR implements a partial hash index (`PartialHashIndex`), i.e., a single-column index structure that can be created for one or more segments of the corresponding column without the need to index all segments. This PR is based on the work started by @bengelhaupt, @Jasperhino, and @vxrahn in #2386. This `PartialHashIndex` index differs from the already implemented indexes because it is created over multiple chunks of a table instead of being on a per-chunk basis. It can be constructed for a set of chunks and can later be modified by adding additional chunks or removing already indexed chunks. The already existing index implementations also share the property of having the methods `lower_bound` and `upper_bound` used for range queries. The `PartialHashIndex` is based on an unsorted hash map and, therefore, cannot provide such functionality efficiently. Instead, it provides the functions `equals` and `not_equals`, which allow lookups in constant time. This is achieved by using a hash map that maps a given value to a vector of positions (i.e., `RowID`s) where this value is located. The `FlatMapIterator` allows iterating over these lists of `RowID`s. Co-authored-by: Tobias Jordan <tobias.jordan@student.hpi.uni-potsdam.de> Co-authored-by: Ben-Noah Engelhaupt <code@bengelhaupt.com> Co-authored-by: Vincent Xeno Rahn <vinni@akv.i24.cc> Co-authored-by: Jasperhino <blum.jasper@gmail.com>
Description
This PR implements a partial hash index (
PartialHashIndex
) and integrates it into the index join.This new index is different from the indexes already implemented, because it is created over multiple chunks of a table, instead of being on a per-chunk basis. It can be constructed for a set of chunks and can later be modified by adding additional chunks or removing already indexed chunks.
The already existing index implementations also share the property of having the methods
lower_bound
andupper_bound
used for range queries. ThePartialHashIndex
is based on an unsorted hash map and therefore is not able to provide such functionality efficiently. Instead, it provides the functionsequals
andnot_equals
, which allow lookups in constant time.This is achieved by using a hash map that maps a given value to a list of
RowID
s where this value is located. TheTableIndexIterator
iterates over these lists ofRowID
s.We made the following design decisions:
PartialHashIndex
does not inherit from theAbstractIndex
or does share a common superclass with e.g. theGroupKeyIndex
, because of an iterator inheritance problem. All indexes that currently inherit fromAbstractIndex
coincidentally share its iterator typestd::vector<ChunkOffset>::const_iterator
, making inheritance work. The iterator of thePartialHashIndex
does not match that, nor is it covariant. To resolve this, the iterator type of theAbstractIndex
would need to be made polymorph, further requiring that pointer or references of it need to be passed around. We felt that to be a change too big to justify. Instead, theAbstractTableIndex
has been introduced to act as a superclass for thePartialHashIndex
and other future hash-based table indexes.PartialHashIndexImpl
has been introduced to allow for typesafe usage of the map keys (which are of the segment/column datatype). Such an object is held by the actualPartialHashIndex
class which forwards respective calls. When aPartialHashIndex
is created using an empty list of chunks, thisImpl
cannot be created because the data type cannot be extracted from the given chunks. A mock instance ofBasePartialHashIndexImpl
is then used until a chunk is added and anImpl
can be constructed.TPC-H Benchmark Comparisons
Standalone:
./hyriseBenchmarkTPCH -s 1
Chunk Indexes:
./hyriseBenchmarkTPCH -s 1 --indexes
PartialHashIndex:
./hyriseBenchmarkTPCH -s 1 --table_indexes
Standalone vs Table Indexes
Chunk Indexes vs PartialHashIndex
Query Plan Example: TPC-H Q7
Standalone
Chunk Indexes
PartialHashIndex