Indexes and Filters
Hyrise provides several secondary data structures. We classify them into two categories: indexes and filters. These can be used, e.g., to speed up query execution by avoiding data accesses (on both, chunk- and tuple-level), but also for cardinality estimation during query optimization. In general, indexes work on a more fine granular level but will also consume more memory.
Indexes are used to directly map searched value to the positions at which they can be found in a table. Thus, for example, full column scans are no longer necessary and the scan or join performance is greatly improved. On the downside, the index itself consumes additional memory and might require maintenance. However, the immutability of Chunks makes index maintenance in most cases unnecessary.
For Hyrise, there are currently three index implementations available:
They all implement the BaseIndex-interface.
A more detailed description of each index and the BaseIndex can be found on their respective wiki page.
General Design Decisions
All current index-implementations as well as the BaseIndex-interface were built under certain constraints and assumptions that specifically reflect the architecture of Hyrise.
Indexes are created per Chunk: An index is not created for a whole table but always ob a per-chunk basis, since all operations in Hyrise are conducted per chunk.
Indexes are immutable: Once created, indexes are not updated. This greatly simplifies the index implementations because no maintenance is necessary. Furthermore, this assumption aligns well with the first decision, since completed chunks in Hyrise are by design immutable as well. This means that the indexes can currently only be created on immutable chunks.
Indexes are only built on encoded segments: This constraint only applies to the current index implementations to simplify the development, but this could possibly change in the future. The BaseIndex interface is generalized to potentially also support unencoded (mutable) chunk segments at a later point in time.
The characteristic of the three index implementations are outlined below for a quick overview:
|+ Fastest creation||- Slow creation||- Slow creation|
|- Only works on a single column||+ Works on multiple columns||- Currently only works on a single column|
|- Only works on DictCols||- Only works on DictCols||+ Easy to adapt to also handle ValueSegments|
Filters are (possibly probabilistic) auxiliary data structures that hold aggregated/statistical information about a single segment of a chunk. This information can be used during query optimization for two purposes: (i) to get more accurate results for cardinality estimations or (ii) to prune chunks which means to exclude them from processing to avoid unnecessary data accesses. In contrast to indexes, filters are supposed to be significantly smaller and are not used for the retrieval of position lists or tuples.
At the moment, Hyrise offers min/max and range filters. We currently work on an implementation of Counting Quotient Filters as well as pruning histograms for Hyrise.
All filters implement the AbstractFilter-interface.
Range filters store a certain number of value ranges. Each range represents a spread of values that is contained within the bounds. These ranges can be used to check whether a certain value exists in a segment. If it does not exist, this chunk can be pruned.