Statistics Component
The statistics component is used by the query optimizer in Hyrise.
The statistics for a table stored within the storage manager can be accessed via a getter function from the corresponding table. The storage manager adds a statistics table object to a table when the table is added to the storage manager. Each table statistics holds the column statistics for all columns if the table. Column statistics are created when accessed.
Column statistics currently only work with the minimum, maximum and distinct count of a column. This information is gathered when needed with the help of the aggregate operator.
On the long run the statistics component should offer for each operator in Hyrise a prediction of the result row count. The idea is that the statistics component offers the same interface to predict the result size of an operator as the corresponding operator. Instead of table input operators statistics uses other table statistics. Therefore, the statistics need to be nested in the same way as the operators in order to predict the final operators output result size.
When looking into the code, start off at the table statistics header: table_statistics.hpp
Assumptions within statistics component
- Uniform value distribution is assumed within a column.
- No dependencies between different columns.
Currently, only the prediction for predicates (table scans) is implemented.
Column data type | Scan type | AllParameterVariant type | Supported |
---|---|---|---|
int, float, double | ==, !=, <, <=, >, >=, between | AllTypeVariant | [x] |
string | ==, != | AllTypeVariant | [x] |
int, float, double | ==, !=, <, <=, >, >=, between | ValuePlaceholder | [x] |
string | ==, !=, <, <=, >, >=, between | ValuePlaceholder | [x] |
int, float, double | ==, !=, <, <=, >, >= | ColumnName | [x] |
string | ColumnName | [ ] |
No support for between for ColumnNames as table scan does not support this. In case a certain type is not supported a selectivity of 1 is assumed for the operation. So the statistics component can handle all requests and does not fail on not implemented types.
Statistics are currently not updated after inserts, updates or deletes.