-
Notifications
You must be signed in to change notification settings - Fork 155
potential_statistics
As of now, the value is calculated by summing up the size of all chunks in a given table. Retrieving the size of a chunk has constant complexity (std::vector::size()
), therefore the complexity of calculating it for the whole table is linear to the number of chunks in the table.
For intermediate results the optimizer needs some way to annotate the estimated result size of a temporary table. This can be done in an internal data structure.
The number of tuples in a table is probably both the most basic and most important statistic to gather. We believe it to be required for virtually all estimations in the optimizer.
Once Opossum can natively handle NULL values #13, statistics about how many NULL values there are in a column can be helpful.
There are two options for having the exact number of NULL values in a column. The first option would be to scan the table on-demand. The second option would be to update a counter on every modifying query, i.e. inserts, updates, and deletes. While the complexity of the first option is prohibitive, the second option also comes with significant overhead.
Alternatively, the count could be estimated. One way would be to sample when to count on modifying queries, e.g. only on every n-th query. Another way would be to scan a part of the table. Those values could be used to extrapolate the count.
The count of NULL values is most prominently subtracted from the cardinality of the table, to only estimate cardinality based on actual values in the column.
In sorted dictionaries, as is the case for our dictionary columns, the information can easily be retrieved from the dictionary in constant time.
In value columns these values can change with every modifying operation. While inserting a value only needs to compare it to the current min and max, both deleting and updating might require a table scan to check for a new minimum, if the current min or max is deleted or updated.
Again, sampling can be used for estimations of the two values.
If the values are reliable (always up-to-date), the optimizer and/or operators can use them to prune parts of the tree.
If, however, they are merely estimates, they can only be used for selectivity estimates in range queries.
Estimating or inferring the number of distinct values in a column.
As of now, we do not have a concept of uniqueness of a column in Opossum. If that is the case at some point, those columns would not need that statistic as the optimizer could infer that the number is equal to the cardinality of the table, minus the number of null values of that column.
The number of distinct values in dictionary columns is equal to the size of the dictionary, which can be determined in constant time (std::vector::size()
).
In the case of value columns, the challenges are similar as for estimating the number of NULL values. Accurate statistics could only be determined by modifying a set for all values of a column on every modifying query. Additionally, in that case a delete operation would imply an additional table scan to check if the value is still present in another tuple. Alternatively, the statistics could be in form of a complete histogram, which would not require a table scan on delete statements. It would, however, be more expensive both in terms of memory consumption and CPU costs.
This statistic can be used to estimate selectivity for equality scans. Typically, the optimizer assumes a uniform distribution over the existing values. Therefore, with D
being the set of distinct tuples, the estimated selectivity for any value would be 1/|D|
.
A very helpful statistic would be to keep track of every distinct value of a column and how often it is present.
A less memory-consuming version of the histogram is to choose a fixed number of buckets of (nearly) equal size. For each of these buckets only the bounds are stored.
Unfortunately, both statistics are very expensive to keep track of, as they require updating/inserting a value on every modifying statement to always be up-to-date. Sampling can be used as described above to estimate the frequencies instead.
While complete histograms can be used for both equality and range queries, partial histograms are most useful for range queries. The selectivity is determined by calculating the number of buckets and the share of partial buckets that a range covers. Let's assume we have k = 5
buckets and n = 200
tuples in the relation, and the upper bounds are [0, 100, 180, 310, 450, 800]
. Each bucket has a size of n/k = 40
. That means that there are 40 values between 0 and 100, 40 values that are between 101 and 180, and so on. Imagine a predicate selecting on values greater than or equal to 250 and smaller than 500. The estimated selectivity, assuming uniform distribution within each bucket, would be as follows:
([# completely included buckets] + [% partial bucket low] + [% partial bucket high]) * (n/k)
= (1 + (310-250)/(310-181) + (500-451)/(800-451)) * 40
~= 64
If the distribution of values in a column is highly skewed, the optimizer could benefit from statistics about the most frequently used values to improve cost estimates for these values. It keeps track of the share of tuples in the relation with that value.
Again, ensuring accurate statistics at all time would basically lead to a complete histogram, as even the less frequently used values would have to be tracked. However, sampling should work quite well here.
The statistic can be used for equality filters. If the queried value is in fact in the list of most frequently used values the selectivity can simply be read from the mapping. Otherwise, the optimizer can use the fact that it is not by summing up the share of the most frequent values, and assume a uniform distribution over the remaining values.
SQL Server offers a way to restrict certain statistics on filters, effectively modeling functional dependencies between two or more columns.
It is very complex to automatically identify dependencies between two columns. However, we could offer a way to manually create statistics. Keeping track of these would have the same implications as the ones above.
Cross-column statistics can be really helpful for queries which select on multiple columns. Depending on the type of filter different kinds of statistics are useful.