potential_statistics

Potential Statistics

Number of Tuples in a Table

Complexity

As of now, the value is calculated by summing up the size of all chunks in a given table. Retrieving the size of a chunk has constant complexity (std::vector::size()), therefore the complexity of calculating it for the whole table is linear to the number of chunks in the table.
For intermediate results the optimizer needs some way to annotate the estimated result size of a temporary table. This can be done in an internal data structure.

Applications

The number of tuples in a table is probably both the most basic and most important statistic to gather. We believe it to be required for virtually all estimations in the optimizer.

Number of NULL Values in a Column

Description

Once Opossum can natively handle NULL values #13, statistics about how many NULL values there are in a column can be helpful.

Complexity

There are two options for having the exact number of NULL values in a column. The first option would be to scan the table on-demand. The second option would be to update a counter on every modifying query, i.e. inserts, updates, and deletes. While the complexity of the first option is prohibitive, the second option also comes with significant overhead.
Alternatively, the count could be estimated. One way would be to sample when to count on modifying queries, e.g. only on every n-th query. Another way would be to scan a part of the table. Those values could be used to extrapolate the count.

Applications

The count of NULL values is most prominently subtracted from the cardinality of the table, to only estimate cardinality based on actual values in the column.

Min and Max Values in a Column

Complexity

In sorted dictionaries, as is the case for our dictionary columns, the information can easily be retrieved from the dictionary in constant time.
In value columns these values can change with every modifying operation. While inserting a value only needs to compare it to the current min and max, both deleting and updating might require a table scan to check for a new minimum, if the current min or max is deleted or updated.
Again, sampling can be used for estimations of the two values.

Applications

If the values are reliable (always up-to-date), the optimizer and/or operators can use them to prune parts of the tree.
If, however, they are merely estimates, they can only be used for selectivity estimates in range queries.

Number of Distinct Values in a Column

Description

Estimating or inferring the number of distinct values in a column.
As of now, we do not have a concept of uniqueness of a column in Opossum. If that is the case at some point, those columns would not need that statistic as the optimizer could infer that the number is equal to the cardinality of the table, minus the number of null values of that column.

Complexity

The number of distinct values in dictionary columns is equal to the size of the dictionary, which can be determined in constant time (std::vector::size()).
In the case of value columns, the challenges are similar as for estimating the number of NULL values. Accurate statistics could only be determined by modifying a set for all values of a column on every modifying query. Additionally, in that case a delete operation would imply an additional table scan to check if the value is still present in another tuple. Alternatively, the statistics could be in form of a complete histogram, which would not require a table scan on delete statements. It would, however, be more expensive both in terms of memory consumption and CPU costs.

Applications

This statistic can be used to estimate selectivity for equality scans. Typically, the optimizer assumes a uniform distribution over the existing values. Therefore, with D being the set of distinct tuples, the estimated selectivity for any value would be 1/|D|.

Optimizers Also Liked

Histograms for column values

Histograms for Column Values

Description

Complete Histograms

A very helpful statistic would be to keep track of every distinct value of a column and how often it is present.

Partial Histograms

A less memory-consuming version of the histogram is to choose a fixed number of buckets of (nearly) equal size. For each of these buckets only the bounds are stored.

Complexity

Unfortunately, both statistics are very expensive to keep track of, as they require updating/inserting a value on every modifying statement to always be up-to-date. Sampling can be used as described above to estimate the frequencies instead.

Applications

While complete histograms can be used for both equality and range queries, partial histograms are most useful for range queries. The selectivity is determined by calculating the number of buckets and the share of partial buckets that a range covers. Let's assume we have k = 5 buckets and n = 200 tuples in the relation, and the upper bounds are [0, 100, 180, 310, 450, 800]. Each bucket has a size of n/k = 40. That means that there are 40 values between 0 and 100, 40 values that are between 101 and 180, and so on. Imagine a predicate selecting on values greater than or equal to 250 and smaller than 500. The estimated selectivity, assuming uniform distribution within each bucket, would be as follows:

   ([# completely included buckets] + [% partial bucket low] + [% partial bucket high]) * (n/k)
=  (1 + (310-250)/(310-181) + (500-451)/(800-451)) * 40
~= 64

Most Frequent Values in a Column

Description

If the distribution of values in a column is highly skewed, the optimizer could benefit from statistics about the most frequently used values to improve cost estimates for these values. It keeps track of the share of tuples in the relation with that value.

Complexity

Again, ensuring accurate statistics at all time would basically lead to a complete histogram, as even the less frequently used values would have to be tracked. However, sampling should work quite well here.

Application

The statistic can be used for equality filters. If the queried value is in fact in the list of most frequently used values the selectivity can simply be read from the mapping. Otherwise, the optimizer can use the fact that it is not by summing up the share of the most frequent values, and assume a uniform distribution over the remaining values.

Filtered Statistics

Description

SQL Server offers a way to restrict certain statistics on filters, effectively modeling functional dependencies between two or more columns.

Complexity

It is very complex to automatically identify dependencies between two columns. However, we could offer a way to manually create statistics. Keeping track of these would have the same implications as the ones above.

Application

Cross-column statistics can be really helpful for queries which select on multiple columns. Depending on the type of filter different kinds of statistics are useful.

potential_statistics