New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
understanding-tidb: add introduction for statistics #70
Conversation
Signed-off-by: Yiding Cui <winoros@gmail.com>
|
||
If we don't know the size, we construct the histogram in the following way. We initialize the bucket depth to 1 for each bucket. And just inserting the data like before. Once the we meet the case that one bucket exceeds the needed depth, we double the depth of the bucket and combine two adjacent buckets into one bucket. | ||
|
||
### Count-Min Sketch(Legacy in TiDB) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why it becomes legacy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... and where is the related code about count-min sketch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xxchan The explanation is written at the place where we introduce the top-n.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@winoros so you can mention here briefly that it is superseded by top-n?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution @winoros ! Comments inline.
@@ -1 +1,88 @@ | |||
# Table Statistics | |||
# TiDB Statistics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why change the title?
|
||
Histogram splits the data into many buckets and uses some simple things to describing the bucket, such as how many records in in this bucket. It's widely used in many RDBMS to do the range estimation. We have two different type of histogram depending on the bucketing strategy: equal-depth histogram and equal-width histogram. | ||
|
||
We choose the equal-depth histogram according to the paper [Accurate estimation of the number of tuples satisfying a condition](https://dl.acm.org/citation.cfm?id=602294). The equal-depth histogram has a better guarantee of the error rate compared in the worst cases, compared with the equal-width histogram. The so-called equal-depth histogram means that the number of values falling into each bucket is as equal as possible. For example, we want to split the given records set `1.6, 1.9, 1.9, 2.0, 2.4, 2.6, 2.7, 2.7, 2.8, 2.9, 3.4, 3.5` into 4 buckets. Then we would get the final result as `[1.6, 1.9], [2.0, 2.6], [2.7, 2.8], [2.9, 3.5]`, the depth of each bucket is 3, i.e. the number of records in each bucket is 3. The graph is shown as below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to describe what "equal-width" means here, though it is straight forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the deeper information can be found in the paper. Also, if we introduce the equal-width case, we need to tell the reader why we choose the equal-depth one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you can write "the deeper information can be found in the paper, in conclusion, we choose the equal-depth histogram..." or so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because a reader to here will think, you wrote "We have two different type of histogram depending on the bucketing strategy: equal-depth histogram and equal-width histogram." above, but talk only one of them, why?
You comment here is better to be present on the content instead of a PR comment.
|
||
If we don't know the size, we construct the histogram in the following way. We initialize the bucket depth to 1 for each bucket. And just inserting the data like before. Once the we meet the case that one bucket exceeds the needed depth, we double the depth of the bucket and combine two adjacent buckets into one bucket. | ||
|
||
### Count-Min Sketch(Legacy in TiDB) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... and where is the related code about count-min sketch?
|
||
This way, when querying how many times a value appears, the d hash functions are still used to find the position mapped to in each row, and the minimum of these d values is used as the estimate. | ||
|
||
### Top-N value(Most Frequent Value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is the related code about top-n value?
For dynamic updating of histograms, the industry generally has two approaches. | ||
|
||
- For each addition or deletion, go to update the corresponding bucket depth. Splitting a bucket when its depth is too high is generally done by dividing the width of the bucket equally, although this makes it difficult to accurately determine the splitting point and causes errors. | ||
- Using the actual number obtained from the query to adjust the histogram with feedback assumes that the error contributed by all buckets is uniform, and uses the continuous value assumption to adjust all the buckets involved. However the assumption of uniformity of errors often causes problems, such as when a newly inserted value is larger than the maximum value of the histogram, it will spread the error caused by the newly inserted value to the histogram, thus causing errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not move these paragraphs to the section of histogram
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, what should be placed here?
@dcalvin I'll appreciate it if you can schedule a review on this PR. |
Co-authored-by: tison <wander4096@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, we should simplify the description about the principles of statistics and row count estimation, and illustrate more on the entry functions or code layout of this component, since it is a dev guide.
nice guide |
it's very helpful for people who has little background of database area. |
@winoros thanks for your updating. I'll review this pr today. You can ping me to review the next time. |
1 similar comment
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments inline. I will review the rest parts later.
|
||
The Count-Min Sketch (CM sketch) is a data structure used for query cardinality estimation for the equal predicate, or join, etc., and provides strong accuracy guarantees. Since its introduction in 2003 in the paper [An improved data stream summary: The count-min sketch and its applications](http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf), it has gained widespread use given its simplicity of construction and use. | ||
|
||
CM sketch maintains an array of `d*w` counts, and for each value, maps it to a column in each row using `d` separate hash functions and modifies the count value at those `d` positions. This is shown in the following figure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is d
and w
? Could you briefly define them here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finish a review cycle. It is almost a great post to be merged. Please take a look at the comments.
Co-authored-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Merging...
@winoros I pushed one more commit to fix grammar complains. You can turn on spell check when writing.
@all-contributors please add @winoros for content |
I've put up a pull request to add @winoros! 🎉 |
Signed-off-by: Yiding Cui winoros@gmail.com
What issue does this PR solve?
What is changed:
Init the first version