Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

understanding-tidb: add introduction for statistics #70

Merged
merged 8 commits into from Oct 12, 2021

Conversation

winoros
Copy link
Member

@winoros winoros commented Aug 8, 2021

Signed-off-by: Yiding Cui winoros@gmail.com

What issue does this PR solve?

What is changed:

Init the first version

Signed-off-by: Yiding Cui <winoros@gmail.com>

If we don't know the size, we construct the histogram in the following way. We initialize the bucket depth to 1 for each bucket. And just inserting the data like before. Once the we meet the case that one bucket exceeds the needed depth, we double the depth of the bucket and combine two adjacent buckets into one bucket.

### Count-Min Sketch(Legacy in TiDB)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why it becomes legacy?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and where is the related code about count-min sketch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xxchan The explanation is written at the place where we introduce the top-n.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@winoros so you can mention here briefly that it is superseded by top-n?

@yudongusa yudongusa self-requested a review August 15, 2021 21:53
Copy link
Contributor

@tisonkun tisonkun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution @winoros ! Comments inline.

@@ -1 +1,88 @@
# Table Statistics
# TiDB Statistics
Copy link
Contributor

@tisonkun tisonkun Aug 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change the title?

src/understand-tidb/table-statistics.md Outdated Show resolved Hide resolved

Histogram splits the data into many buckets and uses some simple things to describing the bucket, such as how many records in in this bucket. It's widely used in many RDBMS to do the range estimation. We have two different type of histogram depending on the bucketing strategy: equal-depth histogram and equal-width histogram.

We choose the equal-depth histogram according to the paper [Accurate estimation of the number of tuples satisfying a condition](https://dl.acm.org/citation.cfm?id=602294). The equal-depth histogram has a better guarantee of the error rate compared in the worst cases, compared with the equal-width histogram. The so-called equal-depth histogram means that the number of values falling into each bucket is as equal as possible. For example, we want to split the given records set `1.6, 1.9, 1.9, 2.0, 2.4, 2.6, 2.7, 2.7, 2.8, 2.9, 3.4, 3.5` into 4 buckets. Then we would get the final result as `[1.6, 1.9], [2.0, 2.6], [2.7, 2.8], [2.9, 3.5]`, the depth of each bucket is 3, i.e. the number of records in each bucket is 3. The graph is shown as below.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to describe what "equal-width" means here, though it is straight forward.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the deeper information can be found in the paper. Also, if we introduce the equal-width case, we need to tell the reader why we choose the equal-depth one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you can write "the deeper information can be found in the paper, in conclusion, we choose the equal-depth histogram..." or so.

Copy link
Contributor

@tisonkun tisonkun Aug 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because a reader to here will think, you wrote "We have two different type of histogram depending on the bucketing strategy: equal-depth histogram and equal-width histogram." above, but talk only one of them, why?

You comment here is better to be present on the content instead of a PR comment.

src/understand-tidb/table-statistics.md Show resolved Hide resolved

If we don't know the size, we construct the histogram in the following way. We initialize the bucket depth to 1 for each bucket. And just inserting the data like before. Once the we meet the case that one bucket exceeds the needed depth, we double the depth of the bucket and combine two adjacent buckets into one bucket.

### Count-Min Sketch(Legacy in TiDB)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and where is the related code about count-min sketch?

src/understand-tidb/table-statistics.md Outdated Show resolved Hide resolved

This way, when querying how many times a value appears, the d hash functions are still used to find the position mapped to in each row, and the minimum of these d values is used as the estimate.

### Top-N value(Most Frequent Value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the related code about top-n value?

Comment on lines 83 to 86
For dynamic updating of histograms, the industry generally has two approaches.

- For each addition or deletion, go to update the corresponding bucket depth. Splitting a bucket when its depth is too high is generally done by dividing the width of the bucket equally, although this makes it difficult to accurately determine the splitting point and causes errors.
- Using the actual number obtained from the query to adjust the histogram with feedback assumes that the error contributed by all buckets is uniform, and uses the continuous value assumption to adjust all the buckets involved. However the assumption of uniformity of errors often causes problems, such as when a newly inserted value is larger than the maximum value of the histogram, it will spread the error caused by the newly inserted value to the histogram, thus causing errors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not move these paragraphs to the section of histogram

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, what should be placed here?

@tisonkun
Copy link
Contributor

@dcalvin I'll appreciate it if you can schedule a review on this PR.

Co-authored-by: tison <wander4096@gmail.com>
Copy link
Contributor

@eurekaka eurekaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, we should simplify the description about the principles of statistics and row count estimation, and illustrate more on the entry functions or code layout of this component, since it is a dev guide.

@feitian124
Copy link
Contributor

nice guide

@feitian124
Copy link
Contributor

feitian124 commented Sep 17, 2021

IMHO, we should simplify the description about the principles of statistics and row count estimation, and illustrate more on the entry functions or code layout of this component, since it is a dev guide.

it's very helpful for people who has little background of database area.
As long as you understand the theory, entry functions or code layout are just a matter of a few words.

@tisonkun
Copy link
Contributor

tisonkun commented Oct 5, 2021

@winoros thanks for your updating. I'll review this pr today. You can ping me to review the next time.

1 similar comment
@tisonkun

This comment has been minimized.

Copy link
Contributor

@tisonkun tisonkun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments inline. I will review the rest parts later.

src/understand-tidb/table-statistics.md Outdated Show resolved Hide resolved
src/understand-tidb/table-statistics.md Outdated Show resolved Hide resolved
src/understand-tidb/table-statistics.md Outdated Show resolved Hide resolved

The Count-Min Sketch (CM sketch) is a data structure used for query cardinality estimation for the equal predicate, or join, etc., and provides strong accuracy guarantees. Since its introduction in 2003 in the paper [An improved data stream summary: The count-min sketch and its applications](http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf), it has gained widespread use given its simplicity of construction and use.

CM sketch maintains an array of `d*w` counts, and for each value, maps it to a column in each row using `d` separate hash functions and modifies the count value at those `d` positions. This is shown in the following figure.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is d and w? Could you briefly define them here?

src/understand-tidb/table-statistics.md Outdated Show resolved Hide resolved
src/understand-tidb/table-statistics.md Outdated Show resolved Hide resolved
src/understand-tidb/table-statistics.md Outdated Show resolved Hide resolved
src/understand-tidb/table-statistics.md Outdated Show resolved Hide resolved
src/understand-tidb/table-statistics.md Outdated Show resolved Hide resolved
Copy link
Contributor

@tisonkun tisonkun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finish a review cycle. It is almost a great post to be merged. Please take a look at the comments.

src/understand-tidb/table-statistics.md Outdated Show resolved Hide resolved
src/understand-tidb/table-statistics.md Outdated Show resolved Hide resolved
src/understand-tidb/table-statistics.md Outdated Show resolved Hide resolved
src/understand-tidb/table-statistics.md Outdated Show resolved Hide resolved
src/understand-tidb/table-statistics.md Outdated Show resolved Hide resolved
src/understand-tidb/table-statistics.md Outdated Show resolved Hide resolved
winoros and others added 3 commits October 12, 2021 14:01
Co-authored-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
Copy link
Contributor

@tisonkun tisonkun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Merging...

@winoros I pushed one more commit to fix grammar complains. You can turn on spell check when writing.

@tisonkun tisonkun merged commit 561b187 into pingcap:master Oct 12, 2021
@tisonkun
Copy link
Contributor

@all-contributors please add @winoros for content

@allcontributors
Copy link
Contributor

@tisonkun

I've put up a pull request to add @winoros! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Write down the first version of Table Statistics section
6 participants