understanding-tidb: add introduction for statistics #70

winoros · 2021-08-08T18:51:07Z

Signed-off-by: Yiding Cui winoros@gmail.com

What issue does this PR solve?

close Write down the first version of Table Statistics section #58

What is changed:

Init the first version

Signed-off-by: Yiding Cui <winoros@gmail.com>

xxchan · 2021-08-11T06:10:19Z

src/understand-tidb/table-statistics.md

+
+If we don't know the size, we construct the histogram in the following way. We initialize the bucket depth to 1 for each bucket. And just inserting the data like before. Once the we meet the case that one bucket exceeds the needed depth, we double the depth of the bucket and combine two adjacent buckets into one bucket.
+
+### Count-Min Sketch(Legacy in TiDB)


Why it becomes legacy?

... and where is the related code about count-min sketch?

@xxchan The explanation is written at the place where we introduce the top-n.

@winoros so you can mention here briefly that it is superseded by top-n?

tisonkun

Thanks for your contribution @winoros ! Comments inline.

tisonkun · 2021-08-16T06:19:18Z

src/understand-tidb/table-statistics.md

@@ -1 +1,88 @@
-# Table Statistics
+# TiDB Statistics


why change the title?

src/understand-tidb/table-statistics.md

tisonkun · 2021-08-16T10:09:03Z

src/understand-tidb/table-statistics.md

+
+Histogram splits the data into many buckets and uses some simple things to describing the bucket, such as how many records in in this bucket. It's widely used in many RDBMS to do the range estimation. We have two different type of histogram depending on the bucketing strategy: equal-depth histogram and equal-width histogram.
+
+We choose the equal-depth histogram according to the paper [Accurate estimation of the number of tuples satisfying a condition](https://dl.acm.org/citation.cfm?id=602294). The equal-depth histogram has a better guarantee of the error rate compared in the worst cases, compared with the equal-width histogram. The so-called equal-depth histogram means that the number of values falling into each bucket is as equal as possible. For example, we want to split the given records set `1.6, 1.9, 1.9, 2.0, 2.4, 2.6, 2.7, 2.7, 2.8, 2.9, 3.4, 3.5` into 4 buckets. Then we would get the final result as `[1.6, 1.9], [2.0, 2.6], [2.7, 2.8], [2.9, 3.5]`, the depth of each bucket is 3, i.e. the number of records in each bucket is 3. The graph is shown as below.


It's better to describe what "equal-width" means here, though it is straight forward.

I think the deeper information can be found in the paper. Also, if we introduce the equal-width case, we need to tell the reader why we choose the equal-depth one.

So you can write "the deeper information can be found in the paper, in conclusion, we choose the equal-depth histogram..." or so.

Because a reader to here will think, you wrote "We have two different type of histogram depending on the bucketing strategy: equal-depth histogram and equal-width histogram." above, but talk only one of them, why?

You comment here is better to be present on the content instead of a PR comment.

src/understand-tidb/table-statistics.md

tisonkun · 2021-08-16T10:14:43Z

src/understand-tidb/table-statistics.md

+
+If we don't know the size, we construct the histogram in the following way. We initialize the bucket depth to 1 for each bucket. And just inserting the data like before. Once the we meet the case that one bucket exceeds the needed depth, we double the depth of the bucket and combine two adjacent buckets into one bucket.
+
+### Count-Min Sketch(Legacy in TiDB)


... and where is the related code about count-min sketch?

src/understand-tidb/table-statistics.md

tisonkun · 2021-08-16T10:17:00Z

src/understand-tidb/table-statistics.md

+
+This way, when querying how many times a value appears, the d hash functions are still used to find the position mapped to in each row, and the minimum of these d values is used as the estimate.
+
+### Top-N value(Most Frequent Value)


Where is the related code about top-n value?

tisonkun · 2021-08-16T10:17:53Z

src/understand-tidb/table-statistics.md

+For dynamic updating of histograms, the industry generally has two approaches.
+
+- For each addition or deletion, go to update the corresponding bucket depth. Splitting a bucket when its depth is too high is generally done by dividing the width of the bucket equally, although this makes it difficult to accurately determine the splitting point and causes errors.
+- Using the actual number obtained from the query to adjust the histogram with feedback assumes that the error contributed by all buckets is uniform, and uses the continuous value assumption to adjust all the buckets involved. However the assumption of uniformity of errors often causes problems, such as when a newly inserted value is larger than the maximum value of the histogram, it will spread the error caused by the newly inserted value to the histogram, thus causing errors.


why not move these paragraphs to the section of histogram

If so, what should be placed here?

tisonkun · 2021-08-16T10:24:34Z

@dcalvin I'll appreciate it if you can schedule a review on this PR.

Co-authored-by: tison <wander4096@gmail.com>

eurekaka

IMHO, we should simplify the description about the principles of statistics and row count estimation, and illustrate more on the entry functions or code layout of this component, since it is a dev guide.

feitian124 · 2021-09-17T15:12:46Z

nice guide

feitian124 · 2021-09-17T15:17:24Z

IMHO, we should simplify the description about the principles of statistics and row count estimation, and illustrate more on the entry functions or code layout of this component, since it is a dev guide.

it's very helpful for people who has little background of database area.
As long as you understand the theory, entry functions or code layout are just a matter of a few words.

tisonkun · 2021-10-05T04:17:08Z

@winoros thanks for your updating. I'll review this pr today. You can ping me to review the next time.

tisonkun

Comments inline. I will review the rest parts later.

src/understand-tidb/table-statistics.md

tisonkun · 2021-10-06T04:05:55Z

src/understand-tidb/table-statistics.md

+
+The Count-Min Sketch (CM sketch) is a data structure used for query cardinality estimation for the equal predicate, or join, etc., and provides strong accuracy guarantees. Since its introduction in 2003 in the paper [An improved data stream summary: The count-min sketch and its applications](http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf), it has gained widespread use given its simplicity of construction and use.
+
+CM sketch maintains an array of `d*w` counts, and for each value, maps it to a column in each row using `d` separate hash functions and modifies the count value at those `d` positions. This is shown in the following figure.


What is d and w? Could you briefly define them here?

src/understand-tidb/table-statistics.md

tisonkun

Finish a review cycle. It is almost a great post to be merged. Please take a look at the comments.

src/understand-tidb/table-statistics.md

Co-authored-by: tison <wander4096@gmail.com>

Signed-off-by: tison <wander4096@gmail.com>

tisonkun

LGTM. Merging...

@winoros I pushed one more commit to fix grammar complains. You can turn on spell check when writing.

tisonkun · 2021-10-12T11:33:20Z

@all-contributors please add @winoros for content

allcontributors · 2021-10-12T11:33:28Z

@tisonkun

I've put up a pull request to add @winoros! 🎉

understanding-tidb: add introduction for statistics

17a4229

Signed-off-by: Yiding Cui <winoros@gmail.com>

xxchan reviewed Aug 11, 2021

View reviewed changes

Merge branch 'master' into stats

2670550

yudongusa self-requested a review August 15, 2021 21:53

tisonkun reviewed Aug 16, 2021

View reviewed changes

feitian124 mentioned this pull request Aug 23, 2021

migrate test-infra to testify for statistics pingcap/tidb#27141

Closed

16 tasks

Apply suggestions from code review

83bbfc1

Co-authored-by: tison <wander4096@gmail.com>

yudongusa requested a review from eurekaka August 28, 2021 20:05

eurekaka reviewed Sep 2, 2021

View reviewed changes

winoros added 2 commits September 28, 2021 17:42

modify

39ef104

Merge branch 'master' into stats

d7f4a0f

This comment has been minimized.

Sign in to view

tisonkun reviewed Oct 6, 2021

View reviewed changes

tisonkun reviewed Oct 8, 2021

View reviewed changes

winoros and others added 3 commits October 12, 2021 14:01

Apply suggestions from code review

7b104a5

Co-authored-by: tison <wander4096@gmail.com>

add appendix

362cfbc

grammar polish

e27f160

Signed-off-by: tison <wander4096@gmail.com>

tisonkun approved these changes Oct 12, 2021

View reviewed changes

tisonkun merged commit 561b187 into pingcap:master Oct 12, 2021

allcontributors bot mentioned this pull request Oct 12, 2021

docs: add winoros as a contributor for content #169

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

understanding-tidb: add introduction for statistics #70

understanding-tidb: add introduction for statistics #70

winoros commented Aug 8, 2021

xxchan Aug 11, 2021

tisonkun Aug 16, 2021

winoros Aug 23, 2021

tisonkun Aug 24, 2021

tisonkun left a comment

tisonkun Aug 16, 2021 •

edited

tisonkun Aug 16, 2021

winoros Aug 23, 2021

tisonkun Aug 24, 2021

tisonkun Aug 24, 2021 •

edited

tisonkun Aug 16, 2021

tisonkun Aug 16, 2021

tisonkun Aug 16, 2021

winoros Aug 23, 2021

tisonkun commented Aug 16, 2021

eurekaka left a comment •

edited

feitian124 commented Sep 17, 2021

feitian124 commented Sep 17, 2021 •

edited

tisonkun commented Oct 5, 2021

This comment has been minimized.

tisonkun left a comment

tisonkun Oct 6, 2021

tisonkun left a comment

tisonkun left a comment

tisonkun commented Oct 12, 2021

allcontributors bot commented Oct 12, 2021


		If we don't know the size, we construct the histogram in the following way. We initialize the bucket depth to 1 for each bucket. And just inserting the data like before. Once the we meet the case that one bucket exceeds the needed depth, we double the depth of the bucket and combine two adjacent buckets into one bucket.

		### Count-Min Sketch(Legacy in TiDB)


		Histogram splits the data into many buckets and uses some simple things to describing the bucket, such as how many records in in this bucket. It's widely used in many RDBMS to do the range estimation. We have two different type of histogram depending on the bucketing strategy: equal-depth histogram and equal-width histogram.

		We choose the equal-depth histogram according to the paper [Accurate estimation of the number of tuples satisfying a condition](https://dl.acm.org/citation.cfm?id=602294). The equal-depth histogram has a better guarantee of the error rate compared in the worst cases, compared with the equal-width histogram. The so-called equal-depth histogram means that the number of values falling into each bucket is as equal as possible. For example, we want to split the given records set `1.6, 1.9, 1.9, 2.0, 2.4, 2.6, 2.7, 2.7, 2.8, 2.9, 3.4, 3.5` into 4 buckets. Then we would get the final result as `[1.6, 1.9], [2.0, 2.6], [2.7, 2.8], [2.9, 3.5]`, the depth of each bucket is 3, i.e. the number of records in each bucket is 3. The graph is shown as below.


		This way, when querying how many times a value appears, the d hash functions are still used to find the position mapped to in each row, and the minimum of these d values is used as the estimate.

		### Top-N value(Most Frequent Value)


		The Count-Min Sketch (CM sketch) is a data structure used for query cardinality estimation for the equal predicate, or join, etc., and provides strong accuracy guarantees. Since its introduction in 2003 in the paper [An improved data stream summary: The count-min sketch and its applications](http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf), it has gained widespread use given its simplicity of construction and use.

		CM sketch maintains an array of `d*w` counts, and for each value, maps it to a column in each row using `d` separate hash functions and modifies the count value at those `d` positions. This is shown in the following figure.

understanding-tidb: add introduction for statistics #70

understanding-tidb: add introduction for statistics #70

Conversation

winoros commented Aug 8, 2021

What issue does this PR solve?

What is changed:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tisonkun left a comment

Choose a reason for hiding this comment

tisonkun Aug 16, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tisonkun Aug 24, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tisonkun commented Aug 16, 2021

eurekaka left a comment • edited

Choose a reason for hiding this comment

feitian124 commented Sep 17, 2021

feitian124 commented Sep 17, 2021 • edited

tisonkun commented Oct 5, 2021

This comment has been minimized.

tisonkun left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tisonkun left a comment

Choose a reason for hiding this comment

tisonkun left a comment

Choose a reason for hiding this comment

tisonkun commented Oct 12, 2021

allcontributors bot commented Oct 12, 2021

tisonkun Aug 16, 2021 •

edited

tisonkun Aug 24, 2021 •

edited

eurekaka left a comment •

edited

feitian124 commented Sep 17, 2021 •

edited