Fix a mistake in handling histogram bounds for batching prunes and copies #5898

lutter · 2025-03-19T11:49:56Z

When setting up a VidBatcher we have both accurate values for the range of vids as well as Postgres' estimate of bounds for a histogram with roughly the same number of entries in each bucket.

As an example, say we have min and max of 1 and 100, and histogram bounds [5, 50, 96]. We used to then add min and max to these bounds resulting in an ogive over [1, 5, 50, 96, 100]. With that, it seems that there is a bucket [1, 5] with just as many entries as the bucket [5, 50], which is not what the Posgres staistics indicate. Using this ogive will cause
e.g. pruning to increase batch size quickly as it tries to get out of the [1, 5] bucket resulting in a batch size that is way too big for the next bucket and a batch that can take a very long time.

The first and last entry of the bounds are Postgres' estimate of the min and max. We now simply replace the first and last bound with our known min and max, resulting in an ogive over [1, 50, 100], which reflects the statistics much more accurately and avoids impossibly short buckets.

When setting upa VidBatcher we have both accurate values for the range of vids as well as Postgres' estimate of bounds for a histogram with roughly the same number of entries in each bucket. As an example, say we have min and max of 1 and 100, and histogram bounds [5, 50, 96]. We used to then add min and max to these bounds resulting in an ogive over [1, 5, 50, 96, 100]. With that, it seems that there is a bucket [1, 5] with just as many entries as the bucket [5, 50], which is not what the Posgres staistics indicate. Using this ogive will cause e.g. pruning to increase batch size quickly as it tries to get out of the [1, 5] bucket resulting in a batch size that is way too big for the next bucket and a batch that can take a very long time. The first and last entry of the bounds are Postgres' estimate of the min and max. We now simply replace the first and last bound with our known min and max, resulting in an ogive over [1, 50, 100], which reflects the statistics much more accurately and avoids impossibly short buckets.

zorancv

Nice!

store: Move filtering of histogram_bounds into VidBatcher::new

900f10a

lutter requested a review from zorancv March 19, 2025 11:50

lutter force-pushed the lutter/prune branch from 3154ea7 to 2038f1c Compare March 19, 2025 12:48

zorancv approved these changes Mar 20, 2025

View reviewed changes

lutter merged commit 2038f1c into master Mar 21, 2025
6 checks passed

lutter deleted the lutter/prune branch March 21, 2025 13:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix a mistake in handling histogram bounds for batching prunes and copies #5898

Fix a mistake in handling histogram bounds for batching prunes and copies #5898

Uh oh!

lutter commented Mar 19, 2025

Uh oh!

zorancv left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix a mistake in handling histogram bounds for batching prunes and copies #5898

Fix a mistake in handling histogram bounds for batching prunes and copies #5898

Uh oh!

Conversation

lutter commented Mar 19, 2025

Uh oh!

zorancv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants