Skip to content

Conversation

@lutter
Copy link
Collaborator

@lutter lutter commented Mar 19, 2025

When setting up a VidBatcher we have both accurate values for the range of vids as well as Postgres' estimate of bounds for a histogram with roughly the same number of entries in each bucket.

As an example, say we have min and max of 1 and 100, and histogram bounds [5, 50, 96]. We used to then add min and max to these bounds resulting in an ogive over [1, 5, 50, 96, 100]. With that, it seems that there is a bucket [1, 5] with just as many entries as the bucket [5, 50], which is not what the Posgres staistics indicate. Using this ogive will cause
e.g. pruning to increase batch size quickly as it tries to get out of the [1, 5] bucket resulting in a batch size that is way too big for the next bucket and a batch that can take a very long time.

The first and last entry of the bounds are Postgres' estimate of the min and max. We now simply replace the first and last bound with our known min and max, resulting in an ogive over [1, 50, 100], which reflects the statistics much more accurately and avoids impossibly short buckets.

@lutter lutter requested a review from zorancv March 19, 2025 11:50
When setting upa VidBatcher we have both accurate values for the range of
vids as well as Postgres' estimate of bounds for a histogram with roughly
the same number of entries in each bucket.

As an example, say we have min and max of 1 and 100, and histogram bounds
[5, 50, 96]. We used to then add min and max to these bounds resulting in
an ogive over [1, 5, 50, 96, 100]. With that, it seems that there is a
bucket [1, 5] with just as many entries as the bucket [5, 50], which is not
what the Posgres staistics indicate. Using this ogive will cause
e.g. pruning to increase batch size quickly as it tries to get out of the
[1, 5] bucket resulting in a batch size that is way too big for the next
bucket and a batch that can take a very long time.

The first and last entry of the bounds are Postgres' estimate of the min
and max. We now simply replace the first and last bound with our known min
and max, resulting in an ogive over [1, 50, 100], which reflects the
statistics much more accurately and avoids impossibly short buckets.
Copy link
Contributor

@zorancv zorancv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@lutter lutter merged commit 2038f1c into master Mar 21, 2025
6 checks passed
@lutter lutter deleted the lutter/prune branch March 21, 2025 13:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants