Skip to content

ApproxCDF bugfix#14765

Merged
hail-ci-robot merged 2 commits intomainfrom
ps-12-12-ApproxCDF_bugfix
Dec 13, 2024
Merged

ApproxCDF bugfix#14765
hail-ci-robot merged 2 commits intomainfrom
ps-12-12-ApproxCDF_bugfix

Conversation

@patrick-schultz
Copy link
Copy Markdown
Member

@patrick-schultz patrick-schultz commented Dec 12, 2024

Change Description

Fixes a bug which was reported in https://discuss.hail.is/t/arrayindexoutofboundsexception-using-cdf-combine/4008/6.

The bug was caused by an unenforced invariant in the ApproxCDF aggregator state. Roughly, the state consists of a number of "levels" (stored flattened into a single array), where each level contains a number of samples from the data. There is also a notion of the "capacity" of each level. The data structure assumes that whenever the array containing the flattened levels becomes full, there must be at least one level which is above capacity, and can therefore be compacted to free space. In other words, the size of the array must be at least the sum of the capacities of the present levels.

In #13935, we changed the ApproxCDF aggregator to return the raw internal state, and moved the computation of the cdf from the internal state to python. The problem is that we "compress" the states before returning them as hail values by reducing the size of the array of samples to only the number of present samples. But then, when the exposed "combine" function recreates ApproxCDF aggregator states from the returned values, the array of samples is no longer large enough to satisfy the invariant.

In this PR, I

  • add a test case in python which fails in main, by running approx_cdf over a very small dataset
  • fix ApproxCDFStateManager.fromData to "uncompress" the aggregator state
  • add a check for the violated invariant in the ApproxCDFStateManager constructor. (This isn't a perfect check, as the underlying state can be changed after construction, but I didn't see a way to make a more complete check without significant refactoring, and this would have caught the current bug.)

Security Assessment

  • This change has no security impact

Impact Description

Low-level refactoring of non-security code

Copy link
Copy Markdown
Member Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

@patrick-schultz patrick-schultz marked this pull request as ready for review December 12, 2024 17:42
Copy link
Copy Markdown
Collaborator

@chrisvittal chrisvittal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, I have one suggestion to simplify the parameters you added to the fixture.

Comment thread hail/python/test/hail/expr/test_expr.py Outdated
Comment thread hail/python/test/hail/expr/test_expr.py Outdated
@patrick-schultz patrick-schultz force-pushed the ps-12-12-ApproxCDF_bugfix branch from e831a77 to 0e05095 Compare December 12, 2024 18:06
Copy link
Copy Markdown
Member Author

Done!

Copy link
Copy Markdown
Member

@ehigham ehigham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done for finding and fixing this so quickly

@hail-ci-robot hail-ci-robot merged commit 0f8d35b into main Dec 13, 2024
@hail-ci-robot hail-ci-robot deleted the ps-12-12-ApproxCDF_bugfix branch December 13, 2024 14:23
grohli pushed a commit to grohli/hail that referenced this pull request Jan 27, 2025
## Change Description

Fixes a bug which was reported in
https://discuss.hail.is/t/arrayindexoutofboundsexception-using-cdf-combine/4008/6.

The bug was caused by an unenforced invariant in the ApproxCDF
aggregator state. Roughly, the state consists of a number of "levels"
(stored flattened into a single array), where each level contains a
number of samples from the data. There is also a notion of the
"capacity" of each level. The data structure assumes that whenever the
array containing the flattened levels becomes full, there must be at
least one level which is above capacity, and can therefore be compacted
to free space. In other words, the size of the array must be at least
the sum of the capacities of the present levels.

In hail-is#13935, we changed the ApproxCDF aggregator to return the raw
internal state, and moved the computation of the cdf from the internal
state to python. The problem is that we "compress" the states before
returning them as hail values by reducing the size of the array of
samples to only the number of present samples. But then, when the
exposed "combine" function recreates ApproxCDF aggregator states from
the returned values, the array of samples is no longer large enough to
satisfy the invariant.

In this PR, I
* add a test case in python which fails in main, by running approx_cdf
over a very small dataset
* fix `ApproxCDFStateManager.fromData` to "uncompress" the aggregator
state
* add a check for the violated invariant in the `ApproxCDFStateManager`
constructor. (This isn't a perfect check, as the underlying state can be
changed after construction, but I didn't see a way to make a more
complete check without significant refactoring, and this would have
caught the current bug.)

## Security Assessment

- This change has no security impact

### Impact Description

Low-level refactoring of non-security code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants