Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[query] add less biased unsmoothed pdf to ggplot #13608

Merged
merged 12 commits into from Sep 21, 2023

Conversation

patrick-schultz
Copy link
Collaborator

… and make the default

Examples
Histogram without setting min/max. Requires two passes over data, has low resolution over interesting part of the distribution:
Screenshot 2023-09-12 at 8 26 54 AM

Histogram with manual min/max. Most accurate, but different choices of number of bins cause different artifacts. Requires knowledge of distribution beforehand.
Screenshot 2023-09-12 at 8 28 19 AM

Smoothed approx_cdf based pdf (old default). Smoothing causes large distortion.
Screenshot 2023-09-12 at 8 30 16 AM

New unsmoothed `approx_cdf_ based pdf. Single pass, works well for any distribution, needs no tuning or foreknowledge of distribution.
Screenshot 2023-09-12 at 8 31 20 AM

hail/python/hail/ggplot/geoms.py Show resolved Hide resolved
hail/python/hail/ggplot/geoms.py Show resolved Hide resolved
Copy link
Collaborator

@ehigham ehigham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making such an effort to make this more readable. I have a couple of small suggestions then we can merge this.

p = update_grid_size(p)
return p

def compute_single_error(s, failure_prob=failure_prob):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you intend to pull these inner functions out or can we just get failure_prob from the enclosing scope (ie remove the parameter)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter is there because it gets called with different values. This function computes the error bound on an estimated rank for a single value, with a given probability of exceeding the bound. We sometimes want an error bound that applies to all possible values simultaneously, with a given probability of any rank estimate exceeding the bound. Computing that involves computing an error bound for a single value with an appropriately smaller failure probability.

xi, yi = point_on_bound(i, upper)
return (yi - fy) / (xi - fx)

def update_min_max_slopes():
Copy link
Collaborator

@ehigham ehigham Sep 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be clearer to return a tuple and unpack that at the call site rather than use side effects:

min_slope, max_slope = min_max_slopes()

max_slope = slope_from_fixed(ui, upper=True)

def fix_point_on_result(i, upper):
nonlocal fx, fy, new_y, keep
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here. I guess I'm not a fan of this pattern.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally I'm not either, but here it felt like the best way to abstract out repeated steps of the algorithm. Maybe it would be clearer if this were a class, and the mutable state fields on the class?

I found this more readable, chunking up common updates to the state of the algorithm rather than repeating more low level changes, but readability is subjective and I'm happy to inline these if you think that's clearer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, there's not need to change this one. You raised a good point that if this were a class and these variables were attributes of the class, then perhaps I wouldn't object - which is certainly true. I don't think it's worth re-writing this as a class. I think this is fine as you're indexing into and mutating state variables, just please define the variables before you use them in this function so it's clear what you're referencing.

hail/python/hail/ggplot/geoms.py Show resolved Hide resolved
@danking danking merged commit dae37d7 into hail-is:main Sep 21, 2023
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants