Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nanquantile function #166

Merged
merged 7 commits into from
Oct 19, 2023
Merged

Conversation

max-sixty
Copy link
Collaborator

An attempt to solve #161

But:

  • I can't get it to broadcast with guvectorize and take multiple quantiles. Somewhat understandably, it doesn't know that quantiles is a fixed list of values, rather than an array to broadcast over.
  • It doesn't seem any faster! (forgive the bad output, scroll right for timings — it's the same for all arrays apart from small arrays, where it's much slower)

I haven't done benchmarks for the version at https://krstn.eu/np.nanpercentile()-there-has-to-be-a-faster-way/, or xclim's version.

If anyone wants to take this and test if those versions are indeed faster, that would be helpful to see.

                       ok
[33.33%] ··· ========================================================================================================================= ======== ==========
                                                                        func                                                              n
             ------------------------------------------------------------------------------------------------------------------------- -------- ----------
              (<numba._GUFunc 'nanquantile_single'>, <function Funcs.<lambda> at 0x15d24a310>, <function nanquantile at 0x1026a0870>)     10     246±0μs
              (<numba._GUFunc 'nanquantile_single'>, <function Funcs.<lambda> at 0x15d24a310>, <function nanquantile at 0x1026a0870>)    1000    4.79±0ms
              (<numba._GUFunc 'nanquantile_single'>, <function Funcs.<lambda> at 0x15d24a310>, <function nanquantile at 0x1026a0870>)   100000   453±0ms
             ========================================================================================================================= ======== ==========

[66.67%] ··· benchmarks.Funcs.time_numpy                                                                                                                                       ok
[66.67%] ··· ========================================================================================================================= ======== ==========
                                                                        func                                                              n
             ------------------------------------------------------------------------------------------------------------------------- -------- ----------
              (<numba._GUFunc 'nanquantile_single'>, <function Funcs.<lambda> at 0x15d24a310>, <function nanquantile at 0x1026a0870>)     10     55.9±0μs
              (<numba._GUFunc 'nanquantile_single'>, <function Funcs.<lambda> at 0x15d24a310>, <function nanquantile at 0x1026a0870>)    1000    4.71±0ms
              (<numba._GUFunc 'nanquantile_single'>, <function Funcs.<lambda> at 0x15d24a310>, <function nanquantile at 0x1026a0870>)   100000   475±0ms
             ========================================================================================================================= ======== ==========

[100.00%] ··· benchmarks.Funcs.time_pandas                                                                                                                                      ok
[100.00%] ··· ========================================================================================================================= ======== ==========
                                                                         func                                                              n
              ------------------------------------------------------------------------------------------------------------------------- -------- ----------
               (<numba._GUFunc 'nanquantile_single'>, <function Funcs.<lambda> at 0x15d24a310>, <function nanquantile at 0x1026a0870>)     10     128±0μs
               (<numba._GUFunc 'nanquantile_single'>, <function Funcs.<lambda> at 0x15d24a310>, <function nanquantile at 0x1026a0870>)    1000    4.74±0ms
               (<numba._GUFunc 'nanquantile_single'>, <function Funcs.<lambda> at 0x15d24a310>, <function nanquantile at 0x1026a0870>)   100000   448±0ms
              ========================================================================================================================= ======== ==========

@max-sixty
Copy link
Collaborator Author

max-sixty commented Oct 18, 2023

OK — actually this is better news — np.nanquantile is only slow when it's parallelizing with e.g. axis=0, because the slow part is how it vectorizes.

So the current benchmarks have, for a 300x100000 array:

  • numbagg.nanquantile at 578ms
  • np.nanquantile at 2800ms
  • pandas .quantile function at 451ms — faster than both

So adding this to numbagg probably does make sense. Unfortunately we'll probably only be able to run for a single quantile, as long as we stay using gufuncs, which allow this over any number of dimensions. Though very open to ideas for a way of fixing that...

I need to clean up the code a lot.

@max-sixty max-sixty changed the title WIP: nanquantile Add nanquantile function Oct 19, 2023
@max-sixty
Copy link
Collaborator Author

max-sixty commented Oct 19, 2023

OK, actually a couple of hours of work showed me this is very possible, by managing the axes parameter to a gufunc.

So this works, and is about 4x faster than np.nanquantile. It's about 10% slower than np.quantile. pandas is surprisingly fast here — we're ~20% slower than its .quantile. (Though pandas' .quantile doesn't have the benefits of a gufunc, such as vectorizing over any combination of axes.) I think that's probably because numbagg is less fast for sorting, whereas for standard loops, it's extremely fast.

I need to clean up the benchmarks, but I'll merge for now.

This has mostly been me, but if anyone has substantive feedback on the code, that would be very welcome.

@max-sixty max-sixty merged commit 280e6c5 into numbagg:main Oct 19, 2023
6 checks passed
@max-sixty max-sixty deleted the wip-nanpercentile branch October 19, 2023 00:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant