Combining extensions in `stats._boost` into one #16583

rgommers · 2022-07-12T05:55:58Z

After gh-15770 I noticed another wheel size increase. With the current codegen it's easy to add new functions, but it looks like a separate new Python extension per function isn't all that sustainable:

$ ls -l build/scipy/stats/_boost/
...
-rwxr-xr-x  1 rgommers  staff  403276 Jul 12 07:13 beta_ufunc.cpython-39-darwin.so
-rwxr-xr-x  1 rgommers  staff  358557 Jul 12 07:13 binom_ufunc.cpython-39-darwin.so
-rwxr-xr-x  1 rgommers  staff  267553 Jul 12 07:13 hypergeom_ufunc.cpython-39-darwin.so
-rwxr-xr-x  1 rgommers  staff  366350 Jul 12 07:13 nbinom_ufunc.cpython-39-darwin.so
-rwxr-xr-x  1 rgommers  staff  337707 Jul 12 07:13 ncf_ufunc.cpython-39-darwin.so
-rwxr-xr-x  1 rgommers  staff  375308 Jul 12 07:13 ncx2_ufunc.cpython-39-darwin.so
# Note; removed directories from output

That's ~350kb per function. A lot of which will be Cython overhead I expect. @mckib2 what do you think about putting them all in a single _ufuncs extension?

The text was updated successfully, but these errors were encountered:

mckib2 · 2022-07-12T18:03:44Z

That's ~350kb per function

Per distribution (9 functions: pdf, sf, kurtosis, etc.), but I understand the concern. It should be smaller than that.

That would probably work -- we could also rewrite the code gen to use the numpy ufunc C API instead of the Cython API. Thinking quickly about it, that would reduce the amount of code gen necessary to deal with variable number of arguments. I may have time to look at this later this week.

rgommers · 2022-07-12T19:39:45Z

That sounds like a good option too. Thanks Nicholas!

rgommers · 2024-03-28T09:13:21Z

As @mckib2 suggests in mckib2#33 (comment), it's probably better to get rid of the separate wrapping of Boost functionality in scipy.stats, and get everything needed from scipy.special.

czgdp1807 · 2024-03-29T10:40:35Z

In general, I think this is worth doing. I can take it up. I will have to look into scipy.special first. :-)

czgdp1807 · 2024-03-29T20:28:55Z

As per #20208 (comment) (with some decisions to be made on #20208 (comment)) I think this is worth doing and I will be start doing this from Monday onwards.

czgdp1807 · 2024-04-01T10:32:41Z

I have a question. For the following, why don't we use direct results from here. Mean, skewness, kurtosis and variance have closed form expressions for beta distribution, so why calling into _boost? Can we return the closed form expressions directly?

scipy/scipy/stats/_continuous_distns.py

Lines 702 to 707 in 8b22ba9

    
           def _stats(self, a, b): 
        
               return ( 
        
                   _boost._beta_mean(a, b), 
        
                   _boost._beta_variance(a, b), 
        
                   _boost._beta_skewness(a, b), 
        
                   _boost._beta_kurtosis_excess(a, b))

The only reason that I can think of is that _boost is not using closed forms but we are calling into boost's generic APIs which accepts the distribution and computes the metrics using PDF (may be). Using closed forms is better or not depends on the distribution under consideration. Design wise at SciPy level I think using closed forms (if available) for a distribution is same as calling into boost APIs. Using closed forms needs us to write it using NumPy APIs. And calling into boost needs us to write wrappers for it. Both need specialised treatment. May be calling into boost gives better precision or error handling? Depending on the distribution sometimes closed forms can give better precision.

Anyways, for now I am using closed forms just to check what needs to be done to remove _stats/boost. Everything needed for stats is not present in scipy.special. We might need to add a bunch of APIs (like CDF, PDF calls into boost) in https://github.com/scipy/scipy/blob/main/scipy/special/boost_special_functions.h

scipy/scipy/stats/_boost/include/func_defs.hpp

Lines 117 to 142 in 8b22ba9

    
           template<template <typename, typename> class Dst, class RealType, class...Args> 
        
           RealType 
        
           boost_mean(const Args ... args) 
        
           { 
        
               return boost::math::mean(Dst<RealType, Policy>(args...)); 
        
           } 
        
           template<template <typename, typename> class Dst, class RealType, class...Args> 
        
           RealType 
        
           boost_variance(const Args ... args) { 
        
               return boost::math::variance(Dst<RealType, Policy>(args...)); 
        
           } 
        
           template<template <typename, typename> class Dst, class RealType, class...Args> 
        
           RealType 
        
           boost_skewness(const Args ... args) { 
        
               return boost::math::skewness(Dst<RealType, Policy>(args...)); 
        
           } 
        
           template<template <typename, typename> class Dst, class RealType, class...Args> 
        
           RealType 
        
           boost_kurtosis_excess(const Args ... args) { 
        
               return boost::math::kurtosis_excess(Dst<RealType, Policy>(args...)); 
        
           } 
        
           #endif // CLASS_DEF_HPP

I will open a PR with beta distribution updated by tonight.

rgommers added scipy.stats enhancement A new feature or improvement labels Jul 12, 2022

mckib2 mentioned this issue Jul 28, 2022

WIP: BLD: generic boost codegen to handle stats+special functions mckib2/scipy#33

Open

4 tasks

mdhaber mentioned this issue Feb 1, 2023

BLD: Boost.Math standalone submodule #17432

Merged

rgommers mentioned this issue Mar 28, 2024

BUG: Test failures due to invalid value encountered in _beta_ppf on M2 mac #20208

Closed

rgommers mentioned this issue Mar 31, 2024

RFC: Adoption of std::mdspan like structures in C++ code #20334

Closed

This was referenced Apr 1, 2024

Remove stats._boost usage by following the same pattern as scipy.special._ufuncs and using scipy.special APIs #20371

Closed

MAINT/BLD: Remove stats._boost and add the distribution related functions to scipy.special._ufuncs #20393

Merged

rgommers added this to the 1.14.0 milestone Apr 10, 2024

steppi closed this as completed in #20393 Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combining extensions in `stats._boost` into one #16583

Combining extensions in `stats._boost` into one #16583

rgommers commented Jul 12, 2022 •

edited

Loading

mckib2 commented Jul 12, 2022

rgommers commented Jul 12, 2022

rgommers commented Mar 28, 2024

czgdp1807 commented Mar 29, 2024

czgdp1807 commented Mar 29, 2024

czgdp1807 commented Apr 1, 2024 •

edited

Loading

Combining extensions in stats._boost into one #16583

Combining extensions in stats._boost into one #16583

Comments

rgommers commented Jul 12, 2022 • edited Loading

mckib2 commented Jul 12, 2022

rgommers commented Jul 12, 2022

rgommers commented Mar 28, 2024

czgdp1807 commented Mar 29, 2024

czgdp1807 commented Mar 29, 2024

czgdp1807 commented Apr 1, 2024 • edited Loading

Combining extensions in `stats._boost` into one #16583

Combining extensions in `stats._boost` into one #16583

rgommers commented Jul 12, 2022 •

edited

Loading

czgdp1807 commented Apr 1, 2024 •

edited

Loading