Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combining extensions in stats._boost into one #16583

Closed
rgommers opened this issue Jul 12, 2022 · 6 comments · Fixed by #20393 · May be fixed by mckib2/scipy#33
Closed

Combining extensions in stats._boost into one #16583

rgommers opened this issue Jul 12, 2022 · 6 comments · Fixed by #20393 · May be fixed by mckib2/scipy#33
Labels
enhancement A new feature or improvement scipy.stats
Milestone

Comments

@rgommers
Copy link
Member

rgommers commented Jul 12, 2022

After gh-15770 I noticed another wheel size increase. With the current codegen it's easy to add new functions, but it looks like a separate new Python extension per function isn't all that sustainable:

$ ls -l build/scipy/stats/_boost/
...
-rwxr-xr-x  1 rgommers  staff  403276 Jul 12 07:13 beta_ufunc.cpython-39-darwin.so
-rwxr-xr-x  1 rgommers  staff  358557 Jul 12 07:13 binom_ufunc.cpython-39-darwin.so
-rwxr-xr-x  1 rgommers  staff  267553 Jul 12 07:13 hypergeom_ufunc.cpython-39-darwin.so
-rwxr-xr-x  1 rgommers  staff  366350 Jul 12 07:13 nbinom_ufunc.cpython-39-darwin.so
-rwxr-xr-x  1 rgommers  staff  337707 Jul 12 07:13 ncf_ufunc.cpython-39-darwin.so
-rwxr-xr-x  1 rgommers  staff  375308 Jul 12 07:13 ncx2_ufunc.cpython-39-darwin.so
# Note; removed directories from output

That's ~350kb per function. A lot of which will be Cython overhead I expect. @mckib2 what do you think about putting them all in a single _ufuncs extension?

@rgommers rgommers added scipy.stats enhancement A new feature or improvement labels Jul 12, 2022
@mckib2
Copy link
Contributor

mckib2 commented Jul 12, 2022

That's ~350kb per function

Per distribution (9 functions: pdf, sf, kurtosis, etc.), but I understand the concern. It should be smaller than that.

That would probably work -- we could also rewrite the code gen to use the numpy ufunc C API instead of the Cython API. Thinking quickly about it, that would reduce the amount of code gen necessary to deal with variable number of arguments. I may have time to look at this later this week.

@rgommers
Copy link
Member Author

That sounds like a good option too. Thanks Nicholas!

@rgommers
Copy link
Member Author

As @mckib2 suggests in mckib2#33 (comment), it's probably better to get rid of the separate wrapping of Boost functionality in scipy.stats, and get everything needed from scipy.special.

@czgdp1807
Copy link
Member

In general, I think this is worth doing. I can take it up. I will have to look into scipy.special first. :-)

@czgdp1807
Copy link
Member

As per #20208 (comment) (with some decisions to be made on #20208 (comment)) I think this is worth doing and I will be start doing this from Monday onwards.

@czgdp1807
Copy link
Member

czgdp1807 commented Apr 1, 2024

I have a question. For the following, why don't we use direct results from here. Mean, skewness, kurtosis and variance have closed form expressions for beta distribution, so why calling into _boost? Can we return the closed form expressions directly?

def _stats(self, a, b):
return (
_boost._beta_mean(a, b),
_boost._beta_variance(a, b),
_boost._beta_skewness(a, b),
_boost._beta_kurtosis_excess(a, b))

The only reason that I can think of is that _boost is not using closed forms but we are calling into boost's generic APIs which accepts the distribution and computes the metrics using PDF (may be). Using closed forms is better or not depends on the distribution under consideration. Design wise at SciPy level I think using closed forms (if available) for a distribution is same as calling into boost APIs. Using closed forms needs us to write it using NumPy APIs. And calling into boost needs us to write wrappers for it. Both need specialised treatment. May be calling into boost gives better precision or error handling? Depending on the distribution sometimes closed forms can give better precision.

Anyways, for now I am using closed forms just to check what needs to be done to remove _stats/boost. Everything needed for stats is not present in scipy.special. We might need to add a bunch of APIs (like CDF, PDF calls into boost) in https://github.com/scipy/scipy/blob/main/scipy/special/boost_special_functions.h

template<template <typename, typename> class Dst, class RealType, class...Args>
RealType
boost_mean(const Args ... args)
{
return boost::math::mean(Dst<RealType, Policy>(args...));
}
template<template <typename, typename> class Dst, class RealType, class...Args>
RealType
boost_variance(const Args ... args) {
return boost::math::variance(Dst<RealType, Policy>(args...));
}
template<template <typename, typename> class Dst, class RealType, class...Args>
RealType
boost_skewness(const Args ... args) {
return boost::math::skewness(Dst<RealType, Policy>(args...));
}
template<template <typename, typename> class Dst, class RealType, class...Args>
RealType
boost_kurtosis_excess(const Args ... args) {
return boost::math::kurtosis_excess(Dst<RealType, Policy>(args...));
}
#endif // CLASS_DEF_HPP

I will open a PR with beta distribution updated by tonight.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment