Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: reductions like np.sum can error on objects that handle ufuncs fine #21387

Closed
machow opened this issue Apr 24, 2022 · 5 comments
Closed
Labels

Comments

@machow
Copy link

machow commented Apr 24, 2022

Describe the issue:

Hey--I'm adding support for ufuncs to siuba, which allows users to specify what they want to do, before executing on a backend (e.g. pandas, sql).

Currently, I'm able to support ufuncs (in machow/siuba#415), but noticed that some functions that ultimately call ufuncs--like np.sum--don't make it that far.

It looks like this is because calls like np.sum(some_obj) follow this logic:

  • does hasattr(some_obj, "sum")? if so, use that method.
  • otherwise, call np.add.reduce(some_obj, ...)

This leads to a tricky situation where np.add.reduce works fine, but np.sum raises an error (because it goes through the first path, not __array_ufunc__).

Reproduce the code example:

Below are two examples: one in siuba, and one in polars

Siuba example

# from the PR...

from siuba import _
import numpy as np

# dispatches to _.__array_ufunc__(...)
np.add.reduce(_)

Outputs siuba's abstract syntax tree (AST) for the ufunc call:

█─'__call__'
├─█─'__custom_func__'
│ └─<function array_ufunc at 0x1078731f0>
├─_
├─<ufunc 'add'>
├─'reduce'
└─_

However, this code produces the wrong AST, because it does not dispatch to __array_ufunc__:

# incorrectly calls _.sum(...), since it exists
np.sum(_)

Polars example

It looks like this also causes issues with https://github.com/pola-rs/polars (cc @ritchie46):

import polars as pl

df = pl.read_csv("https://j.mp/iriscsv")

# okay
np.sqrt(df.sepal_length)

# error:TypeError: sum() got an unexpected keyword argument 'axis'
np.sum(df.sepal_length)

Error message:

Here's the full stacktrace for polars (siuba does not error, but erroneously has the .sum() method called, rather than __array_ufunc__.

TypeError                                 Traceback (most recent call last)
<ipython-input-16-3baf701c091e> in <module>
      6 np.sqrt(df.sepal_length)
      7 # error
----> 8 np.sum(df.sepal_length)

~/.virtualenvs/siuba/lib/python3.8/site-packages/numpy/core/overrides.py in sum(*args, **kwargs)

~/.virtualenvs/siuba/lib/python3.8/site-packages/numpy/core/fromnumeric.py in sum(a, axis, dtype, out, keepdims, initial, where)
   2294         return res
   2295
-> 2296     return _wrapreduction(a, np.add, 'sum', axis, dtype, out, keepdims=keepdims,
   2297                           initial=initial, where=where)
   2298

~/.virtualenvs/siuba/lib/python3.8/site-packages/numpy/core/fromnumeric.py in _wrapreduction(obj, ufunc, method, axis, dtype, out, **kwargs)
     82                 return reduction(axis=axis, dtype=dtype, out=out, **passkwargs)
     83             else:
---> 84                 return reduction(axis=axis, out=out, **passkwargs)
     85
     86     return ufunc.reduce(obj, axis, dtype, out, **passkwargs)

TypeError: sum() got an unexpected keyword argument 'axis'

NumPy/Python version information:

 import sys, numpy; print(numpy.__version__, sys.version)
1.22.3 3.8.12 (default, Feb  1 2022, 11:29:20)
[Clang 13.0.0 (clang-1300.0.29.30)]
@seberg
Copy link
Member

seberg commented Apr 24, 2022

@machow removing the method path seems like a pretty big change. Do you have any idea for what NumPy could do to help you?

@machow
Copy link
Author

machow commented Apr 24, 2022

@seberg thanks for you quick reply! Do you mind giving some context on what makes the change feel big?

It makes sense that there's history / downstream libraries to consider, and I'd be curious to hear the important jobs this first path in wrapreduction is doing (It would be surprising for objects to need this, and also not have __array_ufunc__ implemented, but I am probably missing something important).

If _wrapreduction could prioritize calling an object's __array_ufunc__ method (if possible) over object's e.g. .sum() method, it would correctly dispatch the handling to siuba and polars (who could then handle or raise errors as needed!).

@seberg
Copy link
Member

seberg commented Apr 24, 2022

Calling the methods is much older than __array_ufunc__. __array_function__ is already prioritized over the method, maybe that can help you?
It is simply a backwards compatible break. Yes, checking for __array_ufunc__ first may well be an alternative at least for sum. However, what about functions like np.mean which have similar logic? Don't you run into the same issue there, and it isn't a ufunc at all?

@machow
Copy link
Author

machow commented Apr 24, 2022

__array_function__ is already prioritized over the method, maybe that can help you?

Ah, thanks for the pointer! If it allows dispatching more numpy methods (and is the future), it seems like the way to go!

Don't you run into the same issue there [np.mean], and it isn't a ufunc at all?

Yeah, same issue. I've noticed libraries like pandas are not using __array_function__, but still support things like np.mean (due to the first path?), so part of the trick might be handling expections of users who do not have a version of numpy with __array_function__ enabled, but are still used to calling np.mean(some_series).

I guess if I raise an error in __array_ufunc__ it will at least let siuba steer users with older versions of numpy away from using numpy funcs?. And then, if I implement __array_function__, people who could get functions like np.mean to work can use numpy funcs?

In any event, thanks for the context and nudge to the future :). I tested out __array_function__, and it seems to be working!

@machow machow closed this as completed Apr 24, 2022
@machow
Copy link
Author

machow commented Apr 30, 2022

Okay--just a note for posterity.. after reading through the __array_ufunc__, __array_function__ docs and some source code, it looks like this is the case:

  • functions like np.add are ufuncs, and not supported via __array_function__
  • functions like np.sum are not ufuncs, but might dispatch to a ufunc, but also might not (e.g. if you have a .sum method)
  • you can override the behavior of functions like np.sum using __array_function__.

So if I understand correctly, siuba needs to implement both __array_ufunc__ and __array_function__ (which is not crazy, I just misunderstood and thought one was meant to eventually replace the other).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants