Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Implementing NEP 18's __array_function__ #26380

Open
jakirkham opened this issue May 14, 2019 · 11 comments
Open

ENH: Implementing NEP 18's __array_function__ #26380

jakirkham opened this issue May 14, 2019 · 11 comments
Labels
Compat pandas objects compatability with Numpy or Python functions Enhancement

Comments

@jakirkham
Copy link
Contributor

It would be useful to have __array_function__ support as described in NEP 18 implemented for Pandas objects. This would allow users to run NumPy functions on Pandas objects while deferring to Pandas on how those operations should run.

@TomAugspurger
Copy link
Contributor

This would be interesting to explore (in addition to Series.array_ufunc: #23293).

@jorisvandenbossche
Copy link
Member

One thing we might think about: do we want to keep this working exactly the same as the current methods, or would we want to take the opportunity to make it more compatible with numpy?

Not sure that are more things, but what I am thinking about is the axis handling in case of reductions for a DataFrame (so a rather specific case, maybe not relevant for many of the functions covered by __array_function__):

In [19]: df = pd.DataFrame(np.random.randn(5, 3), columns=['a', 'b', 'c'] ) 

In [20]: df.sum() 
Out[20]: 
a   -0.823846
b    6.850160
c    0.696525
dtype: float64

In [21]: np.sum(df)
Out[21]: 
a   -0.823846
b    6.850160
c    0.696525
dtype: float64

In [22]: np.sum(df.values) 
Out[22]: 6.7228383609003615

In [24]: np.sum(df.values, axis=0) 
Out[24]: array([-0.82384625,  6.85015992,  0.69652469])

On numpy, the default is axis=None to reduce all dimensions. In pandas we don't have this functionality, and (somewhat unfortunately IMO) the axis=None means the default of 0 in practice. I think it would be nice to add this axis=None behaviour to pandas (optional of course, default would stay the same). But if we do that, the question could be if we want to "respect" the default axis of np.sum (but it would need to go through a deprecation cycle anyway, probably).

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jun 19, 2019

@jorisvandenbossche note that for any / all, we do interpret axis=None as reduce all dimensions.

In [25]: df = pd.DataFrame({"A": [True, False], "B": [True, True]})

In [26]: df.all(axis=None)
Out[26]: False

IIRC, that was necessary for compatibility with a change in NumPy. I may be wrong, but I think we wanted to expand that interpretation of None to all the reduction methods.

edit: with a change in the default to axis=0, to maintain compatibility.

@jorisvandenbossche
Copy link
Member

Yes, I think it would be good to add that option to all reduction methods. But apart from that, it is still the question what the default np.sum(df) should ideally do (follow numpy's default axis=None, or pandas' default axis=0. Since numpy is calling, I personally would find it logical to follow numpy's default).

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jun 19, 2019 via email

@TomAugspurger
Copy link
Contributor

I think we'll want to implement this for our arrays first (e.g. IntegerArray).

@jbrockmendel
Copy link
Member

I've got a branch that implements __array_function__ for NDarrayBackedExtensionArray and is now passing the tests. The difficult part is that apparently there is no nice way to implement it for just a handful of np.foo functions (say just np.delete and np.repeat) without breaking every other np.foo function, many of which are called on EAs in our tests.

@TomAugspurger
Copy link
Contributor

Dask handles that with a warning and a fallback:
https://github.com/dask/dask/blob/0ca77043bbbe015dcb69378ece54419332734f40/dask/array/core.py#L1423-L1444.

If we want something similar, we could cast to an ndarray as a fallback.

@shoyer
Copy link
Member

shoyer commented Mar 20, 2021

My two cents:

  1. This would definitely make sense for pandas array objects. These objects have the semantics of 1D NumPy arrays.
  2. I'm not sure it makes sense for pandas.Series or pandas.DataFrame. These objects don't work like NumPy arrays, so implementating NumPy functions on them seems a little funny.

@jbrockmendel
Copy link
Member

If we want something similar, we could cast to an ndarray as a fallback.

The part I'm having trouble with is reliably identifying where self is in the args/kwargs when it could be e.g. hidden inside a tuple somewhere. I tracked the dask implementation back to base.compute before getting lost.

@jakirkham
Copy link
Contributor Author

cc @pentschev @rgommers (for vis)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions Enhancement
Projects
None yet
Development

No branches or pull requests

5 participants