Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A catch-all protocol for numpy-like duck arrays #11129

Closed
mrocklin opened this issue May 21, 2018 · 11 comments

Comments

Projects
None yet
5 participants
@mrocklin
Copy link
Contributor

commented May 21, 2018

There are several functions for which I would like to see protocols constructed. I've raised issues for #11074 and #11128 but these are just special cases of a much larger issue that includes many operations. The sense I've gotten is that the process to change numpy takes a while, so I'm inclined to find a catch-all solution that can serve as a catch-all while things evolve.

To that end I propose that duck-arrays include a method that returns a module that mimics the numpy namespace

class ndarray:
    def __array_module__(self):
        import numpy as np
        return np    

class DaskArray:
    def __array_module__(self):
        import dask.array as da
        return da
        
class CuPyArray:
    def __array_module__(self):
        import cupy as cp
        return cp

class SparseArray:
    def __array_module__(self):
        import sparse
        return sparse
...

Then, in various functions like stack or concatenate we check for these modules

def stack(args, **kwargs):
    modules = {arg.__array_module__() for arg in args}
    if len(modules) == 1:
        module = list(modules)[0]
        if module != numpy:
            return module.stack(args, **kwargs)
    ...

There are likely several things wrong the implementation above, but my hope is that it gets a general point across that we'll dispatch wholesale to the module of the provided duck arrays.

cc @shoyer @hameerabbasi @njsmith @ericmjl

@ngoldbaum

This comment has been minimized.

Copy link

commented May 21, 2018

This is interesting in that it's a bit more general than the __array_concatenate__ proposal. See e.g. the discussion in #4164. I don't think that proposal ever made it into a full NEP.

@mrocklin

This comment has been minimized.

Copy link
Contributor Author

commented May 21, 2018

Right. My hope would be that this would be a placeholder for the common case while more specific protocols, like __array_concatenate__ evolve to support other situations, like where there are arrays of different types.

@hameerabbasi

This comment has been minimized.

Copy link
Contributor

commented May 21, 2018

My hope would be that this would be a placeholder for the common case while more specific protocols, like __array_concatenate__ evolve to support other situations, like where there are arrays of different types.

Why restrict this module protocol to certain types at all? We can follow the same algorithm as, for example, __array_ufunc__. Here's some example code:

(Apologies for the long post)

sandbox.py

def variable_dispatch(name, args, **kwargs):
    for arg in args:
        if hasattr(arg, '__array_module__'):
            module = arg.__array_module__()

            if hasattr(module, name):
                retval = getattr(module, name)(args, **kwargs)

                if retval is not NotImplemented:
                    return retval

    raise TypeError('This operation is not possible with the supplied types.')


def dispatch(name, *args, **kwargs):
    for arg in args:
        if hasattr(arg, '__array_module__'):
            module = arg.__array_module__()

            if hasattr(module, name):
                retval = getattr(module, name)(*args, **kwargs)

                if retval is not NotImplemented:
                    return retval

    raise TypeError('This operation is not possible with the supplied types.')


def where(*args, **kwargs):
    return dispatch('where', *args, **kwargs)


def stack(args, **kwargs):
    return variable_dispatch('stack', args, **kwargs)
In[2]: import numpy as np
In[3]: import sparse
In[4]: import sandbox
In[5]: class PotatoArray(sparse.COO):
  ...:     def __array_module__(self):
  ...:         return sparse
  ...:     
In[6]: x = PotatoArray(np.eye(5))
In[7]: y = PotatoArray(np.zeros((5, 5)))
In[8]: condition = PotatoArray(np.ones((5, 5), dtype=np.bool_))
In[9]: result = sandbox.where(condition, x, y)
In[10]: sandbox.where(condition, x, y)
Out[10]: <COO: shape=(5, 5), dtype=float64, nnz=5>
In[11]: sandbox.where(condition, x, y).todense()
Out[11]: 
array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]])
In[12]: sandbox.stack([x, y], axis=0)
Out[12]: <COO: shape=(2, 5, 5), dtype=float64, nnz=5>
In[13]: sandbox.stack([x, y], axis=0).todense()
Out[13]: 
array([[[1., 0., 0., 0., 0.],
        [0., 1., 0., 0., 0.],
        [0., 0., 1., 0., 0.],
        [0., 0., 0., 1., 0.],
        [0., 0., 0., 0., 1.]],

       [[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]]])
In[14]: class A():
   ...:     pass
   ...: 
In[15]: sandbox.where(A(), x, y)
Traceback (most recent call last):
  File "/anaconda3/envs/sparse/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-15-775f69206c32>", line 1, in <module>
    sandbox.where(A(), x, y)
  File "/Users/hameerabbasi/PycharmProjects/sparse/sandbox.py", line 30, in where
    return dispatch('where', *args, **kwargs)
  File "/Users/hameerabbasi/PycharmProjects/sparse/sandbox.py", line 26, in dispatch
    raise TypeError('This operation is not possible with the supplied types.')
TypeError: This operation is not possible with the supplied types.
@shoyer

This comment has been minimized.

Copy link
Member

commented May 22, 2018

My main concern with this approach is that top level functions should be raising TypeError rather than returning NotImplemented.

For example, consider Python arithmetic (on which __array_ufunc__ was modeled) between two custom types that implement that implement the appropriate special methods (__add__ and __radd__), but that don't know about each other:

def _not_implemented(*args, **kwargs):
  return NotImplemented

class A:
  __add__ = __radd__ = _not_implemented

class B:
  __add__ = __radd__ = _not_implemented

a = A()
b = B()
a.__add__(b)  # NotImplemented
a.__radd__(b)  # NotImplemented
a + b  # TypeError: unsupported operand type(s) for +: 'A' and 'B'

However, I do like the idea of a generic method for NumPy functions that aren't ufuncs. I would still make this a method on array objects, though, e.g., __array_function__. NumPy's implementation of func would call arg.__array_function__(func, *args, **kwargs) in turn on each array argument to a function, and return the first result that is not NotImplemented.

In most cases, you could write something like the following:

import dask.array as da

class DaskArray:
    def __array_function__(self, func, *args, **kwargs):
        if (not hasattr(da, func.__name__) or
                not all(isinstance(arg, HANDLED_TYPES) for arg in args)):
            return NotImplemented
        return getattr(da, func.__name__)(*args, **kwargs)
@mattip

This comment has been minimized.

Copy link
Member

commented May 22, 2018

I don't think that proposal ever made it into a full NEP

There is a NEP PR #10706 which has yet to be merged

@shoyer

This comment has been minimized.

Copy link
Member

commented May 22, 2018

@njsmith and I want to revisit #10706 in a follow-on NEP. The conclusion of our in-person discussion a few months ago (see our notes) was that it would be better to introduce a protocol for a duck-array equivalent of asarray(). But that still leaves the separate problem (addressed here) of what the dispatch mechanism for generic array functions should look like.

From our notes:

Focus on protocols

Historically, numpy has had lots of success at interoperating with third-party objects by defining protocols, like __array__ (asks an arbitrary object to convert itself into an array), __array_interface__ (a precursor to Python’s buffer protocol), and __array_ufunc__ (allows third-party objects to support ufuncs like np.exp).

NEP 16 took a different approach: we need a duck-array equivalent of asarray, and it proposed to do this by defining a version of asarray that would let through objects which implemented a new AbstractArray ABC. As noted above, we now think that trying to define an ABC is a bad idea for other reasons. But when this NEP was discussed on the mailing list, we realized that even on its own merits, this idea is not so great. A better approach is to define a method that can be called on an arbitrary object to ask it to convert itself into a duck array, and then define a version of asarray that calls this method.

This is strictly more powerful: if an object is already a duck array, it can simply return self. It allows more correct semantics: NEP 16 assumed that asarray(obj, dtype=X) is the same as asarray(obj).astype(X), but this isn’t true. And it supports more use cases: if h5py supported sparse arrays, it might want to provide an object which is not itself a sparse array, but which can be automatically converted into a sparse array. See NEP <XX, to be written> for full details.

The protocol approach is also more consistent with core Python conventions: for example, see the __iter__ method for coercing objects to iterators, or the __index__ protocol for safe integer coercion. And finally, focusing on protocols leaves the door open for partial duck arrays, which can pick and choose which subset of the protocols they want to participate in, each of which have well-defined semantics.

Conclusion: protocols are one honking great idea – let’s do more of those.

@hameerabbasi

This comment has been minimized.

Copy link
Contributor

commented May 23, 2018

I'm neutral between @shoyer's and @mrocklin's proposal, but I do see a bit of an issues with @shoyer's:

  • If something is in a submodule (like np.random), then it won't work. Obviously, __array_function__ could be modified to pass the submodule along.
  • Maybe this is just me, but I'd like to create a small differentiation between concatenate/stack like functions (list of duck array arguments) and and ones like where (relatively fixed number of arguments) without having to specifically match the names myself.

On the other hand, @mrocklin's has certain issues as well:

  • We have to mirror the np module, which could be restrictive.

As a side note, sparse.where returning NotImplemented here is a side-effect of it using the same function as __array_ufunc__ but not really interpreting it and raising an error.

All-in-all, I'm definitely for a catch-all protocol vs many separate protocols. It may be a bit "quick-and-dirty", but I'm rather fine with that.

@mattip

This comment has been minimized.

Copy link
Member

commented May 29, 2018

This discussion seems to have evolved into PR #11189

@mattip

This comment has been minimized.

Copy link
Member

commented Oct 3, 2018

Can we close this? The ideas all seem to be in NEP 18 __array_function__ and implementation is in progress #12028

@hameerabbasi

This comment has been minimized.

Copy link
Contributor

commented Oct 3, 2018

I suggest adding a GitHub magic comment to #12028.

Edit: My bad, I thought that was a PR. Yes, this can be closed as duplicate leaving #12028 as the canonical issue.

@mattip

This comment has been minimized.

Copy link
Member

commented Oct 15, 2018

Closing, please open a new issue if there are discussion points not covered in NEPS 18 and 16

@mattip mattip closed this Oct 15, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.