Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement gufunc(pyfunc, ...) similar to numpy.vectorize but without the vectorize part #10526

Open
magonser opened this issue Feb 5, 2018 · 8 comments

Comments

@magonser
Copy link

magonser commented Feb 5, 2018

This issue was suggested in dask/dask#3109

My current understanding is that numpy.vectorize provides a way to

  1. provide a Python function,
  2. assign it a signature with information about core dimensions,
  3. bind it input data, where two things happen
    • call __array__ufunc__, if present, and/or
    • broadcast loop dimensions (according to signature) of input arrays against each other.
  4. Iterative calls of the Python function over all loop dimension entries

I would like to suggest to implement an additional wrapper, just like numpy.vectorize, which does all the steps above, except step 4). I.e. a Python function could be wrapped as gufunc and given a signature, and when binding input the same Step 3) is applied.

The benefit is, that same data binding methodology and interface is used, if the user already provides a vectorized implementation of Python function. It becomes especially important for interoperability with other libraries, e.g. dask.

@hpaulj
Copy link

hpaulj commented Feb 12, 2018

np.vectorize is Python code, so you should be able to customize your own version. A working example would be easier to understand.

np.vectorize creates a callable object. The init setup doesn't do much; most of the work is done when called with the arguments. That includes broadcasting (based on the signature and array dimensions), the ndindex iteration, and massaging the results to fit the signature. I don't see how you can separate out step 4.

In step 3, I don't see any use or reference to __array__ufunc__.

@shoyer
Copy link
Member

shoyer commented Feb 27, 2018

It seems like there are at least two options here:

  1. A standard way to write "duck ufuncs" that aren't actual ufuncs, but which could safely make use of __array_ufunc__. This is a little tricky, because not every part of the ufunc API would necessarily make sense for duck ufuncs.
  2. An interface for turning "duck ufuncs" into actual ufuncs.

I think option (2) would be preferred, but it's not clear to me that it's possible to do with the current ufunc API (admittedly, I don't understand it well).

@mrocklin
Copy link
Contributor

mrocklin commented Jul 8, 2018

As a concrete example consider Scikit-Learn's Estimator.predict method. This typically takes an n x m shaped array and produces an n shaped array, but it generally just broadcasts along the first dimension so it's signature is probably something like (m)->()

It would be useful for Scikit-Learn to have some mechanism to be able to say "this function can be broadcast in the following way" and defer to objects that can do that sort of broadcasting (using the __array_ufunc__ protocol) when they are provided as inputs.

Concretely, it would be nice for downstream projects to be able to say the following:

class Estimator:
    @numpy.broadcastable('(m)->()')
    def predict(self, X):
        ...
        return y

Ideally then if a user provides something like a dask array

estimator = Estimator(...)
estimator.fit(...)

estimator.predict(my_dask_array)

Then ideally the decorated predict method would go through the normal checks for __array_ufunc__ and give control over to the dask array object.

@mattip
Copy link
Member

mattip commented Jul 8, 2018

See a possible implementation in PR #11061, which was rejected for matmul

@eric-wieser
Copy link
Member

but it generally just broadcasts along the first dimension

A signature of (m)->() broadcasts along all but the last dimension - in einsum notation, it's ...j->.... It sounds like predict doesn't actually broadcast, and is either ij->i or j-> but no higher dimensional versions.

@mrocklin
Copy link
Contributor

mrocklin commented Jul 10, 2018 via email

@mhvk
Copy link
Contributor

mhvk commented Jul 10, 2018

We do have np.frompyfunc, which creates a proper ufunc from a python function (with object dtype), so I guess ideally one would have the equivalent that created a gufunc. But this is a bit tricky to implement, since it would need the ndarray implementation to actually extract sub-arrays on each iteration instead of just passing on pointers and strides.

@mattip
Copy link
Member

mattip commented Jul 10, 2018

If we design such a wrapper it should handle these better than the current gufunc mechanism:

  • memory overlap requirements and temporary output buffer allocation
  • specifying requirements for contiguous memory layout

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants