Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expected behaviour regarding new copy keyword in __array__ implementations #25941

Closed
jorisvandenbossche opened this issue Mar 5, 2024 · 15 comments

Comments

@jorisvandenbossche
Copy link
Contributor

Moving from #25922, which added this to the docs:

NumPy will pass copy to the __array__ special method in situations where
it would be set to a non-default value (e.g. in a call to
np.asarray(some_object, copy=False)). Currently, if an
unexpected keyword argument error is raised after this, NumPy will print a
warning and re-try without the copy keyword argument. Implementations of
objects implementing the __array__ protocol should accept a copy keyword
argument with the same meaning as when passed to numpy.array or
numpy.asarray.

Thinking through how to update some __array__ implementations, I am wondering: what are the expectations from numpy how an __array__ implementations handles this keyword fully or partially?

When copy=True is being passed, does numpy assume that the __array__ implementation already made that copy? But what if the implementation defers dtype casting to numpy (which is currently possible), then it might be a waste to already copy up front?

Originally posted by @jorisvandenbossche in #25922 (comment)

@rgommers
Copy link
Member

rgommers commented Mar 5, 2024

Deprecated though, right? The signature should already have included the dtype keyword, even though it was still working to not have that.

then it might be a waste to already copy up front?

What we want to achieve I think is that only a single copy is ever made. We can have two now in many cases, because the other library may copy from its internal implementation into an array, and then numpy will make another copy. The only way to get it right is for the other library to do the copy if copy=True, and for numpy not to do so if the method call including copy=True succeeded.

The old behavior is:

  • the only contract of __array__ is that it must return a numpy.ndarray instance,
  • the implementer of __array__ may or may not make a copy in the process of producing an array
  • NumPy may make a copy of the returned array if needed when copy=False, and must make a copy if copy=True (it cannot know otherwise if it has exclusive ownership of the underlying data)
  • end result for copy=True is: either 1 or 2 copies made (NumPy cannot know which)

The intended new behavior is:

  • copy is always passed on to the __array__ implementer, with the same semantics as for asarray (False: "never copy", None: "copy-if-needed", True: ensure data is a semantic copy / unique to returned array)
  • that implementer must ensure that the data can not be used anymore by anything other than the numpy array it's returning (that usually means "copy the data in memory" but it can also mean something like "invalidate any other reference to this data")
  • end result:
    • copy=False: guarantee that no data is copied in memory
    • copy=None: 0 or 1 copies made, data may be a view
    • copy=True: 0 or 1 copies made, data is guaranteed to not be a view

It looks like we need to do two things:

  1. Make the expectations on implementers of __array__ clearer in our docs,
  2. Ensure we actually do not make a second copy if copy=True if passing along the copy keyword in the call to __array__ succeeded.

This change of __array__ kinda came in at the end of a long PR/discussion about the asarray/array behavior, so I didn't realize until now that this is actually also a performance improvement, potentially avoiding an unnecessary copy.

@jorisvandenbossche
Copy link
Contributor Author

One additional aspect of the old behaviour is illustrated as follows:

class MyArray:
    def __init__(self, arr):
        self._ndarray = arr

    def __array__(self, dtype=None):
        return self._ndarray

In [7]: arr = MyArray(np.array([1, 2, 3]))

In [8]: np.asarray(arr, dtype="float64")
Out[8]: array([1., 2., 3.])

So when __array__ returns an np.ndarray that doesn't match the requested dtype, then numpy does the casting. In all of pandas / geopandas / shapely, I know we have some __array__ implementations that do this.

When a copy keyword is added, I assume we can't keep such a simple implementation if we want to correctly honor the keyword. Numpy will now assume the library did correctly honor the keyword? For example, with the above code, when doing np.asarray(arr, dtype="int64", copy=True), i.e. with a matching dtype, numpy doesn't need to do a cast and thus not need to copy, and then will assume that copy already happened?

@jorisvandenbossche
Copy link
Contributor Author

It seems that on current main, numpy still does a copy afterwards, so np.asarray(arr, dtype="int64", copy=True) actually gives a copy even with the __array__ implementation above. But I assume numpy will want to change that to avoid the additional copy?

@ngoldbaum
Copy link
Member

But I assume numpy will want to change that to avoid the additional copy?

Right now it's actually not possible to get to the code path in numpy that calls __array__ with copy=True, but in the future we'll want to restructure things to avoid the extra copy.

@ngoldbaum
Copy link
Member

ngoldbaum commented Mar 20, 2024

This should probably be fixed:

import numpy as np

class HasArray:
    def __array__(self, values=None, dtype=None, copy=None):
        print(copy)
        return

arr = HasArray()

np.array(arr, copy=True)

Right now the print(copy) will print None where it really should print True.

We also need to update the docs to cover how implementers should handle this.

@mattip
Copy link
Member

mattip commented Apr 3, 2024

Closing, thanks.

@mattip mattip closed this as completed Apr 3, 2024
oscarbenjamin pushed a commit to oscarbenjamin/sympy that referenced this issue Apr 4, 2024
Currently copy=False is commented out due to a bug
in numpy. see:
numpy/numpy#25941 (comment)
@jorisvandenbossche
Copy link
Contributor Author

I have an additional question here that is not fully covered in the migration guide note that closed this issue:

For any __array__ method on a non-NumPy array-like object, a copy=None keyword can be added to the signature - this will work with older NumPy versions as well. If copy keyword is considered in the __array__ method implementation, then for copy=True always return a new copy.

I don't know what "if copy keyword is considered" means exactly (only when handling the copy keyword in your actual implementation, or whenever adding the copy keyword to the signature?), but it seems that adding copy=None to your __array__ implementation actually does impact the behaviour.

# no copy keyword -> raises an error (and depr warning) when specifying copy=False

class MyArray:
    def __array__(self, dtype=None):
        return np.array([1,2,3], dtype=int)

>>> np.array(MyArray(), copy=False)
DeprecationWarning: __array__ should implement the 'dtype' and 'copy' keyword argument
ValueError: Unable to avoid copy while creating an array as requested.
If using `np.array(obj, copy=False)` replace it with `np.asarray(obj)` to allow a copy when needed (no behavior change in NumPy 1.x).
For more details, see https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword.

# adding copy keyword just to signature -> no error

class MyArray:
    def __array__(self, dtype=None, copy=None):
        return np.array([1,2,3], dtype=int)

>>> np.array(MyArray(), copy=False)
array([1, 2, 3])

So just adding copy=None to the signature but further ignoring it (to get rid of the deprecation warning) results in np.array(.., copy=False) to no longer raise an error, while there is no guarantee that the result is not copied?

And this is what for example pandas did (pandas-dev/pandas#57739, and I just wanted to do the same for pyarrow).

Is the behaviour difference expected? (also in the case of copy=False that is now raising, the __array__ method might actually be returning something that is not a copy, and thus numpy can be raising an incorrect error)
And should the recommendation then be to only add copy=None to your __array__ signature when actually honoring the keyword, also when it is passed copy=False? (the current doc note only mentions that it should honor copy=True)

@ngoldbaum
Copy link
Member

So just adding copy=None to the signature but further ignoring it (to get rid of the deprecation warning) results in np.array(.., copy=False) to no longer raise an error, while there is no guarantee that the result is not copied?

This is working as intended. We only raise the warning if the function doesn't support the keyword at all. As soon as you add the keyword (even if you ignore it), numpy has to assume the implementation has done what the user expects when copy=False is passed. It would be nice if we could do something fancier and actually check if a copy happened or not, but I don't think there's a way to tell from inside numpy?

(also in the case of copy=False that is now raising, the array method might actually be returning something that is not a copy, and thus numpy can be raising an incorrect error)

Hmm, maybe it would be better to reword the error message to say something like "numpy tried passing copy=False, but the __array__ implementation doesn't accept a copy keyword, so numpy can't guarantee that a copy won't be made".

Ping @mtsokol, can we get some minor updates to docs and error messages to alleviate the confusion @jorisvandenbossche ran into?

@jorisvandenbossche
Copy link
Contributor Author

Hmm, maybe it would be better to reword the error message to say something like "numpy tried passing copy=False, but the __array__ implementation doesn't accept a copy keyword, so numpy can't guarantee that a copy won't be made".

That message would have helped me as implementor, but is probably not super useful for users? (they don't control the implementation of __array__ and might not even be aware that this is how the conversion happens)

Update the migration guide / __array__ docs specifying in more detail how to handle the copy keyword would be useful!

@jorisvandenbossche
Copy link
Contributor Author

jorisvandenbossche commented Apr 8, 2024

And https://numpy.org/devdocs/user/basics.interoperability.html#the-array-method already mentions the expected behaviour.
Although the example that is referenced there (https://numpy.org/devdocs/user/basics.dispatch.html#basics-dispatch) has a "wrong" implementation of __array__ in the sense that it just ignores the keyword.

It might be good to add a link to that from https://numpy.org/doc/stable/reference/generated/numpy.ndarray.__array__.html

@seberg
Copy link
Member

seberg commented Apr 8, 2024

If we touch the docs, it would be nice to use "must" rather than "should". Should can (and when it comes to such things does) convey that downstream has a choice: they don't (i.e. is more like a strong recommendation)! So the correct terminology is "must".

@jorisvandenbossche
Copy link
Contributor Author

jorisvandenbossche commented Apr 9, 2024

Another point of feedback: for implementing the copy=False case correctly, it is unfortunate (for this specific case, not necessarily for general usage!) that the copy keyword of astype is not strict.
Because we have a (typical?) usage of something like:

class MyArray:

    def __array__(self, dtype=None):
            if dtype is None:
                return self._ndarray
            return self._ndarray.astype(dtype)

To update this to have a copy=None keyword ànd honor the copy=False case, it's not sufficient to add astype(dtype, copy=False) as that still doesn't guarantee no copy was made. Once would have the return value of astype to check if it returned the calling ndarray as is, to detect if a copy happened or not?

EDIT: on second thought, I suppose this can be done much simpler by actually relying on the new behaviour of np.array and its copy keyword that is now strict, by just doing return np.array(self._ndarray, dtype=dtype, copy=False)?

@ngoldbaum
Copy link
Member

The copy keyword for astype is mentioned in NEP 56, where disallowing copies with copy=False is ruled out because astype implies a user explicitly requests a different dtype and a copy has to happen if the dtype changes. See also #25925.

From your edit, it sounds like you figured out an appropriate workaround.

@mtsokol
Copy link
Member

mtsokol commented Apr 9, 2024

@jorisvandenbossche I think this could be the shortest way to support both dtype and copy:

def __array__(self, dtype=None, copy=None):
    dtype = self._ndarray.dtype if dtype is None else dtype
    return np.array(self._ndarray, dtype=dtype, copy=copy)

mleila1312 pushed a commit to mleila1312/sympy that referenced this issue Apr 10, 2024
Currently copy=False is commented out due to a bug
in numpy. see:
numpy/numpy#25941 (comment)
pitrou pushed a commit to apache/arrow that referenced this issue Apr 15, 2024
… compatibility (#41071)

### Rationale for this change

Adapting for changes in numpy 2.0 as decribed at https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword and future changes to pass copy=True (numpy/numpy#26208)

### What changes are included in this PR?

Add a `copy=None` to the signatures of our `__array__` methods.

This does have impact on the user facing behaviour, though. Questioning that upstream at numpy/numpy#25941 (comment)

### Are these changes tested?

Yes

### Are there any user-facing changes?

No (compared to usage with numpy<2)
* GitHub Issue: #39532
* GitHub Issue: #41098

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
raulcd pushed a commit to apache/arrow that referenced this issue Apr 15, 2024
… compatibility (#41071)

### Rationale for this change

Adapting for changes in numpy 2.0 as decribed at https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword and future changes to pass copy=True (numpy/numpy#26208)

### What changes are included in this PR?

Add a `copy=None` to the signatures of our `__array__` methods.

This does have impact on the user facing behaviour, though. Questioning that upstream at numpy/numpy#25941 (comment)

### Are these changes tested?

Yes

### Are there any user-facing changes?

No (compared to usage with numpy<2)
* GitHub Issue: #39532
* GitHub Issue: #41098

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@seberg
Copy link
Member

seberg commented Apr 17, 2024

This has been explained, and a new PR is open to change the behavior, but it will be in 2.1 with a deprecation. Closing this.

@seberg seberg closed this as completed Apr 17, 2024
tolleybot pushed a commit to tmct/arrow that referenced this issue May 2, 2024
…y 2.0+ compatibility (apache#41071)

### Rationale for this change

Adapting for changes in numpy 2.0 as decribed at https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword and future changes to pass copy=True (numpy/numpy#26208)

### What changes are included in this PR?

Add a `copy=None` to the signatures of our `__array__` methods.

This does have impact on the user facing behaviour, though. Questioning that upstream at numpy/numpy#25941 (comment)

### Are these changes tested?

Yes

### Are there any user-facing changes?

No (compared to usage with numpy<2)
* GitHub Issue: apache#39532
* GitHub Issue: apache#41098

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants