Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid double copy in Series/Index.to_numpy #24345

Closed
TomAugspurger opened this issue Dec 18, 2018 · 1 comment · Fixed by #50506
Closed

Avoid double copy in Series/Index.to_numpy #24345

TomAugspurger opened this issue Dec 18, 2018 · 1 comment · Fixed by #50506
Assignees
Labels
Copy / view semantics ExtensionArray Extending pandas with custom dtypes or arrays. Performance Memory or execution speed performance

Comments

@TomAugspurger
Copy link
Contributor

#24341 added the copy parameter to to_numpy().

In some cases, that will copy data twice, when the asarray(self._values, dtype=dtype) makes a copy, Series / Index doesn't know about it.

To avoid that, we could add a method to the ExtensionArray developer API

def _to_numpy(self, dtype=None copy=False): -> Union[ndarray, bool]
    # returns ndarray, did_copy

The second item in the return tuple indicates whether the returned ndarray is already a copy, or whether it's a zero-copy view on some other data.

Then, back in Series.to_numpy we do

arr, copied = self.array._to_numpy(dtype, copy)
if copy and not copied:
    arr = arr.copy()

return arr

This is too much complexity to rush into 0.24

@TomAugspurger TomAugspurger added the ExtensionArray Extending pandas with custom dtypes or arrays. label Dec 18, 2018
@mroeschke mroeschke added Copy / view semantics Performance Memory or execution speed performance labels Jun 25, 2021
@guitargeek
Copy link

take

guitargeek added a commit to guitargeek/pandas that referenced this issue Jan 4, 2022
As reported in pandas-dev#24345, there is a problem with possible double copied in
the different `to_numpy` functions implemented for some pandas objects.

In particular, we are talking about the `to_numpy` functions that use
`np.asarray` here, because `np.asarray` might have already copied the
object if the input was not a numpy array or the requested dtype didn't
match.

In pandas-dev#24345, a possible solution was suggested: replacing `np.asarray`
with a custom function that reports whether a copy happened. However, we
can easily check this also when using `np.asarray` with `arr_out is
arr_in`, as shown in the numpy documentation [1]. This commit implements
these checks to avoid the possible double copy.

This commit also includes a small bugfix. The `arr_out is arr_in` check
was actually used already in `pandas/core/arrays/numpy_.py`, but at the
wrong place: if a copy already happened, the optional NaN value filling
was skipped. This commit fixes that.

Finally, the intendation level of some of the NaN-filling code is
changed to not redundantly check `na_value is not lib.no_default`.

[1] https://numpy.org/doc/stable/reference/generated/numpy.asarray.html
guitargeek added a commit to guitargeek/pandas that referenced this issue Jan 4, 2022
As reported in pandas-dev#24345, there is a problem with possible double copying
in the different `to_numpy` functions implemented for some pandas
objects.

In particular, we are talking about the `to_numpy` functions that use
`np.asarray` here, because `np.asarray` might have already copied the
object if the input was not a numpy array or the requested dtype didn't
match.

In pandas-dev#24345, a possible solution was suggested: replacing `np.asarray`
with a custom function that reports whether a copy happened. However, we
can easily check this also when using `np.asarray` by checking object
equality (`arr_out is arr_in`), as shown in the numpy documentation [1].
This commit implements these checks to avoid the possible double copy.

In the case of the ExtensionArray implemented in
`pandas/core/arrays/base.py`, the object equality is only meaningful to
check if the ExtensionArray implements a custom array container [2], and
then we need to compare the array object returned by `np.asarray` to the
object returned by `__array__()`.

This commit also includes a small bugfix. The `arr_out is arr_in` check
was actually used already in `pandas/core/arrays/numpy_.py`, but at the
wrong place: if a copy already happened, the optional NaN value filling
was skipped. This commit fixes that.

Finally, the intendation level of some of the NaN-filling code is
changed to not redundantly check `na_value is not lib.no_default`.

[1] https://numpy.org/doc/stable/reference/generated/numpy.asarray.html
[2] https://numpy.org/devdocs/user/basics.dispatch.html
guitargeek added a commit to guitargeek/pandas that referenced this issue Feb 1, 2022
As reported in pandas-dev#24345, there is a problem with possible double copying
in the different `to_numpy` functions implemented for some pandas
objects.

In particular, we are talking about the `to_numpy` functions that use
`np.asarray` here, because `np.asarray` might have already copied the
object if the input was not a numpy array or the requested dtype didn't
match.

In pandas-dev#24345, a possible solution was suggested: replacing `np.asarray`
with a custom function that reports whether a copy happened. However, we
can easily check this also when using `np.asarray` by checking object
equality (`arr_out is arr_in`), as shown in the numpy documentation [1].
This commit implements these checks to avoid the possible double copy.

In the case of the ExtensionArray implemented in
`pandas/core/arrays/base.py`, the object equality is only meaningful to
check if the ExtensionArray implements a custom array container [2], and
then we need to compare the array object returned by `np.asarray` to the
object returned by `__array__()`.

This commit also includes a small bugfix. The `arr_out is arr_in` check
was actually used already in `pandas/core/arrays/numpy_.py`, but at the
wrong place: if a copy already happened, the optional NaN value filling
was skipped. This commit fixes that.

Finally, the intendation level of some of the NaN-filling code is
changed to not redundantly check `na_value is not lib.no_default`.

[1] https://numpy.org/doc/stable/reference/generated/numpy.asarray.html
[2] https://numpy.org/devdocs/user/basics.dispatch.html
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Copy / view semantics ExtensionArray Extending pandas with custom dtypes or arrays. Performance Memory or execution speed performance
Projects
None yet
3 participants