Avoid double copy in Series/Index.to_numpy #24345

TomAugspurger · 2018-12-18T22:39:22Z

#24341 added the copy parameter to to_numpy().

In some cases, that will copy data twice, when the asarray(self._values, dtype=dtype) makes a copy, Series / Index doesn't know about it.

To avoid that, we could add a method to the ExtensionArray developer API

def _to_numpy(self, dtype=None copy=False): -> Union[ndarray, bool]
    # returns ndarray, did_copy

The second item in the return tuple indicates whether the returned ndarray is already a copy, or whether it's a zero-copy view on some other data.

Then, back in Series.to_numpy we do

arr, copied = self.array._to_numpy(dtype, copy)
if copy and not copied:
    arr = arr.copy()

return arr

This is too much complexity to rush into 0.24

The text was updated successfully, but these errors were encountered:

guitargeek · 2022-01-04T17:05:42Z

take

As reported in pandas-dev#24345, there is a problem with possible double copied in the different `to_numpy` functions implemented for some pandas objects. In particular, we are talking about the `to_numpy` functions that use `np.asarray` here, because `np.asarray` might have already copied the object if the input was not a numpy array or the requested dtype didn't match. In pandas-dev#24345, a possible solution was suggested: replacing `np.asarray` with a custom function that reports whether a copy happened. However, we can easily check this also when using `np.asarray` with `arr_out is arr_in`, as shown in the numpy documentation [1]. This commit implements these checks to avoid the possible double copy. This commit also includes a small bugfix. The `arr_out is arr_in` check was actually used already in `pandas/core/arrays/numpy_.py`, but at the wrong place: if a copy already happened, the optional NaN value filling was skipped. This commit fixes that. Finally, the intendation level of some of the NaN-filling code is changed to not redundantly check `na_value is not lib.no_default`. [1] https://numpy.org/doc/stable/reference/generated/numpy.asarray.html

As reported in pandas-dev#24345, there is a problem with possible double copying in the different `to_numpy` functions implemented for some pandas objects. In particular, we are talking about the `to_numpy` functions that use `np.asarray` here, because `np.asarray` might have already copied the object if the input was not a numpy array or the requested dtype didn't match. In pandas-dev#24345, a possible solution was suggested: replacing `np.asarray` with a custom function that reports whether a copy happened. However, we can easily check this also when using `np.asarray` by checking object equality (`arr_out is arr_in`), as shown in the numpy documentation [1]. This commit implements these checks to avoid the possible double copy. In the case of the ExtensionArray implemented in `pandas/core/arrays/base.py`, the object equality is only meaningful to check if the ExtensionArray implements a custom array container [2], and then we need to compare the array object returned by `np.asarray` to the object returned by `__array__()`. This commit also includes a small bugfix. The `arr_out is arr_in` check was actually used already in `pandas/core/arrays/numpy_.py`, but at the wrong place: if a copy already happened, the optional NaN value filling was skipped. This commit fixes that. Finally, the intendation level of some of the NaN-filling code is changed to not redundantly check `na_value is not lib.no_default`. [1] https://numpy.org/doc/stable/reference/generated/numpy.asarray.html [2] https://numpy.org/devdocs/user/basics.dispatch.html

TomAugspurger added the ExtensionArray Extending pandas with custom dtypes or arrays. label Dec 18, 2018

mroeschke added Copy / view semantics Performance Memory or execution speed performance labels Jun 25, 2021

github-actions bot assigned guitargeek Jan 4, 2022

guitargeek mentioned this issue Jan 4, 2022

PERF: avoid double copy in to_numpy functions if np.asarray is used #45188

Closed

4 tasks

phofl mentioned this issue Dec 30, 2022

BUG: to_numpy not respecting na_value before converting to array #50506

Merged

6 tasks

mroeschke closed this as completed in #50506 Jan 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid double copy in Series/Index.to_numpy #24345

Avoid double copy in Series/Index.to_numpy #24345

TomAugspurger commented Dec 18, 2018

guitargeek commented Jan 4, 2022

Avoid double copy in Series/Index.to_numpy #24345

Avoid double copy in Series/Index.to_numpy #24345

Comments

TomAugspurger commented Dec 18, 2018

guitargeek commented Jan 4, 2022