-
-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid double copy in Series/Index.to_numpy #24345
Labels
Copy / view semantics
ExtensionArray
Extending pandas with custom dtypes or arrays.
Performance
Memory or execution speed performance
Comments
TomAugspurger
added
the
ExtensionArray
Extending pandas with custom dtypes or arrays.
label
Dec 18, 2018
mroeschke
added
Copy / view semantics
Performance
Memory or execution speed performance
labels
Jun 25, 2021
take |
guitargeek
added a commit
to guitargeek/pandas
that referenced
this issue
Jan 4, 2022
As reported in pandas-dev#24345, there is a problem with possible double copied in the different `to_numpy` functions implemented for some pandas objects. In particular, we are talking about the `to_numpy` functions that use `np.asarray` here, because `np.asarray` might have already copied the object if the input was not a numpy array or the requested dtype didn't match. In pandas-dev#24345, a possible solution was suggested: replacing `np.asarray` with a custom function that reports whether a copy happened. However, we can easily check this also when using `np.asarray` with `arr_out is arr_in`, as shown in the numpy documentation [1]. This commit implements these checks to avoid the possible double copy. This commit also includes a small bugfix. The `arr_out is arr_in` check was actually used already in `pandas/core/arrays/numpy_.py`, but at the wrong place: if a copy already happened, the optional NaN value filling was skipped. This commit fixes that. Finally, the intendation level of some of the NaN-filling code is changed to not redundantly check `na_value is not lib.no_default`. [1] https://numpy.org/doc/stable/reference/generated/numpy.asarray.html
4 tasks
guitargeek
added a commit
to guitargeek/pandas
that referenced
this issue
Jan 4, 2022
As reported in pandas-dev#24345, there is a problem with possible double copying in the different `to_numpy` functions implemented for some pandas objects. In particular, we are talking about the `to_numpy` functions that use `np.asarray` here, because `np.asarray` might have already copied the object if the input was not a numpy array or the requested dtype didn't match. In pandas-dev#24345, a possible solution was suggested: replacing `np.asarray` with a custom function that reports whether a copy happened. However, we can easily check this also when using `np.asarray` by checking object equality (`arr_out is arr_in`), as shown in the numpy documentation [1]. This commit implements these checks to avoid the possible double copy. In the case of the ExtensionArray implemented in `pandas/core/arrays/base.py`, the object equality is only meaningful to check if the ExtensionArray implements a custom array container [2], and then we need to compare the array object returned by `np.asarray` to the object returned by `__array__()`. This commit also includes a small bugfix. The `arr_out is arr_in` check was actually used already in `pandas/core/arrays/numpy_.py`, but at the wrong place: if a copy already happened, the optional NaN value filling was skipped. This commit fixes that. Finally, the intendation level of some of the NaN-filling code is changed to not redundantly check `na_value is not lib.no_default`. [1] https://numpy.org/doc/stable/reference/generated/numpy.asarray.html [2] https://numpy.org/devdocs/user/basics.dispatch.html
guitargeek
added a commit
to guitargeek/pandas
that referenced
this issue
Feb 1, 2022
As reported in pandas-dev#24345, there is a problem with possible double copying in the different `to_numpy` functions implemented for some pandas objects. In particular, we are talking about the `to_numpy` functions that use `np.asarray` here, because `np.asarray` might have already copied the object if the input was not a numpy array or the requested dtype didn't match. In pandas-dev#24345, a possible solution was suggested: replacing `np.asarray` with a custom function that reports whether a copy happened. However, we can easily check this also when using `np.asarray` by checking object equality (`arr_out is arr_in`), as shown in the numpy documentation [1]. This commit implements these checks to avoid the possible double copy. In the case of the ExtensionArray implemented in `pandas/core/arrays/base.py`, the object equality is only meaningful to check if the ExtensionArray implements a custom array container [2], and then we need to compare the array object returned by `np.asarray` to the object returned by `__array__()`. This commit also includes a small bugfix. The `arr_out is arr_in` check was actually used already in `pandas/core/arrays/numpy_.py`, but at the wrong place: if a copy already happened, the optional NaN value filling was skipped. This commit fixes that. Finally, the intendation level of some of the NaN-filling code is changed to not redundantly check `na_value is not lib.no_default`. [1] https://numpy.org/doc/stable/reference/generated/numpy.asarray.html [2] https://numpy.org/devdocs/user/basics.dispatch.html
6 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Copy / view semantics
ExtensionArray
Extending pandas with custom dtypes or arrays.
Performance
Memory or execution speed performance
#24341 added the
copy
parameter toto_numpy()
.In some cases, that will copy data twice, when the
asarray(self._values, dtype=dtype)
makes a copy, Series / Index doesn't know about it.To avoid that, we could add a method to the ExtensionArray developer API
The second item in the return tuple indicates whether the returned ndarray is already a copy, or whether it's a zero-copy view on some other data.
Then, back in
Series.to_numpy
we doThis is too much complexity to rush into 0.24
The text was updated successfully, but these errors were encountered: