Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Public data for Series and Index: .array and .to_numpy() #23623

Merged
merged 28 commits into from Nov 29, 2018

Conversation

Projects
None yet
6 participants
@TomAugspurger
Copy link
Contributor

commented Nov 11, 2018

Closes #19954.

TODO:

  • update references to .values in the docs to use .array or .to_numpy()
  • cross ref between .values and the rest

TomAugspurger added some commits Oct 30, 2018

API: Public data attributes for EA-backed containers
This adds two new methods for working with EA-backed Series / Index.

- `.array -> Union[ExtensionArray, ndarray]`: the actual backing array
- `.to_numpy() -> ndarray`: A NumPy representation of the data

`.array` is always a reference to the actual data stored in the container.
Updating it inplace (not recommended) will be reflected in the Series (or
Index for that matter, so really not recommended).

`to_numpy()` may (or may not) require data copying / coercion.

Closes #19954

@TomAugspurger TomAugspurger added this to the 0.24.0 milestone Nov 11, 2018

@pep8speaks

This comment has been minimized.

Copy link

commented Nov 11, 2018

Hello @TomAugspurger! Thanks for submitting the PR.

@jbrockmendel

This comment has been minimized.

Copy link
Member

commented Nov 11, 2018

Not having read the diff, first thought is that I’d like to wait for DatetimeArray to see what we can simplify vis-a-vis values/_values

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Nov 11, 2018

@jbrockmendel

This comment has been minimized.

Copy link
Member

commented Nov 11, 2018

Not at all a well thought out opinion, just gut reaction.

TomAugspurger added a commit to TomAugspurger/pandas that referenced this pull request Nov 11, 2018

BUG: Ensure that Index._data is an ndarray
Split from pandas-dev#23623, where it was
causing issues with infer_dtype.

TomAugspurger added a commit to TomAugspurger/pandas that referenced this pull request Nov 11, 2018

BUG: Ensure that Index._data is an ndarray
BUG: Ensure that Index._data is an ndarray

Split from pandas-dev#23623, where it was
causing issues with infer_dtype.

TomAugspurger added some commits Nov 13, 2018

Squashed commit of the following:
commit e4b21f6
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Mon Nov 12 16:09:58 2018 -0600

    TST: Change rops tests

commit e903550
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Mon Nov 12 09:31:38 2018 -0600

    Add note

    [ci skip]

    ***NO CI***

commit fa8934a
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Mon Nov 12 06:16:53 2018 -0600

    update errors

commit 505970e
Merge: a30bc02 3592a46
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Mon Nov 12 05:55:31 2018 -0600

    Merge remote-tracking branch 'upstream/master' into index-ndarray-data

commit a30bc02
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Sun Nov 11 15:14:46 2018 -0600

    remove assert

commit 1f23ebc
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Sun Nov 11 15:01:13 2018 -0600

    BUG: Ensure that Index._data is an ndarray

    BUG: Ensure that Index._data is an ndarray

    Split from #23623, where it was
    causing issues with infer_dtype.
@codecov

This comment has been minimized.

Copy link

commented Nov 13, 2018

Codecov Report

Merging #23623 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #23623      +/-   ##
==========================================
+ Coverage   92.31%   92.31%   +<.01%     
==========================================
  Files         161      161              
  Lines       51513    51526      +13     
==========================================
+ Hits        47554    47567      +13     
  Misses       3959     3959
Flag Coverage Δ
#multiple 90.71% <100%> (ø) ⬆️
#single 42.48% <50%> (+0.05%) ⬆️
Impacted Files Coverage Δ
pandas/core/series.py 93.68% <ø> (ø) ⬆️
pandas/core/generic.py 96.84% <ø> (ø) ⬆️
pandas/core/indexes/multi.py 95.51% <100%> (+0.01%) ⬆️
pandas/core/base.py 97.64% <100%> (+0.03%) ⬆️
pandas/core/frame.py 97.03% <100%> (ø) ⬆️
pandas/core/indexes/base.py 96.48% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 43b2dab...f9eee65. Read the comment docs.

Show resolved Hide resolved doc/source/basics.rst
Show resolved Hide resolved doc/source/dsintro.rst Outdated
Show resolved Hide resolved doc/source/dsintro.rst Outdated
Show resolved Hide resolved doc/source/whatsnew/v0.24.0.rst
Show resolved Hide resolved pandas/core/indexes/base.py Outdated
@@ -1269,3 +1269,54 @@ def test_ndarray_values(array, expected):
r_values = pd.Index(array)._ndarray_values
tm.assert_numpy_array_equal(l_values, r_values)
tm.assert_numpy_array_equal(l_values, expected)


@pytest.mark.parametrize("array, attr", [

This comment has been minimized.

Copy link
@jreback

jreback Nov 21, 2018

Contributor

maybe put in pandas/tests/arrays/test_arrays.py?

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Nov 21, 2018

Author Contributor

Doesn't exist yet :), though my pd.array PR is creating it.

This seemed a bit more appropriate since it's next to our tests for ndarray_values.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Nov 27, 2018

Series[datetime64[ns]].array: DatetimeArray
Series[datetime64[ns, tz]].to_numpy: object-dtype ndarray of Timestamps?

so agree (and .array should always be an Array if we can (and here we can)

@shoyer

This comment has been minimized.

Copy link
Member

commented Nov 27, 2018

Do time-zone naive datetimes use extension arrays?

I'd love to have a clear rule for .array:

  • It's an ndarray for builtin types supported by numpy.
  • It's an extension array for pandas extension arrays.
@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Nov 27, 2018

Do time-zone naive datetimes use extension arrays?

That's being worked on right now. It's unclear to me how things will end up, but it's possible there will be an ExtensionArray between Series and the actual ndarray for tz-naive data. This will certainly be the case for DatetimeIndex.

However, I don't think this fact needs to be exposed to the user. It's primarily for code reuse between datetime-tz, datetime, and timedelta. (We haven't really talked about Series[timedelta64[ns]].array, but I presume it should follow the behavior of datetime64[ns]).

I'd love to have a clear rule for .array:

  • It's an ndarray for builtin types supported by numpy.
  • It's an extension array for pandas extension arrays.

I like this rule.

@shoyer

This comment has been minimized.

Copy link
Member

commented Nov 27, 2018

However, I don't think this fact needs to be exposed to the user. It's primarily for code reuse between datetime-tz, datetime, and timedelta. (We haven't really talked about Series[timedelta64[ns]].array, but I presume it should follow the behavior of datetime64[ns]).

In that case I suppose it mostly comes down to:

  1. which choice is most useful to users -- will they want to manipulate these extension array objects directly or would they rather have base numpy arrays?
  2. which choice is most future proof -- are we going to be happy sticking with this choice for the long term? We definitely don't want upheaval with .array like we've had with .values.
@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Nov 27, 2018

which choice is most useful to users -- will they want to manipulate these extension array objects directly or would they rather have base numpy arrays?

"Ideally", if a user wants a numpy array, they use to_numpy or np.array(..), and they should only use .array if they don't really care about the distinction between both, but want to do a certain operation on the values (eg an operation without alignment and then put it back in a Series).
But of course, it is difficult to control what users use it for .. (there will be people starting to use .array to get a numpy array, the only way to avoid this is to raise an error in such a case and only return ExtensionArrays, but that of course defeats the general usecase of the property).

which choice is most future proof -- are we going to be happy sticking with this choice for the long term? We definitely don't want upheaval with .array like we've had with .values

I suppose that returning DatetimeArray instead of ndarray[datetime64[ns]] will be more future proof. But, that also relates to what I commented above (#23623 (comment)) related the back-compat guarantees we make. If we want to keep following the rule in the future, I have the feeling that we will need to change the return value at some point.

I'd love to have a clear rule for .array:

   It's an ndarray for builtin types supported by numpy.
   It's an extension array for pandas extension arrays.

Some other possibilities that would go further than the above rule:

  • It's an extension array if the round-trip (eg Series -> array -> Series) with ndarray would not be completely information-preserving and completely cheap.

    • In practice this might be the same as the above rule at the moment, but eg a possible future StringArray (wrapping an object array of strings but guaranteeing every element is a string) could be converted rather faithfully to an object array, but the roundtrip to ndarray would loose some information (the fact every element is a string). For DatetimeArray you could say it looses the freq attribute, but since this is an attribute on the Array, and not metadata of the dtype, I would say this is less of a problem.
  • It's an extension array if a ndarray would not support the same operations as a Series holding the data

    • For eg DatetimeArray vs ndarray[datetime64], not all arithmetic and other operations will yield the same results, and one could expect that Series[datetime] and Series[datetime].array behave the same

The rule might also whether it is implemented under the hood as an ExtensionArray or not. But this is course not necessarily clear to the user (it might be hidden somewhat), and @shoyer I assume that your proposal for a rule would be to have a clear expectation for the user? (so all the above rules that I mention would be more complicated).

@shoyer

This comment has been minimized.

Copy link
Member

commented Nov 27, 2018

Would it make sense to make to consider having .array always return an ExtensionArray object?

I imagine we could pretty quickly whip up an ExtensionArray for each NumPy dtype that simply defers to the underlying NumPy array for every operation. Internally, we could recognize these NumpyExtensionArray objects and use base numpy operations.

I guess we could also just clearly document: .array means you are going into pandas' internals and is not guaranteed to be stable. We may change the return value from .array in the future.

@jorisvandenbossche
Copy link
Member

left a comment

OK, had the time to go through the full diff, and added some comments (and thanks a lot for the PR Tom!)

Additional comments:

  • to what extent do we also want to mention np.(as)array(..) as alternative to .to_numpy() in the docs?
  • I think we need to keep in some places the explanation about values, since it is not yet going away and users will encounter it in code (or at least mention something about it existing for historical reasons and DataFrame.values being equivalent to to_numpy, but not recommended anymore)
  • I would update the Series.values docstring as well, to add a note about its recommendation status, and to refer to array/to_numpy (similar as what you did for DataFrame/Index.values)
casting every value to a Python object.

For ``df``, our :class:`DataFrame` of all floating-point values,
:meth:`DataFrame.to_numpy` is fast and doesn't require copying data.

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Nov 27, 2018

Member

Reading this, should we have a copy keyword to be able to force a copy? (can be added later)

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Nov 28, 2018

Author Contributor

This is a good idea. Don't care whether we do it here or later.

I think we'll also want (type-specific?) keywords for controlling how the conversion is done (ndarray of Timestamps vs. datetime64[ns] for example). I'm not sure what the eventual signature should be.

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Nov 28, 2018

Member

Yeah, if we decide to go for object array of Timestamps for datetimetz as default, it would be good to have the option to return datetime64

Regarding copy, would it actually make sense to have copy=True the default? Then you have at least a consistent default (it is never a view on the data)

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Nov 28, 2018

Author Contributor

Yes, I think copy=True is a good default since it's the only one that can be ensured for all cases.

Show resolved Hide resolved doc/source/basics.rst
Show resolved Hide resolved doc/source/basics.rst Outdated
Show resolved Hide resolved doc/source/basics.rst
period (time spans) :class:`PeriodDtype` :class:`Period` :class:`arrays.PeriodArray` :ref:`timeseries.periods`
sparse :class:`SparseDtype` (none) :class:`arrays.SparseArray` :ref:`sparse`
intervals :class:`IntervalDtype` :class:`Interval` :class:`arrays.IntervalArray` :ref:`advanced.intervalindex`
nullable integer :clsas:`Int64Dtype`, ... (none) :class:`arrays.IntegerArray` :ref:`integer_na`

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Nov 27, 2018

Member

where does this 'integer_na' point to? (I don't seem to find it in the docs)

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Nov 28, 2018

Author Contributor

#23617. I'm aiming for eventual consistency on the docs :)

Show resolved Hide resolved pandas/core/frame.py Outdated
Show resolved Hide resolved pandas/core/generic.py
Show resolved Hide resolved pandas/core/indexes/base.py Outdated
Show resolved Hide resolved pandas/tests/test_base.py
pytest.skip("No index type for {}".format(array.dtype))

result = thing.to_numpy()
tm.assert_numpy_array_equal(result, expected)

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Nov 27, 2018

Member

Should we also test for the case where it is not a copy?

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Nov 28, 2018

Author Contributor

What do you mean here? (in case you missed it, the first case is a regular ndarray, so that won't be a copy. Though perhaps you're saying I should assert this for that case?)

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Nov 28, 2018

Member

Yes, that's what I meant. If we return a view, and people can rely on it, we should test it.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Nov 27, 2018

@shoyer I think the main problem is that would basically already get rid off the block-based internals? And I think that was more a change we were contemplating for 2.0 instead of 1.0. (apart from the additional work it would mean on the short term)
(unless we use such a NumpyExtensionArray only for wrapping the array when returning it in .array)

@shoyer

This comment has been minimized.

Copy link
Member

commented Nov 27, 2018

(unless we use such a NumpyExtensionArray only for wrapping the array when returning it in .array)

This is all I was thinking of.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Nov 28, 2018

Hmm, having .array always be an ExtensionArray is an interesting proposal... It kind of makes ".array is the actual array stored in the Series" a lie, but maybe users don't care about that? I assume they care more about things like zero-copy and inplace modification, than they do about how pandas chooses to handle a particular dtype.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Nov 29, 2018

What do people think about doing the remaining items as followup?

  1. Determine the signature for .to_numpy() (copy=True is uncontroversial. the rest I'm not sure about, but we could figure it out and do here.)
  2. Finalize DatetimeArray vs. ndarray for tz aware and naive
  3. Explore .array always being an Array.

I think 2 and 3 will be easier to think about once we have a DatetimeArray.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Nov 29, 2018

#23623 (comment)

certainly fine as a followup. I am not sure 3 is actually a blocker for 0.24.0 (though 2 is) and 1) as-is is fine for now

@jreback

This comment has been minimized.

Copy link
Contributor

commented Nov 29, 2018

ping on green.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Nov 29, 2018

All green.

@jreback jreback merged commit 0a4f40c into pandas-dev:master Nov 29, 2018

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
pandas-dev.pandas Build #20181129.36 succeeded
Details
@jreback

This comment has been minimized.

Copy link
Contributor

commented Nov 29, 2018

thanks @TomAugspurger very nice!

@TomAugspurger TomAugspurger deleted the TomAugspurger:public-data branch Nov 29, 2018

@TomAugspurger TomAugspurger referenced this pull request Nov 29, 2018

Closed

Public Data Followups #23995

2 of 3 tasks complete

Pingviinituutti added a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

Pingviinituutti added a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.