Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.values on ExtensionArray-backed containers #19954

Closed
TomAugspurger opened this issue Mar 1, 2018 · 54 comments

Comments

Projects
None yet
6 participants
@TomAugspurger
Copy link
Contributor

commented Mar 1, 2018

Discussed briefly on the call today, but we should go through things formally.

What should the return type of Series[extension_array].values and Index[extension_array].values be? I believe the two options are

  1. Return the ExtensionArray backing it (e.g. like what Categorical does)
  2. Return an ndarray with some information loss / performance cost
    • e.g. like Series[datetimeTZ].values -> datetime64ns at UTC
    • e.g. Series[period].values -> ndarray[Period objects]

Current State

Not sure how much weight we should put on the current behavior, but for reference:

type Series.values Index.values
datetime datetime64ns datetime64ns
datetime-tz datetine64ns(UTC&naive) datetime64ns(UTC&naive)
categorical Categorical Categorical
period NA ndarray[Period objects]
interval NA ndarray[Interval objects]
In [5]: pd.Series(pd.date_range('2017', periods=1)).values
Out[5]: array(['2017-01-01T00:00:00.000000000'], dtype='datetime64[ns]')

In [6]: pd.Series(pd.date_range('2017', periods=1, tz='US/Eastern')).values
Out[6]: array(['2017-01-01T05:00:00.000000000'], dtype='datetime64[ns]')

In [7]: pd.Series(pd.Categorical([1])).values
Out[7]:
[1]
Categories (1, int64): [1]

In [8]: pd.Series(pd.SparseArray([1])).values
Out[8]:
[1]
Fill: 0
IntIndex
Indices: array([0], dtype=int32)

In [9]: pd.date_range('2017', periods=1).values
Out[9]: array(['2017-01-01T00:00:00.000000000'], dtype='datetime64[ns]')

In [10]: pd.date_range('2017', periods=1, tz='US/Central').values
Out[10]: array(['2017-01-01T06:00:00.000000000'], dtype='datetime64[ns]')

In [11]: pd.period_range('2017', periods=1, freq='D').values
Out[11]: array([Period('2017-01-01', 'D')], dtype=object)

In [12]: pd.interval_range(start=0, periods=1).values
Out[12]: array([Interval(0, 1, closed='right')], dtype=object)

In [13]: pd.CategoricalIndex([1]).values
Out[13]:
[1]
Categories (1, int64): [1]

If we decide to have the return values be ExtensionArrays, we'll need to discuss
to what extent they're part of the public API.

Regardless of the choice for .values, we'll probably want to support the other
use case (maybe just by documenting "call np.asarray on it). Internally, we
have ._values ("best" array, ndarray or EA) and ._ndarray_values (always an
ndarray).

cc @jreback @jorisvandenbossche @jschendel @jbrockmendel @shoyer @chris-b1

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Mar 1, 2018

Not to derail the discussion from the get-go, but there's a closely related topic of how many "nuisance" ndarray methods we throw on the ExtensionArray base class to make writing code that's ndarray vs. extensionarray agnostic easier. For example tolist, which just calls list(self). .reshape, which just returns self. The list goes on (and on).

@shoyer

This comment has been minimized.

Copy link
Member

commented Mar 1, 2018

I was pretty happy with always returning an ndarray from .values, but we broke that a long time ago with Categorical. For consistency we should probably clean this up (even for existing extension arrays), but unfortunately this will be a painful compatibility break.

I think my preferred option would be to make obj.values always return the underlying ExtensionArray, guaranteed to be the same data without any copies (so it can be modified in place). For converting to NumPy, we should probably add a obj.to_numpy() method on Series/DataFrame that is basically sugar for np.asarray(obj)

datetine64ns(UTC&naive)

NumPy's datetime64 is now always timezone naive, not UTC. I guess you meant the times are converted into UTC before being passed into NumPy?

there's a closely related topic of how many "nuisance" ndarray methods we throw on the ExtensionArray base class to make writing code that's ndarray vs. extensionarray agnostic easier

My preference would be to make the base class as abstract as possible, only implementing methods that are actually used by pandas. This will minimize the effort required to author new ExtensionArray classes, rather than requiring that authors override pre-existing methods. Extension arrays should be a low-level interface -- we already have the high-level interface for extension arrays in Series/DataFrame.

@chris-b1

This comment has been minimized.

Copy link
Contributor

commented Mar 1, 2018

I think my preferred option would be to make obj.values always return the underlying ExtensionArray, guaranteed to be the same data without any copies (so it can be modified in place). For converting to NumPy, we should probably add a obj.to_numpy() method on Series/DataFrame that is basically sugar for np.asarray(obj)

+1 on this. I think the break in compatibility is less painful than the inconsistency and current situation of, e.g. PeriodIndex.values having very surprising performance costs.

xref to #15750 - was about changing the return type of tz-aware datetimes to boxed scalars.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Mar 1, 2018

I guess you meant the times are converted into UTC before being passed into NumPy?

Yep

I think I share the preference for .values being the actual array backing the data (even though that's not always possible with DataFrame.values).

Also +1 for a to_numpy method on Series / Index / Frame, with the same copy vs. view behavior as asarray.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Apr 4, 2018

Are people OK with Series[datetime64ns].values being an ndarray, but Series[datetime64ns, tz].values an ExtensionArray? That seems potentially confusing.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Apr 4, 2018

it has been this way for a long time
we should change the former to be honest - but last time there were some objections

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Apr 4, 2018

it has been this way for a long time

What do you mean by that?

we should change the former to be honest - but last time there were some objections

Why would we change tz-naive to return anything other than a datetime64[ns] ndarray?

@chris-b1

This comment has been minimized.

Copy link
Contributor

commented Apr 4, 2018

As stated above I'm OK with the inconsistency, but see where it could be surprising.

Another option would be to deprecate .values in favor of an explicit to_numpy() and another accessor that always returns the backing array, numpy or not, something like Series.array?

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Apr 4, 2018

As stated above I'm OK with the inconsistency, but see where it could be surprising.

👍 . I wanted to get some explicit thoughts on that one, since I didn't initially comprehend that two seemingly similar types could have different outcomes.

another accessor that always returns the backing array

Yes. We'll want something like that. Internally, we've used ._values and ._ndarray_values. I like the name array.

@jreback jreback modified the milestones: 0.23.0, Next Major Release Apr 14, 2018

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Oct 26, 2018

Let's start with goals and work backwards to an API

  1. I want the actual array that's backing this Series / Index.
    I cannot afford a copy.
  2. I want an ndarray holds the same values as the backing array.
    This ndarray may not be the actual one backing the Series / Index,
    and getting it may involve some data copying / coercion. I'm willing
    to lose some information (categorical types, timezone, etc.), but I
    need to be able to unambiguously identify a value (i.e. I want the
    category 'a' not the code 0. Not sure about tzinfo here...).
  3. I want some kind of array that represents the data backing the array.
    I'd really like to avoid data copying / coercion (think Categorical.codes,
    PeriodArray.asi8, etc). I'm willing to lose information to avoid copying
    data unnecessarily.

I think getting 1 and 2 right are necessary. 3 would be a nice thing to have, but I wouldn't
consider it a blocker for 0.24.0.

I also think that "fixing" .values isn't something we can do at this point. The
current behavior is inconsistent between EAs, and adding new EAs would force us to
break API or break consistency with Categorical.

So, I propose avoiding that fight. Let's use new names to achieve the behavior we want.
This seems important enough to warrant the API bloat, and I think we can remove mentions
of .values in the docs, so we're net +2 properties here.


Here's a concrete proposal. I still need to go through all our EAs to figure out an
exact set of behavior (datetime64[ns, tz] may be particularly troublesome).

@property
def array(self) -> Union[ndarray, ExtensionArray]:
    """Return the array backing this Series or Index

    Examples
    --------
    >>> arr = pd.core.arrays.period_array(['2000', '2001'], freq='A')
    >>> ser = pd.Series(arr)
    >>> ser.array
    <PeriodArray>
    ['2000', '2001']
    Length: 2, dtype: period[A-DEC]
    """
    return self._values


@property
def numpy_array(self) -> ndarray:
    """Return a NumPy array of this object's values.

    This may or may not require a copy or coercion of values.
    For dtypes that can be represented by NumPy, this will be a view on
    the actual values. For ExtensionArrays, this will likely be an object-dtype
    ndarray that losslessly represents the values.

    Examples
    --------
    >>> arr = pd.core.arrays.period_array(['2000', '2001'], freq='A')
    >>> ser = pd.Series(arr)
    >>> ser.numpy_array
    array([Period('2000', 'A-DEC'), Period('2001', 'A-DEC')], dtype=object)
    """
    return np.asarray(self._values)


@property
def ndarray_values(self) -> ndarray:
    """Return a NumPy array representing the values in this object.

    This should be faster to compute than ``self.numpy_array``, but will
    require additional context to interpret.
    Examples
    --------
    >>> arr = pd.core.arrays.period_array(['2000', '2001'], freq='A')
    >>> ser = pd.Series(arr)
    >>> ser.ndarray_values
    array([30, 31])
    """
    return self._ndarray_values
@shoyer

This comment has been minimized.

Copy link
Member

commented Oct 26, 2018

@TomAugspurger I like this general idea, but let me suggest two refinements:

  • numpy_array should be a to_numpy() method (because the NumPy array is sometimes created on the fly)
  • Instead of ndarray_values, call it something slightly more descriptive -- maybe raw_array or base_array? This property also might not be defined on all extension arrays, e.g., if they are written on top of arrow.
@jreback

This comment has been minimized.

Copy link
Contributor

commented Oct 26, 2018

why would this not be on top of an arrow backed array? the point of EA is to completely hide this detail

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Oct 26, 2018

numpy_array should be a to_numpy() method (because the NumPy array is sometimes created on the fly)

I was viewing this as a .values replacement, which is why I went with a property initially. But if I were designing this from scratch I would certainly make it a method... I'm not sure how to balance these two. I think right now a method would be preferable here, but I could go either way.

Instead of ndarray_values, call it something slightly more descriptive -- maybe raw_array or base_array?

Mmm, yes that's a good point... I would need to look at this. In several places, like in our indexing engines, I think we really need an ndarray right now.

why would this not be on top of an arrow backed array? the point of EA is to completely hide this detail

I'm not sure what you mean. Were you responding to my proposal or Stehpan's comments?

@jreback

This comment has been minimized.

Copy link
Contributor

commented Oct 26, 2018

I was responding to stephan

an arrow backed extension arrow is nothing special as far as pandas is concerned

@shoyer

This comment has been minimized.

Copy link
Member

commented Oct 26, 2018

an arrow backed extension arrow is nothing special as far as pandas is concerned

You can turn an arrow backed extension arrow into a NumPy array (possibly with a copy) or you can access it as an extension array, but there isn't necessarily any equivalent of a "raw numpy array" underlying the values that you can directly mutate to change the values in a pandas.Series.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Oct 26, 2018

@shoyer of course but how does this actually matter - again the actual implementation of the EA is irrelevant, except when we ask it give us a raw set of values and even that is transparent

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Oct 26, 2018

I think whether or not this has to be an ndarray depends on how it's (intended) to be used. Is the primary motivation

  1. a fast (but maybe not zero copy) representation of the data as an ndarray, so that I can do ndarray-like things to it? Or,
  2. A fast array-like representation of the data, no copies allowed.

I was proposing the first, hence the name ndarray_values. But perhaps the second is what people want? This one is kind of ambiguous.

Right now, we seem to use IndexOpsMixin._ndarray_values a lot in the indexing engines. I suspect that needs to be an ndarray, and not an Arrow array (or buffer), but I haven't looked.

@shoyer

This comment has been minimized.

Copy link
Member

commented Oct 26, 2018

  1. A fast array-like representation of the data, no copies allowed.

This is definitely the use-case I had in mind. This can be convenient, e.g., for cases where you want to modify the NumPy array in place.

Note that this is definitely possible for some but not all arrow based arrays. For example, if you have integers or floats without any missing values.

  1. a fast (but maybe not zero copy) representation of the data as an ndarray, so that I can do ndarray-like things to it?

I think this is covered by to_numpy()?

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Oct 26, 2018

  1. a fast (but maybe not zero copy) representation of the data as an ndarray, so that I can do ndarray-like things to it?

I think this is covered by to_numpy()?

Not the "fast" requirement though. For e.g. PeriodIndex, PeriodIndex.to_numpy() would be an ndarray of objects, which is expensive to create.

My last argument in favor of ndarray_values -> ndarray over a raw_array -> ArrayLike is that the .raw_array use-case isn't clear to me. The return type of .raw_array would be deliberately broad (since ExtensionArray makes no claims about the actual physical storage). So use it, you would have to know something about what the actual type is in your case. And if you're doing something specific to your array type, why not do series.array.<type-specific-thing>?

The usage for ndarray_values is a bit clearer to me. If you have some algorithm that has to operate on a ndararys, you may be able to use the algorithm on ndarray_values and then reconstruct the real result using additional information that you've held onto separately (think factorize, sorting).

As a final note, I think this ndarray_values / raw_array is the fuzziest concept out of the 3, and should not necessarily block progression on .to_numpy() (method or property) and .array.

@shoyer

This comment has been minimized.

Copy link
Member

commented Oct 26, 2018

My last argument in favor of ndarray_values -> ndarray over a raw_array -> ArrayLike is that the .raw_array use-case isn't clear to me. The return type of .raw_array would be deliberately broad (since ExtensionArray makes no claims about the actual physical storage). So use it, you would have to know something about what the actual type is in your case. And if you're doing something specific to your array type, why not do series.array.?

I totally agree here, I was just proposing a different name for the same use-case:

@property
def raw_numpy_array_values(self) -> np.ndarray:
    # raises an error if the extension array doesn't support it
@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Nov 5, 2018

One last question here... I've implicitly assumed that ndarray_values / to_numpy are 1-D ndarrays. Is that a safe assumption? I could see a case for e.g. IntervalArray returning a 2-D NumPy array, or a structured array with left and right fields.

I don't think that allowing this to be a 2-D will be that useful in practice. I suspect that places using these arrays are likely to expect 1-D arrays. But I figured I'd throw this out there so we can at least reject it, and document that they should be 1-D.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Nov 6, 2018

Personally I would leave the ndarray_values or raw_numpy_array_values idea out of the public API on the containers. This concept will inherently be inconsistent between EAs, as it will only be possible for some of the EAs. As @shoyer noted this concept might not be possible for certain arrow-backed EAs, but another typical example is our own IntervalArray which is backed by two ndarrays (left/right).

Of course, as @shoyer mentions, we can raise an error for this attribute if there is no clear no-copy option available. But then a user would need to be aware of the specific data type/EA that is being used, so then why not use EA-specific API to do this?
For example for this mentioned use case:

If you have some algorithm that has to operate on a ndararys, you may be able to use the algorithm on ndarray_values and then reconstruct the real result using additional information that you've held onto separately (think factorize, sorting).

and if you have a PeriodArray. Then you know how a PeriodArray works: it consistent of integers and a freq, so you can use the public PeriodArray.asi8 to access those integers, do something with it, and then reconstruct a PeriodArray from the integers and the freq.

I would say that each ExtensionArray that wants to enable such manipulation, can provide it's own access to the underlying data, because this will differ for each array anyhow? (eg for PeriodArray, you already need to know you need to keep track of freq to reconstruct it, so then you can also use a specific attribute to access the data?)
(and of course, having a somewhat consistent API eg across all datetime-like arrays is certainly nice, but that is something we can ensure for our own EAs)

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Nov 6, 2018

Fair point about TimedeltaArray. I forgot about freq (and the future proofing is nice).

For MultiIndex.to_numpy(), I was thinking a 1D array of tuples

In [10]: idx = pd.MultiIndex.from_tuples([('a', 1), ('a', 2)])

In [11]: idx.to_numpy()
Out[11]: array([('a', 1), ('a', 2)], dtype=object)

But a 2d array is also reasonable (depending on the expected use case)...

@jbrockmendel

This comment has been minimized.

Copy link
Member

commented Nov 7, 2018

I think all those questions about _ndarray_values are important to discuss, but not fully relevant for this thread about the public API.

Fair enough. FWIW I was referring to downstream use cases, not pandas internal usage.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Nov 8, 2018

To update the table at
#19954 (comment), the proposal is now

dtype array to_numpy np.array(...)
datetime64[ns] DatetimeArray ndarray[datetime64[ns]] (view) ndarray[datetime64[ns]] (view)
DatetimeTZDtype DatetimeArray ndarray[object] (Timestamp) ndarray[datetime64[ns]] (view)
timedelta64[ns] TimedeltaArray ndarray[timedelta64[ns]] (view) ndarray[timedelta64[ns]] (view)

which is pleasingly symmetric. OK by everyone? And if we allow kwargs for to_numpy, the to_numpy() for datetime64[ns, tz] could be toggled between an ndarray of timestamps (object) and an ndarray converted to UTC and made tznaive (datetime64[ns]).

I suppose we should reconsider what the default is... Should Series[datetime64[ns, tz]].to_numpy() be an ndarray of timestamps, or an ndarray[datetime64[ns]], where we convert to UTC and drop the tz? We have .array for lossless, dtype-preserving stuff, so why slowdown to_numpy() with the object dtype conversion?

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Nov 8, 2018

I think all those questions about _ndarray_values are important to discuss, but not fully relevant for this thread about the public API.

Fair enough. FWIW I was referring to downstream use cases, not pandas internal usage.

It's a private attribute, so in principle there should be no downstream use case right now, but you are right, there could of course be use cases, which is good to think about to decide to make it public or not.
It's for the internal usage that we still need to better define it though, even when we don't make it public (discussed here), so for that I will try to make a new issue.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Nov 8, 2018

+1 on #19954 (comment)

a MultiIndex.to_numpy() should be an ndarray of tuples, similar to what .values does now. If you for example wanted to have an expand=True kwarg to make that return a 2-d ndarray wouldn't object.

I think the discussion for _ndarray_values is quite relevant here. There are too many things going on. What exactly are the guarantees that ._ndarray_values has that is not replicable by ._values?

a bit -0 on .array (property), why not .to_array(), which gives way more flexibility going forward.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Nov 8, 2018

We indeed need to decide on the default behaviour of to_numpy? The "maximally information retaining" one (so objects for tz aware datetimes) ?
In that sense, it is different from np.array(..) (__array__, which could also be the default behaviour of to_numpy), so it might be good to add to that table.

Having the default to_numpy() without kwargs be identical to np.array(..) gives some consistency which might be nice, but of course puts the distinct behaviour (eg objects, the case for which you might want to use to_numpy) behind a keyword.

which is pleasingly symmetric.
[from the table] datetime64[ns, tz] |  

To be clear this is not a numpy dtype (I would make that clearer in the formatting), so in that sense the table is not fully symmetric.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Nov 8, 2018

There are too many things going on.

That's part of why I want to ignore it for now :)

What exactly are the guarantees that ._ndarray_values has that is not replicable by ._values?

._values and ._ndarray_values are quite different. For e.g. Series[category]. _ndarray_values is an ndarray of codes. ._values (now being named .array), is the Categorical.

a bit -0 on .array (property),

People seem to like the property. And if .array is supposed to be "the actual array backing this Series" I don't think any keywords make sense. We're just handing you the array, nothing else.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Nov 8, 2018

To be clear this is not a numpy dtype (I would make that clearer in the formatting), so in that sense the table is not fully symmetric.

Changed to DatetimeTZDtype

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Nov 8, 2018

I think the discussion for _ndarray_values is quite relevant here. There are too many things going on. What exactly are the guarantees that ._ndarray_values has that is not replicable by ._values?

Again, this is discussion is for sure very relevant, and one we need to have. But I think it is orthogonal enough from the discussion about public attributes to keep it separate and not further complicate this already complex issue thread.

a bit -0 on .array (property), why not .to_array(), which gives way more flexibility going forward.

The idea is that .array always simply gives the underlying array-like (ndarray or EA) how it is actually stored in the container (series or index). So that feels as a property to me, and there is not really a reason to have it as a method?

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Nov 8, 2018

I added np.array(...) to the table, which makes my earlier proposal for Series[datetime64[ns, tz]] stand out a bit. np.array on that returns the array converted to UTC and tzinfo dropped. We can't change that for backwards compatibility, so should we make .to_numpy() match it?

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Nov 8, 2018

Opened #23565 for the discussion on _ndarray_values

@pandas-dev pandas-dev deleted a comment from TomAugspurger Nov 8, 2018

@jreback

This comment has been minimized.

Copy link
Contributor

commented Nov 8, 2018

The idea is that .array always simply gives the underlying array-like (ndarray or EA) how it is actually stored in the container (series or index). So that feels as a property to me, and there is not really a reason to have it as a method?

a property just boxes you in as you can never add arguments. .to_array() is much more friendly.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Nov 8, 2018

why is np.array() any different than .to_numpy()? adding more and more options is just going to guarantee confusion. we need to have the simplest possible public API, sometimes even making conversions expensive.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Nov 8, 2018

a property just boxes you in as you can never add arguments. .to_array() is much more friendly.

But why would you want an argument for it? Can you give a use case? There is only one array stored in the series/index container, and it simply returns that one (without making a copy, without any modification)

is np.array() any different than .to_numpy()?

If we do the last change that Tom mentioned above (#19954 (comment)), there might be no difference in the default behaviour, but to_numpy would have the flexibility of having custom keywords to eg keep timezone information.
Of course, if that is our only use case, you can also get that with np.array(.., dtype=object) ..

Do we have other use cases where multiple ndarrays are possible (except for the datetime tz case) ?

@jreback

This comment has been minimized.

Copy link
Contributor

commented Nov 8, 2018

But why would you want an argument for it? Can you give a use case? There is only one array stored in the series/index container, and it simply returns that one (without making a copy, without any modification)

this is exactly the point! I don't know now. .values boxed us in for so many years, you want to repeat this???

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Nov 8, 2018

.values boxed us in for some many years, you want to repeat this???

Can you explain what you mean with "boxed us in" ?

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Nov 8, 2018

I think part of the idea is to not have arguments. .array is to provide the thing boxed by the series / index, full stop.

why is np.array() any different than .to_numpy()?

I think we're moving to it matching by default.

Do we have other use cases where multiple ndarrays are possible (except for the datetime tz case) ?

Not saying this is a good idea, but potentially a 2-D array from an IntervalArray.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Nov 8, 2018

Maybe we can pause this conversation until this afternoon :)

@jreback

This comment has been minimized.

Copy link
Contributor

commented Nov 8, 2018

If we do the last change that Tom mentioned above (#19954 (comment)), there might be no difference in the default behaviour, but to_numpy would have the flexibility of having custom keywords to eg keep timezone information.
Of course, if that is our only use case, you can also get that with np.array(.., dtype=object) ..

This again offers too many options.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Nov 8, 2018

I think we're moving to it matching by default.

great

@jreback

This comment has been minimized.

Copy link
Contributor

commented Nov 8, 2018

Can you explain what you mean with "boxed us in" ?

why do you think we have a problem converting Datetime w/tz now? because we can't touch .values in any way. This is boxed in. We have NO choice but to return a single values no matter what. Imagine if .values(tz=True) or whatever was available that you would have a lot more flexibility.

A method offers possibilities, a property precludes them.

@jbrockmendel

This comment has been minimized.

Copy link
Member

commented Nov 8, 2018

.array is to provide the thing boxed by the series / index, full stop.

Suppose we were to implement RangeArray to back RangeIndex (or even range-like DatetimeArray/Index). Then there is nothing being boxed. I don't have a real opinion here, just sayin'

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Nov 8, 2018

Suppose we were to implement RangeArray to back RangeIndex. Then there is nothing being boxed.

Then it is the RangeArray that is being boxed, and thus what would be returned?

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Nov 11, 2018

API: Public data attributes for EA-backed containers
This adds two new methods for working with EA-backed Series / Index.

- `.array -> Union[ExtensionArray, ndarray]`: the actual backing array
- `.to_numpy() -> ndarray`: A NumPy representation of the data

`.array` is always a reference to the actual data stored in the container.
Updating it inplace (not recommended) will be reflected in the Series (or
Index for that matter, so really not recommended).

`to_numpy()` may (or may not) require data copying / coercion.

Closes pandas-dev#19954

@jreback jreback modified the milestones: Contributions Welcome, 0.24.0 Nov 27, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.