Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review/Overhaul __array__ and view methods #23569

Closed
jbrockmendel opened this issue Nov 8, 2018 · 12 comments

Comments

Projects
None yet
5 participants
@jbrockmendel
Copy link
Member

commented Nov 8, 2018

#23524 fixes a problem with DatetimeIndex.__array__; we should do a thorough check of the __array__ methods on other classes to make sure (and test) that they make sense

@jorisvandenbossche jorisvandenbossche changed the title Review/Overhauld __array__ and view methods Review/Overhaul __array__ and view methods Nov 8, 2018

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Nov 9, 2018

So question is what do we want to support in __array__ (for the datetime-like EAs):

  • default np.array(EA) will return an array of M8/m8 dtype, compatible with what it does for Series/Index (on master this currenlty gives object dtype, due to using the default iteration), and object array for PeriodArray
  • np.array(EA, dtype=object) converts to object array of pandas Timestamp/Timedelta scalars.

But do we also want to support np.array(EA, dtype=int) to convert to i8/ordinals ?

@jorisvandenbossche jorisvandenbossche added this to the 0.24.0 milestone Nov 9, 2018

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Nov 9, 2018

But do we also want to support np.array(EA, dtype=int) to convert to i8/ordinals ?

For datetime/timedelta, this will happen automatically if we do nothing. So the only other option is to raise an error, but that does not seem necessary I think.
The question is still whether to explicitly enable it for PeriodArray.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Dec 4, 2018

Sorry I missed this earlier. I'm concerned about the default behavior on DatetimeIndex.__array__ and Series[datetime64[ns, tz]] with tz-aware data. Both of these convert to UTC, drop the tz, and return an M8[ns]-dtype array

In [7]: idx = pd.date_range('2000', periods=4, tz='US/Central')

In [8]: ser = pd.Series(idx)

In [9]: np.asarray(ser)
Out[9]:
array(['2000-01-01T06:00:00.000000000', '2000-01-02T06:00:00.000000000',
       '2000-01-03T06:00:00.000000000', '2000-01-04T06:00:00.000000000'],
      dtype='datetime64[ns]')

In [10]: np.asarray(idx)
Out[10]:
array(['2000-01-01T06:00:00.000000000', '2000-01-02T06:00:00.000000000',
       '2000-01-03T06:00:00.000000000', '2000-01-04T06:00:00.000000000'],
      dtype='datetime64[ns]')

This unfortunately conflicts with __iter__.

In [11]: list(ser)
Out[11]:
[Timestamp('2000-01-01 00:00:00-0600', tz='US/Central', freq='D'),
 Timestamp('2000-01-02 00:00:00-0600', tz='US/Central', freq='D'),
 Timestamp('2000-01-03 00:00:00-0600', tz='US/Central', freq='D'),
 Timestamp('2000-01-04 00:00:00-0600', tz='US/Central', freq='D')]

In [12]: list(idx)
Out[12]:
[Timestamp('2000-01-01 00:00:00-0600', tz='US/Central', freq='D'),
 Timestamp('2000-01-02 00:00:00-0600', tz='US/Central', freq='D'),
 Timestamp('2000-01-03 00:00:00-0600', tz='US/Central', freq='D'),
 Timestamp('2000-01-04 00:00:00-0600', tz='US/Central', freq='D')]

Do we care about that (I do)? Should we deprecate the current behavior?

diff --git a/pandas/core/arrays/datetimes.py b/pandas/core/arrays/datetimes.py
index e731dd33f..0f0577b1a 100644
--- a/pandas/core/arrays/datetimes.py
+++ b/pandas/core/arrays/datetimes.py
@@ -404,6 +404,9 @@ class DatetimeArrayMixin(dtl.DatetimeLikeArrayMixin,
     # Array-Like / EA-Interface Methods
 
     def __array__(self, dtype=None):
+        if dtype is None:
+            warnings.warn("bad", FutureWarning)
+            dtype = object
         if is_object_dtype(dtype):
             return np.array(list(self), dtype=object)
         elif is_int64_dtype(dtype):

usage:

In [6]: np.asarray(idx)
/Users/taugspurger/sandbox/pandas/pandas/core/arrays/datetimes.py:408: FutureWarning: bad
  warnings.warn("bad", FutureWarning)
Out[6]:
array([Timestamp('2000-01-01 00:00:00-0600', tz='US/Central', freq='D'),
       Timestamp('2000-01-02 00:00:00-0600', tz='US/Central', freq='D'),
       Timestamp('2000-01-03 00:00:00-0600', tz='US/Central', freq='D'),
       Timestamp('2000-01-04 00:00:00-0600', tz='US/Central', freq='D')],
      dtype=object)

In [7]: np.asarray(ser)
/Users/taugspurger/sandbox/pandas/pandas/core/arrays/datetimes.py:408: FutureWarning: bad
  warnings.warn("bad", FutureWarning)
Out[7]:
array([Timestamp('2000-01-01 00:00:00-0600', tz='US/Central', freq='D'),
       Timestamp('2000-01-02 00:00:00-0600', tz='US/Central', freq='D'),
       Timestamp('2000-01-03 00:00:00-0600', tz='US/Central', freq='D'),
       Timestamp('2000-01-04 00:00:00-0600', tz='US/Central', freq='D')],
      dtype=object)
@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Dec 4, 2018

As an example of the kind of strange knock-on effects this has, see https://github.com/pandas-dev/pandas/pull/23601/files#diff-425b4da47d01dc33d86c5c697e196b70R1128.

Ideally we'd just write array = np.asarray(values) there, but we had to special case categorical there, since it dispatches to .categories.__array__, which did the "wrong" thing for tz-aware datetime data.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Dec 4, 2018

I agree here. np.asarray() should preserve dtype if possible. so ok with making this consistent.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Dec 8, 2018

Is changing DatetimeIndex__array__ and `Series[datetime64[ns, tz]] to be warn a blocker for 0.24.0 (I think it should be).

I have a branch started. Just need to go through and fix up all the warnings.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Jan 2, 2019

Picking this up now. I'm it's an annoyingly large diff to get things through.

I'm going with @jorisvandenbossche's suggestion to

  1. Make DTA.__array__ have the future behavior now (tz-aware -> object-dtype ndarray)
  2. Implement the deprecation in DatetimeIndex.__array__ and Series.__array__.

This proved to be much easier that doing the deprecation that DatetimeArray.__array__, and modifying the callers to pass dtype= to avoid the warning / avoid changing behavior. In some cases, it was hard to determine what dtype should be passed.

Unfortunately, Series.__array__ uses Series.get_values, so I need to plumb the dtype keyword all the way through to internals. I think I had to add the dtype keyword to get_values and to_dense as well. It's unavoidable I think. Will try to have a PR later today so people can take a look.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Jan 3, 2019

Unfortunately, Series.array uses Series.get_values

In light of deprecating Series.get_values, we can directly use the internals here (or, do np.array(Series.array) ?)

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Jan 3, 2019

I didn't realize Series.get_values was being deprecated.

But that's a decent idea I'll try exploring. IIUC the main purpose of get_values was special handling for Categorical and Sparse at the block level, which should no longer be necessary.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Jan 3, 2019

I didn't realize Series.get_values was being deprecated.

I don't think there is a PR, but it certainly is on the wishlist :)

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Jan 3, 2019

See #19617. I think the discussion there is mainly resolved as we now have the explicit .array and np.array/.to_numpy() to cover the two use cases.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Jan 3, 2019

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Jan 3, 2019

DEPR: __array__ for tz-aware Series/Index
This deprecates the current behvior when converting tz-aware Series
or Index to an ndarray. Previously, we converted to M8[ns], throwing
away the timezone information. In the future, we will return an
object-dtype array filled with Timestamps, each of which has the correct
tz.

```python
In [1]: import pandas as pd; import numpy as np

In [2]: ser = pd.Series(pd.date_range('2000', periods=2, tz="CET"))

In [3]: np.asarray(ser)
/bin/ipython:1: FutureWarning: Converting timezone-aware DatetimeArray to timezone-naive ndarray with 'datetime64[ns]' dtype. In the future, this will return an ndarray with 'object' dtype where each element is a 'pandas.Timestamp' with the correct 'tz'.
        To accept the future behavior, pass 'dtype=object'.
        To keep the old behavior, pass 'dtype="datetime64[ns]"'.
  #!/Users/taugspurger/Envs/pandas-dev/bin/python3
Out[3]:
array(['1999-12-31T23:00:00.000000000', '2000-01-01T23:00:00.000000000'],
      dtype='datetime64[ns]')
```

xref pandas-dev#23569

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Jan 3, 2019

DEPR: __array__ for tz-aware Series/Index
This deprecates the current behvior when converting tz-aware Series
or Index to an ndarray. Previously, we converted to M8[ns], throwing
away the timezone information. In the future, we will return an
object-dtype array filled with Timestamps, each of which has the correct
tz.

```python
In [1]: import pandas as pd; import numpy as np

In [2]: ser = pd.Series(pd.date_range('2000', periods=2, tz="CET"))

In [3]: np.asarray(ser)
/bin/ipython:1: FutureWarning: Converting timezone-aware DatetimeArray to timezone-naive ndarray with 'datetime64[ns]' dtype. In the future, this will return an ndarray with 'object' dtype where each element is a 'pandas.Timestamp' with the correct 'tz'.
        To accept the future behavior, pass 'dtype=object'.
        To keep the old behavior, pass 'dtype="datetime64[ns]"'.
  #!/Users/taugspurger/Envs/pandas-dev/bin/python3
Out[3]:
array(['1999-12-31T23:00:00.000000000', '2000-01-01T23:00:00.000000000'],
      dtype='datetime64[ns]')
```

xref pandas-dev#23569

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Jan 3, 2019

DEPR: __array__ for tz-aware Series/Index
This deprecates the current behvior when converting tz-aware Series
or Index to an ndarray. Previously, we converted to M8[ns], throwing
away the timezone information. In the future, we will return an
object-dtype array filled with Timestamps, each of which has the correct
tz.

```python
In [1]: import pandas as pd; import numpy as np

In [2]: ser = pd.Series(pd.date_range('2000', periods=2, tz="CET"))

In [3]: np.asarray(ser)
/bin/ipython:1: FutureWarning: Converting timezone-aware DatetimeArray to timezone-naive ndarray with 'datetime64[ns]' dtype. In the future, this will return an ndarray with 'object' dtype where each element is a 'pandas.Timestamp' with the correct 'tz'.
        To accept the future behavior, pass 'dtype=object'.
        To keep the old behavior, pass 'dtype="datetime64[ns]"'.
  #!/Users/taugspurger/Envs/pandas-dev/bin/python3
Out[3]:
array(['1999-12-31T23:00:00.000000000', '2000-01-01T23:00:00.000000000'],
      dtype='datetime64[ns]')
```

xref pandas-dev#23569
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.