Support NDFrame.shift with EAs #22387

TomAugspurger · 2018-08-16T16:06:27Z

Uses take internally.

One question: do we want to add shift to the interface, and dispatch to that? This just does the right thing at the block level, by writing a shift that'll work with an ExtensionArray.

CategoricalBlock still overrides .shift to call Categorical.shift, which uses a slightly faster implementation based around the codes and np.roll (~1.5-2x faster based on microbenchmarks)`.

We could add move ExtensionBlock.shift to ExtensionArray.shift, and just call that from ExtensionBlock. Then we can delete CategoricalBlock.shift.

I'm 50/50 on whether it's worth adding to the interface.

Uses take internally. Closes pandas-dev#22386

jreback · 2018-08-16T17:06:42Z

i think it’s worth adding to the interface itself

jorisvandenbossche · 2018-08-16T19:12:16Z

Wouldn't it be possible to do this without take, but with slicing?

Something like (very crude implementation for positive shifts):

def shift(a, p):
    a_sliced = a[:-p]
    a_NA = a._from_sequence([a.dtype.na_value]*p, dtype=a.dtype)
    return a._concat_same_type([a_NA, a_sliced])

Speed wise I would expect this to be very similar to the current Categorical.shift (which only uses the np.roll to basically do this instead of manually concatenating both parts)

codecov · 2018-08-16T19:25:03Z

Codecov Report

Merging #22387 into master will decrease coverage by <.01%.
The diff coverage is 91.66%.

@@            Coverage Diff             @@
##           master   #22387      +/-   ##
==========================================
- Coverage   92.05%   92.05%   -0.01%     
==========================================
  Files         169      169              
  Lines       50733    50743      +10     
==========================================
+ Hits        46702    46711       +9     
- Misses       4031     4032       +1

Flag	Coverage Δ
#multiple	`90.46% <91.66%> (-0.01%)`	⬇️
#single	`42.24% <16.66%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/internals/blocks.py	`93.84% <100%> (ø)`	⬆️
pandas/core/arrays/base.py	`88% <90%> (+0.14%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 68273a7...c5b556d. Read the comment docs.

jorisvandenbossche · 2018-08-16T19:28:58Z

Speed wise I would expect this to be very similar to the current Categorical.shift

Categorical was apparently a very bad example as there it is like 5x slower than the existing Categorical.shift, but that is more because generating the missing values (_from_sequence) and _concat_same_type are very slow for Categorical (and could certainly be optimized), for IntegerArray it was as fast as I expected.

TomAugspurger · 2018-08-16T19:38:18Z

That seems preferable to me, as it avoids allocating the indexer array.

TomAugspurger · 2018-08-16T19:42:26Z

pandas/core/arrays/base.py

+        shifted : ExtensionArray
+        """
+        if periods == 0:
+            return self.copy()


Thoughts on this? non-zero periods will necessarily be a copy.

Although Series.shift(0) doesn't make a copy, so let's follow that?

I personally find that strange behaviour to not copy then (it would be more logical to know that the result is always a copy, regardless the value of periods IMO)

Opened #22397 for that. Will leave this as is then.

TomAugspurger · 2018-08-17T11:44:21Z

Just as a note, I couldn't quite use this shift implementation for SparseArray.

When ExtensionArray.shiftcreates the empty SparseArray, it ends up with NaN for the fill value. When shifting a SparseArray with, say, 0 as the fill value, you end up concatenating the empty (with fill of NaN) and the shifted (with a fill of 0) and that ends up with a fill of NaN.

The root issue is that SparseArray stores fill_value on the array, rather than on the dtype. I haven't looked into whether it should go on the dtype or not, it's not clear to me.

Regardless, I think this is good to go, if we're happy with the slice and concat approach.

jorisvandenbossche · 2018-08-17T11:47:17Z

From that it seems to makes sense to have this on the dtype then (but not that familiar with the rest of the code)

TomAugspurger · 2018-08-17T11:51:37Z

Indeed... On the other hand, it complicates things like equality a bit...

I'll push it onto the dtype and see what breaks.

TomAugspurger · 2018-08-20T11:04:05Z

Any objections to merging?

jreback

shouldn't we be able to change the impl of interval / categorical shift for this?

jreback · 2018-08-20T11:09:26Z

pandas/core/arrays/base.py

@@ -400,6 +400,36 @@ def dropna(self):

        return self[~self.isna()]

+    def shift(self, periods=1):


I think update in the ExtensionArray doc-string?

Currently, in the class docstring we only mention the methods that either needs to implemented (because they raise AbstractMethodError otherwise) or either have a suboptimal implementation because it does the object ndarray roundtrip.
This is not the case here (which is not saying we couldn't also list other methods that can be overriden for specific reasons)

jorisvandenbossche · 2018-08-20T11:27:06Z

shouldn't we be able to change the impl of interval / categorical shift for this?

As I mentioned above (#22387 (comment)), if we implement Categorical.shift as is done here (eg by simply inheriting this base method), it gets much slower because the creating of the NaNs and concatting same typed-categoricals is very slow (at least slower as it should be).
IntervalArray did not yet have any shift implementation.

jorisvandenbossche · 2018-08-20T11:34:20Z

pandas/tests/extension/base/methods.py

+        (0, [0, 1, 2, 3, 4]),
+        (2, [-1, -1, 0, 1, 2]),
+    ])
+    def test_container_shift_negative(self, data, frame, periods, indices):


Very minor comment, but is there a reason this is called test_container_shift_negative? (because it does not only test negative periods, and 'container' seems a bit superfluous). I would simply do test_shift

Leftover from a merger of multiple tests. Will fix the name.

I do test_container_shift, rather than just test_shift, as this test goes through Series(array).shift, rather than directly doing array.shift.

TomAugspurger · 2018-08-20T12:30:13Z

It seems that IntervalArray may need some updates,

In [22]: idx = IntervalArray.from_breaks(pd.date_range('2017', periods=10))

In [23]: idx.shift()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-23-1b2c2192e1e6> in <module>()
----> 1 idx.shift()

~/sandbox/pandas/pandas/core/arrays/base.py in shift(self, periods)
    422             return self.copy()
    423         empty = self._from_sequence([self.dtype.na_value] * abs(periods),
--> 424                                     dtype=self.dtype)
    425         if periods > 0:
    426             a = empty

~/sandbox/pandas/pandas/core/arrays/interval.py in _from_sequence(cls, scalars, dtype, copy)
    193     @classmethod
    194     def _from_sequence(cls, scalars, dtype=None, copy=False):
--> 195         return cls(scalars, dtype=dtype, copy=copy)
    196
    197     @classmethod

~/sandbox/pandas/pandas/core/arrays/interval.py in __new__(cls, data, closed, dtype, copy, fastpath, verify_integrity)
    138
    139         return cls._simple_new(left, right, closed, copy=copy, dtype=dtype,
--> 140                                verify_integrity=verify_integrity)
    141
    142     @classmethod

~/sandbox/pandas/pandas/core/arrays/interval.py in _simple_new(cls, left, right, closed, copy, dtype, verify_integrity)
    156                 raise TypeError(msg.format(dtype=dtype))
    157             elif dtype.subtype is not None:
--> 158                 left = left.astype(dtype.subtype)
    159                 right = right.astype(dtype.subtype)
    160

~/sandbox/pandas/pandas/core/indexes/numeric.py in astype(self, dtype, copy)
    313             msg = ('Cannot convert Float64Index to dtype {dtype}; integer '
    314                    'values are required for conversion').format(dtype=dtype)
--> 315             raise TypeError(msg)
    316         elif is_integer_dtype(dtype) and self.hasnans:
    317             # GH 13149

TypeError: Cannot convert Float64Index to dtype datetime64[ns]; integer values are required for conversion

In [24]: idx = IntervalArray.from_breaks(range(10))

In [25]: idx.shift()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-25-1b2c2192e1e6> in <module>()
----> 1 idx.shift()

~/sandbox/pandas/pandas/core/arrays/base.py in shift(self, periods)
    422             return self.copy()
    423         empty = self._from_sequence([self.dtype.na_value] * abs(periods),
--> 424                                     dtype=self.dtype)
    425         if periods > 0:
    426             a = empty

~/sandbox/pandas/pandas/core/arrays/interval.py in _from_sequence(cls, scalars, dtype, copy)
    193     @classmethod
    194     def _from_sequence(cls, scalars, dtype=None, copy=False):
--> 195         return cls(scalars, dtype=dtype, copy=copy)
    196
    197     @classmethod

~/sandbox/pandas/pandas/core/arrays/interval.py in __new__(cls, data, closed, dtype, copy, fastpath, verify_integrity)
    138
    139         return cls._simple_new(left, right, closed, copy=copy, dtype=dtype,
--> 140                                verify_integrity=verify_integrity)
    141
    142     @classmethod

~/sandbox/pandas/pandas/core/arrays/interval.py in _simple_new(cls, left, right, closed, copy, dtype, verify_integrity)
    156                 raise TypeError(msg.format(dtype=dtype))
    157             elif dtype.subtype is not None:
--> 158                 left = left.astype(dtype.subtype)
    159                 right = right.astype(dtype.subtype)
    160

~/sandbox/pandas/pandas/core/indexes/numeric.py in astype(self, dtype, copy)
    316         elif is_integer_dtype(dtype) and self.hasnans:
    317             # GH 13149
--> 318             raise ValueError('Cannot convert NA to integer')
    319         return super(Float64Index, self).astype(dtype, copy=copy)
    320

ValueError: Cannot convert NA to integer

The basic issue is

    423         empty = self._from_sequence([self.dtype.na_value] * abs(periods),
--> 424                                     dtype=self.dtype)

The container (ndarray[self.dtype]) can't actually hold self.dtype.na_value (nan).

Does that change our opinion on this implementation?

jorisvandenbossche · 2018-08-20T12:37:59Z

I think that is rather a bug in IntervalArray? (or maybe better a current limitation)

Anyway, if the point is that it (currently) cannot hold NaNs, it can also not implement shift, as it requires NaNs (so don't know if that changes anything about the shift implementation?)

TomAugspurger · 2018-08-20T12:40:33Z

Well, it can implement .shift, as long as it's allowed to change dtypes (upcast from int to float). The current implementation assumes / requires that the dtype not change.

jorisvandenbossche · 2018-08-20T12:42:40Z

Basically, the problem is that for IntervalArray, the data is stored in left and right arrays, and so it depends on the dtype of those arrays whether IntervalArray can have NaNs or not?
So eg float and datetime64 can hold missing values, ints not.

For the datetime error above, that is actually solvable with the current model, because if the data is not all-NaN it works:

In [75]: IntervalArray([pd.Interval(pd.Timestamp('2017-01-01'), pd.Timestamp('2017-01-02'))], dtype='interval[datetime64[ns]]')
Out[75]: 
IntervalArray([(2017-01-01, 2017-01-02]],
              closed='right',
              dtype='interval[datetime64[ns]]')

In [76]: IntervalArray([pd.Interval(pd.Timestamp('2017-01-01'), pd.Timestamp('2017-01-02')), np.nan], dtype='interval[datetime64[ns]]')
Out[76]: 
IntervalArray([(2017-01-01, 2017-01-02], nan],
              closed='right',
              dtype='interval[datetime64[ns]]')

In [78]: IntervalArray([np.nan], dtype='interval[datetime64[ns]]').left
...
TypeError: Cannot convert Float64Index to dtype datetime64[ns]; integer values are required for conversion

jorisvandenbossche · 2018-08-20T12:44:44Z

Well, it can implement .shift, as long as it's allowed to change dtypes

Ah, yes, the same happens with Series([int]).shift() of course.

I think ideally, we should improve the missing values handling on the IntervalArray level (since it's a composite array, we have the ability to do it properly, not requiring such upcasts), but that's certainly out of scope for this PR :-)

jorisvandenbossche · 2018-08-20T12:50:05Z

So using the take implementation, that does already do the upcasting for Interval, so the shift would then work for intervals.

I don't really care much to whether to switch back to that or leave the implementation now as is (with the consequence shift doesn't work for Interval; although given the take precedent, upcasting the dtype seems somewhat consistent.
But I do think we should discuss the missing value handling for IntervalArray independently from this (eg using a mask instead).

TomAugspurger · 2018-08-20T12:58:55Z

I think ideally, we should improve the missing values handling on the IntervalArray level

That's my preference. Opened #22428 for that.

TomAugspurger · 2018-08-22T12:22:01Z

CI is passing now (last one was a random failure).

jreback

lgtm. minor doc-comments.

jreback · 2018-08-22T12:24:28Z

pandas/core/arrays/base.py

+        Shift values by desired number.
+
+        Newly introduced missing values are filled with
+        ``self.dtype.na_value``.


can you add a versionadded tag

jreback · 2018-08-22T12:24:44Z

pandas/core/internals/blocks.py

@@ -2068,6 +2068,12 @@ def interpolate(self, method='pad', axis=0, inplace=False, limit=None,
                                 limit=limit),
            placement=self.mgr_locs)

+    def shift(self, periods, axis=0, mgr=None):
+        # type: (int, Optional[BlockPlacement]) -> List[ExtensionBlock]


can you add a doc-string

[ci skip]

jreback

lgtm. minor comments. merge away

jreback · 2018-08-22T12:33:42Z

pandas/core/internals/blocks.py

+
+        Dispatches to underlying ExtensionArray and re-boxes in an
+        ExtensionBlock.
+        """


bonus points for the Parameters :> (future PR ok too)

jreback · 2018-08-23T10:38:34Z

thanks @TomAugspurger

Support NDFrame.shift with EAs

b29dfc6

Uses take internally. Closes pandas-dev#22386

TomAugspurger added the ExtensionArray Extending pandas with custom dtypes or arrays. label Aug 16, 2018

TomAugspurger added this to the 0.24.0 milestone Aug 16, 2018

Updated

c980035

Slice based

c4b0b97

TomAugspurger commented Aug 16, 2018

View reviewed changes

jreback requested changes Aug 20, 2018

View reviewed changes

jorisvandenbossche approved these changes Aug 20, 2018

View reviewed changes

update name

8d404bc

TomAugspurger mentioned this pull request Aug 20, 2018

IntervalArray.shift and missing value handling #22428

Closed

TomAugspurger added 2 commits August 20, 2018 08:02

Doc notes

64f51f4

Merge remote-tracking branch 'upstream/master' into ea-shift

ab901a6

jreback requested changes Aug 22, 2018

View reviewed changes

Added versionadded [ci skip]

c5b556d

[ci skip]

jreback approved these changes Aug 22, 2018

View reviewed changes

jreback merged commit e7fca91 into pandas-dev:master Aug 23, 2018

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018

Support NDFrame.shift with EAs (pandas-dev#22387)

e4718c7

jorisvandenbossche mentioned this pull request Dec 10, 2018

Follow-up - Refactor cythonize geometry series operations geopandas/geopandas#473

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support NDFrame.shift with EAs #22387

Support NDFrame.shift with EAs #22387

TomAugspurger commented Aug 16, 2018

jreback commented Aug 16, 2018

jorisvandenbossche commented Aug 16, 2018

codecov bot commented Aug 16, 2018 •

edited

Loading

jorisvandenbossche commented Aug 16, 2018

TomAugspurger commented Aug 16, 2018

TomAugspurger Aug 16, 2018

TomAugspurger Aug 16, 2018

jorisvandenbossche Aug 16, 2018

TomAugspurger Aug 17, 2018

TomAugspurger commented Aug 17, 2018 •

edited

Loading

jorisvandenbossche commented Aug 17, 2018

TomAugspurger commented Aug 17, 2018

TomAugspurger commented Aug 20, 2018

jreback left a comment

jreback Aug 20, 2018

jorisvandenbossche Aug 20, 2018

jorisvandenbossche commented Aug 20, 2018 •

edited

Loading

jorisvandenbossche Aug 20, 2018

TomAugspurger Aug 20, 2018

TomAugspurger commented Aug 20, 2018

jorisvandenbossche commented Aug 20, 2018

TomAugspurger commented Aug 20, 2018

jorisvandenbossche commented Aug 20, 2018

jorisvandenbossche commented Aug 20, 2018

jorisvandenbossche commented Aug 20, 2018

TomAugspurger commented Aug 20, 2018

TomAugspurger commented Aug 22, 2018

jreback left a comment

jreback Aug 22, 2018

jreback Aug 22, 2018

jreback left a comment

jreback Aug 22, 2018

jreback commented Aug 23, 2018

		@@ -400,6 +400,36 @@ def dropna(self):

		return self[~self.isna()]

		def shift(self, periods=1):

Support NDFrame.shift with EAs #22387

Support NDFrame.shift with EAs #22387

Conversation

TomAugspurger commented Aug 16, 2018

jreback commented Aug 16, 2018

jorisvandenbossche commented Aug 16, 2018

codecov bot commented Aug 16, 2018 • edited Loading

Codecov Report

jorisvandenbossche commented Aug 16, 2018

TomAugspurger commented Aug 16, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Aug 17, 2018 • edited Loading

jorisvandenbossche commented Aug 17, 2018

TomAugspurger commented Aug 17, 2018

TomAugspurger commented Aug 20, 2018

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Aug 20, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Aug 20, 2018

jorisvandenbossche commented Aug 20, 2018

TomAugspurger commented Aug 20, 2018

jorisvandenbossche commented Aug 20, 2018

jorisvandenbossche commented Aug 20, 2018

jorisvandenbossche commented Aug 20, 2018

TomAugspurger commented Aug 20, 2018

TomAugspurger commented Aug 22, 2018

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Aug 23, 2018

codecov bot commented Aug 16, 2018 •

edited

Loading

TomAugspurger commented Aug 17, 2018 •

edited

Loading

jorisvandenbossche commented Aug 20, 2018 •

edited

Loading