Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support NDFrame.shift with EAs #22387

Merged
merged 7 commits into from
Aug 23, 2018
Merged

Conversation

TomAugspurger
Copy link
Contributor

Uses take internally.

Closes #22386

One question: do we want to add shift to the interface, and dispatch to that? This just does the right thing at the block level, by writing a shift that'll work with an ExtensionArray.

CategoricalBlock still overrides .shift to call Categorical.shift, which uses a slightly faster implementation based around the codes and np.roll (~1.5-2x faster based on microbenchmarks)`.

We could add move ExtensionBlock.shift to ExtensionArray.shift, and just call that from ExtensionBlock. Then we can delete CategoricalBlock.shift.

I'm 50/50 on whether it's worth adding to the interface.

Uses take internally.

Closes pandas-dev#22386
@TomAugspurger TomAugspurger added the ExtensionArray Extending pandas with custom dtypes or arrays. label Aug 16, 2018
@TomAugspurger TomAugspurger added this to the 0.24.0 milestone Aug 16, 2018
@jreback
Copy link
Contributor

jreback commented Aug 16, 2018

i think it’s worth adding to the interface itself

@jorisvandenbossche
Copy link
Member

Wouldn't it be possible to do this without take, but with slicing?

Something like (very crude implementation for positive shifts):

def shift(a, p):
    a_sliced = a[:-p]
    a_NA = a._from_sequence([a.dtype.na_value]*p, dtype=a.dtype)
    return a._concat_same_type([a_NA, a_sliced])

Speed wise I would expect this to be very similar to the current Categorical.shift (which only uses the np.roll to basically do this instead of manually concatenating both parts)

@codecov
Copy link

codecov bot commented Aug 16, 2018

Codecov Report

Merging #22387 into master will decrease coverage by <.01%.
The diff coverage is 91.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22387      +/-   ##
==========================================
- Coverage   92.05%   92.05%   -0.01%     
==========================================
  Files         169      169              
  Lines       50733    50743      +10     
==========================================
+ Hits        46702    46711       +9     
- Misses       4031     4032       +1
Flag Coverage Δ
#multiple 90.46% <91.66%> (-0.01%) ⬇️
#single 42.24% <16.66%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/internals/blocks.py 93.84% <100%> (ø) ⬆️
pandas/core/arrays/base.py 88% <90%> (+0.14%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 68273a7...c5b556d. Read the comment docs.

@jorisvandenbossche
Copy link
Member

Speed wise I would expect this to be very similar to the current Categorical.shift

Categorical was apparently a very bad example as there it is like 5x slower than the existing Categorical.shift, but that is more because generating the missing values (_from_sequence) and _concat_same_type are very slow for Categorical (and could certainly be optimized), for IntegerArray it was as fast as I expected.

@TomAugspurger
Copy link
Contributor Author

That seems preferable to me, as it avoids allocating the indexer array.

shifted : ExtensionArray
"""
if periods == 0:
return self.copy()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on this? non-zero periods will necessarily be a copy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although Series.shift(0) doesn't make a copy, so let's follow that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally find that strange behaviour to not copy then (it would be more logical to know that the result is always a copy, regardless the value of periods IMO)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #22397 for that. Will leave this as is then.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Aug 17, 2018

Just as a note, I couldn't quite use this shift implementation for SparseArray.

When ExtensionArray.shiftcreates the empty SparseArray, it ends up with NaN for the fill value. When shifting a SparseArray with, say, 0 as the fill value, you end up concatenating the empty (with fill of NaN) and the shifted (with a fill of 0) and that ends up with a fill of NaN.

The root issue is that SparseArray stores fill_value on the array, rather than on the dtype. I haven't looked into whether it should go on the dtype or not, it's not clear to me.

Regardless, I think this is good to go, if we're happy with the slice and concat approach.

@jorisvandenbossche
Copy link
Member

From that it seems to makes sense to have this on the dtype then (but not that familiar with the rest of the code)

@TomAugspurger
Copy link
Contributor Author

Indeed... On the other hand, it complicates things like equality a bit...

I'll push it onto the dtype and see what breaks.

@TomAugspurger
Copy link
Contributor Author

Any objections to merging?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we be able to change the impl of interval / categorical shift for this?

@@ -400,6 +400,36 @@ def dropna(self):

return self[~self.isna()]

def shift(self, periods=1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think update in the ExtensionArray doc-string?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, in the class docstring we only mention the methods that either needs to implemented (because they raise AbstractMethodError otherwise) or either have a suboptimal implementation because it does the object ndarray roundtrip.
This is not the case here (which is not saying we couldn't also list other methods that can be overriden for specific reasons)

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Aug 20, 2018

shouldn't we be able to change the impl of interval / categorical shift for this?

As I mentioned above (#22387 (comment)), if we implement Categorical.shift as is done here (eg by simply inheriting this base method), it gets much slower because the creating of the NaNs and concatting same typed-categoricals is very slow (at least slower as it should be).
IntervalArray did not yet have any shift implementation.

(0, [0, 1, 2, 3, 4]),
(2, [-1, -1, 0, 1, 2]),
])
def test_container_shift_negative(self, data, frame, periods, indices):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor comment, but is there a reason this is called test_container_shift_negative? (because it does not only test negative periods, and 'container' seems a bit superfluous). I would simply do test_shift

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover from a merger of multiple tests. Will fix the name.

I do test_container_shift, rather than just test_shift, as this test goes through Series(array).shift, rather than directly doing array.shift.

@TomAugspurger
Copy link
Contributor Author

It seems that IntervalArray may need some updates,

In [22]: idx = IntervalArray.from_breaks(pd.date_range('2017', periods=10))

In [23]: idx.shift()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-23-1b2c2192e1e6> in <module>()
----> 1 idx.shift()

~/sandbox/pandas/pandas/core/arrays/base.py in shift(self, periods)
    422             return self.copy()
    423         empty = self._from_sequence([self.dtype.na_value] * abs(periods),
--> 424                                     dtype=self.dtype)
    425         if periods > 0:
    426             a = empty

~/sandbox/pandas/pandas/core/arrays/interval.py in _from_sequence(cls, scalars, dtype, copy)
    193     @classmethod
    194     def _from_sequence(cls, scalars, dtype=None, copy=False):
--> 195         return cls(scalars, dtype=dtype, copy=copy)
    196
    197     @classmethod

~/sandbox/pandas/pandas/core/arrays/interval.py in __new__(cls, data, closed, dtype, copy, fastpath, verify_integrity)
    138
    139         return cls._simple_new(left, right, closed, copy=copy, dtype=dtype,
--> 140                                verify_integrity=verify_integrity)
    141
    142     @classmethod

~/sandbox/pandas/pandas/core/arrays/interval.py in _simple_new(cls, left, right, closed, copy, dtype, verify_integrity)
    156                 raise TypeError(msg.format(dtype=dtype))
    157             elif dtype.subtype is not None:
--> 158                 left = left.astype(dtype.subtype)
    159                 right = right.astype(dtype.subtype)
    160

~/sandbox/pandas/pandas/core/indexes/numeric.py in astype(self, dtype, copy)
    313             msg = ('Cannot convert Float64Index to dtype {dtype}; integer '
    314                    'values are required for conversion').format(dtype=dtype)
--> 315             raise TypeError(msg)
    316         elif is_integer_dtype(dtype) and self.hasnans:
    317             # GH 13149

TypeError: Cannot convert Float64Index to dtype datetime64[ns]; integer values are required for conversion

In [24]: idx = IntervalArray.from_breaks(range(10))

In [25]: idx.shift()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-25-1b2c2192e1e6> in <module>()
----> 1 idx.shift()

~/sandbox/pandas/pandas/core/arrays/base.py in shift(self, periods)
    422             return self.copy()
    423         empty = self._from_sequence([self.dtype.na_value] * abs(periods),
--> 424                                     dtype=self.dtype)
    425         if periods > 0:
    426             a = empty

~/sandbox/pandas/pandas/core/arrays/interval.py in _from_sequence(cls, scalars, dtype, copy)
    193     @classmethod
    194     def _from_sequence(cls, scalars, dtype=None, copy=False):
--> 195         return cls(scalars, dtype=dtype, copy=copy)
    196
    197     @classmethod

~/sandbox/pandas/pandas/core/arrays/interval.py in __new__(cls, data, closed, dtype, copy, fastpath, verify_integrity)
    138
    139         return cls._simple_new(left, right, closed, copy=copy, dtype=dtype,
--> 140                                verify_integrity=verify_integrity)
    141
    142     @classmethod

~/sandbox/pandas/pandas/core/arrays/interval.py in _simple_new(cls, left, right, closed, copy, dtype, verify_integrity)
    156                 raise TypeError(msg.format(dtype=dtype))
    157             elif dtype.subtype is not None:
--> 158                 left = left.astype(dtype.subtype)
    159                 right = right.astype(dtype.subtype)
    160

~/sandbox/pandas/pandas/core/indexes/numeric.py in astype(self, dtype, copy)
    316         elif is_integer_dtype(dtype) and self.hasnans:
    317             # GH 13149
--> 318             raise ValueError('Cannot convert NA to integer')
    319         return super(Float64Index, self).astype(dtype, copy=copy)
    320

ValueError: Cannot convert NA to integer

The basic issue is

    423         empty = self._from_sequence([self.dtype.na_value] * abs(periods),
--> 424                                     dtype=self.dtype)

The container (ndarray[self.dtype]) can't actually hold self.dtype.na_value (nan).

Does that change our opinion on this implementation?

@jorisvandenbossche
Copy link
Member

I think that is rather a bug in IntervalArray? (or maybe better a current limitation)

Anyway, if the point is that it (currently) cannot hold NaNs, it can also not implement shift, as it requires NaNs (so don't know if that changes anything about the shift implementation?)

@TomAugspurger
Copy link
Contributor Author

Well, it can implement .shift, as long as it's allowed to change dtypes (upcast from int to float). The current implementation assumes / requires that the dtype not change.

@jorisvandenbossche
Copy link
Member

Basically, the problem is that for IntervalArray, the data is stored in left and right arrays, and so it depends on the dtype of those arrays whether IntervalArray can have NaNs or not?
So eg float and datetime64 can hold missing values, ints not.

For the datetime error above, that is actually solvable with the current model, because if the data is not all-NaN it works:

In [75]: IntervalArray([pd.Interval(pd.Timestamp('2017-01-01'), pd.Timestamp('2017-01-02'))], dtype='interval[datetime64[ns]]')
Out[75]: 
IntervalArray([(2017-01-01, 2017-01-02]],
              closed='right',
              dtype='interval[datetime64[ns]]')

In [76]: IntervalArray([pd.Interval(pd.Timestamp('2017-01-01'), pd.Timestamp('2017-01-02')), np.nan], dtype='interval[datetime64[ns]]')
Out[76]: 
IntervalArray([(2017-01-01, 2017-01-02], nan],
              closed='right',
              dtype='interval[datetime64[ns]]')

In [78]: IntervalArray([np.nan], dtype='interval[datetime64[ns]]').left
...
TypeError: Cannot convert Float64Index to dtype datetime64[ns]; integer values are required for conversion

@jorisvandenbossche
Copy link
Member

Well, it can implement .shift, as long as it's allowed to change dtypes

Ah, yes, the same happens with Series([int]).shift() of course.

I think ideally, we should improve the missing values handling on the IntervalArray level (since it's a composite array, we have the ability to do it properly, not requiring such upcasts), but that's certainly out of scope for this PR :-)

@jorisvandenbossche
Copy link
Member

So using the take implementation, that does already do the upcasting for Interval, so the shift would then work for intervals.

I don't really care much to whether to switch back to that or leave the implementation now as is (with the consequence shift doesn't work for Interval; although given the take precedent, upcasting the dtype seems somewhat consistent.
But I do think we should discuss the missing value handling for IntervalArray independently from this (eg using a mask instead).

@TomAugspurger
Copy link
Contributor Author

I think ideally, we should improve the missing values handling on the IntervalArray level

That's my preference. Opened #22428 for that.

@TomAugspurger
Copy link
Contributor Author

CI is passing now (last one was a random failure).

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. minor doc-comments.

Shift values by desired number.

Newly introduced missing values are filled with
``self.dtype.na_value``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a versionadded tag

@@ -2068,6 +2068,12 @@ def interpolate(self, method='pad', axis=0, inplace=False, limit=None,
limit=limit),
placement=self.mgr_locs)

def shift(self, periods, axis=0, mgr=None):
# type: (int, Optional[BlockPlacement]) -> List[ExtensionBlock]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a doc-string

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. minor comments. merge away


Dispatches to underlying ExtensionArray and re-boxes in an
ExtensionBlock.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bonus points for the Parameters :> (future PR ok too)

@jreback jreback merged commit e7fca91 into pandas-dev:master Aug 23, 2018
@jreback
Copy link
Contributor

jreback commented Aug 23, 2018

thanks @TomAugspurger

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants