Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support NDFrame.shift with EAs #22387

Merged
merged 7 commits into from Aug 23, 2018

Conversation

Projects
None yet
3 participants
@TomAugspurger
Copy link
Contributor

commented Aug 16, 2018

Uses take internally.

Closes #22386

One question: do we want to add shift to the interface, and dispatch to that? This just does the right thing at the block level, by writing a shift that'll work with an ExtensionArray.

CategoricalBlock still overrides .shift to call Categorical.shift, which uses a slightly faster implementation based around the codes and np.roll (~1.5-2x faster based on microbenchmarks)`.

We could add move ExtensionBlock.shift to ExtensionArray.shift, and just call that from ExtensionBlock. Then we can delete CategoricalBlock.shift.

I'm 50/50 on whether it's worth adding to the interface.

Support NDFrame.shift with EAs
Uses take internally.

Closes #22386

@TomAugspurger TomAugspurger added this to the 0.24.0 milestone Aug 16, 2018

@jreback

This comment has been minimized.

Copy link
Contributor

commented Aug 16, 2018

i think it’s worth adding to the interface itself

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Aug 16, 2018

Wouldn't it be possible to do this without take, but with slicing?

Something like (very crude implementation for positive shifts):

def shift(a, p):
    a_sliced = a[:-p]
    a_NA = a._from_sequence([a.dtype.na_value]*p, dtype=a.dtype)
    return a._concat_same_type([a_NA, a_sliced])

Speed wise I would expect this to be very similar to the current Categorical.shift (which only uses the np.roll to basically do this instead of manually concatenating both parts)

@codecov

This comment has been minimized.

Copy link

commented Aug 16, 2018

Codecov Report

Merging #22387 into master will decrease coverage by <.01%.
The diff coverage is 91.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22387      +/-   ##
==========================================
- Coverage   92.05%   92.05%   -0.01%     
==========================================
  Files         169      169              
  Lines       50733    50743      +10     
==========================================
+ Hits        46702    46711       +9     
- Misses       4031     4032       +1
Flag Coverage Δ
#multiple 90.46% <91.66%> (-0.01%) ⬇️
#single 42.24% <16.66%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/internals/blocks.py 93.84% <100%> (ø) ⬆️
pandas/core/arrays/base.py 88% <90%> (+0.14%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 68273a7...c5b556d. Read the comment docs.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Aug 16, 2018

Speed wise I would expect this to be very similar to the current Categorical.shift

Categorical was apparently a very bad example as there it is like 5x slower than the existing Categorical.shift, but that is more because generating the missing values (_from_sequence) and _concat_same_type are very slow for Categorical (and could certainly be optimized), for IntegerArray it was as fast as I expected.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Aug 16, 2018

That seems preferable to me, as it avoids allocating the indexer array.

shifted : ExtensionArray
"""
if periods == 0:
return self.copy()

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Aug 16, 2018

Author Contributor

Thoughts on this? non-zero periods will necessarily be a copy.

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Aug 16, 2018

Author Contributor

Although Series.shift(0) doesn't make a copy, so let's follow that?

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Aug 16, 2018

Member

I personally find that strange behaviour to not copy then (it would be more logical to know that the result is always a copy, regardless the value of periods IMO)

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Aug 17, 2018

Author Contributor

Opened #22397 for that. Will leave this as is then.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Aug 17, 2018

Just as a note, I couldn't quite use this shift implementation for SparseArray.

When ExtensionArray.shiftcreates the empty SparseArray, it ends up with NaN for the fill value. When shifting a SparseArray with, say, 0 as the fill value, you end up concatenating the empty (with fill of NaN) and the shifted (with a fill of 0) and that ends up with a fill of NaN.

The root issue is that SparseArray stores fill_value on the array, rather than on the dtype. I haven't looked into whether it should go on the dtype or not, it's not clear to me.

Regardless, I think this is good to go, if we're happy with the slice and concat approach.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Aug 17, 2018

From that it seems to makes sense to have this on the dtype then (but not that familiar with the rest of the code)

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Aug 17, 2018

Indeed... On the other hand, it complicates things like equality a bit...

I'll push it onto the dtype and see what breaks.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Aug 20, 2018

Any objections to merging?

@jreback
Copy link
Contributor

left a comment

shouldn't we be able to change the impl of interval / categorical shift for this?

@@ -400,6 +400,36 @@ def dropna(self):

return self[~self.isna()]

def shift(self, periods=1):

This comment has been minimized.

Copy link
@jreback

jreback Aug 20, 2018

Contributor

I think update in the ExtensionArray doc-string?

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Aug 20, 2018

Member

Currently, in the class docstring we only mention the methods that either needs to implemented (because they raise AbstractMethodError otherwise) or either have a suboptimal implementation because it does the object ndarray roundtrip.
This is not the case here (which is not saying we couldn't also list other methods that can be overriden for specific reasons)

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Aug 20, 2018

shouldn't we be able to change the impl of interval / categorical shift for this?

As I mentioned above (#22387 (comment)), if we implement Categorical.shift as is done here (eg by simply inheriting this base method), it gets much slower because the creating of the NaNs and concatting same typed-categoricals is very slow (at least slower as it should be).
IntervalArray did not yet have any shift implementation.

(0, [0, 1, 2, 3, 4]),
(2, [-1, -1, 0, 1, 2]),
])
def test_container_shift_negative(self, data, frame, periods, indices):

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Aug 20, 2018

Member

Very minor comment, but is there a reason this is called test_container_shift_negative? (because it does not only test negative periods, and 'container' seems a bit superfluous). I would simply do test_shift

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Aug 20, 2018

Author Contributor

Leftover from a merger of multiple tests. Will fix the name.

I do test_container_shift, rather than just test_shift, as this test goes through Series(array).shift, rather than directly doing array.shift.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Aug 20, 2018

It seems that IntervalArray may need some updates,

In [22]: idx = IntervalArray.from_breaks(pd.date_range('2017', periods=10))

In [23]: idx.shift()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-23-1b2c2192e1e6> in <module>()
----> 1 idx.shift()

~/sandbox/pandas/pandas/core/arrays/base.py in shift(self, periods)
    422             return self.copy()
    423         empty = self._from_sequence([self.dtype.na_value] * abs(periods),
--> 424                                     dtype=self.dtype)
    425         if periods > 0:
    426             a = empty

~/sandbox/pandas/pandas/core/arrays/interval.py in _from_sequence(cls, scalars, dtype, copy)
    193     @classmethod
    194     def _from_sequence(cls, scalars, dtype=None, copy=False):
--> 195         return cls(scalars, dtype=dtype, copy=copy)
    196
    197     @classmethod

~/sandbox/pandas/pandas/core/arrays/interval.py in __new__(cls, data, closed, dtype, copy, fastpath, verify_integrity)
    138
    139         return cls._simple_new(left, right, closed, copy=copy, dtype=dtype,
--> 140                                verify_integrity=verify_integrity)
    141
    142     @classmethod

~/sandbox/pandas/pandas/core/arrays/interval.py in _simple_new(cls, left, right, closed, copy, dtype, verify_integrity)
    156                 raise TypeError(msg.format(dtype=dtype))
    157             elif dtype.subtype is not None:
--> 158                 left = left.astype(dtype.subtype)
    159                 right = right.astype(dtype.subtype)
    160

~/sandbox/pandas/pandas/core/indexes/numeric.py in astype(self, dtype, copy)
    313             msg = ('Cannot convert Float64Index to dtype {dtype}; integer '
    314                    'values are required for conversion').format(dtype=dtype)
--> 315             raise TypeError(msg)
    316         elif is_integer_dtype(dtype) and self.hasnans:
    317             # GH 13149

TypeError: Cannot convert Float64Index to dtype datetime64[ns]; integer values are required for conversion

In [24]: idx = IntervalArray.from_breaks(range(10))

In [25]: idx.shift()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-25-1b2c2192e1e6> in <module>()
----> 1 idx.shift()

~/sandbox/pandas/pandas/core/arrays/base.py in shift(self, periods)
    422             return self.copy()
    423         empty = self._from_sequence([self.dtype.na_value] * abs(periods),
--> 424                                     dtype=self.dtype)
    425         if periods > 0:
    426             a = empty

~/sandbox/pandas/pandas/core/arrays/interval.py in _from_sequence(cls, scalars, dtype, copy)
    193     @classmethod
    194     def _from_sequence(cls, scalars, dtype=None, copy=False):
--> 195         return cls(scalars, dtype=dtype, copy=copy)
    196
    197     @classmethod

~/sandbox/pandas/pandas/core/arrays/interval.py in __new__(cls, data, closed, dtype, copy, fastpath, verify_integrity)
    138
    139         return cls._simple_new(left, right, closed, copy=copy, dtype=dtype,
--> 140                                verify_integrity=verify_integrity)
    141
    142     @classmethod

~/sandbox/pandas/pandas/core/arrays/interval.py in _simple_new(cls, left, right, closed, copy, dtype, verify_integrity)
    156                 raise TypeError(msg.format(dtype=dtype))
    157             elif dtype.subtype is not None:
--> 158                 left = left.astype(dtype.subtype)
    159                 right = right.astype(dtype.subtype)
    160

~/sandbox/pandas/pandas/core/indexes/numeric.py in astype(self, dtype, copy)
    316         elif is_integer_dtype(dtype) and self.hasnans:
    317             # GH 13149
--> 318             raise ValueError('Cannot convert NA to integer')
    319         return super(Float64Index, self).astype(dtype, copy=copy)
    320

ValueError: Cannot convert NA to integer

The basic issue is

    423         empty = self._from_sequence([self.dtype.na_value] * abs(periods),
--> 424                                     dtype=self.dtype)

The container (ndarray[self.dtype]) can't actually hold self.dtype.na_value (nan).

Does that change our opinion on this implementation?

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Aug 20, 2018

I think that is rather a bug in IntervalArray? (or maybe better a current limitation)

Anyway, if the point is that it (currently) cannot hold NaNs, it can also not implement shift, as it requires NaNs (so don't know if that changes anything about the shift implementation?)

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Aug 20, 2018

Well, it can implement .shift, as long as it's allowed to change dtypes (upcast from int to float). The current implementation assumes / requires that the dtype not change.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Aug 20, 2018

Basically, the problem is that for IntervalArray, the data is stored in left and right arrays, and so it depends on the dtype of those arrays whether IntervalArray can have NaNs or not?
So eg float and datetime64 can hold missing values, ints not.

For the datetime error above, that is actually solvable with the current model, because if the data is not all-NaN it works:

In [75]: IntervalArray([pd.Interval(pd.Timestamp('2017-01-01'), pd.Timestamp('2017-01-02'))], dtype='interval[datetime64[ns]]')
Out[75]: 
IntervalArray([(2017-01-01, 2017-01-02]],
              closed='right',
              dtype='interval[datetime64[ns]]')

In [76]: IntervalArray([pd.Interval(pd.Timestamp('2017-01-01'), pd.Timestamp('2017-01-02')), np.nan], dtype='interval[datetime64[ns]]')
Out[76]: 
IntervalArray([(2017-01-01, 2017-01-02], nan],
              closed='right',
              dtype='interval[datetime64[ns]]')

In [78]: IntervalArray([np.nan], dtype='interval[datetime64[ns]]').left
...
TypeError: Cannot convert Float64Index to dtype datetime64[ns]; integer values are required for conversion
@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Aug 20, 2018

Well, it can implement .shift, as long as it's allowed to change dtypes

Ah, yes, the same happens with Series([int]).shift() of course.

I think ideally, we should improve the missing values handling on the IntervalArray level (since it's a composite array, we have the ability to do it properly, not requiring such upcasts), but that's certainly out of scope for this PR :-)

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Aug 20, 2018

So using the take implementation, that does already do the upcasting for Interval, so the shift would then work for intervals.

I don't really care much to whether to switch back to that or leave the implementation now as is (with the consequence shift doesn't work for Interval; although given the take precedent, upcasting the dtype seems somewhat consistent.
But I do think we should discuss the missing value handling for IntervalArray independently from this (eg using a mask instead).

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Aug 20, 2018

I think ideally, we should improve the missing values handling on the IntervalArray level

That's my preference. Opened #22428 for that.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Aug 22, 2018

CI is passing now (last one was a random failure).

@jreback
Copy link
Contributor

left a comment

lgtm. minor doc-comments.

Shift values by desired number.
Newly introduced missing values are filled with
``self.dtype.na_value``.

This comment has been minimized.

Copy link
@jreback

jreback Aug 22, 2018

Contributor

can you add a versionadded tag

@@ -2068,6 +2068,12 @@ def interpolate(self, method='pad', axis=0, inplace=False, limit=None,
limit=limit),
placement=self.mgr_locs)

def shift(self, periods, axis=0, mgr=None):
# type: (int, Optional[BlockPlacement]) -> List[ExtensionBlock]

This comment has been minimized.

Copy link
@jreback

jreback Aug 22, 2018

Contributor

can you add a doc-string

@jreback
Copy link
Contributor

left a comment

lgtm. minor comments. merge away

Dispatches to underlying ExtensionArray and re-boxes in an
ExtensionBlock.
"""

This comment has been minimized.

Copy link
@jreback

jreback Aug 22, 2018

Contributor

bonus points for the Parameters :> (future PR ok too)

@jreback jreback merged commit e7fca91 into pandas-dev:master Aug 23, 2018

4 checks passed

ci/circleci: py27_compat Your tests passed on CircleCI!
Details
ci/circleci: py35_ascii Your tests passed on CircleCI!
Details
ci/circleci: py36_locale Your tests passed on CircleCI!
Details
ci/circleci: py36_locale_slow Your tests passed on CircleCI!
Details
@jreback

This comment has been minimized.

Copy link
Contributor

commented Aug 23, 2018

Sup3rGeo added a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.