New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: rolling.apply will pass Series to function #20584

Merged
merged 4 commits into from Apr 16, 2018

Conversation

Projects
None yet
5 participants
@jreback
Contributor

jreback commented Apr 2, 2018

closes #5071

@jreback jreback added this to the 0.23.0 milestone Apr 2, 2018

self._check_expanding(expanding_mean, np.mean)
# TODO(jreback), needed to add preserve_nan=False

This comment has been minimized.

@jreback

jreback Apr 2, 2018

Contributor

if anyone is interested to see why this doesn't work, would be hepful, @WillAyd (you have some xp in this area)

This comment has been minimized.

@WillAyd

WillAyd Apr 2, 2018

Member

It looks like rows [20:39] of the series in use here contains NA values. preserve_nan is assuming that those will still be NA after the function application, but isn't it by design with expanding functions that the NA values do not get preserved? Ex:

>>> ser = pd.Series(range(5))
>>> ser[2] = np.nan
>>> ser.expanding().sum()
0    0.0
1    1.0
2    1.0
3    4.0
4    8.0
dtype: float64

So I think this keyword is necessary by design of the expanding function

This comment has been minimized.

@jreback

jreback Apr 2, 2018

Contributor

yeah exactly - so wondering if the current impl (eg master) is actually wrong

This comment has been minimized.

@WillAyd

WillAyd Apr 2, 2018

Member

I haven't looked at the implementation in detail but I'm assuming this ties back to the default argument of min_periods=1 that the Expanding class uses - I think that would make it almost impossible for NA data to appear while expanding.

I'd buy the argument that the default here should match the Rolling impl (i.e. 0) and Expanding should deal with "expected" NAs (i.e. forward looking values in the window that just haven't been reached yet) vs "unexpected" NA (data from a previous entry in the window) in a more robust fashion.

It would be a breaking change; happy to open the ticket if you agree

This comment has been minimized.

@jreback

jreback Apr 2, 2018

Contributor

yeah this might be a bit of an unexplored corner

yeah if i can show a simple example would be great to open an issue

after all expanding is just rolling with the window len of the len of the obj

This comment has been minimized.

@WillAyd

WillAyd Apr 2, 2018

Member

Maybe it shouldn't be though? I.e. should the window size of expanding be i for i in range(N) instead of N for i in range(N)? The difference can be illustrated looking at the third element below (elements 4 and 5 are expected to differ from window size)

>>> ser = pd.Series([0, np.nan, 2, 3, 4])
In [127]: ser.rolling(window=2, min_periods=2).sum()
Out[127]: 
0    NaN
1    NaN
2    NaN  # Doesn't populate here; there aren't 2 non-NA entries at this point
3    5.0
4    7.0
dtype: float64

In [128]: ser.rolling(window=len(ser), min_periods=2).sum()
Out[128]: 
0    NaN
1    NaN
2    2.0  # Populates here, even though this is the first non-NA record and min_periods is 2
3    5.0
4    9.0
dtype: float64
@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Apr 2, 2018

Thoughts on the default being ndarray with a warning that the default will change to raw=False in the future? Then this isn't a breaking change.

@jreback

This comment has been minimized.

Contributor

jreback commented Apr 2, 2018

yeah i could make the keyword default (of None) and then warn that it’s going to change

that’s back compat, though would have to flip it at some point anyhow

@codecov

This comment has been minimized.

codecov bot commented Apr 2, 2018

Codecov Report

Merging #20584 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #20584      +/-   ##
==========================================
- Coverage   91.84%   91.83%   -0.01%     
==========================================
  Files         153      153              
  Lines       49279    49295      +16     
==========================================
+ Hits        45259    45272      +13     
- Misses       4020     4023       +3
Flag Coverage Δ
#multiple 90.23% <100%> (-0.01%) ⬇️
#single 41.89% <23.07%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/generic.py 95.94% <ø> (ø) ⬆️
pandas/core/window.py 96.29% <100%> (+0.02%) ⬆️
pandas/core/arrays/categorical.py 95.78% <0%> (-0.41%) ⬇️
pandas/util/testing.py 84.38% <0%> (-0.21%) ⬇️
pandas/io/pytables.py 92.41% <0%> (-0.05%) ⬇️
pandas/core/internals.py 95.53% <0%> (-0.01%) ⬇️
pandas/core/groupby/groupby.py 92.55% <0%> (ø) ⬆️
pandas/core/frame.py 97.16% <0%> (ø) ⬆️
pandas/core/indexing.py 93.08% <0%> (+0.05%) ⬆️
pandas/core/dtypes/missing.py 92.85% <0%> (+1.78%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d04b746...ab46058. Read the comment docs.

The :func`Series.rolling`, :func:`DataFrame.rolling`, :func`Series.expanding`, :func:`DataFrame.expanding` methods when used with ``.apply()`` have gained a ``raw=False`` parameter.
This is similar to :func:`DataFame.apply`. This parameter, ``False`` by default allows one to send a ``np.ndarray`` to the applied function, rather than the default of a ``Series``.
This is a change from prior versions, when the applied function would *always* received an ndarray. This allow one to use pandas operations generically. (:issue:`5071`)

This comment has been minimized.

@WillAyd

WillAyd Apr 3, 2018

Member

Minor typo on this line - either "would always receive" or "always received"

@jreback

This comment has been minimized.

Contributor

jreback commented Apr 10, 2018

this is all fixed up. now back-compat (with a warning). if anyone wants to look

cc @WillAyd
@TomAugspurger

@WillAyd

This comment has been minimized.

Member

WillAyd commented Apr 11, 2018

lgtm

@jorisvandenbossche

This is really nice to get it more consistent with DataFrame.apply!

Rolling/Expanding.apply() accepts a ``raw`` keyword to pass a ``Series`` to the function
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The :func`Series.rolling`, :func:`DataFrame.rolling`, :func`Series.expanding`, :func:`DataFrame.expanding` methods when used with ``.apply()`` have gained a ``raw=None`` parameter.

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Apr 11, 2018

Member

You could actually link directly to the apply method here (which is more correct, as it is not rolling that gained a keyword). I think something like :func:`Series.rolling().apply() <pandas.core.window.Rolling.apply>` might do it

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Apr 11, 2018

Member

Can you also add a one-line notice to the Deprecations section?

objects instead.
If you are just applying a NumPy reduction function this will
achieve much better performance.
.. versionadded:: 0.23.0

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Apr 11, 2018

Member

blank line above this one

* ``False`` : passes each row or column as a Series to the
function.
* ``True`` or ``None`` : the passed function will receive ndarray
objects instead.

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Apr 11, 2018

Member

can you indicate something about the deprecation / changing default?

"for .apply().\nIn the future, this will default to "
"False, meaning a Series will be passed to the "
"applied function. Not passing raw, defaults "
"raw=True, meaning a ndarray is passed to the "

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Apr 11, 2018

Member

I find this a bit confusing, as in the future this will not be the default. I would try to rephrase a bit like "Now does this, in future will change to this. Pass this to keep current behaviour, pass this to silence the warning"

@@ -314,7 +314,7 @@ def _center_window(self, result, window):
def aggregate(self, arg, *args, **kwargs):
result, how = self._aggregate(arg, *args, **kwargs)
if result is None:
return self.apply(arg, args=args, kwargs=kwargs)
return self.apply(arg, raw=False, args=args, kwargs=kwargs)

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Apr 11, 2018

Member

Will this have any consequences for the user?
(but, on the other hand, I don't think it is worth to add a raw keyword just to be able to deprecate the behaviour and change it more slowly)

This comment has been minimized.

@jreback

jreback Apr 11, 2018

Contributor

no, this is a very special case

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Apr 11, 2018

Member

It's not that a special case, it happens when the user passes a custom function to r.agg(..), which eg is done in the documentation about it: http://pandas.pydata.org/pandas-docs/stable/computation.html#aggregation

So just as with apply, if the user has a function that behaves differently for a Series than an array, this will break.

@@ -955,22 +955,48 @@ def count(self):
----------
func : function
Must produce a single value from an ndarray input

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Apr 11, 2018

Member

Can you update the "ndarray input" here?

# #1850
@pytest.mark.parametrize(
'method', [lambda x: x.rolling(window=2), lambda x: x.expanding()])
def test_apply_future_warning(self, method):

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Apr 11, 2018

Member

parametrize for Series/DataFrame ? (although they maybe take fully the same code path?)

result = df.rolling(2).apply(f, raw=False)
expected = df[1:].reindex_like(df)
tm.assert_frame_equal(result, expected)

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Apr 11, 2018

Member

Can you add the case with raw=True and assert that an AttributeError is raised ? (or pass a function that asserts it gets a numpy array)

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Apr 11, 2018

Member

And can you do this test (with such a function that requires a pandas object) also with a freq window?

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Apr 11, 2018

While trying this out, I also noticed that rolling apply only seems to work on numeric (int/float) data. Is this a known limitation/bug? (didn't directly find an issue about it):

In [69]: s = pd.Series(list('abcd'))

In [71]: s.rolling(2).apply(lambda x: x+1, raw=True)   
Out[71]: 
0    a         #<-------- for string data, just returns the original series, whathever the function is
1    b
2    c
3    d
dtype: object

In [72]: s = pd.Series(pd.Categorical(list('abcd')))

In [73]: s.rolling(2).apply(lambda x: x+1, raw=True)
Out[73]: 
[a, b, c, d]     #<-------- for categorical data, just returns the original Categorical (not even series)
Categories (4, object): [a, b, c, d]

In [74]: s = pd.Series(pd.timedelta_range("1 days", periods=4))

In [75]: s.rolling(2).apply(lambda x: x+1, raw=True)   #<----- for timedelta/datetime, raises informative error
...
NotImplementedError: ops for Rolling for this dtype timedelta64[ns] are not implemented

(this is a separate issue as this PR, it's already like this in released version)

@jsexauer jsexauer referenced this pull request Apr 11, 2018

Open

DEPR: deprecations from prior versions #6581

0 of 85 tasks complete
@jreback

This comment has been minimized.

Contributor

jreback commented Apr 15, 2018

any comments.

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Apr 16, 2018

@jreback can you please work with additional commits, that makes a second review round really a lot easier for me

@jorisvandenbossche

I think we still need to decide about the rolling().agg case (https://github.com/pandas-dev/pandas/pull/20584/files#r180718011). It got (in certain cases of passing a custom function) a ndarray before, which is now changed to a Series.
I don't think it is worth adding a raw keyword here as well (people who need it can use apply), and I am fine with deciding that it is also not worth to deprecate, but then we should do that explicitly (thus also document this API change).

Apart from that, looks good to me, added only some doc comments

@@ -843,6 +872,9 @@ Deprecations
- ``Index.summary()`` is deprecated and will be removed in a future version (:issue:`18217`)
- ``NDFrame.get_ftype_counts()`` is deprecated and will be removed in a future version (:issue:`18243`)
- The ``convert_datetime64`` parameter in :func:`DataFrame.to_records` has been deprecated and will be removed in a future version. The NumPy bug motivating this parameter has been resolved. The default value for this parameter has also changed from ``True`` to ``None`` (:issue:`18160`).
- :func:`Series.rolling().apply() <pandas.core.window.Rolling.apply>`, :func:`DataFrame.rolling().apply() <pandas.core.window.Rolling.apply>`,
:func:`Series.expanding().apply() <pandas.core.window.Expanding.apply>`, and :func:`DataFrame.expanding().apply() <pandas.core.window.Expanding.apply>` will show a ``FutureWarning`` if the new
``raw`` parameter is not explicity passed (:issue:`20584`)

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Apr 16, 2018

Member

I would rather say what has been deprecated: passing an array by default (and then indeed raw can be used to keep this behaviour or not)

achieve much better performance.
The raw parameter is required and will show a FutureWarning if
not passed. In the futures raw will default to False.

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Apr 16, 2018

Member

futures -> future, raw -> `raw`

@@ -954,23 +954,53 @@ def count(self):
Parameters
----------
func : function
Must produce a single value from an ndarray input
\*args and \*\*kwargs are passed to the function""")
Must produce a single value from an ndarray input if raw=True

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Apr 16, 2018

Member

quote raw=True as a code snippet (``raw=True``), same on line below for raw=False

# change to False in the future
if raw is None:
warnings.warn(
"raw defaults to False "

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Apr 16, 2018

Member

Can you put single quotes around raw? (like "'raw' defaults to False ...") And two lines below as well. I think that will make it read better (clearer that it is a keyword)

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Apr 16, 2018

Member

Also, I think it is the other way around? The warning says "default is False, in future will change to True", but isn't it the other way around?

@jreback

This comment has been minimized.

Contributor

jreback commented Apr 16, 2018

updated, I adjusted the .aggregate doc-string & put in an api-note. lmk if any further comments.

jreback and others added some commits Apr 16, 2018

@jreback

This comment has been minimized.

Contributor

jreback commented Apr 16, 2018

bombs away

@jreback jreback merged commit 4a34497 into pandas-dev:master Apr 16, 2018

2 of 3 checks passed

continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Apr 16, 2018

Thanks!

@H0bbitBaron

This comment has been minimized.

H0bbitBaron commented Aug 2, 2018

@jreback hello! I googled insanely, but now I'm even more confused. Is there a way to pass 2D with few columns instead of 1D series to apply(function)? There was a function rolling_apply, but it's depricated. Ended up writing my own function with cycle inside, which emulates rolling window with some cross-column calculations, but it's rather slow. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment