Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR: pct_change method/limit keyword #53491

Closed
jbrockmendel opened this issue Jun 1, 2023 · 29 comments · Fixed by #53520
Closed

DEPR: pct_change method/limit keyword #53491

jbrockmendel opened this issue Jun 1, 2023 · 29 comments · Fixed by #53520
Assignees
Labels
Blocker Blocking issue or pull request for an upcoming release Deprecate Functionality to remove in pandas Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Transformations e.g. cumsum, diff, rank
Milestone

Comments

@jbrockmendel
Copy link
Member

jbrockmendel commented Jun 1, 2023

We should deprecate these and have users do obj.fillna(...).pct_change() instead.

@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 1, 2023
@mroeschke mroeschke added Deprecate Functionality to remove in pandas and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 1, 2023
@rhshadrach
Copy link
Member

+1

@rhshadrach
Copy link
Member

The default for method is pad. I would think we need to change the default to None before we can deprecate this outright.

@rhshadrach rhshadrach added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Transformations e.g. cumsum, diff, rank labels Jun 4, 2023
@rhshadrach rhshadrach added this to the 2.1 milestone Jun 8, 2023
@aalyousfi
Copy link

For 2.1.0, how to achieve the behavior of pct_change(fill_method=None) from the previous versions?

About this new change, the What's New page says to explicitly call ffill or bfill before calling pct_change. But my question is when I don't want to fill in any value (i.e. fill_method=None in previous versions).

@jbrockmendel @rhshadrach

@rhshadrach rhshadrach reopened this Sep 20, 2023
@rhshadrach
Copy link
Member

Thanks @aalyousfi; agreed we should allow the operation without warning. I propose we do not warn when fill_method=None is specified but otherwise continue with the deprecation. In 3.0, we will only allow None (raise on anything else) and then deprecate the fill_method keyword entirely, finally removing it in 4.0.

cc @jbrockmendel

@jbrockmendel
Copy link
Member Author

is there an option that doesn't require a 2-step deprecation cycle? thats kind of heavy for fairly small method

@rhshadrach
Copy link
Member

I don't think so. I'm curious why you describe this as heavy. I would only think of it as being heavy if it's holding up other improvements - is that the case here? Agreed it can be painfully slow, but if it's not holding up other work then I think we should put user experience at the forefront.

@jbrockmendel
Copy link
Member Author

I'm curious why you describe this as heavy

Poor choice of words. I just don't like the 2-step deprecation pattern and would prefer to avoid it if at all possible.

@Charlie-XIAO
Copy link
Contributor

Charlie-XIAO commented Sep 22, 2023

Since I worked on deprecating this keyword, should I now make another PR implementing the first stage of the 2-step deprecation?

@hirolau
Copy link

hirolau commented Oct 2, 2023

The warning message of this is really hard to get rid of since the series warns if any value is null and I have logic where I am expecting some series to be only NaNs, or where values at the very start of series are NaN.

@rhshadrach
Copy link
Member

@Charlie-XIAO - I'm +1 on doing so.

@Charlie-XIAO
Copy link
Contributor

@rhshadrach @jbrockmendel I'm thinking that obj.pct_change(fill_method=..., limit=...) should be replaced by obj.ffill/bfill(limit=...).pct_change(fill_method=None), is it? The things is, pct_change() will warn and will perform fill_method="pad", so if one uses obj.ffill().pct_change() he will get a deprecation warning (resolved in #54981), and if one uses obj.bfill().pct_change() he will get a deprecation warning and even incorrect result. Also @phofl who resolved #54981.

Also, I think we will need extra specific warning messages since this deprecation seems to complicated. I will open a PR for this once the above is confirmed.

@Charlie-XIAO
Copy link
Contributor

take

@yellow1912
Copy link

i have just upgraded and now I get tons of warnings. Even when I'm sure my data is already ffilled I still get the warnings. How do I turn off these warnings?

@rhshadrach
Copy link
Member

@yellow1912 - https://docs.python.org/3/library/warnings.html. If you believe pandas is not behaving correctly, can you open a new issue.

@Charlie-XIAO
Copy link
Contributor

@rhshadrach I believe what @yellow1912 said was exactly related to this issue. We are raising warnings for pct_change with default parameters, so that happens even after ffill/bfill. Please see #53491 (comment) or #55527 for details.

@rhshadrach
Copy link
Member

rhshadrach commented Oct 15, 2023

We are raising warnings for pct_change with default parameters, so that happens even after ffill/bfill.

Can you post a reproducible example?

This issue is about the deprecation itself. If there are incorrect behaviors in the implementation of that deprecation they should go in a separate issue.

@Charlie-XIAO
Copy link
Contributor

Charlie-XIAO commented Oct 15, 2023

@rhshadrach This is a reproducible example (main branch).

>>> ser = pd.Series([np.nan, 1, 2, 3, np.nan])
>>> ser
0    NaN
1    1.0
2    2.0
3    3.0
4    NaN
dtype: float64
>>> ser.ffill().pct_change()
0    NaN
1    NaN
2    1.0
3    0.5
4    0.0
dtype: float64
>>> ser.bfill().pct_change()
<stdin>:1: FutureWarning: The default fill_method='pad' in Series.pct_change is deprecated and will be removed in a future version. Call ffill before calling pct_change to retain current behavior and silence this warning.
0    NaN
1    0.0
2    1.0
3    0.5
4    0.0
dtype: float64

ffill did not raise because it was fixed in #54981 by checking for NA values. However, I think this is adding too much complexity. Since now we consider not deprecating fill_method=None, I propose that we should prompt in the deprecation message to let users use obj.ffill/bfill().pct_change(fill_method=None) but not obj.ffill/bfill().pct_change().

@CaioLC
Copy link

CaioLC commented Dec 21, 2023

quick comment, the warning message is confusing when using pct_change with a GroupBySeries, as the fill() method returns a regular series and the pct_change is calculated incorrectly afterwards:

df.groupby('columnA')['values'].pct_change(fill_method='pad') != df.groupby('columnA')['values'].ffill().pct_change()

an example:

import numpy as np
import pandas as pd
ex = pd.DataFrame({
    'a': range(1,11),
    'b': range(1,11),
    'c': range(1,11),
})
ex.iloc[5] = np.NAN
ex = ex.melt()
ex['pct'] = ex.groupby('variable')['value'].pct_change()
ex['pct2'] = ex.groupby('variable')['value'].ffill().pct_change()
assert (ex.pct == ex.pct2).all()

perhaps we can retain the original behaviour for groupby objects?

@Charlie-XIAO
Copy link
Contributor

A possible workaround might be:

import numpy as np
import pandas as pd
from pandas.testing import assert_series_equal

df = pd.DataFrame(
    {
        "a": range(1, 11),
        "b": range(1, 11),
        "c": range(1, 11),
    }
)
df.iloc[5] = np.nan
df = df.melt()

df["pct"] = df.groupby("variable")["value"].pct_change()
df["pct2"] = df.groupby("variable")["value"].transform(lambda x: x.ffill().pct_change())
assert_series_equal(df.pct, df.pct2, check_names=False)   # This should pass

Maybe there are better workarounds but at least the one above doesn't seem very clean... @jbrockmendel @rhshadrach What do maintainers think here? Should the deprecation on groupby objects be reverted, or should we just update the warning message?

BTW @CaioLC I think you cannot use assert (ex.pct == ex.pct2).all() because there are nan values which will make this always False.

@CaioLC
Copy link

CaioLC commented Dec 22, 2023

BTW @CaioLC I think you cannot use assert (ex.pct == ex.pct2).all() because there are nan values which will make this always False.

That's true, thank you for pointing that out :)

Lambda functions is a valid workaround. My only concern is that the warning message states that simply adding ffill() before pct_change() will suffice and that's not entirely true. And a wrong pct_change is hard to debug, because it will compute without faillure most of the time

@phofl
Copy link
Member

phofl commented Dec 22, 2023

Yeah we shouldn't require lambdas for this use case,

@rhshadrach
Copy link
Member

ex = pd.DataFrame({
    'a': range(1,11),
    'b': range(1,11),
    'c': range(1,11),
})
ex.iloc[5] = np.NAN
ex = ex.melt()
ex['pct'] = ex.groupby('variable')['value'].pct_change()
ex['pct2'] = ex.groupby('variable')['value'].ffill().groupby(ex['variable']).pct_change()
pd.testing.assert_series_equal(ex['pct'], ex['pct2'], check_names=False)

@phofl
Copy link
Member

phofl commented Dec 22, 2023

Requires this calculating the grouper twice?

@CaioLC
Copy link

CaioLC commented Dec 22, 2023

ex['pct2'] = ex.groupby('variable')['value'].ffill().groupby(ex['variable']).pct_change() 

Having to call groupby twice for a single operation, and for a fairly common func, seems unintuitive and a worse experience overall.

The ideal would be GroupBySeries.fill() to return a GroupBySeries object instead of a regular Series, then concatenation should be straight forward. But I understand this could be a major break and possibly not that trivial to implement...

@rhshadrach
Copy link
Member

Requires this calculating the grouper twice?

Because you can modify the DataFrame in place, you can actually get away with using the same grouper. As long as you're just adding columns, this is safe - but other modifications (e.g. removing rows) may not be.

ex = pd.DataFrame({
    'a': range(1,11),
    'b': range(1,11),
    'c': range(1,11),
})
ex.iloc[5] = np.NAN
ex = ex.melt()
gb = ex.groupby('variable')
ex['pct'] = gb['value'].pct_change()
ex['value_filled'] = gb['value'].ffill()
ex['pct2'] = gb['value_filled'].pct_change()
pd.testing.assert_series_equal(ex['pct'], ex['pct2'], check_names=False)

Having to call groupby twice for a single operation, and for a fairly common func, seems unintuitive and a worse experience overall.

I would not describe this as a single operation.

The ideal would be GroupBySeries.fill() to return a GroupBySeries object instead of a regular Series

And then for use cases that just want to fill?

@lithomas1 lithomas1 modified the milestones: 2.1, 2.2 Dec 27, 2023
@lithomas1 lithomas1 added the Blocker Blocking issue or pull request for an upcoming release label Dec 27, 2023
@phofl
Copy link
Member

phofl commented Dec 28, 2023

Yeah I think the your solution seems fine to me, so the deprecation is fine as well

@CaioLC
Copy link

CaioLC commented Dec 28, 2023

Maybe just a small adjustment to the warning message then? so that users don't try to inadvertently chain function calls coming from a groupby object

@lithomas1
Copy link
Member

@rhshadrach

Anything actionable here?

@rhshadrach
Copy link
Member

@CaioLC - the warning message seems okay to me, it merely tells users to fill values rather than telling users an incorrect way of filling the values. But if you have an alternative suggestion to the message, we can certainly consider it. If you'd like, add it here and we can reopen.

Closing for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocker Blocking issue or pull request for an upcoming release Deprecate Functionality to remove in pandas Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Transformations e.g. cumsum, diff, rank
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants