Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Rolling min_periods not working on groupby object #36040

Closed
3 tasks done
justinessert opened this issue Sep 1, 2020 · 5 comments · Fixed by #37035
Closed
3 tasks done

BUG: Rolling min_periods not working on groupby object #36040

justinessert opened this issue Sep 1, 2020 · 5 comments · Fixed by #37035
Labels
Bug Groupby Window rolling, ewma, expanding
Milestone

Comments

@justinessert
Copy link
Contributor

justinessert commented Sep 1, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

df = pd.DataFrame({
    'segment': 'A',
    'data': range(10)
})

df.rolling(5, center=True, min_periods=1).max()

df.groupby('segment').rolling(5, center=True, min_periods=1).max().reset_index(drop=True)

Problem description

For the DataFrame above, with a single segment 'A', the result of df.rolling(5, center=True, min_periods=1).max() should be identical to that of df.groupby('segment').rolling(5, center=True, min_periods=1).max().reset_index(drop=True). Instead, the latter operation has NaNs in the last two positions of the data column.

Expected Output

Both operations should return the sequence [2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 9.0, 9.0]. Instead, df.groupby('segment').rolling(5, center=True, min_periods=1).max().reset_index(drop=True) returns [2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, NaN, NaN]

Output of pd.show_versions()

INSTALLED VERSIONS

commit : f2ca0a2
python : 3.7.7.final.0
python-bits : 64
OS : Darwin
OS-release : 19.6.0
Version : Darwin Kernel Version 19.6.0: Thu Jun 18 20:49:00 PDT 2020; root:xnu-6153.141.1~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.1.1
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.0
pip : 20.1.1
setuptools : 47.3.0.post20200616
Cython : None
pytest : 6.0.0
hypothesis : None
sphinx : 3.1.1
blosc : None
feather : None
xlsxwriter : 1.2.9
lxml.etree : 4.5.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.15.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 0.8.0
fastparquet : None
gcsfs : None
matplotlib : 3.2.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 1.0.0
pytables : None
pyxlsb : None
s3fs : 0.4.2
scipy : 1.5.2
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
numba : None

@justinessert justinessert added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 1, 2020
@justinessert
Copy link
Contributor Author

justinessert commented Sep 1, 2020

[Edited]

Additionally, I have found that if there are two segments in the DataFrame, the groupby is not respected, but the NaNs only come in on the last segment.

df = pd.DataFrame({
    'segment': ['A']*10 + ['B']*10,
    'data': range(20)
})
df.groupby('segment').rolling(5, center=True, min_periods=1).max()

Here, the expected result of df.groupby('segment').rolling(5, center=True, min_periods=1).max() is:
for segment 'A' is [2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 9.0, 9.0] but the actual result is [2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0]
for segment 'B' is [12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 19.0, 19.0] but the actual result is [12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, NaN, NaN]

@jreback
Copy link
Contributor

jreback commented Sep 2, 2020

cc @mroeschke

@TomAugspurger TomAugspurger removed the Needs Triage Issue that has not been reviewed by a pandas team member label Sep 4, 2020
@wfvining
Copy link

Seeing what I think is the same problem with the following example.

x = pd.Series([1, 2, 3, 4, 5, 6, 7, 8])
x.groupby(x % 2).rolling(window=3, min_periods=1, center=True).sum()

I expect to see

0  1     6.0                                                                                                               
   3    12.0
   5    18.0 
   7    14.0
1  0     4.0
   2     9.0
   4    15.0
   6    12.0
dtype: float64 

But instead I get

0  1     6.0                                                                                                               
   3    12.0
   5    18.0 
   7     1.0
1  0     4.0
   2     9.0
   4    15.0
   6     NaN
dtype: float64 

If center or min_periods are not specified then I get the expected behavior.

@justinessert
Copy link
Contributor Author

Issue resolved with PR 36567

@mroeschke
Copy link
Member

We'll officially close this issue with your PR in #37035

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby Window rolling, ewma, expanding
Projects
None yet
5 participants