Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

offset-based rolling window, multiple issues with closed='left' #26005

Closed
pshargo opened this issue Apr 5, 2019 · 1 comment

Comments

@pshargo
Copy link

commented Apr 5, 2019

Code Sample

# Case 1: single row
df1 = pd.DataFrame({'B': [0]}, index=[pd.Timestamp('20130101 09:00:00')])
df1.rolling('1s', closed='left').median()  # <- raises 'MemoryError: skiplist_init failed'

# Case 2: multiple rows, but entries separated by a larger time than the specified window
df2 = pd.DataFrame({'B': [0, 1]}, index=[pd.Timestamp('20130101 09:00:00'), pd.Timestamp('20130101 09:00:02')])
df2.rolling('1s', closed='left').median() # <- raises 'MemoryError: skiplist_init failed'
df2.rolling('1s', closed='left').max() # <- no error, but second entry seems incorrect

# Case 3: as long as at least one row has other entries in its window, it runs without 
# an exception but the values are suspect
df3 = pd.DataFrame({'B': [1, 2, 3]}, index=[pd.Timestamp('20130101 09:00:00'), pd.Timestamp('20130101 09:00:02'), pd.Timestamp('20130101 09:00:03')])
df3.rolling('1s', closed='left').median() # <- no exception, but the values seem incorrect
df3.rolling('1s', closed='left').max() # ditto
df3.rolling('2s', closed='left').median() # ditto (note longer window)

Problem description

Obviously, the exception cases are a big problem and should be addressed. The other cases laid out here seem to give unexpected results that are inconsistent with other aggregations (such as mean and sum) that do seem to be operating correctly. Note that using closed='right' or closed='both' does seem to give results consistent with my expectations, while using closed='neither' yields similar problems as closed='left'. (So, it would seem that the common factor here is whether or not the input rows are included in their own rolling windows.)

Expected Output

Case 1: since there are no other entries in the input row's window, I would expect that the median aggregation return NaN. (This would be consistent with mean, max, etc. for this case.)

                      B
2013-01-01 09:00:00 NaN

Case 2: since neither input row should have any other entries in their windows, I would expect that the median and max results should all be NaN. (This would be consistent with what the mean aggregation returns for this case.)

                     B
2013-01-01 09:00:00  NaN
2013-01-01 09:00:02  NaN

Case 3a and 3b (1s window): since neither of the first two input rows should have any other entries in their windows, I would expect that their median and max results should all be NaN. since the last row does have an entry in its window (the second row) I would expect that both the median and max should be 2.0. (This would be consistent with what the mean aggregation returns for this case.)

                     B
2013-01-01 09:00:00  NaN
2013-01-01 09:00:02  NaN
2013-01-01 09:00:03  2.0

Case 3c (2s window): since the first row should have no entries in its window, I would expect the first output row to be NaN. the second row will have the first entry in its window, so I would expect its output to be 1.0. similarly, the last row will have the second entry in its window and I would expect its output to be 2.0.

                     B
2013-01-01 09:00:00  NaN
2013-01-01 09:00:02  1.0
2013-01-01 09:00:03  2.0

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.14.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-17134-Microsoft
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None
pandas: 0.24.2
pytest: None
pip: 19.0.1
setuptools: 40.6.3
Cython: None
numpy: 1.16.0
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 4.3.0
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@WillAyd

This comment has been minimized.

Copy link
Member

commented Apr 9, 2019

Thanks for the report. Might be an off by one error with closed='left' for these. Investigation and PRs would certainly be welcome

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.