BUG: rolling window functions don't support custom indexers #32865

AlexKirko · 2020-03-20T13:45:34Z

Code Sample, a copy-pastable example if possible

class ForwardIndexer(BaseIndexer):
    
    def get_window_bounds(self, num_values, min_periods, center, closed):
        start = np.empty(num_values, dtype=np.int64)
        end = np.empty(num_values, dtype=np.int64)
        for i in range(num_values):
            if i + min_periods <= num_values:
                start[i] = i
                end[i] = min(i + self.window_size, num_values)
            else:
                start[i] = i
                end[i] = i + 1
        return start, end

x = pd.DataFrame({"a": [1,2,3,4,5,6,7,8,9]})

rolling = x["a"].rolling(ForwardIndexer(window_size=3), min_periods=2)

result = rolling.min()
result

OUT:
0    0.0
1    0.0
2    1.0
3    2.0
4    3.0
5    4.0
6    5.0
7    7.0
8    NaN
Name: a, dtype: float64

IN:
expected = rolling.apply(lambda x: min(x))
expected

OUT:
0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
6    7.0
7    8.0
8    NaN
Name: a, dtype: float64

Problem description

We state here that we support supplying a custom Indexer when building a pandas.DataFrame.rolling object. While the object does get built, and it returns the correct windows, it doesn't support many rolling window functions. The problem is that our implementations of these aggregation functions expect a standard backward-looking window and we support centered windows via a bit of a crutch.

For example, rolling.min eventually falls through to _roll_min_max_variable in aggregations.pyx, which uses this bit of code to record the output:

        for i in range(endi[0], endi[N-1]):
            if not Q.empty() and curr_win_size > 0:
                output[i-1+close_offset] = calc_mm(
                    minp, nobs, values[Q.front()])
            else:
                output[i-1+close_offset] = NaN

This indexing of output means that the window minimum gets written near the end of the window, even if the window is forward-looking. I've investigated a bit, and there is a similar issue in rolling.std - it also isn't adapted to more flexible rolling windows.

While it's not possible to make rolling window aggregation functions completely universal without loss of efficiency, it's possible to adapt them to most useful cases: forward-looking, smoothly contracting and expanding. We'd still have to think on how we would check that we support a custom Indexer, and whether we would check at all. It might be possible to just specify the supported kinds in the docs and throw a warning or do something similar.

If we choose this path, I'd be happy to deal with the problem over a series of PRs or share the load with someone. Looks like a fair bit of work, but the pandemic freed up a lot of time.

Expected Output

OUT:
0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
6    7.0
7    8.0
8    NaN
Name: a, dtype: float64

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : d308712
python : 3.7.6.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.18362
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : ru_RU.UTF-8
LOCALE : None.None

pandas : 0.26.0.dev0+2635.gd308712c8
numpy : 1.17.5
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 44.0.0.post20200106
Cython : 0.29.14
pytest : 5.3.4
hypothesis : 5.2.0
sphinx : 2.3.1
blosc : None
feather : None
xlsxwriter : 1.2.7
lxml.etree : 4.4.2
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.1
fastparquet : None
gcsfs : None
matplotlib : 3.1.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.1
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : 0.4.0
scipy : 1.3.1
sqlalchemy : 1.3.12
tables : 3.6.1
tabulate : 0.8.6
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.47.0

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2020-03-20T14:43:12Z

cc @mroeschke.

mroeschke · 2020-03-22T06:46:06Z

Nice investigation @AlexKirko. The BaseIndexers are definitely designed more for apply (i.e. "treat each window as independent and apply an agg func on each window"), as the implementation of the other aggregation algorithms may reference another window.

Do you mind investigating which other rolling aggregations return different results?

AlexKirko · 2020-03-23T10:25:12Z

Hello, @mroeschke. No problem, see below.

Broken:

min
max (same implementation as min)
std
var
count
skew (indexing looks fine, but breaks with window = 3, min_periods = 2, while the vanilla window doesn't throw an error)
cov
corr

Working correctly:

mean
sum
median
kurt
quantile

Yeah, supporting arbitrary BaseIndexers would require the algorithms to go O(N²), which, I think, is unacceptable (we'd be running everything with apply speed). However, we can adjust the algorithms (without complexity increase) to support the most important use cases, like forward-looking windows (and windows with constant offsets) and smoothly expanding or contracting windows. I've taken a look at the algorithms and the implementations we currently use, and the only difficulty I see is the amount of work it would require.

A lazier approach would be to throw a warning when creating a rolling window object based on a custom indexer, saying that only apply is supported.

This entire thing came up when I and a colleague were working on a problem on Kaggle, and we needed forward-looking windows to incorporate information about the future into our estimates. From that very limited user base I can say that at least some people expect this stuff to just work without a noticeable drop in speed, because it's unclear why it shouldn't.

mroeschke · 2020-03-25T21:17:38Z

Thanks for looking into this @AlexKirko!

For now I'll put in a PR to raise a NotImplementedError for the ops that don't work as expected. Better to tell the user the results are incorrect rather than silently give incorrect results.

For the next step, if there's a way to modify the current algorithms with minimal performance impacts that'd be fantastic! It's not immediately apparently to me what those modifications would be, and I won't have a lot of time to revisit it in the short term. If there's a general fix for each algorithm I might be able to help out where I can.

WhistleWhileYouWork · 2020-03-29T03:45:07Z

Hi, I came here to report this issue as well. I wanted to chime in on a couple things.

First thanks for pandas and this very useful new feature!

I think that at the very least there should be support of window functions for forward-looking windows. Forward-looking windows is an oft-requested feature (here in pandas issues) and rolling Indexer support was a huge step in the right direction.

Without that support, the use of apply is required (as stated above), but the performance hit is orders of magnitude too large. My df using apply(max) takes 6:46 minutes while builtin max() takes a mere 5 seconds. To work around the performance hit, I have to use apply with numba to get it to just over 7 seconds.

Also, I don't believe agg can be used with numba since numba requires the numpy backing array. So, we also gain the simplicity of agg if window functions are supported.

For now, might I suggest a mention and example in the guide docs at Custom window rolling to use apply as a workaround and optionally numba for performance.

Here is a complete numba example for max:

import numba
import numpy as np
import pandas as pd
from pandas.api.indexers import BaseIndexer

@numba.jit
def numba_max(x):
    return max(x)

class ForwardIndexer(BaseIndexer):
    ''' Custom `Indexer` for use in forward-looking `rolling` windows '''
    def get_window_bounds(self, num_values, min_periods, center, closed):
        ''' Set up forward looking windows '''
        start = np.arange(num_values, dtype=np.int64)
        end = start.copy() + self.window_size
        #---- Clip to `num_values`
        end[end > num_values] = num_values
        return start, end

df = pd.DataFrame({"a": [1,2,3,4,5,6,7,8,9]})
df_max = df.rolling(window=ForwardIndexer(window_size=3), min_periods=1).apply(numba_max, raw=True)
df_max

Thanks again.

AlexKirko · 2020-03-30T12:56:32Z

@mroeschke , thank you for implementing the error-raising behavior!
I'll work on the functions one by one, starting tomorrow (one PR per function), and we'll see how it goes.
@WhistleWhileYouWork , thanks for taking interest in this! Another possible workaround to get efficient forward-looking computation right now is to pad the Series appropriately, shift, use normal backward-looking windows, then shift the results back. Hopefully, fixing the problem proves to be within my capabilities, and the question of workaround efficiency becomes moot.

AlexKirko · 2020-03-31T13:36:53Z

take

AlexKirko · 2020-04-04T21:40:54Z

I'll also take a closer look at the median. It's acting funny during benchmarks.
Was fine during smaller scale tests, but now I see a bunch of NaNs with a 10 ** 5 array of random numbers. Will check.

AlexKirko · 2020-04-09T09:52:50Z

@mroeschke
So I came across something very weird while working on fixing std and var. When calculating variance in a window, we use the unbiased sample variance formula (sum(squared_diff) / N - 1) instead of the population variance formula (sum(squared_diff) / N ) :

IN:
values = np.arange(10)
values[5] = 100.0
rolling2 = pd.Series(values).rolling(window=3, min_periods=2)
rolling2.var()
# This is sample variance
OUT:
0            NaN
1       0.500000
2       1.000000
3       1.000000
4       1.000000
5    3104.333333
6    3009.333333
7    2914.333333
8       1.000000
9       1.000000
dtype: float64

IN:
rolling2.apply(lambda x: np.var(x))
# This is population variance
OUT:
0            NaN
1       0.250000
2       0.666667
3       0.666667
4       0.666667
5    2069.555556
6    2006.222222
7    1942.888889
8       0.666667
9       0.666667
dtype: float64

As far as I know, we should use sample variance when we draw a limited number of data observations from the total population. In that case, when we average across many draws, the expectation of the average will match the ground truth population variance.

When we slide a window across a data series, the data in a particular window isn't drawn from anything, it's all there is. For our purposes it is the population, and we should divide the sum of squared differences by window_size, not by window_size - 1 as we do now.

Here is the formula from our current code:

result = ssqdm_x / (nobs - <float64_t>ddof)

Where ddof is the number of degrees of freedom and equals 1.

Am I missing something statitstics-wise, or should I open a new issue and solve it with a PR there (we do this stuff both for variable and fixed windows)?

P.S. There are no other problems with std and var. The output is already properly anchored to the input. If we divide by N, forward-looking variance will match apply output without changes to performance.

mroeschke · 2020-04-09T18:19:19Z

@AlexKirko across pandas I think sample variance is what we calculate in general for std and var (we intentionally diverge from numpy) so I think this behavior is expected

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.std.html

AlexKirko · 2020-04-09T20:18:20Z

@mroeschke Thanks!

I agree that sample variance makes total sense when we calculate variance for an entire DataFrame or Series, because it's probable that the data we have is a sample from some larger general population, and normalizing by N - 1 makes tbe estimate unbiased.

However, when we slide a window across a series, this does not make sense to me. As far as I can see, the sample variance argument is illogical here, because there is no drawing from a larger population. We have a window, let's say ten elements, and we are interested in variance for these ten elements. In my opinion, they are the population, and the variance should be estimated via the population formula. When the window moves, we are interested in the variance for different ten elements, and we make no assumptions about them being from some population of ten element windows with stable statistical properties or whatever (we use a rolling window precisely because we want to calculate a different variance for each window).

Is my argument making sense?

AlexKirko · 2020-04-09T20:23:21Z

I must add that in practice we rarely use small windows, plus data science methods that we apply to the estimation results can usually adapt to scaling, so this is more about correct statistical methodology than practical impact on the user experience.

If my argument is wrong, then I'll just add std and var to the whitelist.

mroeschke · 2020-04-09T21:00:18Z

Your conclusion from a data scientist practitioner point of view makes sense, but I think the user will need to recognize that ddof may need to be adjusted (can't make assumptions of the users dataset or why they're calculating the statistic) in this situation.

From a software perspective, it's a nice to have consistency across the library how std and var are calculated, and having exposed the ddof parameter is adequate enough.

AlexKirko · 2020-04-09T21:43:20Z

@mroeschke
Thanks for the clarification. I'll just whitelist these two functions in a PR tomorrow then: they are working as intended.

AlexKirko · 2020-04-18T05:58:07Z

median will need fixing too. It pruduces NaNs if min_periods isn't supplied. Since I missed it in the first pass, I'll tackle it today.

mroeschke mentioned this issue Mar 27, 2020

ERR: Raise NotImplementedError with BaseIndexer and certain rolling operations #33057

Merged

5 tasks

github-actions bot assigned AlexKirko Mar 31, 2020

This was referenced Mar 31, 2020

BUG: support min/max functions for rolling windows with custom BaseIndexer #33180

Merged

ENH: provide standard BaseIndexers in pandas.api.indexers #33201

Closed

AlexKirko mentioned this issue Apr 10, 2020

Whitelist std and var for use with custom rolling windows #33448

Merged

5 tasks

AlexKirko mentioned this issue Apr 17, 2020

BUG: support count function for custom BaseIndexer rolling windows #33605

Merged

5 tasks

This was referenced Apr 18, 2020

BUG: support median function for custom BaseIndexer rolling windows #33626

Merged

BUG: support skew function for custom BaseIndexer rolling windows #33745

Merged

AlexKirko mentioned this issue Apr 26, 2020

BUG: support corr and cov functions for custom BaseIndexer rolling windows #33804

Merged

6 tasks

jreback added the Window rolling, ewma, expanding label Apr 26, 2020

jreback added this to the 1.1 milestone Apr 26, 2020

jreback closed this as completed in #33804 Apr 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: rolling window functions don't support custom indexers #32865

BUG: rolling window functions don't support custom indexers #32865

AlexKirko commented Mar 20, 2020

INSTALLED VERSIONS

TomAugspurger commented Mar 20, 2020

mroeschke commented Mar 22, 2020

AlexKirko commented Mar 23, 2020 •

edited

Loading

mroeschke commented Mar 25, 2020

WhistleWhileYouWork commented Mar 29, 2020

AlexKirko commented Mar 30, 2020

AlexKirko commented Mar 31, 2020

AlexKirko commented Apr 4, 2020

AlexKirko commented Apr 9, 2020 •

edited

Loading

mroeschke commented Apr 9, 2020

AlexKirko commented Apr 9, 2020 •

edited

Loading

AlexKirko commented Apr 9, 2020 •

edited

Loading

mroeschke commented Apr 9, 2020

AlexKirko commented Apr 9, 2020

AlexKirko commented Apr 18, 2020

BUG: rolling window functions don't support custom indexers #32865

BUG: rolling window functions don't support custom indexers #32865

Comments

AlexKirko commented Mar 20, 2020

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

TomAugspurger commented Mar 20, 2020

mroeschke commented Mar 22, 2020

AlexKirko commented Mar 23, 2020 • edited Loading

mroeschke commented Mar 25, 2020

WhistleWhileYouWork commented Mar 29, 2020

AlexKirko commented Mar 30, 2020

AlexKirko commented Mar 31, 2020

AlexKirko commented Apr 4, 2020

AlexKirko commented Apr 9, 2020 • edited Loading

mroeschke commented Apr 9, 2020

AlexKirko commented Apr 9, 2020 • edited Loading

AlexKirko commented Apr 9, 2020 • edited Loading

mroeschke commented Apr 9, 2020

AlexKirko commented Apr 9, 2020

AlexKirko commented Apr 18, 2020

Output of `pd.show_versions()`

AlexKirko commented Mar 23, 2020 •

edited

Loading

AlexKirko commented Apr 9, 2020 •

edited

Loading

AlexKirko commented Apr 9, 2020 •

edited

Loading

AlexKirko commented Apr 9, 2020 •

edited

Loading