REF: Separate window bounds calculation from aggregation functions #29428

mroeschke · 2019-11-06T07:30:05Z

tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff

Pre-req for #28987

Currently many of the aggregation functions in window.pyx follow the form:

def roll_func(values, window, minp, N, closed):
    # calculate window bounds _and_ validate arguments
    start, end, ... = get_window_bounds(values, window, minp, N, ...)
    for i in range(values):
        s = start[i]
        ....

This PR refactors out the window bound calculation into window_indexer.pyx and validation so the aggregation functions can be of the form:

def roll_func(values, start, end, minp):
    for i in range(values):
        s = start[i]
        ....

The methods therefore in rolling.py now have the following pattern:

Fetch the correct cython aggregation function (whether the window is fixed or variable), and prep it with kwargs if needed
Compute the start and end window bounds from functionality in window_indexer.pyx
Pass in the values, start, end, min periods into the aggregation function.

…ernals

jbrockmendel · 2019-11-06T15:34:46Z

pandas/core/window/rolling.py


-        return self._apply(f, func, args=args, kwargs=kwargs, center=False, raw=raw)
+        # Why do we always pass center=False?


jbrockmendel · 2019-11-06T15:35:54Z

pandas/_libs/window_indexer.pyx

+        return start, end
+
+    def get_window_bounds(self):
+        return self.start, self.end


jbrockmendel · 2019-11-06T15:36:06Z

pandas/_libs/window_indexer.pyx

+
+        # TODO: Maybe will need to use this?
+        # max window size
+        #self.win = (self.end - self.start).max()


jbrockmendel · 2019-11-06T15:37:15Z

pandas/_libs/window_indexer.pyx

+        # max window size
+        #self.win = (self.end - self.start).max()
+
+    def build(self, const int64_t[:] index, int64_t win, bint left_closed,


looks like the function doesnt use self?

jbrockmendel · 2019-11-06T15:38:06Z

pandas/_libs/window_indexer.pyx

+    """
+    def __init__(self, ndarray values, int64_t win, object closed, object index=None):
+        cdef:
+            ndarray start_s, start_e, end_s, end_e


ndarray[int64_t, ndim=1]?

jbrockmendel · 2019-11-06T15:50:49Z

are there more window/rolling-centric .pyx files on the horizon? if so, would it make sense to make a _libs/window/ directory?

mroeschke · 2019-11-06T19:05:23Z

@jbrockmendel Should just be window.pyx and window_indexer.pyx for now, but I think those two files are enough to split into their own directory as you suggested. Will tackle that reorg step once I get all the tests passing.

jbrockmendel · 2019-11-06T19:09:17Z

sounds good. I think skiplist may belong in there too.

If the intra-pandas dependencies of _libs/windows/ can be tighted locked down (e.g. "only _libs.util") thatd be great

…ernals

mroeschke · 2019-11-20T07:52:17Z

@jreback @jbrockmendel tests are passing locally now. Since this PR is already bulky, the follow up PR will be

Create a new pandas/_libs/window directory with indexers.pyx and aggregations.pyx
Add BaseIndexer base class and make it passable as a window.

…ernals

jreback

lgtm. followups noted. ping on green.

@TomAugspurger @jorisvandenbossche @jbrockmendel if any comments

jreback · 2019-11-20T19:23:14Z

pandas/_libs/window_indexer.pyx

+        end_e = start_e + win
+        self.end = np.concatenate([end_s, end_e])
+
+    def get_window_bounds(self):


hmm interesting, though I think still can type. Anyhow for a try in a followup.

TomAugspurger · 2019-11-20T20:13:37Z

Happy to defer to others here. Things seem nice based on a quick skim.

…ernals

jbrockmendel · 2019-11-21T00:35:54Z

pandas/_libs/window.pyx

@@ -442,80 +182,75 @@ cdef inline void remove_sum(float64_t val, int64_t *nobs, float64_t *sum_x) nogi
        sum_x[0] = sum_x[0] - val


-def roll_sum(ndarray[float64_t] values, int64_t win, int64_t minp,
-             object index, object closed):
+def roll_sum_variable(ndarray[float64_t] values, ndarray[int64_t] start,


does ndarray[type_t] vs type_t[:] make a difference here?

Not entirely noticeable?

# np buffer In [1]: N = 1_000_000 ...: s = pd.Series(range(N), index=pd.date_range('2019', periods=N, freq='s')) ...: roll = s.rolling('1H') ...: %timeit roll.sum() 28.8 ms ± 486 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) # memoryview In [1]: N = 1_000_000 ...: s = pd.Series(range(N), index=pd.date_range('2019', periods=N, freq='s')) ...: roll = s.rolling('1H') ...: %timeit roll.sum() 28.8 ms ± 416 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

jbrockmendel · 2019-11-21T00:36:34Z

pandas/_libs/window.pyx


-        # fixed window
+    output = np.empty(N, dtype=float)


does float vs np.float64 matter?

Not sure in a cython context but:

In [2]: %timeit np.empty(N, dtype=float) 127 µs ± 1.15 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) In [3]: %timeit np.empty(N, dtype=np.float64) 127 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) In [4]: N Out[4]: 1000000

jbrockmendel · 2019-11-21T00:38:33Z

pandas/_libs/window.pyx

+    with nogil:
+        for i in range(minp - 1):
+            val = values[i]
+            add_skew(val, &nobs, &x, &xx, &xxx)


can x, xx, and xxx have more informative names? (not a blocker, as its the same as status quo)

Yeah copied verbatim, but can address in a followup

jbrockmendel · 2019-11-21T00:39:54Z

pandas/_libs/window.pyx

+
+    Parameters
+    ----------
+    values: numpy array


space after colon, in these files we usually specify the dtype as if it were a cython annotation, so np.ndarray[np.float64]

jbrockmendel · 2019-11-21T00:41:01Z

pandas/_libs/window.pyx

-                                                               minp, index,
-                                                               closed,
-                                                               floor=0)
+    counts = roll_sum_fixed(np.concatenate([np.isfinite(arr).astype(float),


if is is perf-relevant, i expect there is a cnp version of isfinite.

same question before about float vs float64

Was getting TypeError: only size-1 arrays can be converted to Python scalars from the test suite when trying to use cnp.math.isfinite here. I can try to get it working later but I am fairly confident that it won't be a performance bottleneck here.

jbrockmendel · 2019-11-21T00:42:26Z

pandas/_libs/window.pyx

+    if n == 0:
+        return obj
+
+    arr = np.asarray(obj)


can we be confident we wont get here with e.g. datetime64tz?

Each block of data is attempted to be cast to float first:

try: values = self._prep_values(b.values) except (TypeError, NotImplementedError): if isinstance(obj, ABCDataFrame): exclude.extend(b.columns) del block_list[i] continue else: raise DataError("No numeric types to aggregate")

Therefore np.asarray(obj) should always be valid here. (Also copied verbatim from the refactor)

jbrockmendel · 2019-11-21T00:49:03Z

pandas/core/window/rolling.py

@@ -1414,8 +1454,14 @@ def skew(self, **kwargs):
    )

    def kurt(self, **kwargs):
+        window_func = self._get_cython_func_type("roll_kurt")
+        kwargs.pop("require_min_periods", None)


what happens here if the user passes a weird value for require_min_periods?

require_min_periods is effectively an internal variable and shouldn't be expected from an external API. I need to pop here because of kwargs passed from other super calls.

jbrockmendel · 2019-11-21T00:50:01Z

pandas/_libs/window_indexer.pyx

+
+import numpy as np
+from numpy cimport ndarray, int64_t
+


is there anything in this module that we can/should test independently of the rest of the imlpementation?

In a 2nd follow up PR, I am planning on allowing users to create their own "window indexers" to be passed into rolling(...). In that PR I can add tests for these existing indexers then. They have been effectively smoke tested since they get hit with every rolling test.

jreback

minor follow up comments, thanks @mroeschke

jreback · 2019-11-21T12:56:10Z

pandas/_libs/window.pyx

        output[:] = NaN
        return output
-
+    win = (end - start).max()


can create win about (followup ok)

jreback · 2019-11-21T12:57:33Z

pandas/_libs/window_indexer.pyx

+        end_e = start_e + win
+        self.end = np.concatenate([end_s, end_e])
+
+    def get_window_bounds(self):


its optimizing readability :->

jreback · 2019-11-21T12:58:34Z

pandas/_libs/window.pyx

@@ -96,280 +96,20 @@ def _check_minp(win, minp, N, floor=None) -> int:
 # Physical description: 366 p.
 #               Series: Prentice-Hall Series in Automatic Computation

-# ----------------------------------------------------------------------


i don't think using _check_minp above? and likely can be moved to indexer.pyx anyhow (next pass)

also the references above are misplaced, not sure where they should go

simonjayhawkins · 2019-11-25T11:23:23Z

pandas/core/window/rolling.py

+            kwargs=kwargs,
+            raw=raw,
+            offset=offset,
+            func=func,


mypy error: "partial" gets multiple values for keyword argument "func"

What version of mypy raises this? I get this with 0.740

(pandas-dev) matthewroeschke:pandas-mroeschke matthewroeschke$ mypy pandas pandas/core/indexes/frozen.py:112: error: Incompatible types in assignment (expression has type "Callable[[FrozenList, VarArg(Any), KwArg(Any)], Any]", base class "list" defined the type as overloaded function) pandas/core/indexes/frozen.py:112: error: Incompatible types in assignment (expression has type "Callable[[FrozenList, VarArg(Any), KwArg(Any)], Any]", base class "list" defined the type as "Callable[[List[Any], Union[int, slice]], None]") pandas/core/indexes/frozen.py:113: error: Incompatible types in assignment (expression has type "Callable[[FrozenList, VarArg(Any), KwArg(Any)], Any]", base class "list" defined the type as "Callable[[List[Any], int], Any]") pandas/core/indexes/frozen.py:113: error: Incompatible types in assignment (expression has type "Callable[[FrozenList, VarArg(Any), KwArg(Any)], Any]", base class "list" defined the type as "Callable[[List[Any], Any], None]") pandas/core/indexes/frozen.py:113: error: Incompatible types in assignment (expression has type "Callable[[FrozenList, VarArg(Any), KwArg(Any)], Any]", base class "list" defined the type as "Callable[[List[Any], Iterable[Any]], None]") pandas/core/indexes/frozen.py:113: error: Incompatible types in assignment (expression has type "Callable[[FrozenList, VarArg(Any), KwArg(Any)], Any]", base class "list" defined the type as "Callable[[List[Any], DefaultNamedArg(Optional[Callable[[Any], Any]], 'key'), DefaultNamedArg(bool, 'reverse')], None]") Found 6 errors in 1 file (checked 807 source files)

i'm getting that error on 0.740 with --check-untyped-defs (on #28339)

The problem is that the required argument for partial is named func, so I assume you can't also pass func as a keyword argument.

functools.partial(func, /, *args, **keywords)

EDIT: 0.730 -> 0.740

looking into this further, I think this is a false positive from mypy.

the __new__ of class partial seems to be able to handle this use case. testing with a minimum examples doesn't seem to break. so it appears that it is a typeshed issue.

…andas-dev#29428)

Matt Roeschke added 16 commits October 27, 2019 16:25

Have self._get_roll_func validate function import from window.pyx

7cb42b0

Standardize center arg

dfe6f39

always call get_window in _apply

c608b44

move min_period calculation up the stack

45c89c7

Remove fetching of window function in _apply

c288137

Black

1d90df6

Ensure all passed _apply functions are standardized

20d4490

Migrate indexers to their own file

782209f

Merge remote-tracking branch 'upstream/master' into clean_rolling_int…

592db08

…ernals

Integrating indexers

6b1935e

Change all cython function signatures

c7df962

Add function to get variable vs fixed function

9cc2b03

Merge remote-tracking branch 'upstream/master' into clean_rolling_int…

fd51fec

…ernals

Remove is_variable cdef variable

ef09f13

Fix some cython compilation

1195461

Ensure window_indexer.pyx gets compiled by cython

71e08da

mroeschke added Refactor Internal refactoring of code Window rolling, ewma, expanding labels Nov 6, 2019

jbrockmendel reviewed Nov 6, 2019

View reviewed changes

pandas/_libs/window_indexer.pyx Outdated

return start, end

def get_window_bounds(self):

return self.start, self.end

Copy link

Member

jbrockmendel Nov 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newline

jbrockmendel reviewed Nov 6, 2019

View reviewed changes

Matt Roeschke added 4 commits November 9, 2019 13:08

Merge remote-tracking branch 'upstream/master' into clean_rolling_int…

41f0775

…ernals

Merge remote-tracking branch 'upstream/master' into clean_rolling_int…

5205174

…ernals

Merge remote-tracking branch 'upstream/master' into clean_rolling_int…

a764caf

…ernals

incremental progress on tests

4c1915a

Matt Roeschke added 5 commits November 19, 2019 17:43

Merge remote-tracking branch 'upstream/master' into clean_rolling_int…

2a13e9f

…ernals

Lint pyx files

5f8751a

Address Brock's comments

72703bd

Deprivatize calculate_min_periods

43af04f

More linting of pyx files

f1e7c7a

Matt Roeschke added 2 commits November 20, 2019 10:55

Merge remote-tracking branch 'upstream/master' into clean_rolling_int…

80df109

…ernals

Fix typing and lint issue

8294705

mroeschke changed the title ~~WIP: REF: Separate window bounds calculation from aggregation functions~~ REF: Separate window bounds calculation from aggregation functions Nov 20, 2019

jreback approved these changes Nov 20, 2019

View reviewed changes

Merge remote-tracking branch 'upstream/master' into clean_rolling_int…

2cf048c

…ernals

jbrockmendel reviewed Nov 21, 2019

View reviewed changes

Docstring updates

09ae258

jreback approved these changes Nov 21, 2019

View reviewed changes

jreback merged commit 6e5d148 into pandas-dev:master Nov 21, 2019

mroeschke deleted the clean_rolling_internals branch November 21, 2019 15:49

This was referenced Nov 23, 2019

REF: Create _lib/window directory #29817

Merged

TRACKER: Related pandas-dev/pandas master PRs for Generalized Window Operations twosigma/pandas#4

Closed

simonjayhawkins reviewed Nov 25, 2019

View reviewed changes

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

REF: Separate window bounds calculation from aggregation functions (p…

2c9f044

…andas-dev#29428)

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

REF: Separate window bounds calculation from aggregation functions (p…

6dae2c2

…andas-dev#29428)

simonjayhawkins mentioned this pull request May 8, 2020

BUG: Using kwargs in Rolling.apply() #33433

Closed

3 tasks


		return self._apply(f, func, args=args, kwargs=kwargs, center=False, raw=raw)
		# Why do we always pass center=False?

REF: Separate window bounds calculation from aggregation functions #29428

REF: Separate window bounds calculation from aggregation functions #29428

Conversation

mroeschke commented Nov 6, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Nov 6, 2019

mroeschke commented Nov 6, 2019

jbrockmendel commented Nov 6, 2019

mroeschke commented Nov 20, 2019

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Nov 20, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke Nov 21, 2019 • edited

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simonjayhawkins Nov 26, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke commented Nov 6, 2019 •

edited

mroeschke Nov 21, 2019 •

edited

simonjayhawkins Nov 26, 2019 •

edited