Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: Separate window bounds calculation from aggregation functions #29428

Merged
merged 42 commits into from Nov 21, 2019

Conversation

mroeschke
Copy link
Member

@mroeschke mroeschke commented Nov 6, 2019

  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff

Pre-req for #28987

Currently many of the aggregation functions in window.pyx follow the form:

def roll_func(values, window, minp, N, closed):
    # calculate window bounds _and_ validate arguments
    start, end, ... = get_window_bounds(values, window, minp, N, ...)
    for i in range(values):
        s = start[i]
        ....

This PR refactors out the window bound calculation into window_indexer.pyx and validation so the aggregation functions can be of the form:

def roll_func(values, start, end, minp):
    for i in range(values):
        s = start[i]
        ....

The methods therefore in rolling.py now have the following pattern:

  1. Fetch the correct cython aggregation function (whether the window is fixed or variable), and prep it with kwargs if needed
  2. Compute the start and end window bounds from functionality in window_indexer.pyx
  3. Pass in the values, start, end, min periods into the aggregation function.

@mroeschke mroeschke added Refactor Internal refactoring of code Window rolling, ewma, expanding labels Nov 6, 2019

return self._apply(f, func, args=args, kwargs=kwargs, center=False, raw=raw)
# Why do we always pass center=False?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO?

return start, end

def get_window_bounds(self):
return self.start, self.end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newline


# TODO: Maybe will need to use this?
# max window size
#self.win = (self.end - self.start).max()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove?

# max window size
#self.win = (self.end - self.start).max()

def build(self, const int64_t[:] index, int64_t win, bint left_closed,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like the function doesnt use self?

"""
def __init__(self, ndarray values, int64_t win, object closed, object index=None):
cdef:
ndarray start_s, start_e, end_s, end_e
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ndarray[int64_t, ndim=1]?

@jbrockmendel
Copy link
Member

are there more window/rolling-centric .pyx files on the horizon? if so, would it make sense to make a _libs/window/ directory?

@mroeschke
Copy link
Member Author

@jbrockmendel Should just be window.pyx and window_indexer.pyx for now, but I think those two files are enough to split into their own directory as you suggested. Will tackle that reorg step once I get all the tests passing.

@jbrockmendel
Copy link
Member

sounds good. I think skiplist may belong in there too.

If the intra-pandas dependencies of _libs/windows/ can be tighted locked down (e.g. "only _libs.util") thatd be great

@mroeschke
Copy link
Member Author

@jreback @jbrockmendel tests are passing locally now. Since this PR is already bulky, the follow up PR will be

  1. Create a new pandas/_libs/window directory with indexers.pyx and aggregations.pyx
  2. Add BaseIndexer base class and make it passable as a window.

@mroeschke mroeschke changed the title WIP: REF: Separate window bounds calculation from aggregation functions REF: Separate window bounds calculation from aggregation functions Nov 20, 2019
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. followups noted. ping on green.

@TomAugspurger @jorisvandenbossche @jbrockmendel if any comments

end_e = start_e + win
self.end = np.concatenate([end_s, end_e])

def get_window_bounds(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm interesting, though I think still can type. Anyhow for a try in a followup.

@TomAugspurger
Copy link
Contributor

Happy to defer to others here. Things seem nice based on a quick skim.

@@ -442,80 +182,75 @@ cdef inline void remove_sum(float64_t val, int64_t *nobs, float64_t *sum_x) nogi
sum_x[0] = sum_x[0] - val


def roll_sum(ndarray[float64_t] values, int64_t win, int64_t minp,
object index, object closed):
def roll_sum_variable(ndarray[float64_t] values, ndarray[int64_t] start,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does ndarray[type_t] vs type_t[:] make a difference here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not entirely noticeable?

# np buffer
In [1]: N = 1_000_000
   ...: s = pd.Series(range(N), index=pd.date_range('2019', periods=N, freq='s'))
   ...: roll = s.rolling('1H')
   ...: %timeit roll.sum()
28.8 ms ± 486 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# memoryview
In [1]: N = 1_000_000
   ...: s = pd.Series(range(N), index=pd.date_range('2019', periods=N, freq='s'))
   ...: roll = s.rolling('1H')
   ...: %timeit roll.sum()
28.8 ms ± 416 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


# fixed window
output = np.empty(N, dtype=float)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does float vs np.float64 matter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure in a cython context but:

In [2]: %timeit np.empty(N, dtype=float)
127 µs ± 1.15 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [3]: %timeit np.empty(N, dtype=np.float64)
127 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [4]: N
Out[4]: 1000000

with nogil:
for i in range(minp - 1):
val = values[i]
add_skew(val, &nobs, &x, &xx, &xxx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can x, xx, and xxx have more informative names? (not a blocker, as its the same as status quo)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah copied verbatim, but can address in a followup


Parameters
----------
values: numpy array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space after colon, in these files we usually specify the dtype as if it were a cython annotation, so np.ndarray[np.float64]

minp, index,
closed,
floor=0)
counts = roll_sum_fixed(np.concatenate([np.isfinite(arr).astype(float),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if is is perf-relevant, i expect there is a cnp version of isfinite.

same question before about float vs float64

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was getting TypeError: only size-1 arrays can be converted to Python scalars from the test suite when trying to use cnp.math.isfinite here. I can try to get it working later but I am fairly confident that it won't be a performance bottleneck here.

if n == 0:
return obj

arr = np.asarray(obj)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we be confident we wont get here with e.g. datetime64tz?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each block of data is attempted to be cast to float first:

            try:
                values = self._prep_values(b.values)

            except (TypeError, NotImplementedError):
                if isinstance(obj, ABCDataFrame):
                    exclude.extend(b.columns)
                    del block_list[i]
                    continue
                else:
                    raise DataError("No numeric types to aggregate")

Therefore np.asarray(obj) should always be valid here. (Also copied verbatim from the refactor)

@@ -1414,8 +1454,14 @@ def skew(self, **kwargs):
)

def kurt(self, **kwargs):
window_func = self._get_cython_func_type("roll_kurt")
kwargs.pop("require_min_periods", None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens here if the user passes a weird value for require_min_periods?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

require_min_periods is effectively an internal variable and shouldn't be expected from an external API. I need to pop here because of kwargs passed from other super calls.


import numpy as np
from numpy cimport ndarray, int64_t

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there anything in this module that we can/should test independently of the rest of the imlpementation?

Copy link
Member Author

@mroeschke mroeschke Nov 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a 2nd follow up PR, I am planning on allowing users to create their own "window indexers" to be passed into rolling(...). In that PR I can add tests for these existing indexers then. They have been effectively smoke tested since they get hit with every rolling test.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor follow up comments, thanks @mroeschke

output[:] = NaN
return output

win = (end - start).max()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can create win about (followup ok)

end_e = start_e + win
self.end = np.concatenate([end_s, end_e])

def get_window_bounds(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its optimizing readability :->

@@ -96,280 +96,20 @@ def _check_minp(win, minp, N, floor=None) -> int:
# Physical description: 366 p.
# Series: Prentice-Hall Series in Automatic Computation

# ----------------------------------------------------------------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think using _check_minp above? and likely can be moved to indexer.pyx anyhow (next pass)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also the references above are misplaced, not sure where they should go

kwargs=kwargs,
raw=raw,
offset=offset,
func=func,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mypy error: "partial" gets multiple values for keyword argument "func"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What version of mypy raises this? I get this with 0.740

(pandas-dev) matthewroeschke:pandas-mroeschke matthewroeschke$ mypy pandas
pandas/core/indexes/frozen.py:112: error: Incompatible types in assignment (expression has type "Callable[[FrozenList, VarArg(Any), KwArg(Any)], Any]", base class "list" defined the type as overloaded function)
pandas/core/indexes/frozen.py:112: error: Incompatible types in assignment (expression has type "Callable[[FrozenList, VarArg(Any), KwArg(Any)], Any]", base class "list" defined the type as "Callable[[List[Any], Union[int, slice]], None]")
pandas/core/indexes/frozen.py:113: error: Incompatible types in assignment (expression has type "Callable[[FrozenList, VarArg(Any), KwArg(Any)], Any]", base class "list" defined the type as "Callable[[List[Any], int], Any]")
pandas/core/indexes/frozen.py:113: error: Incompatible types in assignment (expression has type "Callable[[FrozenList, VarArg(Any), KwArg(Any)], Any]", base class "list" defined the type as "Callable[[List[Any], Any], None]")
pandas/core/indexes/frozen.py:113: error: Incompatible types in assignment (expression has type "Callable[[FrozenList, VarArg(Any), KwArg(Any)], Any]", base class "list" defined the type as "Callable[[List[Any], Iterable[Any]], None]")
pandas/core/indexes/frozen.py:113: error: Incompatible types in assignment (expression has type "Callable[[FrozenList, VarArg(Any), KwArg(Any)], Any]", base class "list" defined the type as "Callable[[List[Any], DefaultNamedArg(Optional[Callable[[Any], Any]], 'key'), DefaultNamedArg(bool, 'reverse')], None]")
Found 6 errors in 1 file (checked 807 source files)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm getting that error on 0.740 with --check-untyped-defs (on #28339)

The problem is that the required argument for partial is named func, so I assume you can't also pass func as a keyword argument.

functools.partial(func, /, *args, **keywords)

EDIT: 0.730 -> 0.740

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking into this further, I think this is a false positive from mypy.

the __new__ of class partial seems to be able to handle this use case. testing with a minimum examples doesn't seem to break. so it appears that it is a typeshed issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Refactor Internal refactoring of code Window rolling, ewma, expanding
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants