-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Support multi args window UDF for pandas backend #2035
ENH: Support multi args window UDF for pandas backend #2035
Conversation
ibis/pandas/aggcontext.py
Outdated
| with warnings.catch_warnings(): | ||
| warnings.filterwarnings( | ||
| "ignore", message=".+raw=True.+", category=FutureWarning | ||
| # get the DataFrame from which the operand originated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the original logic for built-in aggregation functions.
ibis/pandas/tests/test_udf.py
Outdated
| def test_udaf_window_multi_params(): | ||
| @udf.reduction(['double', 'double'], 'double') | ||
| def my_wm(v, w): | ||
| print("v") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove these
ibis/pandas/udf.py
Outdated
| @@ -530,25 +531,33 @@ def execute_udaf_node_groupby(op, *args, **kwargs): | |||
| # | |||
| # If the argument is not a SeriesGroupBy then keep | |||
| # repeating it until all groups are exhausted. | |||
|
|
|||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove white space
ibis/pandas/aggcontext.py
Outdated
| @@ -361,6 +364,7 @@ def agg(self, grouped_data, function, *args, **kwargs): | |||
| else: | |||
| # do mostly the same thing as if we did NOT have a grouping key, | |||
| # but don't call the callable just yet. See below where we call it. | |||
|
|
|||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove white space
| @@ -52,15 +52,8 @@ def test_array_collect(t, df): | |||
| tm.assert_frame_equal(result, expected) | |||
|
|
|||
|
|
|||
| @pytest.mark.xfail( | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now supported as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some comments
ibis/pandas/aggcontext.py
Outdated
| import pandas as pd | ||
| from pandas import Series |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prob can just use of.Series
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
ibis/pandas/aggcontext.py
Outdated
| # create a generator for each input series | ||
| # the generator will yield a slice of the | ||
| # input series for each valid window | ||
| data = getattr(grouped_series, 'obj', grouped_series).values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you shouldn’t use .values here a that coerces to a ndarray
rather leave as a Series and use .iloc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I choose to do ndarray here because:
- This preserve the same udf API as before.
- Passing Series is about 15x slower than ndarray
I don't want to introduce API change and performance regression in this PR. I think we can have a separate chat whether window UDF should take ndarray or Series.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you shouldn’t show any perf issues must be something odd going on
does it currently take ndarray?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it currently take ndarray (By using raw=True I think)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok can you create an issue to fix this, meaning to use iloc. .values is not type preserving and generally a bad idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created: #2045
ibis/pandas/aggcontext.py
Outdated
| inputs = args if len(args) > 0 else [grouped_data] | ||
|
|
||
| input_gens = list( | ||
| create_input_gen(arg, window_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are you trying to do here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I am creating generators for each inputs so later I can just call next(input). This hides the details of how next is implemented and unifies how we send data inputs and arg inputs to the user function.
ibis/pandas/aggcontext.py
Outdated
| result[mask] = valid_result | ||
| result.index = obj.index | ||
| else: | ||
| with warnings.catch_warnings(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would move these out to 2 module level functions rather than nesting like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this is not needed anymore because we don't use raw=True - it was used for UDF and now UDF is handled separately.
ibis/pandas/tests/test_udf.py
Outdated
| @@ -271,6 +272,87 @@ def my_mean(series): | |||
| tm.assert_frame_equal(result, expected) | |||
|
|
|||
|
|
|||
| def test_udaf_window_interval(): | |||
| @udf.reduction(['double'], 'double') | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn’t this defined above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
ibis/pandas/udf.py
Outdated
| # the custom rolling logic. | ||
| result = aggcontext.agg(args[0], func, *args, **kwargs) | ||
| else: | ||
| iters = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would move this to a module level function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
700a22a
to
ee64a17
Compare
ibis/pandas/aggcontext.py
Outdated
| # create a generator for each input series | ||
| # the generator will yield a slice of the | ||
| # input series for each valid window | ||
| data = getattr(grouped_series, 'obj', grouped_series).values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok can you create an issue to fix this, meaning to use iloc. .values is not type preserving and generally a bad idea.
ibis/pandas/aggcontext.py
Outdated
| raw_window_size = windowed.apply(len, raw=True).reset_index( | ||
| drop=True | ||
| ) | ||
| mask = ~(raw_window_size.isna()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls make this a separate function, this is too hard to grok inline here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| @@ -271,6 +274,88 @@ def my_mean(series): | |||
| tm.assert_frame_equal(result, expected) | |||
|
|
|||
|
|
|||
| def test_udaf_window_interval(): | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are all of the window udf tests here or in test_window?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all winndow udf tests are here
ibis/pandas/udf.py
Outdated
| @@ -100,6 +101,29 @@ def arguments_from_signature(signature, *args, **kwargs): | |||
| return args, new_kwargs | |||
|
|
|||
|
|
|||
| def create_gens_from_args_groupby(args): | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you type args
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
ibis/pandas/udf.py
Outdated
| """ Create generators for each args for groupby udaf. | ||
|
|
||
| If the arg is SeriesGroupBy, return a generator that outputs each group. | ||
| If the arg is not SeriesGroupBy, return a generator that repeats the arg. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what else could this be? can you type it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improved docstring
ibis/pandas/aggcontext.py
Outdated
| ) | ||
|
|
||
| valid_result = pd.Series(valid_result) | ||
| valid_result.index = window_size.index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you testing the output indexes are correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes there are tests (test_udaf_window_interval) that cover out of order indices and make sure the output is correct
ibis/pandas/aggcontext.py
Outdated
|
|
||
| valid_result = pd.Series(valid_result) | ||
| valid_result.index = window_size.index | ||
| result = pd.Series(np.repeat(None, len(obj))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is really strange to do, what are you trying?
ibis/pandas/aggcontext.py
Outdated
| result = pd.Series(np.repeat(None, len(obj))) | ||
| result[mask] = valid_result | ||
| result.index = obj.index | ||
| else: | ||
| result = method(windowed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add some commments on what is this case (again would be much better to split these out to free functions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have separated this into separate functions
|
if you'd rebase can look again |
cccfa5c
to
de81e5a
Compare
|
@jreback Haven't addressed all comments. Will ping again when it's done. |
ibis/pandas/aggcontext.py
Outdated
| @@ -326,6 +327,83 @@ def __init__(self, kind, *args, **kwargs): | |||
| ) | |||
| self.construct_window = operator.methodcaller(kind, *args, **kwargs) | |||
|
|
|||
| def _agg_built_in(self, frame, windowed, function, *args, **kwargs): | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make both of these module level functions (don't pass self, I think you can just pass max_lookback as as kord arg).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that these methods are very particular to Window aggregation context and therefore probably belongs in the class. I curious why module level functions are better here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline. Moved to module level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. can you add a release note (likely point to a new doc section) & a doc section (can be a followup PR).
ibis/pandas/aggcontext.py
Outdated
| @@ -313,6 +314,92 @@ def compute_window_spec_interval(_, expr): | |||
| return pd.tseries.frequencies.to_offset(value) | |||
|
|
|||
|
|
|||
| def _window_agg_built_in( | |||
| frame, windowed, function, max_lookback, *args, **kwargs | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally if you can type these arguments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
ibis/pandas/aggcontext.py
Outdated
|
|
||
|
|
||
| def _window_agg_udf( | ||
| grouped_data, windowed, function, dtype, max_lookback, *args, **kwargs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you type this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
ibis/pandas/aggcontext.py
Outdated
| ): | ||
| """Apply window aggregation with UDFs. | ||
| """ | ||
| # Use custom logic to computing rolling window UDF instead of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
include this in the doc-string under Notes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
ibis/pandas/aggcontext.py
Outdated
| # This is because pandas's rolling function doesn't support | ||
| # multi param UDFs. | ||
|
|
||
| def create_input_gen(grouped_series, window_size): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
ibis/pandas/aggcontext.py
Outdated
| obj = getattr(grouped_data, 'obj', grouped_data) | ||
| name = obj.name | ||
| if frame[name] is not obj: | ||
| name = "{}_{}".format(name, ibis.util.guid()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can use an fstring here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Donne
ibis/pandas/aggcontext.py
Outdated
| # TODO: see if we can do this in the caller, when the context | ||
| # is constructed rather than pulling out the data | ||
| columns = group_by + order_by + [name] | ||
| indexed_by_ordering = frame.loc[:, columns].set_index(order_by) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use frame[columns].set_index(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
|
||
|
|
||
| def test_udaf_window_multi_args(): | ||
| @udf.reduction(['double', 'double'], 'double') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add the issue number as a comment (or this PR number)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
ibis/pandas/aggcontext.py
Outdated
| name = obj.name | ||
| if frame[name] is not obj: | ||
| name = f"{name}_{ibis.util.guid()}" | ||
| frame[name] = obj |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do this like
frame = frame.assign(name=obj)
to avoid mutating the input
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you are just copying original code (but this should change anyhow)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
ibis/pandas/tests/test_udf.py
Outdated
| } | ||
| ) | ||
| con = ibis.pandas.connect({'df': df}) | ||
| t = con.table('df') | ||
| window = ibis.trailing_window(2, order_by='a', group_by='key') | ||
| expr = t.mutate(rolled=my_mean(t.b).over(window)) | ||
| expr = t.mutate( | ||
| wm_b=my_wm(t.b, t.d).over(window), wm_c=my_wm(t.c, t.d).over(window) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that this will work if we use different windows? can you add a test for that, or if it doesn't work can you test the exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
| @@ -225,49 +250,107 @@ def my_corr2(lhs, **kwargs): | |||
| pass | |||
|
|
|||
|
|
|||
| def test_compose_udfs(): | |||
| def test_compose_udfs(t2, df2): | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if the udaf raises an exception? are these caught in a reasonable way? just asking for a test (not even sure what this should do)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added test_udf_error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. minor comment & can you add a note in release.rst? ping on green.
| **kwargs, | ||
| ) | ||
| try: | ||
| return result.astype(self.dtype, copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may want to add a comment on when this can fail
7fb3065
to
e652fb7
Compare
|
thanks @icexelloss very nice patch! |
This PR addresses issue #1998
Currently, when using UDAF with rolling window, pandas backend will throw an exception:
With this PR, it will support this use case.
The main idea of this PR is that instead of using
groupby().rolling().apply(func)to compute the result, we usegroupby().rolling().apply(len, raw=True)to get the size of each window, and then manually applyfuncto each window in a Python for-loop. This way, we work around issue thatgroupby().rolling().apply(func)can only take function that apply on a single series.Benchmarks
Before the change:
After the change: