-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix analytic ops over ungrouped and unordered windows on Pandas backend #2376
BUG: Fix analytic ops over ungrouped and unordered windows on Pandas backend #2376
Conversation
|
Going to need to investigate a bit into why the tests are failing on the Spark backend |
|
There are some differences in how backends view a window spec like this: Backends that treat this as a completely unbounded window (applies the function over ALL rows in the table):
Backends that treats this as an expanding window (apply the function over all rows up until the current row):
This is sort of orthogonal to this PR (this is hit by a test I added in this PR for completeness but not for testing the change being done) so this could be addressed in a separate PR. But to confirm, behavior across backends should be consistent regardless of how the backend itself treats windows in its non-Ibis API? If not, (if we want to accept the difference) I can just change the tests here to reflect the difference. |
|
Can you please add a release note? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good. Some minor comments.
|
About the behavior difference I mentioned above: Created #2378 to track making ungrouped, ordered windows consistent across backends. I will x-fail that test for Impala and Spark for this PR. |
|
Still have a couple of tests I'm still troubleshooting: I need to confirm this is a problem with the Spark backend and not something that should be changed with the test (strangely, Spark inherits from Impala for this, yet Impala seems to pass for these tests) |
|
I think I know the reason behind the failing tests on the Spark backend, but I haven't been able to figure out why the Impala backend doesn't fail in the same way, given that the Spark backend inherits from the Impala backend. I'm going to stop investigating further since this is orthogonal to the main part of the PR anyways. I opened #2381 to look into this further |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. +1
ibis/pandas/execution/window.py
Outdated
| @@ -30,12 +30,19 @@ | |||
| from ibis.pandas.execution import util | |||
|
|
|||
|
|
|||
| def _post_process_empty(scalar, parent, order_by, group_by): | |||
| def _post_process_empty(result, parent, order_by, group_by): | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you type these args (and return type of this function)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
ibis/pandas/execution/window.py
Outdated
| result.index = index | ||
| return result | ||
| if isinstance(result, pd.Series): | ||
| # `result` is a Series e.g. when an analytic operation is being |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you elabort when this happens (e.g. what casuses this result to be a Series vs a scalar). Is this better handled in a dispatch rule though? e.g. to differentiate these cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make the comments a little clearer
I think if all the _post_process_xxxx functions could be made into a dispatch rule that would be pretty nice, but some of them exactly share the same types of input parameters
In terms of a dispatch rule for just _post_process_empty, I don't think it'd be any clearer because I think the dispatch rule for the case that result is a scalar would have to be @_post_process_empty.register(Any), which doesn't clearly indicate that that is the scalar case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK this ^ has been done
| @@ -323,25 +329,120 @@ def test_bounded_preceding_windows(backend, alltypes, df, con, window_fn): | |||
| ), | |||
| ], | |||
| ) | |||
| @pytest.mark.parametrize( | |||
| ('ordered'), [param(True, id='orderered'), param(False, id='unordered')], | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these ids different for a reason? why are you using param here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this actually doesn't need param or ids. I'll change it to just [True, False]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this makes the test output a bit nicer, are you OK with keeping this as is?
# Without ids
XFAIL ibis/tests/all/test_window.py::test_unbounded_window_grouped[Pandas-True-mean]
vs
# With ids
XFAIL ibis/tests/all/test_window.py::test_unbounded_window_grouped[Pandas-orderered-mean]
|
thanks @timothydijamco |
Overview
Using analytic ops (like
lead,lag,first,last, or analytic UDFs) over a window where nogroup_byand noorder_byare defined produces an incorrect result.Example
Analytic op
lagover an ungrouped+unordered windowBefore this PR
The result of the analytic op is a
Series. ThatSeriesresult is strangely copiedntimes, wherenis the number of rows of the original column.After this PR
The
Seriesresult of the analytic op is treated as the end result of using the analytic op over the window.As additional verification that this is the correct result: the above table is what we get when we use the exact same code but with
window = ibis.window(order_by=[table.id])(both before and after this PR).Testing
New tests in
test_window.pyto exercise the four kinds of windows:group_byandorder_bygroup_byonlyorder_byonlygroup_bynororder_byNew test in
test_vectorized_udf.pyto test using an analytic op and adding the result to a table usingmutate. This implicitly executes the analytic op over a window. (Also a few additional general UDF tests for completeness)