-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix same window op with different window size on table lead to incorrect results for pyspark backend #2414
Conversation
ibis/pyspark/compiler.py
Outdated
| # For NotAll and NotAny, negation must be applied after .over(window) | ||
| # Here we rewrite node to be its negation, and negate it back after | ||
| # translation and window operation | ||
| negated = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's have cleaner branching like this:
if isinstance(op, (ops.NotAll, ops.NotAny)):
return ...
elif isinstance(op, (ops.MinRank,ops.DenseRank, ops.RowNumber)):
return ...
else:
return ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's indeed more readable. However, I argue that this may bring duplication in code:
result = t.translate(operand, scope, timecontext, context=context).over(
pyspark_window
)
will repeat 3 times.
The motivation for the current structure is pre_processing, translation, post_processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I revised as you suggested for more readability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good in general minor comments.
ibis/pyspark/compiler.py
Outdated
| elif isinstance(res_op, (ops.MinRank, ops.DenseRank, ops.RowNumber)): | ||
| # result must be cast to long type for rank / rownumber | ||
| return ( | ||
| t.translate(operand, scope, timecontext, context=context) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
factor out this to a variable
t.translate(operand, scope, timecontext, context=context)
|
lgtm ping on green |
|
@jreback Green now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed another around
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
thanks @LeeTZ |
Overview
This PR fixes #2412 where the same window op with different window size on tables will lead to incorrect results for pyspark backend.
A refactor of param
windowin pyspark translation is proposed.Example
Let's say we create two windows with different sizes to calculate
Mean()on the same column of a tableCurrently, this will fail. The compiled result in result_pd is wrong and the two columns of different window sizes are the same.
The Problem
The reason why these two columns are the same is: the value we stored in scope is wrong. The second window is not compiled.
Here we are calculating Mean() on 2 different windows, 1h and 2h. Note that in
compile_window_op, we callpassing
windowas a param and calculate window in the dispatched compile process. e.g.compile_aggregatorThis is actually wrong. since we pass window as a param to
compile_aggregator, thenMean()will be compiled to Col<... over 1h window> . When we translate the second window (2h).translate()will try to look up inscopefirst forMean(). AndMean()is already compiled toCol<... over 1h window>. There is no window information in the key in scope forMean(). So the 2h window will not be translated again and we get an incorrect result of 2 windows with the same size.The Change
To make our scope cache correct and reliable, we should lift the logic to run .over(window) in
compile_window_op.This is a little complicated here:
There are operations like
Lead,Lag,NotAny,NotAll,Rank, etc. These operations takewindowas a param, and they run.over(window)inside the dispatch function to compile these ops. If we move.over(window)outside of these ops, there will be several issues:compile_rankandcompile_dense_rank, they do:So if we move
.over(window)tocompile_window_op, we will have to do some post processing to do typecast.compile_notanyandcompile_notall, they do:The negation ~ here is very tricky to lift out here. We need to call
.over(window)and then negate it. The solution we propose here is a rewrite in pre-translation to passAnyandAllop to translation, and negate them back after window operation.How is this tested
For window operations, cases are covered in
ibis/tests/all/test_window.py. For multi-window, tests are added inibis/pyspark/tests/test_window.py(Actually it's a re-enable of previously marked failure test)