-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT: Support Ibis interval for window in pyspark backend #2409
FEAT: Support Ibis interval for window in pyspark backend #2409
Conversation
322f13d
to
7a73fa1
Compare
ibis/pyspark/compiler.py
Outdated
| ordering_keys = [ | ||
| key.to_expr().get_name() | ||
| F.col(key.to_expr().get_name()).cast('long') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really weird. Why are we getting op from expr and then use to_expr to get expr from op again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ordering_keys = [
F.col(expr.get_name()).cast('long')
if isinstance(expr, types.TimestampColumn)
else expr.get_name()
for expr in order_by
]
Does this work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't work but I get your point that we don't need to call op and to_expr at the same time. I refactored a little.
ibis/pyspark/compiler.py
Outdated
| if window.following is None: | ||
| end = Window.unboundedFollowing | ||
| elif isinstance(window.following, ir.IntervalScalar): | ||
| end = int(execute(window.following).value / 1e9) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are duplicate logic as window.preceding. Can you refactor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I added a function to canonicalize these intervals.
|
|
||
| def test_time_indexed_window(client): | ||
| table = client.table('time_indexed_table') | ||
| window1 = ibis.trailing_window( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add tests for future window as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I add one to test following and compare the result with df generated by pyspark window.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High level looks good. Left some comments.
ibis/pyspark/compiler.py
Outdated
| ) | ||
|
|
||
| # These ops need type cast after running over window |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we agreed over.(window) should be lifted to this function instead of passing via kwargs.
This for sure works for compile_aggregator, but there are some other ops that is dispatched in this line. For example, lead, lag, row_numbers, first, last, etc. We used to pass window as a param to them.
I removed window param in these functions, but there are some ops like the three ones I listed here, they do type cast after running over window. Let me know if this change makes sense to you, or maybe there are better ways to do so? @icexelloss
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. This is probably why over(window) was not in compile_window_op originally.
I don't think what you have here is much better than original implementation actually and I think we should roll back to the original implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree branching too much isn't a good idea, I reverted and commented below for thoughts
ibis/pyspark/compiler.py
Outdated
| def compile_notany( | ||
| t, expr, scope, timecontext, *, context=None, window=None, **kwargs | ||
| ): | ||
| def compile_notany(t, expr, scope, timecontext, *, context=None, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/ibis-project/ibis/pull/2409/checks?check_run_id=1178201082#step:4:1284
We removed window here and seems tests are failing for this, and compile_notall, given the comments here, could you give some suggestions on how we should refactor? @icexelloss
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function looks fine and I don't think we need to change this. What exactly failed and how?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stacktrace is:
pyspark.sql.utils.AnalysisException: "grouping expressions sequence is empty, and 'functional_alltypes.`index`' is not an aggregate function. Wrap '(min((functional_alltypes.`double_col` = CAST(0 AS DOUBLE))) AS `_w0`)' in windowing function(s) or wrap 'functional_alltypes.`index`'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I debugged into this issue
For col_in_selection_order in compile_selection_op`, The correct value should be:
['index', 'Unnamed: 0', 'id', 'bool_col', 'tinyint_col', 'smallint_col', 'int_col', 'bigint_col', 'float_col', 'double_col', 'date_string_col', 'string_col', 'timestamp_col', 'year', 'month', Column<b'(NOT min((double_col = 0)) OVER (PARTITION BY string_col ORDER BY id ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)) AS `val`'>]
while the wrong value is
['index', 'Unnamed: 0', 'id', 'bool_col', 'tinyint_col', 'smallint_col', 'int_col', 'bigint_col', 'float_col', 'double_col', 'date_string_col', 'string_col', 'timestamp_col', 'year', 'month', Column<b'(NOT min((double_col = 0))) OVER (PARTITION BY string_col ORDER BY id ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS `val`'>]
Basically we need to call .over(window) inside F.col calls, and note that there are negate operation that we should also take care of. The fix we purpose to lift .over(window) to compile_window_op breaks this behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per discussion with @icexelloss , NotAny must be compiled to ~compile(Any) otherwse table aggregation breaks. Also we have to keep NotAny to call compile_aggregator because it could be used with table aggregation. e.g.
table.groupby('key').aggregate(new_col=table['bool'].not_any())
So we cannot easily change the rule for compiling NotAny to replement the order to insert over.window() inside.
Let me just revert this change to remove window and address this in followups
171637e
to
41e2a39
Compare
OverviewI revert refactoring to remove The ProblemThe problem of A perfect fix will be removing
So if we move The negation So we cannot easily change the rule for compiling The ChangeGiven the complexity of removing Follow-upsWe should definitely think about how we could refactor all |
…:LeeTZ/ibis into tiezheng/fix_pyspark_time_indexed_window
|
As per offline discussion with Li, I am leaving this PR simple as to support Ibis interval for pyspark window, let me create another PR to do the refactor. |
Sorry, I am -1 to this idea. I think if we cannot trust the cache then the fundamental invariant of the compile algorithm breaks. (an op can uniquely define the compilation result) Per discussion, we agree the best solution forward is to change the lift the pyspark This increases the complexity in the "compile window op" but at least the complexity is local to that function. "local complexity" is better than "global complexity" which the flag solution will introduce. |
ibis/pyspark/compiler.py
Outdated
| """ | ||
| if isinstance(interval, ir.IntervalScalar): | ||
| # value is in nanosecond | ||
| return int(execute(interval).value / 1e9) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a translation rule for IntervalScalar? Perhaps we can add a rule for IntervalScalar and do
@compiles(IntervalScalar)
def compile_interval_scalar(interval):
return execute(interval)
timedelta = t.translate(interval, ...) # This is a pd.Timedelta
second_since_epoch = int(timedelta.value / 1e9)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IntervalScalar is not of type op (operations), while all other @ compiles are compiling ops. I found it a little strange to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
op is Literal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm i agree with @icexelloss here shouldn't we just be able to translate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or should this be a common method with other backends?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I add this to compile(Literal)
ibis/pyspark/tests/test_window.py
Outdated
| pytestmark = pytest.mark.pyspark | ||
|
|
||
|
|
||
| def test_time_indexed_window(client): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you parameterize this test by the window bounds?
(trailing_window(ibis.interval(hours=1), ...), Window.partitionBy(...).rangeBetween(...))
and
(ibis.range_window(..., ...),
Window.partitionBy(...).rangeBetween(...))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot simply put Window in param because that might trigger error of 'NoneType' object has no attribute '_jvm' where spark jvm is not initialized properly. Instead, I put range as param.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed again.
ibis/pyspark/compiler.py
Outdated
|
|
||
| if isinstance(interval, ir.IntervalScalar): | ||
|
|
||
| return int(execute(interval).value / 1e9) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a concern if this is too big here? e.g. i think resolution on spark is only ms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here in window bounds spark uses seconds since epoch, execute gives us a Timedelta and value is in nanosecond
ibis/pyspark/compiler.py
Outdated
| return int(execute(interval).value / 1e9) | ||
| elif isinstance(interval, int): | ||
| return interval | ||
| else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just raise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure I removed the else here
ibis/pyspark/compiler.py
Outdated
| """ | ||
| if isinstance(interval, ir.IntervalScalar): | ||
| # value is in nanosecond | ||
| return int(execute(interval).value / 1e9) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm i agree with @icexelloss here shouldn't we just be able to translate?
ibis/pyspark/compiler.py
Outdated
| return interval | ||
| else: | ||
| raise com.UnsupportedOperationError( | ||
| 'type {} is not supported in preceding/following in window'.format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use f-strings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
ibis/pyspark/compiler.py
Outdated
| """ | ||
| if isinstance(interval, ir.IntervalScalar): | ||
| # value is in nanosecond | ||
| return int(execute(interval).value / 1e9) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or should this be a common method with other backends?
ibis/pyspark/tests/test_window.py
Outdated
|
|
||
|
|
||
| # TODO: multi windows don't update scope correctly | ||
| @pytest.mark.xfail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add this int he reason for the xfail and create an issue for this (and list it here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np, this is just the next issue I am going to fix, #2412
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm minor comment in tests & can you add a release note with this PR number. ping on green.
ibis/pyspark/tests/test_window.py
Outdated
|
|
||
|
|
||
| # TODO: multi windows don't update scope correctly | ||
| @pytest.mark.xfail(reason='Issue #2412', strict=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a description as well here
|
@icexelloss @jreback CI is green for tests now but failed for impala clickhouse / docbuild. Could we ignore and proceed or should I run again? |
|
yep this is fine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Overview
Currently, window size defined by
ibis.intervalis broken for pyspark backend. This PR aims to fix this issue.The Problem
In pyspark backend, for a
tablegenerated by the following df:And calculate mean with a
trailing_window, ordered by time column on this table:will result in error:
The Fix
TimeStampColumntolongtype (seconds since epoch), this is required for spark to filtering and comparing time.precedingandfollowing, nowNone,IntervalScalarandintare supported forprecedingandfollowingin pyspark window.Tests
Tests are added in
ibis/pyspark/tests/test_window.py.Follow-ups
Once this is fixed, the next step is to add logic for context adjustment in pyspark backend. Again, to make things easier to review, will do context adjustment in follow-up PRs.