-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix time context trimming error for multi column udfs in pandas backend #2712
BUG: Fix time context trimming error for multi column udfs in pandas backend #2712
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add this PR number to the prior release note for changes in time context
|
|
||
| """ | ||
| # noop if timecontext is None | ||
| if not timecontext: | ||
| return data | ||
| assert isinstance(data, pd.Series) or isinstance( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert(isinstance, (pd.Series, pd.DataFrame))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
| name = data.name if data.name else 0 | ||
| index_columns = list(subset.columns.difference([name])) | ||
| else: | ||
| name = data.columns.tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need the .tolist() here just pass name directly to .difference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops forgot to remove this, +1
| input_type=[dt.double, dt.double], | ||
| output_type=dt.Struct(['demean', 'demean_weight'], [dt.double, dt.double]), | ||
| ) | ||
| def demean_struct(v, w): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we import these UDFs instead of duplicating
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I will just import from vectorized_udf tests
| @@ -361,15 +371,16 @@ def execute_window_op( | |||
| clients=clients, | |||
| **kwargs, | |||
| ) | |||
| series = post_process( | |||
| computed = post_process( | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
calling this window_result is a little better I think
Also fine to reuse result as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will keep using result
| @@ -179,26 +179,32 @@ def get_aggcontext_window( | |||
| return aggcontext | |||
|
|
|||
|
|
|||
| def trim_window_series(data, timecontext: Optional[TimeContext]): | |||
| def trim_window_result(data, timecontext: Optional[TimeContext]): | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you annotate the type of "data"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that would be helpful, added
| index_columns = list(subset.columns.difference([name])) | ||
| else: | ||
| name = data.columns | ||
| index_columns = list(subset.columns.difference(name)) | ||
|
|
||
| # set the correct index for return Seires |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Series -> Series/DataFrame
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
Minor comments. Otherwise LGTM. |
…bis into tiezheng/multi-col-udf-nogroupby
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
thanks @LeeTZ ping on green. |
Overview
When specifying time context, multi-column udf execution for pandas backend may end up with an exception. This PR aims to fix the issue.
Example
For the following example:
To compute a multi column udf over a table, the execution may fail with exception:
The reason of failing is that, in Window execution for pandas backend, we assume the computed result is always a
pd.Series. However in such cases of multi-column udf, the result of opAnalyticVectorizedUDFis apd.DataFrame.For the fix, we add logic to deal with both
pd.Seriesandpd.DataFramein trimming the computed result for a pandasWindowOpTest
Test case to cover this multi-column udf case is added in
ibis/backends/tests/test_timecontext.py