-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT: Support context adjustment for udfs for pandas backend #2646
FEAT: Support context adjustment for udfs for pandas backend #2646
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a release note as well
ibis/expr/timecontext.py
Outdated
| @@ -127,6 +127,40 @@ def canonicalize_context( | |||
| return begin, end | |||
|
|
|||
|
|
|||
| def construct_multi_index_series( | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have a more informative name for this. yes its constructing a MI series, but its doing it for time_context.
maybe
construct_time_context_aware_series
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I changed the name to be construct_time_context_aware_series
ibis/expr/timecontext.py
Outdated
| def construct_multi_index_series( | ||
| series: pd.Series, frame: pd.DataFrame | ||
| ) -> pd.Series: | ||
| """ Construct a pd.MultiIndex of IntIndex and 'time' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why IntIndex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the IntIndex here presenting the groupby key? What about the case that groupby key is not integer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rephrase this a little bit, it meant to add 'time' into the original index, whatever type it is.
ibis/expr/timecontext.py
Outdated
| if TIME_COL == frame.index.name: | ||
| time_index = frame.index | ||
| elif TIME_COL in frame: | ||
| time_index = frame.set_index(TIME_COL).index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably question for Jeff: Is this the best way to construct an index from existing column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback ^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no ,do this
pd.Index(frame[TIME_COL])
ibis/expr/timecontext.py
Outdated
|
|
||
| Parameters | ||
| ---------- | ||
| series: pd.Series |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you describe what is the series and what is the frame here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as comments above, added description, as well as examples.
ibis/expr/timecontext.py
Outdated
| @@ -127,6 +127,40 @@ def canonicalize_context( | |||
| return begin, end | |||
|
|
|||
|
|
|||
| def construct_multi_index_series( | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also feels to me what the function is an implementation of the pandas backend and therefore, probably don't belong in ibis/expr/timecontext.py module. Is there a place you can put this in the pandas module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this. This could be a util function in general, also putting this in pandas/execution/timecontext.py seems to cause circular dependency. I could revisit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would it have a circular dependency?
| @@ -338,6 +341,8 @@ def execute_window_op( | |||
| ) | |||
| series = post_process(result, data, ordering_keys, grouping_keys) | |||
|
|
|||
| if timecontext: | |||
| series = construct_multi_index_series(series, data) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the series and data matching in row ordering here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e. I think the series is ordered in group_by ordering, and the data is ordered in original ordering (probably ordered by time).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good point. I made some modifications here. Time index should be injected to series prior to groupby. I add logics in aggcontext.py for this, row order is well tracked in agg phase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments.
|
Update with a change in |
| @@ -53,13 +62,18 @@ def _post_process_group_by( | |||
| parent: pd.DataFrame, | |||
| order_by: List[str], | |||
| group_by: List[str], | |||
| **kwargs, | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be timecontext instead of kwargs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be, but timecontext is currently not used in this function so I didn't make it as an explicit param
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a good pattern to specify all param explicitly even some of them are not used. kwargs brings flexibility which I don't think you need here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Explicitness over implicitness)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree with @icexelloss here, we don't want kwargs at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx, I removed kwargs
| return series.mean() | ||
|
|
||
|
|
||
| def test_context_adjustment_window_udf_nogroupby_noorderby( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move these to the backend test and test this with pyspark backend as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
ibis/expr/timecontext.py
Outdated
| 1 2017-01-03 01:02:03.234 2 2.2 | ||
| 2 2017-01-04 01:02:03.234 3 3.3 | ||
|
|
||
| For a series of: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some index here for better readability ?
-
For a series:
...
The result series will be -
For a series:
...
The result will be
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, combining with Jeff's comment I update this to be runnable code / ipython output
| @@ -53,13 +62,18 @@ def _post_process_group_by( | |||
| parent: pd.DataFrame, | |||
| order_by: List[str], | |||
| group_by: List[str], | |||
| **kwargs, | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree with @icexelloss here, we don't want kwargs at all.
ibis/expr/timecontext.py
Outdated
| Examples: | ||
| Assume frame is: | ||
| time id value | ||
| 0 2017-01-02 01:02:03.234 1 1.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you actually programatically construct this, then show it.
also use
Examples
-----------
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IOW do this in ipython and basically copy it here (or alternatly do it in a python terminal and use >>>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idea is that this code can be copy pasted to a terminal for execution directly; so include imports too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point thx
ibis/expr/timecontext.py
Outdated
| Name: value, dtype: float64 | ||
|
|
||
| The result series will be: | ||
| time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move all examples after Parameters
ibis/expr/timecontext.py
Outdated
| if TIME_COL == frame.index.name: | ||
| time_index = frame.index | ||
| elif TIME_COL in frame: | ||
| time_index = frame.set_index(TIME_COL).index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no ,do this
pd.Index(frame[TIME_COL])
2b971fd
to
85d05a2
Compare
…is into tiezheng/udf_context_adjustment
85d05a2
to
3dc41d7
Compare
|
Move tests for timecontext for udfs to be under |
| @@ -920,6 +920,7 @@ def execute_database_table_client( | |||
| df = client.dictionary[op.name] | |||
| if timecontext: | |||
| begin, end = timecontext | |||
| TIME_COL = get_time_col() | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks funky, why are we setting a constant here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably I should make this not capitalized to avoid misunderstading here. TIME_COL is not a constant anymore, this just means, read from config for the name of time col and assign the value to a variable, for later execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok yeah, upper case is confusing because it looks like you are reassigning value to a constant.
…is into tiezheng/udf_context_adjustment
0ad5eeb
to
b7c0521
Compare
| # indexes are retturned. | ||
| if TIME_COL not in df: | ||
| # indexes are returned. | ||
| TIME_COL = get_time_col() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment here (as you describe above)
ibis/expr/timecontext.py
Outdated
| pd.Series | ||
|
|
||
| Examples | ||
| ----------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
underline should be the same length
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
ibis/expr/timecontext.py
Outdated
| Name: value, dtype: float64 | ||
| The result is unchanged for a series already has 'time' as its index. | ||
| """ | ||
| TIME_COL = get_time_col() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above (make this lowercase)
a7cf598
to
b7c0521
Compare
5ee762a
to
b7c0521
Compare
e77c8ff
to
b7c0521
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| @@ -33,6 +33,13 @@ | |||
| validator=ibis.config.is_bool, | |||
| ) | |||
| ibis.config.register_option('default_backend', None) | |||
| ibis.config.register_option( | |||
| 'time_col', | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be a follow up. We should properly namespace this to be
context_adjustment.time_col
|
thanks @LeeTZ very nice! |
Overview
This PR add context adjustment for udfs for pandas backend.
Previously, window context adjustment is tested with built-in aggregations. Context adjustment will fail when user put udf in window aggregations, for example:
resultwill look like:Extra rows are added since the result Series of udf is not trimmed by timecontext.
We did implement rules to trim result Series in window execution. However it is based on a fact that the Series contains a time col, so we could extract time info and trim according to time. The result Series of udf doesn't include a time index in current implementation. This PR adds 'time' col (if present) as index for the result of udf execution. To match the result Series of built-in aggregations.
How is this tested
Test is added in
ibis/backends/pandas/tests/execution/test_timecontext.pyfor udf use cases, including unit test for the core methodconstruct_time_context_aware_series, and integration test for udfs on different windows (when there is no groupby nor orderby, both groupby and orderby , only orderby etc.)