Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT: Support context adjustment for udfs for pandas backend #2646

Merged
merged 15 commits into from
Mar 10, 2021

Conversation

LeeTZ
Copy link
Contributor

@LeeTZ LeeTZ commented Feb 24, 2021

Overview

This PR add context adjustment for udfs for pandas backend.
Previously, window context adjustment is tested with built-in aggregations. Context adjustment will fail when user put udf in window aggregations, for example:

@udf.reduction(['double'], 'double')
def my_mean(series):
    return series.mean()

window = ibis.trailing_window(ibis.interval(days=3), order_by='time')
expr = time_table.mutate(v1=my_mean(time_table['value']).over(window))
result = expr.execute(timecontext=context)

result will look like:

                     time   id  value    v1
0 2017-01-05 01:02:03.234  4.0    4.4  1.00
1 2017-01-06 01:02:03.234  5.0    5.5  1.50
2 2017-01-07 01:02:03.234  6.0    6.6  1.65
3 2017-01-08 01:02:03.234  7.0    7.7  2.75
4 2017-01-09 01:02:03.234  8.0    8.8  3.85
5                     NaT  NaN    NaN  4.95
6                     NaT  NaN    NaN  6.05
7                     NaT  NaN    NaN  7.15

Extra rows are added since the result Series of udf is not trimmed by timecontext.
We did implement rules to trim result Series in window execution. However it is based on a fact that the Series contains a time col, so we could extract time info and trim according to time. The result Series of udf doesn't include a time index in current implementation. This PR adds 'time' col (if present) as index for the result of udf execution. To match the result Series of built-in aggregations.

How is this tested

Test is added in ibis/backends/pandas/tests/execution/test_timecontext.py for udf use cases, including unit test for the core methodconstruct_time_context_aware_series, and integration test for udfs on different windows (when there is no groupby nor orderby, both groupby and orderby , only orderby etc.)

@jreback jreback added the expressions Issues or PRs related to the expression API label Feb 24, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a release note as well

ibis/expr/timecontext.py Show resolved Hide resolved
@@ -127,6 +127,40 @@ def canonicalize_context(
return begin, end


def construct_multi_index_series(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have a more informative name for this. yes its constructing a MI series, but its doing it for time_context.
maybe

construct_time_context_aware_series

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I changed the name to be construct_time_context_aware_series

def construct_multi_index_series(
series: pd.Series, frame: pd.DataFrame
) -> pd.Series:
""" Construct a pd.MultiIndex of IntIndex and 'time'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why IntIndex?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the IntIndex here presenting the groupby key? What about the case that groupby key is not integer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rephrase this a little bit, it meant to add 'time' into the original index, whatever type it is.

if TIME_COL == frame.index.name:
time_index = frame.index
elif TIME_COL in frame:
time_index = frame.set_index(TIME_COL).index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably question for Jeff: Is this the best way to construct an index from existing column?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no ,do this
pd.Index(frame[TIME_COL])


Parameters
----------
series: pd.Series
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe what is the series and what is the frame here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as comments above, added description, as well as examples.

@@ -127,6 +127,40 @@ def canonicalize_context(
return begin, end


def construct_multi_index_series(
Copy link
Contributor

@icexelloss icexelloss Feb 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also feels to me what the function is an implementation of the pandas backend and therefore, probably don't belong in ibis/expr/timecontext.py module. Is there a place you can put this in the pandas module?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this. This could be a util function in general, also putting this in pandas/execution/timecontext.py seems to cause circular dependency. I could revisit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would it have a circular dependency?

@@ -338,6 +341,8 @@ def execute_window_op(
)
series = post_process(result, data, ordering_keys, grouping_keys)

if timecontext:
series = construct_multi_index_series(series, data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the series and data matching in row ordering here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e. I think the series is ordered in group_by ordering, and the data is ordered in original ordering (probably ordered by time).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good point. I made some modifications here. Time index should be injected to series prior to groupby. I add logics in aggcontext.py for this, row order is well tracked in agg phase.

Copy link
Contributor

@icexelloss icexelloss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments.

@LeeTZ
Copy link
Contributor Author

LeeTZ commented Feb 25, 2021

Update with a change in window.py and aggcontext.py to address row ordering for the series and parent DataFrame, when groupby is defined.
Also added more tests, including unit test for onstruct_time_context_aware_series, and integration test for udfs when there is no groupby nor orderby, both groupby and orderby , and only orderby
A relase note is also added.

@@ -53,13 +62,18 @@ def _post_process_group_by(
parent: pd.DataFrame,
order_by: List[str],
group_by: List[str],
**kwargs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be timecontext instead of kwargs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be, but timecontext is currently not used in this function so I didn't make it as an explicit param

Copy link
Contributor

@icexelloss icexelloss Feb 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a good pattern to specify all param explicitly even some of them are not used. kwargs brings flexibility which I don't think you need here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Explicitness over implicitness)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with @icexelloss here, we don't want kwargs at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx, I removed kwargs

return series.mean()


def test_context_adjustment_window_udf_nogroupby_noorderby(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move these to the backend test and test this with pyspark backend as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

1 2017-01-03 01:02:03.234 2 2.2
2 2017-01-04 01:02:03.234 3 3.3

For a series of:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some index here for better readability ?

  1. For a series:
    ...
    The result series will be

  2. For a series:
    ...
    The result will be
    ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, combining with Jeff's comment I update this to be runnable code / ipython output

@jreback jreback added this to the Next release milestone Feb 25, 2021
@@ -53,13 +62,18 @@ def _post_process_group_by(
parent: pd.DataFrame,
order_by: List[str],
group_by: List[str],
**kwargs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with @icexelloss here, we don't want kwargs at all.

Examples:
Assume frame is:
time id value
0 2017-01-02 01:02:03.234 1 1.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you actually programatically construct this, then show it.

also use

Examples
-----------

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IOW do this in ipython and basically copy it here (or alternatly do it in a python terminal and use >>>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idea is that this code can be copy pasted to a terminal for execution directly; so include imports too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point thx

Name: value, dtype: float64

The result series will be:
time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move all examples after Parameters

if TIME_COL == frame.index.name:
time_index = frame.index
elif TIME_COL in frame:
time_index = frame.set_index(TIME_COL).index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no ,do this
pd.Index(frame[TIME_COL])

@icexelloss icexelloss added the pandas The pandas backend label Feb 26, 2021
@LeeTZ LeeTZ force-pushed the tiezheng/udf_context_adjustment branch from 2b971fd to 85d05a2 Compare February 26, 2021 19:51
@LeeTZ LeeTZ force-pushed the tiezheng/udf_context_adjustment branch from 85d05a2 to 3dc41d7 Compare March 4, 2021 15:33
@LeeTZ
Copy link
Contributor Author

LeeTZ commented Mar 9, 2021

Move tests for timecontext for udfs to be under ibis/backends/test to test for all backends that support this feature (pandas & pyspark in this case). Tests are parameterized and use alltypes as the data source. Note that in order to make this work, I also move TIME_COL from a constant in timecontext.py to be an ibis config that is configurable on test runs.

@@ -920,6 +920,7 @@ def execute_database_table_client(
df = client.dictionary[op.name]
if timecontext:
begin, end = timecontext
TIME_COL = get_time_col()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks funky, why are we setting a constant here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably I should make this not capitalized to avoid misunderstading here. TIME_COL is not a constant anymore, this just means, read from config for the name of time col and assign the value to a variable, for later execution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok yeah, upper case is confusing because it looks like you are reassigning value to a constant.

@LeeTZ LeeTZ force-pushed the tiezheng/udf_context_adjustment branch from 0ad5eeb to b7c0521 Compare March 9, 2021 16:19
# indexes are retturned.
if TIME_COL not in df:
# indexes are returned.
TIME_COL = get_time_col()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment here (as you describe above)

ibis/backends/tests/test_timecontext.py Show resolved Hide resolved
ibis/backends/tests/test_timecontext.py Show resolved Hide resolved
pd.Series

Examples
-----------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

underline should be the same length

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Name: value, dtype: float64
The result is unchanged for a series already has 'time' as its index.
"""
TIME_COL = get_time_col()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above (make this lowercase)

@LeeTZ LeeTZ force-pushed the tiezheng/udf_context_adjustment branch 4 times, most recently from a7cf598 to b7c0521 Compare March 10, 2021 16:40
@LeeTZ LeeTZ force-pushed the tiezheng/udf_context_adjustment branch from 5ee762a to b7c0521 Compare March 10, 2021 17:00
@LeeTZ LeeTZ force-pushed the tiezheng/udf_context_adjustment branch from e77c8ff to b7c0521 Compare March 10, 2021 17:03
Copy link
Contributor

@icexelloss icexelloss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -33,6 +33,13 @@
validator=ibis.config.is_bool,
)
ibis.config.register_option('default_backend', None)
ibis.config.register_option(
'time_col',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be a follow up. We should properly namespace this to be
context_adjustment.time_col

@jreback jreback merged commit 3e250a8 into ibis-project:master Mar 10, 2021
@jreback
Copy link
Contributor

jreback commented Mar 10, 2021

thanks @LeeTZ very nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
expressions Issues or PRs related to the expression API pandas The pandas backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants