Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Support multi args window UDF for pandas backend #2035

Merged
merged 16 commits into from Dec 10, 2019

Conversation

@icexelloss
Copy link
Collaborator

icexelloss commented Nov 19, 2019

This PR addresses issue #1998

Currently, when using UDAF with rolling window, pandas backend will throw an exception:

import ibis
import pandas as pd
import numpy as np

from ibis.pandas.udf import udf
import ibis.expr.datatypes as dt

client = ibis.pandas.connect({'table': pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'key': ['a', 'a', 'a']})})
t = client.table('table')
w = ibis.trailing_window(preceding=1, order_by='key', group_by='key')
#w = ibis.window(group_by='key')


@udf.reduction(input_type=[dt.double, dt.double], output_type=dt.double)
def my_average(v, w):
    return np.average(v, weights=w)

t = t.mutate(new_col=my_average(t.a, t.b).over(w))

t.execute()

With this PR, it will support this use case.

The main idea of this PR is that instead of using groupby().rolling().apply(func) to compute the result, we use groupby().rolling().apply(len, raw=True) to get the size of each window, and then manually apply func to each window in a Python for-loop. This way, we work around issue that groupby().rolling().apply(func) can only take function that apply on a single series.

Benchmarks

Before the change:

In [4]: %time pandas.time_low_card_window_analytics_udf()                                                                                                     
CPU times: user 898 ms, sys: 248 ms, total: 1.15 s
Wall time: 1.15 s

In [5]: %time pandas.time_high_card_window_analytics_udf()                                                                                                    
CPU times: user 12.4 s, sys: 324 ms, total: 12.7 s
Wall time: 12.8 s

In [6]: %time pandas.time_low_card_grouped_rolling()                                                                                                          
CPU times: user 7.99 s, sys: 1.95 s, total: 9.94 s
Wall time: 10.1 s

In [7]: %time pandas.time_high_card_grouped_rolling()                                                                                                         
CPU times: user 15.3 s, sys: 1.83 s, total: 17.1 s
Wall time: 17.3 s

In [8]: %time pandas.time_low_card_grouped_rolling_udf()                                                                                                      
CPU times: user 2min 1s, sys: 1.96 s, total: 2min 3s
Wall time: 2min 4s

In [9]: %time pandas.time_high_card_grouped_rolling_udf()                                                                                                     
CPU times: user 1min 41s, sys: 2.17 s, total: 1min 43s
Wall time: 1min 43s

After the change:

time pandas.time_low_card_window_analytics_udf()
CPU times: user 856 ms, sys: 235 ms, total: 1.09 s
Wall time: 1.08 s
time pandas.time_high_card_window_analytics_udf()
CPU times: user 12.7 s, sys: 316 ms, total: 13 s
Wall time: 13.1 s
time pandas.time_low_card_grouped_rolling()
CPU times: user 9.3 s, sys: 3.15 s, total: 12.4 s
Wall time: 13 s
time pandas.time_high_card_grouped_rolling()
CPU times: user 15.5 s, sys: 1.9 s, total: 17.4 s
Wall time: 17.5 s
time pandas.time_low_card_grouped_rolling_udf()
CPU times: user 1min 26s, sys: 1.74 s, total: 1min 27s
Wall time: 1min 28s
time pandas.time_high_card_grouped_rolling_udf()
CPU times: user 1min 4s, sys: 1.79 s, total: 1min 6s
Wall time: 1min 6s
@icexelloss

This comment has been minimized.

Copy link
Collaborator Author

icexelloss commented Nov 19, 2019

with warnings.catch_warnings():
warnings.filterwarnings(
"ignore", message=".+raw=True.+", category=FutureWarning
# get the DataFrame from which the operand originated

This comment has been minimized.

Copy link
@icexelloss

icexelloss Nov 19, 2019

Author Collaborator

These are the original logic for built-in aggregation functions.

def test_udaf_window_multi_params():
@udf.reduction(['double', 'double'], 'double')
def my_wm(v, w):
print("v")

This comment has been minimized.

Copy link
@icexelloss

icexelloss Nov 19, 2019

Author Collaborator

Remove these

@@ -530,25 +531,33 @@ def execute_udaf_node_groupby(op, *args, **kwargs):
#
# If the argument is not a SeriesGroupBy then keep
# repeating it until all groups are exhausted.

This comment has been minimized.

Copy link
@icexelloss

icexelloss Nov 19, 2019

Author Collaborator

Remove white space

@@ -361,6 +364,7 @@ def agg(self, grouped_data, function, *args, **kwargs):
else:
# do mostly the same thing as if we did NOT have a grouping key,
# but don't call the callable just yet. See below where we call it.

This comment has been minimized.

Copy link
@icexelloss

icexelloss Nov 19, 2019

Author Collaborator

Remove white space

@@ -52,15 +52,8 @@ def test_array_collect(t, df):
tm.assert_frame_equal(result, expected)


@pytest.mark.xfail(

This comment has been minimized.

Copy link
@icexelloss

icexelloss Nov 19, 2019

Author Collaborator

This is now supported as well

Copy link
Contributor

jreback left a comment

some comments

import pandas as pd
from pandas import Series

This comment has been minimized.

Copy link
@jreback

jreback Nov 22, 2019

Contributor

prob can just use of.Series

This comment has been minimized.

Copy link
@icexelloss

icexelloss Nov 25, 2019

Author Collaborator

Removed

# create a generator for each input series
# the generator will yield a slice of the
# input series for each valid window
data = getattr(grouped_series, 'obj', grouped_series).values

This comment has been minimized.

Copy link
@jreback

jreback Nov 22, 2019

Contributor

you shouldn’t use .values here a that coerces to a ndarray

rather leave as a Series and use .iloc

This comment has been minimized.

Copy link
@icexelloss

icexelloss Nov 22, 2019

Author Collaborator

I choose to do ndarray here because:

  • This preserve the same udf API as before.
  • Passing Series is about 15x slower than ndarray

I don't want to introduce API change and performance regression in this PR. I think we can have a separate chat whether window UDF should take ndarray or Series.

This comment has been minimized.

Copy link
@jreback

jreback Nov 22, 2019

Contributor

you shouldn’t show any perf issues must be something odd going on

does it currently take ndarray?

This comment has been minimized.

Copy link
@icexelloss

icexelloss Nov 22, 2019

Author Collaborator

Yes it currently take ndarray (By using raw=True I think)

This comment has been minimized.

Copy link
@jreback

jreback Nov 27, 2019

Contributor

ok can you create an issue to fix this, meaning to use iloc. .values is not type preserving and generally a bad idea.

This comment has been minimized.

Copy link
@icexelloss

icexelloss Dec 4, 2019

Author Collaborator

Created: #2045

inputs = args if len(args) > 0 else [grouped_data]

input_gens = list(
create_input_gen(arg, window_size)

This comment has been minimized.

Copy link
@jreback

jreback Nov 22, 2019

Contributor

what are you trying to do here?

This comment has been minimized.

Copy link
@icexelloss

icexelloss Nov 22, 2019

Author Collaborator

Here I am creating generators for each inputs so later I can just call next(input). This hides the details of how next is implemented and unifies how we send data inputs and arg inputs to the user function.

result[mask] = valid_result
result.index = obj.index
else:
with warnings.catch_warnings():

This comment has been minimized.

Copy link
@jreback

jreback Nov 22, 2019

Contributor

i would move these out to 2 module level functions rather than nesting like this

This comment has been minimized.

Copy link
@icexelloss

icexelloss Nov 25, 2019

Author Collaborator

Actually this is not needed anymore because we don't use raw=True - it was used for UDF and now UDF is handled separately.

@@ -271,6 +272,87 @@ def my_mean(series):
tm.assert_frame_equal(result, expected)


def test_udaf_window_interval():
@udf.reduction(['double'], 'double')

This comment has been minimized.

Copy link
@jreback

jreback Nov 22, 2019

Contributor

isn’t this defined above?

This comment has been minimized.

Copy link
@icexelloss

icexelloss Nov 25, 2019

Author Collaborator

Removed

# the custom rolling logic.
result = aggcontext.agg(args[0], func, *args, **kwargs)
else:
iters = (

This comment has been minimized.

Copy link
@jreback

jreback Nov 22, 2019

Contributor

i would move this to a module level function

This comment has been minimized.

Copy link
@icexelloss

icexelloss Nov 25, 2019

Author Collaborator

Done

ibis/pandas/udf.py Show resolved Hide resolved
@icexelloss icexelloss force-pushed the icexelloss:pandas-backend-multi-args-udf branch 2 times, most recently from 700a22a to ee64a17 Nov 25, 2019
@jreback jreback added the enhancement label Nov 27, 2019
@jreback jreback added this to the Next Feature Release milestone Nov 27, 2019
# create a generator for each input series
# the generator will yield a slice of the
# input series for each valid window
data = getattr(grouped_series, 'obj', grouped_series).values

This comment has been minimized.

Copy link
@jreback

jreback Nov 27, 2019

Contributor

ok can you create an issue to fix this, meaning to use iloc. .values is not type preserving and generally a bad idea.

raw_window_size = windowed.apply(len, raw=True).reset_index(
drop=True
)
mask = ~(raw_window_size.isna())

This comment has been minimized.

Copy link
@jreback

jreback Nov 27, 2019

Contributor

pls make this a separate function, this is too hard to grok inline here.

This comment has been minimized.

Copy link
@icexelloss

icexelloss Dec 4, 2019

Author Collaborator

Done

@@ -271,6 +274,88 @@ def my_mean(series):
tm.assert_frame_equal(result, expected)


def test_udaf_window_interval():

This comment has been minimized.

Copy link
@jreback

jreback Nov 27, 2019

Contributor

are all of the window udf tests here or in test_window?

This comment has been minimized.

Copy link
@icexelloss

icexelloss Dec 4, 2019

Author Collaborator

all winndow udf tests are here

@@ -100,6 +101,29 @@ def arguments_from_signature(signature, *args, **kwargs):
return args, new_kwargs


def create_gens_from_args_groupby(args):

This comment has been minimized.

Copy link
@jreback

jreback Nov 27, 2019

Contributor

can you type args

This comment has been minimized.

Copy link
@icexelloss

icexelloss Dec 4, 2019

Author Collaborator

Added

""" Create generators for each args for groupby udaf.
If the arg is SeriesGroupBy, return a generator that outputs each group.
If the arg is not SeriesGroupBy, return a generator that repeats the arg.

This comment has been minimized.

Copy link
@jreback

jreback Nov 27, 2019

Contributor

what else could this be? can you type it

This comment has been minimized.

Copy link
@icexelloss

icexelloss Dec 4, 2019

Author Collaborator

Improved docstring

ibis/pandas/udf.py Show resolved Hide resolved
)

valid_result = pd.Series(valid_result)
valid_result.index = window_size.index

This comment has been minimized.

Copy link
@jreback

jreback Nov 27, 2019

Contributor

are you testing the output indexes are correct?

This comment has been minimized.

Copy link
@icexelloss

icexelloss Dec 4, 2019

Author Collaborator

yes there are tests (test_udaf_window_interval) that cover out of order indices and make sure the output is correct


valid_result = pd.Series(valid_result)
valid_result.index = window_size.index
result = pd.Series(np.repeat(None, len(obj)))

This comment has been minimized.

Copy link
@jreback

jreback Nov 27, 2019

Contributor

this is really strange to do, what are you trying?

result = pd.Series(np.repeat(None, len(obj)))
result[mask] = valid_result
result.index = obj.index
else:
result = method(windowed)

This comment has been minimized.

Copy link
@jreback

jreback Nov 27, 2019

Contributor

can you add some commments on what is this case (again would be much better to split these out to free functions)

This comment has been minimized.

Copy link
@icexelloss

icexelloss Dec 4, 2019

Author Collaborator

I have separated this into separate functions

@jreback

This comment has been minimized.

Copy link
Contributor

jreback commented Dec 3, 2019

if you'd rebase can look again

@icexelloss icexelloss force-pushed the icexelloss:pandas-backend-multi-args-udf branch from cccfa5c to de81e5a Dec 3, 2019
@icexelloss

This comment has been minimized.

Copy link
Collaborator Author

icexelloss commented Dec 4, 2019

@jreback Haven't addressed all comments. Will ping again when it's done.

@icexelloss icexelloss changed the title ENH: Support multi param window UDF for pandas backend ENH: Support multi arg window UDF for pandas backend Dec 4, 2019
@icexelloss icexelloss changed the title ENH: Support multi arg window UDF for pandas backend ENH: Support multi args window UDF for pandas backend Dec 4, 2019
@@ -326,6 +327,83 @@ def __init__(self, kind, *args, **kwargs):
)
self.construct_window = operator.methodcaller(kind, *args, **kwargs)

def _agg_built_in(self, frame, windowed, function, *args, **kwargs):

This comment has been minimized.

Copy link
@jreback

jreback Dec 4, 2019

Contributor

I would make both of these module level functions (don't pass self, I think you can just pass max_lookback as as kord arg).

This comment has been minimized.

Copy link
@icexelloss

icexelloss Dec 4, 2019

Author Collaborator

I feel that these methods are very particular to Window aggregation context and therefore probably belongs in the class. I curious why module level functions are better here?

This comment has been minimized.

Copy link
@icexelloss

icexelloss Dec 5, 2019

Author Collaborator

Discussed offline. Moved to module level.

@icexelloss icexelloss added the pandas label Dec 4, 2019
Copy link
Contributor

jreback left a comment

looks good. can you add a release note (likely point to a new doc section) & a doc section (can be a followup PR).

@@ -313,6 +314,92 @@ def compute_window_spec_interval(_, expr):
return pd.tseries.frequencies.to_offset(value)


def _window_agg_built_in(
frame, windowed, function, max_lookback, *args, **kwargs

This comment has been minimized.

Copy link
@jreback

jreback Dec 6, 2019

Contributor

ideally if you can type these arguments

This comment has been minimized.

Copy link
@icexelloss

icexelloss Dec 9, 2019

Author Collaborator

Done

ibis/pandas/aggcontext.py Show resolved Hide resolved
ibis/pandas/aggcontext.py Show resolved Hide resolved


def _window_agg_udf(
grouped_data, windowed, function, dtype, max_lookback, *args, **kwargs

This comment has been minimized.

Copy link
@jreback

jreback Dec 6, 2019

Contributor

can you type this

This comment has been minimized.

Copy link
@icexelloss

icexelloss Dec 9, 2019

Author Collaborator

Done

):
"""Apply window aggregation with UDFs.
"""
# Use custom logic to computing rolling window UDF instead of

This comment has been minimized.

Copy link
@jreback

jreback Dec 6, 2019

Contributor

include this in the doc-string under Notes

This comment has been minimized.

Copy link
@icexelloss

icexelloss Dec 9, 2019

Author Collaborator

Done

# This is because pandas's rolling function doesn't support
# multi param UDFs.

def create_input_gen(grouped_series, window_size):

This comment has been minimized.

Copy link
@jreback

jreback Dec 6, 2019

Contributor

can you type

This comment has been minimized.

Copy link
@icexelloss

icexelloss Dec 9, 2019

Author Collaborator

Done

obj = getattr(grouped_data, 'obj', grouped_data)
name = obj.name
if frame[name] is not obj:
name = "{}_{}".format(name, ibis.util.guid())

This comment has been minimized.

Copy link
@jreback

jreback Dec 6, 2019

Contributor

can use an fstring here

This comment has been minimized.

Copy link
@icexelloss

icexelloss Dec 9, 2019

Author Collaborator

Donne

# TODO: see if we can do this in the caller, when the context
# is constructed rather than pulling out the data
columns = group_by + order_by + [name]
indexed_by_ordering = frame.loc[:, columns].set_index(order_by)

This comment has been minimized.

Copy link
@jreback

jreback Dec 6, 2019

Contributor

use frame[columns].set_index(...)

This comment has been minimized.

Copy link
@icexelloss

icexelloss Dec 9, 2019

Author Collaborator

Done



def test_udaf_window_multi_args():
@udf.reduction(['double', 'double'], 'double')

This comment has been minimized.

Copy link
@jreback

jreback Dec 6, 2019

Contributor

can you add the issue number as a comment (or this PR number)

This comment has been minimized.

Copy link
@icexelloss

icexelloss Dec 9, 2019

Author Collaborator

Added

name = obj.name
if frame[name] is not obj:
name = f"{name}_{ibis.util.guid()}"
frame[name] = obj

This comment has been minimized.

Copy link
@jreback

jreback Dec 9, 2019

Contributor

do this like

frame = frame.assign(name=obj)

to avoid mutating the input

This comment has been minimized.

Copy link
@jreback

jreback Dec 9, 2019

Contributor

I see you are just copying original code (but this should change anyhow)

This comment has been minimized.

Copy link
@icexelloss

icexelloss Dec 10, 2019

Author Collaborator

Done

ibis/pandas/aggcontext.py Show resolved Hide resolved
}
)
con = ibis.pandas.connect({'df': df})
t = con.table('df')
window = ibis.trailing_window(2, order_by='a', group_by='key')
expr = t.mutate(rolled=my_mean(t.b).over(window))
expr = t.mutate(
wm_b=my_wm(t.b, t.d).over(window), wm_c=my_wm(t.c, t.d).over(window)

This comment has been minimized.

Copy link
@jreback

jreback Dec 9, 2019

Contributor

I assume that this will work if we use different windows? can you add a test for that, or if it doesn't work can you test the exception.

This comment has been minimized.

Copy link
@icexelloss

icexelloss Dec 10, 2019

Author Collaborator

Added

@@ -225,49 +250,107 @@ def my_corr2(lhs, **kwargs):
pass


def test_compose_udfs():
def test_compose_udfs(t2, df2):

This comment has been minimized.

Copy link
@jreback

jreback Dec 9, 2019

Contributor

what happens if the udaf raises an exception? are these caught in a reasonable way? just asking for a test (not even sure what this should do)

This comment has been minimized.

Copy link
@icexelloss

icexelloss Dec 10, 2019

Author Collaborator

Added test_udf_error

Copy link
Contributor

jreback left a comment

lgtm. minor comment & can you add a note in release.rst? ping on green.

**kwargs,
)
try:
return result.astype(self.dtype, copy=False)

This comment has been minimized.

Copy link
@jreback

jreback Dec 10, 2019

Contributor

may want to add a comment on when this can fail

@icexelloss icexelloss force-pushed the icexelloss:pandas-backend-multi-args-udf branch from 7fb3065 to e652fb7 Dec 10, 2019
@jreback jreback merged commit a95e32f into ibis-project:master Dec 10, 2019
8 checks passed
8 checks passed
ibis-project.ibis #20191210.3 succeeded
Details
ibis-project.ibis (LinuxBenchmark) LinuxBenchmark succeeded
Details
ibis-project.ibis (LinuxBuildConda) LinuxBuildConda succeeded
Details
ibis-project.ibis (LinuxBuildDocs) LinuxBuildDocs succeeded
Details
ibis-project.ibis (LinuxTest py36) LinuxTest py36 succeeded
Details
ibis-project.ibis (LinuxTest py37) LinuxTest py37 succeeded
Details
ibis-project.ibis (WindowsTest py36) WindowsTest py36 succeeded
Details
ibis-project.ibis (WindowsTest py37) WindowsTest py37 succeeded
Details
@jreback

This comment has been minimized.

Copy link
Contributor

jreback commented Dec 10, 2019

thanks @icexelloss very nice patch!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.