Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix time context trimming error for multi column udfs in pandas backend #2712

Merged
merged 7 commits into from
Mar 26, 2021

Conversation

LeeTZ
Copy link
Contributor

@LeeTZ LeeTZ commented Mar 26, 2021

Overview

When specifying time context, multi-column udf execution for pandas backend may end up with an exception. This PR aims to fix the issue.

Example

For the following example:

@analytic(
    input_type=[dt.double, dt.double],
    output_type=dt.Struct(['demean', 'demean_weight'], [dt.double, dt.double]),
)
def demean_struct(v, w):
    return v - v.mean(), w - w.mean()

w = ibis.window(preceding=None, following=None)
result = expr.mutate(
    demean_struct(expr['double_col'], expr['int_col'])
    .over(w)
    .destructure()
).execute(timecontext=context)

To compute a multi column udf over a table, the execution may fail with exception:

ibis/backends/pandas/execution/window.py:375: in execute_window_op
    computed = trim_window_result(computed, timecontext)
ibis/backends/pandas/execution/window.py:219: in trim_window_result
    name = data.name if data.name else 0

self =                             demean  demean_weight
   timestamp_col                                 
0  2009-01-05 00:4...5.25            2.5
58 2009-01-10 01:38:04.330   35.35            3.5
59 2009-01-10 01:39:04.410   45.45            4.5
name = 'name'

    def __getattr__(self, name: str):
        """After regular attribute access, try looking up the name
        This allows simpler access to columns for interactive use.
        """
    
        # Note: obj.x will always call obj.__getattribute__('x') prior to
        # calling obj.__getattr__('x').
    
        if (
            name in self._internal_names_set
            or name in self._metadata
            or name in self._accessors
        ):
            return object.__getattribute__(self, name)
        else:
            if self._info_axis._can_hold_identifiers_and_holds_name(name):
                return self[name]
>           return object.__getattribute__(self, name)
E           AttributeError: 'DataFrame' object has no attribute 'name'

The reason of failing is that, in Window execution for pandas backend, we assume the computed result is always a pd.Series. However in such cases of multi-column udf, the result of op AnalyticVectorizedUDF is a pd.DataFrame.

For the fix, we add logic to deal with both pd.Series and pd.DataFrame in trimming the computed result for a pandas WindowOp

Test

Test case to cover this multi-column udf case is added in ibis/backends/tests/test_timecontext.py

@LeeTZ LeeTZ changed the title BUG: Fix time context trimming error fro multi column udfs BUG: Fix time context trimming error for multi column udfs in pandas backend Mar 26, 2021
@jreback jreback added the udf Issues related to user-defined functions label Mar 26, 2021
@jreback jreback added this to the Next release milestone Mar 26, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add this PR number to the prior release note for changes in time context


"""
# noop if timecontext is None
if not timecontext:
return data
assert isinstance(data, pd.Series) or isinstance(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert(isinstance, (pd.Series, pd.DataFrame))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

name = data.name if data.name else 0
index_columns = list(subset.columns.difference([name]))
else:
name = data.columns.tolist()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need the .tolist() here just pass name directly to .difference

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops forgot to remove this, +1

input_type=[dt.double, dt.double],
output_type=dt.Struct(['demean', 'demean_weight'], [dt.double, dt.double]),
)
def demean_struct(v, w):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we import these UDFs instead of duplicating

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I will just import from vectorized_udf tests

@@ -361,15 +371,16 @@ def execute_window_op(
clients=clients,
**kwargs,
)
series = post_process(
computed = post_process(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calling this window_result is a little better I think

Also fine to reuse result as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will keep using result

@@ -179,26 +179,32 @@ def get_aggcontext_window(
return aggcontext


def trim_window_series(data, timecontext: Optional[TimeContext]):
def trim_window_result(data, timecontext: Optional[TimeContext]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you annotate the type of "data"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that would be helpful, added

index_columns = list(subset.columns.difference([name]))
else:
name = data.columns
index_columns = list(subset.columns.difference(name))

# set the correct index for return Seires
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix comment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Series -> Series/DataFrame

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@icexelloss
Copy link
Contributor

Minor comments. Otherwise LGTM.

Copy link
Contributor

@icexelloss icexelloss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jreback
Copy link
Contributor

jreback commented Mar 26, 2021

thanks @LeeTZ ping on green.

@jreback jreback merged commit f6910dc into ibis-project:master Mar 26, 2021
@cpcloud cpcloud removed this from the Next release milestone Jan 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pandas The pandas backend udf Issues related to user-defined functions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants