Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix Summarize not always producing a scalar value (Pandas backend) #2410

Merged
merged 7 commits into from
Oct 5, 2020

Conversation

timothydijamco
Copy link
Contributor

@timothydijamco timothydijamco commented Sep 29, 2020

This PR is to fix a bug with the Summarize aggregation context class in the Pandas backend.

Problem

Currently, Summarize.agg makes pretty straightforward use of pandas.Series.agg to produce its result.

However, pandas.Series.agg seems to have a quirk where if a single function is passed to pandas.Series.agg, it will behave exactly like pandas.Series.apply (in fact, it will call pandas.Series.apply) and produce a Series result, unless an exception is caught while doing this, in which case it will treat the function as an aggregation function (and produce a scalar result).

Because of this quirk, Summarize.agg will have unexpected results (i.e., will sometimes produce a Series rather than a scalar value) simply depending on whether the function passed to Summarize.agg happens to not raise an error when passed to pandas.Series.apply.

Solution

Currently (before this PR), Summarize.agg would wrap the function passed to it. This PR essentially adds logic in that wrapper function to detect if the function is being passed to pandas.Series.apply, and raise a TypeError if so (which will force pandas.Series.agg to not behave like pandas.Series.apply).

Example

Code

import pandas as pd
import ibis
from ibis.pandas.aggcontext import Summarize

df = pd.DataFrame(
    {
        'id': [1, 2, 1, 2],
        'v1': [1.0, 2.0, 3.0, 4.0],
        'v2': [10.0, 20.0, 30.0, 40.0],
    }
)

aggcontext = Summarize()

# Note that this function takes two columns, but only does a reduction operation on the second column!
# This means that this function can technically be used with Pandas `apply` with no issues.
def some_udf(v1, v2):
    return v2.mean()

args = [df['v1'], df['v2']]

aggcontext.agg(args[0], some_udf, *args[1:])

Output

Before

0    25.0
1    25.0
2    25.0
3    25.0
Name: v1, dtype: float64

After

25.0

Testing

Created a few tests for Summarize in ibis/pandas/tests/test_aggcontext.py:

  • test_summarize_single_series
  • test_summarize_single_seriesgroupby
  • test_summarize_multiple_series <-- one test parameter for this fails before this PR

@timothydijamco
Copy link
Contributor Author

Here is a rundown of the pandas.Series.agg logic to explain the "quirk":

  1. If the "function" passed to agg is a dict, str, or list (or similar), call _aggregate and return that result. (I don't think we ever hit this case in Ibis Summarize) (L4013)
  2. Otherwise, the "function" is just a single function. Try calling apply, and return that result if succeeds. (L4030)
  3. If an exception is caught, call the function as-is and return the result. (L4032)

In Ibis we are using pandas.Series.agg in a kind of odd way to implement aggregations across multiple columns, which is why this implementation of pandas.Series.agg leads to these unexpected results in Ibis sometimes.

@jreback jreback added pandas The pandas backend window functions Issues or PRs related to window functions labels Sep 30, 2020
@jreback jreback added this to the Next Bugfix Release milestone Oct 1, 2020
ibis/pandas/aggcontext.py Show resolved Hide resolved
docs/source/release/index.rst Outdated Show resolved Hide resolved
ibis/pandas/aggcontext.py Show resolved Hide resolved
ibis/pandas/aggcontext.py Outdated Show resolved Hide resolved
Copy link
Contributor

@icexelloss icexelloss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. +1

@@ -252,19 +252,52 @@ def agg(self, grouped_data, function, *args, **kwargs):
pass


def make_applied_function(function, args=None, kwargs=None):
def wrap_for_apply(function, args, kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you type these as much as possible (same for wrap_for_agg)

2) Treat the function as a N->1 aggregation function (i.e. calls the
function once on the entire Series)
Pandas `agg` will use behavior #1 unless an error is raised when doing so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a Parameters section

raises a TypeError otherwise. When Pandas `agg` is attempting to use
behavior #1 but sees the TypeError, it will fall back to behavior #2.
"""
assert callable(function), 'function {} is not callable'.format(function)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use f-strings

@jreback jreback merged commit 07826d3 into ibis-project:master Oct 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pandas The pandas backend window functions Issues or PRs related to window functions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants