Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: aggregations were getting overwritten if they had the same name #30858

Merged
merged 34 commits into from
Jul 14, 2020

Conversation

MarcoGorelli
Copy link
Member

@MarcoGorelli MarcoGorelli commented Jan 9, 2020

xref #30092

@MarcoGorelli MarcoGorelli changed the title 🐛 aggregations were getting overwritten if they had the same name [BUG] aggregations were getting overwritten if they had the same name Jan 9, 2020
pandas/core/groupby/generic.py Outdated Show resolved Hide resolved
@WillAyd WillAyd added the Groupby label Jan 9, 2020
@charlesdong1991
Copy link
Member

nice, thanks for the PR @MarcoGorelli

I think this PR deserves a new issue other than #30092 , so would suggest to xref it.

@pep8speaks
Copy link

pep8speaks commented Jan 10, 2020

Hello @MarcoGorelli! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-07-14 18:13:33 UTC

@MarcoGorelli MarcoGorelli force-pushed the multiple-aggregations branch 2 times, most recently from 83398f8 to f84483b Compare January 10, 2020 17:11
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job looks pretty good

pandas/core/groupby/generic.py Outdated Show resolved Hide resolved
pandas/tests/groupby/aggregate/test_aggregate.py Outdated Show resolved Hide resolved
pandas/core/groupby/generic.py Show resolved Hide resolved
doc/source/whatsnew/v1.0.0.rst Outdated Show resolved Hide resolved
@MarcoGorelli MarcoGorelli force-pushed the multiple-aggregations branch 3 times, most recently from 42e9571 to 33c57a2 Compare January 21, 2020 11:29
@MarcoGorelli
Copy link
Member Author

@jreback @WillAyd thanks for your reviews, have updated accordingly

@MarcoGorelli MarcoGorelli changed the title [BUG] aggregations were getting overwritten if they had the same name BUG: aggregations were getting overwritten if they had the same name Jan 24, 2020
pandas/core/groupby/generic.py Show resolved Hide resolved
doc/source/whatsnew/v1.1.0.rst Outdated Show resolved Hide resolved
@MarcoGorelli MarcoGorelli force-pushed the multiple-aggregations branch 2 times, most recently from 44e562c to 521bc1d Compare February 3, 2020 11:51
@MarcoGorelli
Copy link
Member Author

Hi @WillAyd - sorry to chase you up, just wanted to ask if there's anything else that needs doing here or if it's alright (or indeed if it's the wrong fix altogether :) )

pandas/core/groupby/generic.py Outdated Show resolved Hide resolved
pandas/core/groupby/generic.py Show resolved Hide resolved

return DataFrame(results, columns=columns)
return {key.label: value for key, value in results.items()}
return DataFrame(self._wrap_aggregated_output(results), columns=columns)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the DataFrame constructor still required here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the test

pytest pandas/tests/groupby/aggregate/test_aggregate.py::test_aggregate_item_by_item

when we get here we have

(Pdb) results
{OutputKey(label='<lambda>', position=0): A
bar    3
foo    5
Name: B, dtype: int64}
(Pdb) self._wrap_aggregated_output(results)
A
bar    3
foo    5
Name: <lambda>, dtype: int64
(Pdb) type(self._wrap_aggregated_output(results))
<class 'pandas.core.series.Series'>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd have updated with a call to .to_frame (if necessary)

@MarcoGorelli MarcoGorelli mentioned this pull request May 26, 2020
@MarcoGorelli
Copy link
Member Author

Before this is merged, I should add a test case using pd.NamedAgg, xref #34380

@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented May 27, 2020

Before this is merged, I should add a test case using pd.NamedAgg, xref #34380

@jreback, have added a test which uses pd.NamedAgg, as in #34380, it's green now

EDIT: this is no longer allowed as of #34435, so have removed that extra test

@MarcoGorelli
Copy link
Member Author

friendly ping :)


if any(isinstance(x, DataFrame) for x in results.values()):
# let higher level handle
return results

return self.obj._constructor_expanddim(results, columns=columns)
if not results:
return DataFrame()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm is this correct? do we have tests that hit this. I would think we would have somthing e.g. columns even if this is empty

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also why is this not just handled in wrap_aggregated_output?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a test that hits it: pandas/tests/groupby/aggregate/test_aggregate.py::TestNamedAggregationSeries::test_no_args_raises

When trying to move this to wrap_aggregated_output I ran into #34977, so I'll try to address that first

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this still is quite fishy . if you pass en empty result to self._wrap_aggregated_output what do you get as output? I really don't like special cases like this which inevitably hide errors and make groking code way more complex.

So prefer to have _wrap_aggregated_output handle this correctly. you may not even need L333, its possible to pass columns to _wrap_aggregated_output

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback if we pass {} to _wrap_aggregated_output we get a KeyError.

Here's the traceback:

============================= test session starts ==============================
platform linux -- Python 3.8.3, pytest-5.4.3, py-1.8.2, pluggy-0.13.1
rootdir: /home/marco/pandas-dev, inifile: setup.cfg
plugins: xdist-1.32.0, cov-2.10.0, asyncio-0.12.0, hypothesis-5.16.1, forked-1.1.2
collected 1 item

pandas/tests/groupby/aggregate/test_aggregate.py F                       [100%]

=================================== FAILURES ===================================
________________ TestNamedAggregationSeries.test_no_args_raises ________________

self = <pandas.tests.groupby.aggregate.test_aggregate.TestNamedAggregationSeries object at 0x7f8835d975b0>

    def test_no_args_raises(self):
        gr = pd.Series([1, 2]).groupby([0, 1])
        with pytest.raises(TypeError, match="Must provide"):
            gr.agg()
    
        # but we do allow this
>       result = gr.agg([])

pandas/tests/groupby/aggregate/test_aggregate.py:555: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/core/groupby/generic.py:247: in aggregate
    ret = self._aggregate_multiple_funcs(func)
pandas/core/groupby/generic.py:328: in _aggregate_multiple_funcs
    output = self._wrap_aggregated_output(results)
pandas/core/groupby/generic.py:387: in _wrap_aggregated_output
    result = self._wrap_series_output(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <pandas.core.groupby.generic.SeriesGroupBy object at 0x7f8835d97ac0>
output = {}, index = Int64Index([0, 1], dtype='int64')

    def _wrap_series_output(
        self, output: Mapping[base.OutputKey, Union[Series, np.ndarray]], index: Index,
    ) -> Union[Series, DataFrame]:
        """
        Wraps the output of a SeriesGroupBy operation into the expected result.
    
        Parameters
        ----------
        output : Mapping[base.OutputKey, Union[Series, np.ndarray]]
            Data to wrap.
        index : pd.Index
            Index to apply to the output.
    
        Returns
        -------
        Series or DataFrame
    
        Notes
        -----
        In the vast majority of cases output and columns will only contain one
        element. The exception is operations that expand dimensions, like ohlc.
        """
        indexed_output = {key.position: val for key, val in output.items()}
        columns = Index(key.label for key in output)
    
        result: Union[Series, DataFrame]
        if len(output) > 1:
            result = self.obj._constructor_expanddim(indexed_output, index=index)
            result.columns = columns
        else:
            result = self.obj._constructor(
>               indexed_output[0], index=index, name=columns[0]
            )
E           KeyError: 0

pandas/core/groupby/generic.py:362: KeyError
-------------- generated xml file: /tmp/tmp-31663hvopHHCRFEu.xml ---------------
=========================== short test summary info ============================
FAILED pandas/tests/groupby/aggregate/test_aggregate.py::TestNamedAggregationSeries::test_no_args_raises
============================== 1 failed in 0.22s ===============================

The problem is this line which access [0] on an empty object

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would just fix this, need to check if len(indexed_output)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So,

elif len(indexed_output):
    result = self.obj._constructor(
        indexed_output[0], index=index, name=columns[0]
    )
else:
    result = self.obj._constructor()

?

I can do that, but then I'll still have to address #34977 when the output of _wrap_aggregated_output is passed to self.obj._constructor_expanddim(results, columns=columns).

its possible to pass columns to _wrap_aggregated_output

Are you sure? It seems to only take on argument (other than self)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback would

elif len(indexed_output):
    result = self.obj._constructor(
        indexed_output[0], index=index, name=columns[0]
    )
else:
    return None

be an acceptable solution?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually,

        elif not columns.empty:
            result = self.obj._constructor(
                indexed_output[0], index=index, name=columns[0]
            )
        else:
            result = self.obj._constructor_expanddim()

works, because

pd.DataFrame(pd.DataFrame(), columns=[])

is allowed.

No need to modify the return types like this :)


if any(isinstance(x, DataFrame) for x in results.values()):
# let higher level handle
return results

return self.obj._constructor_expanddim(results, columns=columns)
if not results:
return DataFrame()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this still is quite fishy . if you pass en empty result to self._wrap_aggregated_output what do you get as output? I really don't like special cases like this which inevitably hide errors and make groking code way more complex.

So prefer to have _wrap_aggregated_output handle this correctly. you may not even need L333, its possible to pass columns to _wrap_aggregated_output

@jreback
Copy link
Contributor

jreback commented Jul 9, 2020

this needs to go after #34998 which substantially refactors things

@MarcoGorelli
Copy link
Member Author

OK, thanks for letting me know, will try rebasing locally to get this ready

@jreback
Copy link
Contributor

jreback commented Jul 14, 2020

@MarcoGorelli hmm we are going to be refactoring #34998 / delaying it, so let's see if we can get this one in. pls rebase and see if you can address above comments.

@MarcoGorelli
Copy link
Member Author

@jreback have merged master

Regarding above comments, seems the only outstanding issue is handling when we pass an empty result to self._wrap_aggregated_output, which I've handled by returning self.obj._constructor_expanddim(). If that's not the right solution, any hints would be appreciated

@jreback
Copy link
Contributor

jreback commented Jul 14, 2020

@TomAugspurger if you'd have a quick look.

@@ -1101,6 +1101,7 @@ Reshaping
- Bug in :func:`crosstab` when inputs are two Series and have tuple names, the output will keep dummy MultiIndex as columns. (:issue:`18321`)
- :meth:`DataFrame.pivot` can now take lists for ``index`` and ``columns`` arguments (:issue:`21425`)
- Bug in :func:`concat` where the resulting indices are not copied when ``copy=True`` (:issue:`29879`)
- Bug in :meth:`SeriesGroupBy.aggregate` was resulting in aggregations being overwritten when they shared the same name (:issue:`30880`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: the link to this method won't render, since SeriesGroupBy isn't in the pands namespace.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about that - will make sure the build the whatsnew file in the future to check

@TomAugspurger TomAugspurger merged commit b6222ec into pandas-dev:master Jul 14, 2020
@TomAugspurger
Copy link
Contributor

Thanks @MarcoGorelli!

@MarcoGorelli MarcoGorelli deleted the multiple-aggregations branch July 15, 2020 07:12
@MarcoGorelli
Copy link
Member Author

This one took me a while (I opened it in January!) so thanks for bearing with me while I worked on it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multiple aggregations with the same name get overwritten
6 participants