BUG: aggregations were getting overwritten if they had the same name #30858

MarcoGorelli · 2020-01-09T18:22:21Z

closes Multiple aggregations with the same name get overwritten #30880
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

xref #30092

pandas/core/groupby/generic.py

charlesdong1991 · 2020-01-09T20:42:02Z

nice, thanks for the PR @MarcoGorelli

I think this PR deserves a new issue other than #30092 , so would suggest to xref it.

pep8speaks · 2020-01-10T09:58:46Z

Hello @MarcoGorelli! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-07-14 18:13:33 UTC

WillAyd

Nice job looks pretty good

pandas/core/groupby/generic.py

pandas/tests/groupby/aggregate/test_aggregate.py

pandas/core/groupby/generic.py

doc/source/whatsnew/v1.0.0.rst

MarcoGorelli · 2020-01-21T11:58:10Z

@jreback @WillAyd thanks for your reviews, have updated accordingly

pandas/core/groupby/generic.py

doc/source/whatsnew/v1.1.0.rst

MarcoGorelli · 2020-03-03T22:23:58Z

Hi @WillAyd - sorry to chase you up, just wanted to ask if there's anything else that needs doing here or if it's alright (or indeed if it's the wrong fix altogether :) )

pandas/core/groupby/generic.py

WillAyd · 2020-03-03T22:40:53Z

pandas/core/groupby/generic.py

-
-        return DataFrame(results, columns=columns)
+            return {key.label: value for key, value in results.items()}
+        return DataFrame(self._wrap_aggregated_output(results), columns=columns)


Is the DataFrame constructor still required here?

In the test

pytest pandas/tests/groupby/aggregate/test_aggregate.py::test_aggregate_item_by_item

when we get here we have

(Pdb) results {OutputKey(label='<lambda>', position=0): A bar 3 foo 5 Name: B, dtype: int64} (Pdb) self._wrap_aggregated_output(results) A bar 3 foo 5 Name: <lambda>, dtype: int64 (Pdb) type(self._wrap_aggregated_output(results)) <class 'pandas.core.series.Series'>

@WillAyd have updated with a call to .to_frame (if necessary)

MarcoGorelli · 2020-05-27T08:00:00Z

Before this is merged, I should add a test case using pd.NamedAgg, xref #34380

MarcoGorelli · 2020-05-27T20:41:32Z

Before this is merged, I should add a test case using pd.NamedAgg, xref #34380

@jreback, have added a test which uses pd.NamedAgg, as in #34380, it's green now

EDIT: this is no longer allowed as of #34435, so have removed that extra test

MarcoGorelli · 2020-06-24T13:02:03Z

friendly ping :)

jreback · 2020-06-24T14:51:54Z

pandas/core/groupby/generic.py


        if any(isinstance(x, DataFrame) for x in results.values()):
            # let higher level handle
            return results

-        return self.obj._constructor_expanddim(results, columns=columns)
+        if not results:
+            return DataFrame()


hmm is this correct? do we have tests that hit this. I would think we would have somthing e.g. columns even if this is empty

also why is this not just handled in wrap_aggregated_output?

Here's a test that hits it: pandas/tests/groupby/aggregate/test_aggregate.py::TestNamedAggregationSeries::test_no_args_raises

When trying to move this to wrap_aggregated_output I ran into #34977, so I'll try to address that first

this still is quite fishy . if you pass en empty result to self._wrap_aggregated_output what do you get as output? I really don't like special cases like this which inevitably hide errors and make groking code way more complex.

So prefer to have _wrap_aggregated_output handle this correctly. you may not even need L333, its possible to pass columns to _wrap_aggregated_output

@jreback if we pass {} to _wrap_aggregated_output we get a KeyError.

Here's the traceback:

============================= test session starts ============================== platform linux -- Python 3.8.3, pytest-5.4.3, py-1.8.2, pluggy-0.13.1 rootdir: /home/marco/pandas-dev, inifile: setup.cfg plugins: xdist-1.32.0, cov-2.10.0, asyncio-0.12.0, hypothesis-5.16.1, forked-1.1.2 collected 1 item pandas/tests/groupby/aggregate/test_aggregate.py F [100%] =================================== FAILURES =================================== ________________ TestNamedAggregationSeries.test_no_args_raises ________________ self = <pandas.tests.groupby.aggregate.test_aggregate.TestNamedAggregationSeries object at 0x7f8835d975b0> def test_no_args_raises(self): gr = pd.Series([1, 2]).groupby([0, 1]) with pytest.raises(TypeError, match="Must provide"): gr.agg() # but we do allow this > result = gr.agg([]) pandas/tests/groupby/aggregate/test_aggregate.py:555: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ pandas/core/groupby/generic.py:247: in aggregate ret = self._aggregate_multiple_funcs(func) pandas/core/groupby/generic.py:328: in _aggregate_multiple_funcs output = self._wrap_aggregated_output(results) pandas/core/groupby/generic.py:387: in _wrap_aggregated_output result = self._wrap_series_output( _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = <pandas.core.groupby.generic.SeriesGroupBy object at 0x7f8835d97ac0> output = {}, index = Int64Index([0, 1], dtype='int64') def _wrap_series_output( self, output: Mapping[base.OutputKey, Union[Series, np.ndarray]], index: Index, ) -> Union[Series, DataFrame]: """ Wraps the output of a SeriesGroupBy operation into the expected result. Parameters ---------- output : Mapping[base.OutputKey, Union[Series, np.ndarray]] Data to wrap. index : pd.Index Index to apply to the output. Returns ------- Series or DataFrame Notes ----- In the vast majority of cases output and columns will only contain one element. The exception is operations that expand dimensions, like ohlc. """ indexed_output = {key.position: val for key, val in output.items()} columns = Index(key.label for key in output) result: Union[Series, DataFrame] if len(output) > 1: result = self.obj._constructor_expanddim(indexed_output, index=index) result.columns = columns else: result = self.obj._constructor( > indexed_output[0], index=index, name=columns[0] ) E KeyError: 0 pandas/core/groupby/generic.py:362: KeyError -------------- generated xml file: /tmp/tmp-31663hvopHHCRFEu.xml --------------- =========================== short test summary info ============================ FAILED pandas/tests/groupby/aggregate/test_aggregate.py::TestNamedAggregationSeries::test_no_args_raises ============================== 1 failed in 0.22s ===============================

The problem is this line which access [0] on an empty object

i would just fix this, need to check if len(indexed_output)

So,

elif len(indexed_output): result = self.obj._constructor( indexed_output[0], index=index, name=columns[0] ) else: result = self.obj._constructor()

?

I can do that, but then I'll still have to address #34977 when the output of _wrap_aggregated_output is passed to self.obj._constructor_expanddim(results, columns=columns).

its possible to pass columns to _wrap_aggregated_output

Are you sure? It seems to only take on argument (other than self)

@jreback would

elif len(indexed_output): result = self.obj._constructor( indexed_output[0], index=index, name=columns[0] ) else: return None

be an acceptable solution?

Actually,

elif not columns.empty: result = self.obj._constructor( indexed_output[0], index=index, name=columns[0] ) else: result = self.obj._constructor_expanddim()

works, because

pd.DataFrame(pd.DataFrame(), columns=[])

is allowed.

No need to modify the return types like this :)

jreback · 2020-06-24T22:31:55Z

pandas/core/groupby/generic.py


        if any(isinstance(x, DataFrame) for x in results.values()):
            # let higher level handle
            return results

-        return self.obj._constructor_expanddim(results, columns=columns)
+        if not results:
+            return DataFrame()


this still is quite fishy . if you pass en empty result to self._wrap_aggregated_output what do you get as output? I really don't like special cases like this which inevitably hide errors and make groking code way more complex.

So prefer to have _wrap_aggregated_output handle this correctly. you may not even need L333, its possible to pass columns to _wrap_aggregated_output

…ions

jreback · 2020-07-09T23:38:45Z

this needs to go after #34998 which substantially refactors things

MarcoGorelli · 2020-07-10T07:54:33Z

OK, thanks for letting me know, will try rebasing locally to get this ready

jreback · 2020-07-14T17:19:21Z

@MarcoGorelli hmm we are going to be refactoring #34998 / delaying it, so let's see if we can get this one in. pls rebase and see if you can address above comments.

…ions

MarcoGorelli · 2020-07-14T19:01:54Z

@jreback have merged master

Regarding above comments, seems the only outstanding issue is handling when we pass an empty result to self._wrap_aggregated_output, which I've handled by returning self.obj._constructor_expanddim(). If that's not the right solution, any hints would be appreciated

jreback · 2020-07-14T20:24:18Z

@TomAugspurger if you'd have a quick look.

TomAugspurger · 2020-07-14T20:32:51Z

doc/source/whatsnew/v1.1.0.rst

@@ -1101,6 +1101,7 @@ Reshaping
 - Bug in :func:`crosstab` when inputs are two Series and have tuple names, the output will keep dummy MultiIndex as columns. (:issue:`18321`)
 - :meth:`DataFrame.pivot` can now take lists for ``index`` and ``columns`` arguments (:issue:`21425`)
 - Bug in :func:`concat` where the resulting indices are not copied when ``copy=True`` (:issue:`29879`)
+- Bug in :meth:`SeriesGroupBy.aggregate` was resulting in aggregations being overwritten when they shared the same name (:issue:`30880`)


FYI: the link to this method won't render, since SeriesGroupBy isn't in the pands namespace.

Sorry about that - will make sure the build the whatsnew file in the future to check

pandas/tests/groupby/aggregate/test_aggregate.py

TomAugspurger · 2020-07-14T20:35:00Z

Thanks @MarcoGorelli!

MarcoGorelli · 2020-07-15T07:15:33Z

This one took me a while (I opened it in January!) so thanks for bearing with me while I worked on it!

…andas-dev#30858) * 🐛 aggregations were getting overwritten if they had the same name

MarcoGorelli changed the title ~~🐛 aggregations were getting overwritten if they had the same name~~ [BUG] aggregations were getting overwritten if they had the same name Jan 9, 2020

WillAyd requested changes Jan 9, 2020

View reviewed changes

pandas/core/groupby/generic.py Outdated Show resolved Hide resolved

WillAyd added the Groupby label Jan 9, 2020

MarcoGorelli force-pushed the multiple-aggregations branch from 65ae0c6 to 5e9fe4e Compare January 10, 2020 09:58

MarcoGorelli force-pushed the multiple-aggregations branch 2 times, most recently from 83398f8 to f84483b Compare January 10, 2020 17:11

WillAyd requested changes Jan 20, 2020

View reviewed changes

pandas/core/groupby/generic.py Outdated Show resolved Hide resolved

pandas/tests/groupby/aggregate/test_aggregate.py Outdated Show resolved Hide resolved

jreback requested changes Jan 20, 2020

View reviewed changes

pandas/core/groupby/generic.py Show resolved Hide resolved

doc/source/whatsnew/v1.0.0.rst Outdated Show resolved Hide resolved

MarcoGorelli force-pushed the multiple-aggregations branch 3 times, most recently from 42e9571 to 33c57a2 Compare January 21, 2020 11:29

MarcoGorelli force-pushed the multiple-aggregations branch from 33c57a2 to 3e648e1 Compare January 23, 2020 15:13

MarcoGorelli changed the title ~~[BUG] aggregations were getting overwritten if they had the same name~~ BUG: aggregations were getting overwritten if they had the same name Jan 24, 2020

WillAyd requested changes Jan 31, 2020

View reviewed changes

pandas/core/groupby/generic.py Show resolved Hide resolved

doc/source/whatsnew/v1.1.0.rst Outdated Show resolved Hide resolved

MarcoGorelli force-pushed the multiple-aggregations branch 2 times, most recently from 44e562c to 521bc1d Compare February 3, 2020 11:51

Marco Gorelli and others added 7 commits February 3, 2020 11:54

🐛 aggregations were getting overwritten if they had the same name

20049c1

🎨 shorten test for the sake of legibility

ab685fd

🎨 handle empty in , make whatsnewentry public-facing

e38e450

📝 move whatsnew entry to v1.1.0

cb849a2

remove accidentally added whatsnewentry

521bc1d

Merge branch 'master' into multiple-aggregations

ec93c4f

Update v1.1.0.rst

6f9aac8

WillAyd requested changes Mar 3, 2020

View reviewed changes

Marco Gorelli added 2 commits March 4, 2020 11:55

remove dataframe constructor

a8e9121

Dict instead of Mapping

b857c6d

MarcoGorelli mentioned this pull request May 26, 2020

BUG: #34380

Closed

MarcoGorelli added 3 commits May 27, 2020 19:00

add test with namedtuple

aa988a4

better layout

7a62f5f

better layout

d80ddc5

jreback requested changes Jun 24, 2020

View reviewed changes

MarcoGorelli mentioned this pull request Jun 24, 2020

ERR: Can't initialise DataFrame using empty Series and empty columns #34977

Open

3 tasks

jreback requested changes Jun 24, 2020

View reviewed changes

MarcoGorelli added 2 commits June 27, 2020 10:42

Merge remote-tracking branch 'upstream/master' into multiple-aggregat…

4f954d4

…ions

dont special case empty output

62d91d1

MarcoGorelli requested a review from jreback June 27, 2020 10:26

Merge remote-tracking branch 'upstream/master' into multiple-aggregat…

fb3ba5c

…ions

jreback approved these changes Jul 14, 2020

View reviewed changes

jreback requested a review from TomAugspurger July 14, 2020 20:24

TomAugspurger approved these changes Jul 14, 2020

View reviewed changes

TomAugspurger merged commit b6222ec into pandas-dev:master Jul 14, 2020

MarcoGorelli deleted the multiple-aggregations branch July 15, 2020 07:12

fangchenli pushed a commit to fangchenli/pandas that referenced this pull request Jul 16, 2020

BUG: aggregations were getting overwritten if they had the same name (p…

935be95

…andas-dev#30858) * 🐛 aggregations were getting overwritten if they had the same name

simonjayhawkins mentioned this pull request Jul 31, 2020

BUG: DataFrame.agg with multiple cum functions creates wrong result #35490

Closed

3 tasks

villebro mentioned this pull request Sep 23, 2020

chore: bump pandas to latest stable version apache/superset#11018

Merged

6 tasks

MarcoGorelli mentioned this pull request Dec 7, 2021

read_csv/pl.Dataframe.rename broken?! pola-rs/polars#2004

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: aggregations were getting overwritten if they had the same name #30858

BUG: aggregations were getting overwritten if they had the same name #30858

MarcoGorelli commented Jan 9, 2020 •

edited

charlesdong1991 commented Jan 9, 2020

pep8speaks commented Jan 10, 2020 •

edited

WillAyd left a comment

MarcoGorelli commented Jan 21, 2020

MarcoGorelli commented Mar 3, 2020

WillAyd Mar 3, 2020

MarcoGorelli Mar 4, 2020

MarcoGorelli Mar 5, 2020

MarcoGorelli commented May 27, 2020

MarcoGorelli commented May 27, 2020 •

edited

MarcoGorelli commented Jun 24, 2020

jreback Jun 24, 2020

jreback Jun 24, 2020

MarcoGorelli Jun 24, 2020

jreback Jun 24, 2020

MarcoGorelli Jun 25, 2020 •

edited

jreback Jun 25, 2020

MarcoGorelli Jun 25, 2020

MarcoGorelli Jun 27, 2020

MarcoGorelli Jun 27, 2020

jreback Jun 24, 2020

jreback commented Jul 9, 2020

MarcoGorelli commented Jul 10, 2020

jreback commented Jul 14, 2020

MarcoGorelli commented Jul 14, 2020

jreback commented Jul 14, 2020

TomAugspurger Jul 14, 2020

MarcoGorelli Jul 15, 2020

TomAugspurger commented Jul 14, 2020

MarcoGorelli commented Jul 15, 2020

BUG: aggregations were getting overwritten if they had the same name #30858

BUG: aggregations were getting overwritten if they had the same name #30858

Conversation

MarcoGorelli commented Jan 9, 2020 • edited

charlesdong1991 commented Jan 9, 2020

pep8speaks commented Jan 10, 2020 • edited

Comment last updated at 2020-07-14 18:13:33 UTC

WillAyd left a comment

Choose a reason for hiding this comment

MarcoGorelli commented Jan 21, 2020

MarcoGorelli commented Mar 3, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoGorelli commented May 27, 2020

MarcoGorelli commented May 27, 2020 • edited

MarcoGorelli commented Jun 24, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoGorelli Jun 25, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jul 9, 2020

MarcoGorelli commented Jul 10, 2020

jreback commented Jul 14, 2020

MarcoGorelli commented Jul 14, 2020

jreback commented Jul 14, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Jul 14, 2020

MarcoGorelli commented Jul 15, 2020

MarcoGorelli commented Jan 9, 2020 •

edited

pep8speaks commented Jan 10, 2020 •

edited

MarcoGorelli commented May 27, 2020 •

edited

MarcoGorelli Jun 25, 2020 •

edited