REGR: Fixed AssertionError in groupby #31616

TomAugspurger · 2020-02-03T14:13:46Z

cc @jbrockmendel. Just raising a TypeError when that assert failed didn't work. The finally still runs, which raised an assertion error.

It seemed easier to try to just support this case. IIUC, it only occurs when an (P, n_rows) input block gets split into P result blocks. I believe that

The result blocks should all have the same dtype
The input block must not have been an extension block, since it's 2d

So it should be safe to just cast the result values into an ndarray. Hopefully...

Are there any edge cases I'm not considering? Some kind of agg that returns a result that can't be put in a 2D block? Even something like .agg(lambda x: pd.Period())
won't hit this, since it has to be a Cython function.

Closes pandas-dev#31522

jbrockmendel · 2020-02-03T16:18:55Z

The result blocks should all have the same dtype

I'm not clear on how we end up with multiple blocks if they all have the same dtype. if so, could we just consolidate?

cc @WillAyd you put a lot of time into this code late last year.

…ertion

TomAugspurger · 2020-02-03T16:43:27Z

Pushed up a change to consolidate if needed (and fixed a bug in my test. Had transposed some values on accident).

WillAyd · 2020-02-04T00:48:09Z

Hmm seems like the actual issue is somewhere else; do we know where we are getting multiple blocks in the first place? I think the consolidate call should be coupled tighter with whatever that function is

pandas/core/groupby/generic.py

…ertion

TomAugspurger · 2020-02-04T12:12:30Z

The split happens as a result of result._convert(datetime=True) after the aggregation:

pandas/pandas/core/groupby/generic.py

Line 984 in 73ea6ca

return result._convert(datetime=True)

I'm not exactly sure why that's called, but it seems necessary to split object columns, since only some columns within a block might be converted? Indeed, that reveals a problem even with this patched version

In [44]: df = pd.DataFrame({"A": pd.date_range("2000", periods=4), "B": ['a', 'b', 'c', 'd']}).astype(object)

In [45]: df.groupby([0, 0, 0, 1]).min()
ValueError: Wrong number of items passed 1, placement implies 2

TomAugspurger · 2020-02-04T16:05:41Z

@jbrockmendel @WillAyd can you take another look when you get a chance? The basic problem is described in #31616 (comment), but the tldr is that we deliberately split the result of object blocks in aggregate. The test at https://github.com/pandas-dev/pandas/pull/31616/files#diff-98307885a4959790d371ce5c886d6039R404 demonstrates a case where this is probably useful.

Regardless, all the code in DataFrameGroupBy._cython_agg_blocks is assuming that operation is 1:1 in terms of the number of blocks. I've chosen to handle split blocks after the fact, similar to deleted_blocks. I think it's slightly less ugly than trying to handle them inside the main for loop...

TomAugspurger · 2020-02-04T16:09:28Z

Oh, whoops, that last commit is going to fail.

Is there a reason the block at https://github.com/pandas-dev/pandas/pull/31616/files#diff-bfee1ba9e7cb79839776fac1a57ed940R1082-R1084 needs to be in a finally? That makes that section run even when we continue. I would think just dedenting it would achieve the same effect.

pandas/core/groupby/generic.py

jbrockmendel · 2020-02-04T18:23:09Z

pandas/core/groupby/generic.py

+                assert len(locs) == result.shape[1]
+                for i, loc in enumerate(locs):
+                    new_items.append(np.array([loc], dtype=locs.dtype))
+                    agg_blocks.append(result.iloc[:, [i]]._data.blocks[0])


could we avoid some of this by changing the agg_blocks.append to agg_blocks.extend? and construct these separate blocks up in 1069-1076?

I tried that but it didn't look promising so I abandoned it. Several things work against that

The conversion to ndarray. We need to avoid that since we have mixed types

The construction of the new block with block.make_block would need to be handle specially, since we have multiple blocks and the origin block's locs aren't correct anymore, since it's been split.

Hmm this might be out of scope, but i think if we used block.apply it would handle both the make_block and potential splitting

jreback · 2020-02-05T00:35:20Z

I guess this is ok; the whole groupby operating on blocks has been a mess for quite a while. @TomAugspurger pls rebase.

…ertion

jorisvandenbossche · 2020-02-05T09:22:58Z

Rebased

TomAugspurger · 2020-02-05T13:24:23Z

CI is passing now.

TomAugspurger · 2020-02-05T13:58:08Z

Will give @jreback and @WillAyd another hour or so for feedback, but I think this is OKish, at least for 1.0.1.

TomAugspurger · 2020-02-05T14:55:18Z

OK, merging. Apologies for rushing things along, but I think this is an improvement on master.

Happy to work through followups if people have suggestions.

I'll start with #31616 (comment), though immediate see where block.apply would be used. In place of s.aggregate?

Co-authored-by: Tom Augspurger <TomAugspurger@users.noreply.github.com>

REGR: Fixed AssertionError in groupby

f868874

Closes pandas-dev#31522

TomAugspurger added Groupby Regression Functionality that used to work in a prior pandas version labels Feb 3, 2020

TomAugspurger added this to the 1.0.1 milestone Feb 3, 2020

TomAugspurger added 2 commits February 3, 2020 10:31

Merge remote-tracking branch 'upstream/master' into 31522-groupby-ass…

e2fa8f5

…ertion

consolidate if needed

70608cf

jreback requested changes Feb 4, 2020

View reviewed changes

pandas/core/groupby/generic.py Show resolved Hide resolved

Merge remote-tracking branch 'upstream/master' into 31522-groupby-ass…

4da6bff

…ertion

TomAugspurger added 3 commits February 4, 2020 07:35

add test

6eeda42

dedent

04d2c72

fixup

8a5db12

all split

6eb1cfd

TomAugspurger commented Feb 4, 2020

View reviewed changes

pandas/core/groupby/generic.py Show resolved Hide resolved

jbrockmendel reviewed Feb 4, 2020

View reviewed changes

Merge remote-tracking branch 'upstream/master' into 31522-groupby-ass…

b4554be

…ertion

TomAugspurger merged commit 2bf618f into pandas-dev:master Feb 5, 2020

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Feb 5, 2020

Backport PR pandas-dev#31616: REGR: Fixed AssertionError in groupby

64f41d6

meeseeksmachine mentioned this pull request Feb 5, 2020

Backport PR #31616 on branch 1.0.x (REGR: Fixed AssertionError in groupby) #31703

Merged

TomAugspurger deleted the 31522-groupby-assertion branch February 5, 2020 14:56

TomAugspurger added a commit that referenced this pull request Feb 5, 2020

Backport PR #31616: REGR: Fixed AssertionError in groupby (#31703)

32990d5

Co-authored-by: Tom Augspurger <TomAugspurger@users.noreply.github.com>

This was referenced Feb 6, 2020

BUG: groupby _cython_agg_blocks implicitly assumes unique columns #31735

Closed

REF: _cython_agg_blocks #31752

Closed

jbrockmendel mentioned this pull request Aug 21, 2020

REF: simplify _cython_agg_blocks #35841

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REGR: Fixed AssertionError in groupby #31616

REGR: Fixed AssertionError in groupby #31616

TomAugspurger commented Feb 3, 2020

jbrockmendel commented Feb 3, 2020

TomAugspurger commented Feb 3, 2020

WillAyd commented Feb 4, 2020

TomAugspurger commented Feb 4, 2020

TomAugspurger commented Feb 4, 2020

TomAugspurger commented Feb 4, 2020

jbrockmendel Feb 4, 2020

TomAugspurger Feb 4, 2020

jbrockmendel Feb 5, 2020

jreback commented Feb 5, 2020

jorisvandenbossche commented Feb 5, 2020

TomAugspurger commented Feb 5, 2020

TomAugspurger commented Feb 5, 2020

TomAugspurger commented Feb 5, 2020

REGR: Fixed AssertionError in groupby #31616

REGR: Fixed AssertionError in groupby #31616

Conversation

TomAugspurger commented Feb 3, 2020

jbrockmendel commented Feb 3, 2020

TomAugspurger commented Feb 3, 2020

WillAyd commented Feb 4, 2020

TomAugspurger commented Feb 4, 2020

TomAugspurger commented Feb 4, 2020

TomAugspurger commented Feb 4, 2020

jbrockmendel Feb 4, 2020

Choose a reason for hiding this comment

TomAugspurger Feb 4, 2020

Choose a reason for hiding this comment

jbrockmendel Feb 5, 2020

Choose a reason for hiding this comment

jreback commented Feb 5, 2020

jorisvandenbossche commented Feb 5, 2020

TomAugspurger commented Feb 5, 2020

TomAugspurger commented Feb 5, 2020

TomAugspurger commented Feb 5, 2020