Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REGR: Fixed AssertionError in groupby #31616

Merged

Conversation

TomAugspurger
Copy link
Contributor

Closes #31522

cc @jbrockmendel. Just raising a TypeError when that assert failed didn't work. The finally still runs, which raised an assertion error.

It seemed easier to try to just support this case. IIUC, it only occurs when an (P, n_rows) input block gets split into P result blocks. I believe that

  1. The result blocks should all have the same dtype
  2. The input block must not have been an extension block, since it's 2d

So it should be safe to just cast the result values into an ndarray. Hopefully...

Are there any edge cases I'm not considering? Some kind of agg that returns a result that can't be put in a 2D block? Even something like .agg(lambda x: pd.Period())
won't hit this, since it has to be a Cython function.

@TomAugspurger TomAugspurger added Groupby Regression Functionality that used to work in a prior pandas version labels Feb 3, 2020
@TomAugspurger TomAugspurger added this to the 1.0.1 milestone Feb 3, 2020
@jbrockmendel
Copy link
Member

The result blocks should all have the same dtype

I'm not clear on how we end up with multiple blocks if they all have the same dtype. if so, could we just consolidate?

cc @WillAyd you put a lot of time into this code late last year.

@TomAugspurger
Copy link
Contributor Author

Pushed up a change to consolidate if needed (and fixed a bug in my test. Had transposed some values on accident).

@WillAyd
Copy link
Member

WillAyd commented Feb 4, 2020

Hmm seems like the actual issue is somewhere else; do we know where we are getting multiple blocks in the first place? I think the consolidate call should be coupled tighter with whatever that function is

@TomAugspurger
Copy link
Contributor Author

The split happens as a result of result._convert(datetime=True) after the aggregation:

return result._convert(datetime=True)

I'm not exactly sure why that's called, but it seems necessary to split object columns, since only some columns within a block might be converted? Indeed, that reveals a problem even with this patched version

In [44]: df = pd.DataFrame({"A": pd.date_range("2000", periods=4), "B": ['a', 'b', 'c', 'd']}).astype(object)

In [45]: df.groupby([0, 0, 0, 1]).min()
ValueError: Wrong number of items passed 1, placement implies 2

@TomAugspurger
Copy link
Contributor Author

@jbrockmendel @WillAyd can you take another look when you get a chance? The basic problem is described in #31616 (comment), but the tldr is that we deliberately split the result of object blocks in aggregate. The test at https://github.com/pandas-dev/pandas/pull/31616/files#diff-98307885a4959790d371ce5c886d6039R404 demonstrates a case where this is probably useful.

Regardless, all the code in DataFrameGroupBy._cython_agg_blocks is assuming that operation is 1:1 in terms of the number of blocks. I've chosen to handle split blocks after the fact, similar to deleted_blocks. I think it's slightly less ugly than trying to handle them inside the main for loop...

@TomAugspurger
Copy link
Contributor Author

Oh, whoops, that last commit is going to fail.

Is there a reason the block at https://github.com/pandas-dev/pandas/pull/31616/files#diff-bfee1ba9e7cb79839776fac1a57ed940R1082-R1084 needs to be in a finally? That makes that section run even when we continue. I would think just dedenting it would achieve the same effect.

assert len(locs) == result.shape[1]
for i, loc in enumerate(locs):
new_items.append(np.array([loc], dtype=locs.dtype))
agg_blocks.append(result.iloc[:, [i]]._data.blocks[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we avoid some of this by changing the agg_blocks.append to agg_blocks.extend? and construct these separate blocks up in 1069-1076?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that but it didn't look promising so I abandoned it. Several things work against that

  1. The conversion to ndarray. We need to avoid that since we have mixed types
  2. The construction of the new block with block.make_block would need to be handle specially, since we have multiple blocks and the origin block's locs aren't correct anymore, since it's been split.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this might be out of scope, but i think if we used block.apply it would handle both the make_block and potential splitting

@jreback
Copy link
Contributor

jreback commented Feb 5, 2020

I guess this is ok; the whole groupby operating on blocks has been a mess for quite a while. @TomAugspurger pls rebase.

@jorisvandenbossche
Copy link
Member

Rebased

@TomAugspurger
Copy link
Contributor Author

CI is passing now.

@TomAugspurger
Copy link
Contributor Author

Will give @jreback and @WillAyd another hour or so for feedback, but I think this is OKish, at least for 1.0.1.

@TomAugspurger
Copy link
Contributor Author

OK, merging. Apologies for rushing things along, but I think this is an improvement on master.

Happy to work through followups if people have suggestions.

I'll start with #31616 (comment), though immediate see where block.apply would be used. In place of s.aggregate?

@TomAugspurger TomAugspurger merged commit 2bf618f into pandas-dev:master Feb 5, 2020
meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Feb 5, 2020
@TomAugspurger TomAugspurger deleted the 31522-groupby-assertion branch February 5, 2020 14:56
TomAugspurger added a commit that referenced this pull request Feb 5, 2020
Co-authored-by: Tom Augspurger <TomAugspurger@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AssertionError when grouping with max/min as aggregation functions (pandas-1.0.0)
5 participants