-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REGR: Fixed AssertionError in groupby #31616
REGR: Fixed AssertionError in groupby #31616
Conversation
I'm not clear on how we end up with multiple blocks if they all have the same dtype. if so, could we just consolidate? cc @WillAyd you put a lot of time into this code late last year. |
Pushed up a change to consolidate if needed (and fixed a bug in my test. Had transposed some values on accident). |
Hmm seems like the actual issue is somewhere else; do we know where we are getting multiple blocks in the first place? I think the consolidate call should be coupled tighter with whatever that function is |
The split happens as a result of pandas/pandas/core/groupby/generic.py Line 984 in 73ea6ca
I'm not exactly sure why that's called, but it seems necessary to split object columns, since only some columns within a block might be converted? Indeed, that reveals a problem even with this patched version In [44]: df = pd.DataFrame({"A": pd.date_range("2000", periods=4), "B": ['a', 'b', 'c', 'd']}).astype(object)
In [45]: df.groupby([0, 0, 0, 1]).min()
ValueError: Wrong number of items passed 1, placement implies 2 |
@jbrockmendel @WillAyd can you take another look when you get a chance? The basic problem is described in #31616 (comment), but the tldr is that we deliberately split the result of object blocks in Regardless, all the code in |
Oh, whoops, that last commit is going to fail. Is there a reason the block at https://github.com/pandas-dev/pandas/pull/31616/files#diff-bfee1ba9e7cb79839776fac1a57ed940R1082-R1084 needs to be in a |
assert len(locs) == result.shape[1] | ||
for i, loc in enumerate(locs): | ||
new_items.append(np.array([loc], dtype=locs.dtype)) | ||
agg_blocks.append(result.iloc[:, [i]]._data.blocks[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we avoid some of this by changing the agg_blocks.append to agg_blocks.extend? and construct these separate blocks up in 1069-1076?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried that but it didn't look promising so I abandoned it. Several things work against that
- The conversion to ndarray. We need to avoid that since we have mixed types
- The construction of the new block with
block.make_block
would need to be handle specially, since we have multiple blocks and the originblock
's locs aren't correct anymore, since it's been split.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm this might be out of scope, but i think if we used block.apply it would handle both the make_block and potential splitting
I guess this is ok; the whole groupby operating on blocks has been a mess for quite a while. @TomAugspurger pls rebase. |
Rebased |
CI is passing now. |
OK, merging. Apologies for rushing things along, but I think this is an improvement on master. Happy to work through followups if people have suggestions. I'll start with #31616 (comment), though immediate see where block.apply would be used. In place of |
Co-authored-by: Tom Augspurger <TomAugspurger@users.noreply.github.com>
Closes #31522
cc @jbrockmendel. Just raising a
TypeError
when that assert failed didn't work. Thefinally
still runs, which raised an assertion error.It seemed easier to try to just support this case. IIUC, it only occurs when an
(P, n_rows)
input block gets split intoP
result blocks. I believe thatSo it should be safe to just cast the result values into an ndarray. Hopefully...
Are there any edge cases I'm not considering? Some kind of
agg
that returns a result that can't be put in a 2D block? Even something like.agg(lambda x: pd.Period())
won't hit this, since it has to be a Cython function.