Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrameGroupBy.boxplot crashes if any group contains duplicate index #30772

Closed
xuancong84 opened this issue Jan 7, 2020 · 3 comments
Closed
Labels
Duplicate Report Duplicate issue or pull request Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@xuancong84
Copy link

For DataFrameGroupBy, if any group contains duplicate index, boxplot will crash. See code below for illustration, setting crash=True will give rise to duplicate index in Group 1 (2nd group), causing boxplot to crash.

import pandas as pd
import numpy.random as rnd

crash=True
if crash:
    index = pd.date_range(start='1/1/2018', end='1/6/2018').append(pd.date_range(start='1/6/2018', end='1/15/2018'))
else:
    index = pd.date_range(start='1/1/2018', end='1/16/2018')

df = pd.DataFrame(data={'value':rnd.randn(16), 'group':[i for i in range(4) for j in range(4)]}, index=index)
dfg = df.groupby('group')
dfg.boxplot(subplots=False)

The error stack trace looks like the following:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-51-a9576feccdf0> in <module>
     11 df = pd.DataFrame(data={'value':rnd.randn(16), 'group':[i for i in range(4) for j in range(4)]}, index=index)
     12 dfg = df.groupby('group')
---> 13 dfg.boxplot(subplots=False)
     14 dfg

~/anaconda3/lib/python3.7/site-packages/pandas/plotting/_core.py in boxplot_frame_groupby(grouped, subplots, column, fontsize, rot, grid, ax, figsize, layout, sharex, sharey, **kwds)
    498         sharex=sharex,
    499         sharey=sharey,
--> 500         **kwds
    501     )
    502 

~/anaconda3/lib/python3.7/site-packages/pandas/plotting/_matplotlib/boxplot.py in boxplot_frame_groupby(grouped, subplots, column, fontsize, rot, grid, ax, figsize, layout, sharex, sharey, **kwds)
    398         keys, frames = zip(*grouped)
    399         if grouped.axis == 0:
--> 400             df = pd.concat(frames, keys=keys, axis=1)
    401         else:
    402             if len(frames) > 1:

~/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/concat.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    256     )
    257 
--> 258     return op.get_result()
    259 
    260 

~/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/concat.py in get_result(self)
    471 
    472             new_data = concatenate_block_managers(
--> 473                 mgrs_indexers, self.new_axes, concat_axis=self.axis, copy=self.copy
    474             )
    475             if not self.copy:

~/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py in concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy)
   2057         blocks.append(b)
   2058 
-> 2059     return BlockManager(blocks, axes)

~/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py in __init__(self, blocks, axes, do_integrity_check)
    141 
    142         if do_integrity_check:
--> 143             self._verify_integrity()
    144 
    145         self._consolidate_check()

~/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py in _verify_integrity(self)
    343         for block in self.blocks:
    344             if block._verify_integrity and block.shape[1:] != mgr_shape[1:]:
--> 345                 construction_error(tot_items, block.shape[1:], self.axes)
    346         if len(self.items) != tot_items:
    347             raise AssertionError(

~/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py in construction_error(tot_items, block_shape, axes, e)
   1717         raise ValueError("Empty data passed with indices specified.")
   1718     raise ValueError(
-> 1719         "Shape of passed values is {0}, indices imply {1}".format(passed, implied)
   1720     )
   1721 

ValueError: Shape of passed values is (18, 8), indices imply (16, 8)

From practical point of view, when people use boxplot, it is not necessary to ensure no duplicate index, therefore, boxplot should work regardless of whether there exist duplicate index or not, it is irrelevant. Interestingly, DataFrame.boxplot does not crash when there exist duplicate index.

@TomAugspurger
Copy link
Contributor

Thanks for the report. The problem looks to be a bit deeper than boxplot.

In [21]: df1 = pd.DataFrame({"A": [1, 2, 3], "B": [1, 2, 3]}, index=['a', 'a', 'b'])

In [22]: df2 = pd.DataFrame({"A": [1, 2, 3], "B": [1, 2, 3]}, index=['b', 'b', 'c'])

In [23]: pd.concat([df1, df2], keys=['a', 'b'], axis=1)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-23-6f9592337c0e> in <module>
----> 1 pd.concat([df, df2], keys=['a', 'b'], axis=1)

~/sandbox/pandas/pandas/core/reshape/concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    282     )
    283
--> 284     return op.get_result()
    285
    286

~/sandbox/pandas/pandas/core/reshape/concat.py in get_result(self)
    495
    496             new_data = concatenate_block_managers(
--> 497                 mgrs_indexers, self.new_axes, concat_axis=self.axis, copy=self.copy
    498             )
    499             if not self.copy:

~/sandbox/pandas/pandas/core/internals/managers.py in concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy)
   2026         blocks.append(b)
   2027
-> 2028     return BlockManager(blocks, axes)

~/sandbox/pandas/pandas/core/internals/managers.py in __init__(self, blocks, axes, do_integrity_check)
    138
    139         if do_integrity_check:
--> 140             self._verify_integrity()
    141
    142         self._consolidate_check()

~/sandbox/pandas/pandas/core/internals/managers.py in _verify_integrity(self)
    333         for block in self.blocks:
    334             if block._verify_integrity and block.shape[1:] != mgr_shape[1:]:
--> 335                 construction_error(tot_items, block.shape[1:], self.axes)
    336         if len(self.items) != tot_items:
    337             raise AssertionError(

~/sandbox/pandas/pandas/core/internals/managers.py in construction_error(tot_items, block_shape, axes, e)
   1693     if block_shape[0] == 0:
   1694         raise ValueError("Empty data passed with indices specified.")
-> 1695     raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")
   1696
   1697

ValueError: Shape of passed values is (4, 4), indices imply (3, 4)

I haven't looked to see what the expected output of that concat is. Are you interested in investigating further @xuancong84?

@TomAugspurger TomAugspurger added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Jan 7, 2020
@TomAugspurger
Copy link
Contributor

Actually, this is a duplicate of #6963. Post over there if you're interested in investigating.

@TomAugspurger TomAugspurger added the Duplicate Report Duplicate issue or pull request label Jan 7, 2020
@TomAugspurger TomAugspurger added this to the No action milestone Jan 7, 2020
@xuancong84
Copy link
Author

@TomAugspurger I have proposed and posted a solution for #6963, but I am not sure whether that will fix the bug in this post.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

2 participants