GroupBySeries Quantile fails when there are 3 or more categories #28312

josesho · 2019-09-06T09:59:21Z

Code Sample

Using the latest version of pandas v0.25.1

import numpy as np
import pandas as pd

np.random.seed(12345)

df1 = pd.DataFrame({
    'category': ['A', 'A', 'A', 'A', 
                 'B', 'B', 'B', 'B', 
                 ],
    'value': np.random.randint(1, 10, 8)
})

df1.groupby("category").value.quantile([0.25, 0.75])

produces

category      
A         0.25    2.75
          0.75    5.25
B         0.25    2.75
          0.75    6.25
Name: value, dtype: float64

as expected. However, running this

np.random.seed(12345)

df2 = pd.DataFrame({
    'category': ['A', 'A', 'A', 'A', 
                 'B', 'B', 'B', 'B', 
                 'C', 'C', 'C', 'C', 
                 ],
    'value': np.random.randint(1, 10, 12)
})

df2.groupby("category").value.quantile([0.25, 0.75])

produces this error instead:

IndexError                                Traceback (most recent call last)
<ipython-input-60-12c4dbb665fc> in <module>
      8 })
      9 
---> 10 df2.groupby("category").value.quantile([0.25, 0.75])

~/anaconda3/envs/dabest-dev-py3.7/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in quantile(self, q, interpolation)
   1951             indices = np.concatenate(arrays)
   1952             assert len(indices) == len(result)
-> 1953             return result.take(indices)
   1954 
   1955     @Substitution(name="groupby")

~/anaconda3/envs/dabest-dev-py3.7/lib/python3.7/site-packages/pandas/core/series.py in take(self, indices, axis, is_copy, **kwargs)
   4430 
   4431         indices = ensure_platform_int(indices)
-> 4432         new_index = self.index.take(indices)
   4433 
   4434         if is_categorical_dtype(self):

~/anaconda3/envs/dabest-dev-py3.7/lib/python3.7/site-packages/pandas/core/indexes/multi.py in take(self, indices, axis, allow_fill, fill_value, **kwargs)
   2030             allow_fill=allow_fill,
   2031             fill_value=fill_value,
-> 2032             na_value=-1,
   2033         )
   2034         return MultiIndex(

~/anaconda3/envs/dabest-dev-py3.7/lib/python3.7/site-packages/pandas/core/indexes/multi.py in _assert_take_fillable(self, values, indices, allow_fill, fill_value, na_value)
   2058                 taken = masked
   2059         else:
-> 2060             taken = [lab.take(indices) for lab in self.codes]
   2061         return taken
   2062 

~/anaconda3/envs/dabest-dev-py3.7/lib/python3.7/site-packages/pandas/core/indexes/multi.py in <listcomp>(.0)
   2058                 taken = masked
   2059         else:
-> 2060             taken = [lab.take(indices) for lab in self.codes]
   2061         return taken
   2062 

IndexError: index 6 is out of bounds for size 6

The expected output is produced with pandas=0.24:

df2.groupby("category").value.quantile([0.25, 0.75])

category      
A         0.25    2.75
          0.75    5.25
B         0.25    2.75
          0.75    6.25
C         0.25    1.75
          0.75    7.25

Not exactly sure how to mitigate this?

I understand a related bug was patched with #28285 and #27526.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None

pandas : 0.25.1
numpy : 1.16.2
pytz : 2019.2
dateutil : 2.8.0
pip : 19.2.3
setuptools : 41.2.0
Cython : None
pytest : 4.3.0
hypothesis : None
sphinx : 2.2.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.2.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.2.1
sqlalchemy : None
tables : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2019-09-06T16:00:09Z

Fixed in #28113 I think (0.25.2)

This was referenced Sep 6, 2019

groupby/quantile breaks #28307

Closed

plotting fails in pandas==0.25.0 ACCLAB/DABEST-python#52

Closed

TomAugspurger closed this as completed Sep 6, 2019

TomAugspurger added the Duplicate Report Duplicate issue or pull request label Sep 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GroupBySeries Quantile fails when there are 3 or more categories #28312

GroupBySeries Quantile fails when there are 3 or more categories #28312

josesho commented Sep 6, 2019 •

edited

Loading

INSTALLED VERSIONS

TomAugspurger commented Sep 6, 2019

GroupBySeries Quantile fails when there are 3 or more categories #28312

GroupBySeries Quantile fails when there are 3 or more categories #28312

Comments

josesho commented Sep 6, 2019 • edited Loading

Code Sample

Output of pd.show_versions()

INSTALLED VERSIONS

TomAugspurger commented Sep 6, 2019

josesho commented Sep 6, 2019 •

edited

Loading

Output of `pd.show_versions()`