Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GroupBySeries Quantile fails when there are 3 or more categories #28312

Closed
josesho opened this issue Sep 6, 2019 · 1 comment
Closed

GroupBySeries Quantile fails when there are 3 or more categories #28312

josesho opened this issue Sep 6, 2019 · 1 comment
Labels
Duplicate Report Duplicate issue or pull request

Comments

@josesho
Copy link

josesho commented Sep 6, 2019

Code Sample

Using the latest version of pandas v0.25.1

import numpy as np
import pandas as pd

np.random.seed(12345)

df1 = pd.DataFrame({
    'category': ['A', 'A', 'A', 'A', 
                 'B', 'B', 'B', 'B', 
                 ],
    'value': np.random.randint(1, 10, 8)
})

df1.groupby("category").value.quantile([0.25, 0.75])

produces

category      
A         0.25    2.75
          0.75    5.25
B         0.25    2.75
          0.75    6.25
Name: value, dtype: float64

as expected. However, running this

np.random.seed(12345)

df2 = pd.DataFrame({
    'category': ['A', 'A', 'A', 'A', 
                 'B', 'B', 'B', 'B', 
                 'C', 'C', 'C', 'C', 
                 ],
    'value': np.random.randint(1, 10, 12)
})

df2.groupby("category").value.quantile([0.25, 0.75])

produces this error instead:

IndexError                                Traceback (most recent call last)
<ipython-input-60-12c4dbb665fc> in <module>
      8 })
      9 
---> 10 df2.groupby("category").value.quantile([0.25, 0.75])

~/anaconda3/envs/dabest-dev-py3.7/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in quantile(self, q, interpolation)
   1951             indices = np.concatenate(arrays)
   1952             assert len(indices) == len(result)
-> 1953             return result.take(indices)
   1954 
   1955     @Substitution(name="groupby")

~/anaconda3/envs/dabest-dev-py3.7/lib/python3.7/site-packages/pandas/core/series.py in take(self, indices, axis, is_copy, **kwargs)
   4430 
   4431         indices = ensure_platform_int(indices)
-> 4432         new_index = self.index.take(indices)
   4433 
   4434         if is_categorical_dtype(self):

~/anaconda3/envs/dabest-dev-py3.7/lib/python3.7/site-packages/pandas/core/indexes/multi.py in take(self, indices, axis, allow_fill, fill_value, **kwargs)
   2030             allow_fill=allow_fill,
   2031             fill_value=fill_value,
-> 2032             na_value=-1,
   2033         )
   2034         return MultiIndex(

~/anaconda3/envs/dabest-dev-py3.7/lib/python3.7/site-packages/pandas/core/indexes/multi.py in _assert_take_fillable(self, values, indices, allow_fill, fill_value, na_value)
   2058                 taken = masked
   2059         else:
-> 2060             taken = [lab.take(indices) for lab in self.codes]
   2061         return taken
   2062 

~/anaconda3/envs/dabest-dev-py3.7/lib/python3.7/site-packages/pandas/core/indexes/multi.py in <listcomp>(.0)
   2058                 taken = masked
   2059         else:
-> 2060             taken = [lab.take(indices) for lab in self.codes]
   2061         return taken
   2062 

IndexError: index 6 is out of bounds for size 6

The expected output is produced with pandas=0.24:

df2.groupby("category").value.quantile([0.25, 0.75])
category      
A         0.25    2.75
          0.75    5.25
B         0.25    2.75
          0.75    6.25
C         0.25    1.75
          0.75    7.25

Not exactly sure how to mitigate this?

I understand a related bug was patched with #28285 and #27526.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None

pandas : 0.25.1
numpy : 1.16.2
pytz : 2019.2
dateutil : 2.8.0
pip : 19.2.3
setuptools : 41.2.0
Cython : None
pytest : 4.3.0
hypothesis : None
sphinx : 2.2.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.2.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.2.1
sqlalchemy : None
tables : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None

@TomAugspurger
Copy link
Contributor

Fixed in #28113 I think (0.25.2)

@TomAugspurger TomAugspurger added the Duplicate Report Duplicate issue or pull request label Sep 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request
Projects
None yet
Development

No branches or pull requests

2 participants