Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

groupby aggregation on ordered Categorial with 'observed=True' breaks order #25871

Closed
kpflugshaupt opened this issue Mar 25, 2019 · 7 comments · Fixed by kpflugshaupt/pandas#1 or #25908
Labels
Duplicate Report Duplicate issue or pull request
Milestone

Comments

@kpflugshaupt
Copy link
Contributor

Code Sample:

import pandas as pd

# Create a DataFrame with an ordered categorical column, one category not present
df = pd.DataFrame(
            dict(cat = pd.Series([3, 1, 2, 1, 3, 2], 
                                 dtype=pd.CategoricalDtype(
                                              categories=[1, 2, 3, 4], 
                                              ordered=True)
                                ), 
                 val = pd.Series([1.5, 0.5, 1.0, 0.5, 1.5, 1.0])
            )
)

Including unobserved categories gives correct groups:

# Sum 'val' grouped by 'cat', including unobserved categories
df.groupby('cat', observed=False)['val'].agg('sum')
cat
1    1.0
2    2.0
3    3.0
4    0.0
Name: val, dtype: float64

Excluding unobserved categories changes the order, groups are wrong:

# Sum 'val' grouped by 'cat', excluding unobserved categories
df.groupby('cat', observed=True)['val'].agg('sum')
cat
3    1.0
1    2.0
2    3.0
Name: val, dtype: float64

Problem description:

The sample code shows that grouping with an ordered factor does not respect the factor's order when 'observed=True' is set. Instead, group labels are in order of first occurrence in the Categorical, as if it were unordered. The aggregation results, however, are in the Categorical's order.
Thus, the result is wrong.

Related issues: #25167
There, the Categorical was unordered, and the sort=True parameter did not work as expected in combination with observed=True. In my case, sort makes no difference:

df.groupby('cat', observed=True, sort=True)['val'].agg('sum')
df.groupby('cat', observed=True, sort=False)['val'].agg('sum')

both give the same, wrong result as shown above.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.2.final.0
python-bits: 64
OS: Darwin
OS-release: 18.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: 4.3.0
pip: 19.0.3
setuptools: 40.8.0
Cython: 0.29.6
numpy: 1.16.2
scipy: 1.2.1
pyarrow: None
xarray: 0.11.3
IPython: 7.1.1
sphinx: 1.8.5
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: 2.6.1
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.5
lxml.etree: 4.3.2
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: 1.3.1
pymysql: None
psycopg2: 2.7.6.1 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: 0.2.1
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@WillAyd
Copy link
Member

WillAyd commented Mar 26, 2019

Unless I am overlooking something this looks to be a duplicate of the issue you've linked to. As such closing this and let's stick to that one

@WillAyd WillAyd closed this as completed Mar 26, 2019
@WillAyd WillAyd added the Duplicate Report Duplicate issue or pull request label Mar 26, 2019
@kpflugshaupt
Copy link
Contributor Author

Fair enough.
The scenario is not identical, though (ordered Categorical vs. ‚sort=True‘).
Looking at the code in grouper.py, it certainly seems to stem from the same place, but the ‚ordered‘ attribute and the ‚sort‘ parameter are separate concepts there, as well.
Maybe make this another unit Test?

@WillAyd
Copy link
Member

WillAyd commented Mar 26, 2019

A good test should be able to parametrize all of those parameters - if you'd like to take a look at the other issue and post a PR that would certainly be welcome!

@kpflugshaupt
Copy link
Contributor Author

Right, I will try. Paid work keeps intruding, though...

@kpflugshaupt
Copy link
Contributor Author

kpflugshaupt commented Mar 26, 2019

I have digged further, and found that this issue is not a duplicate of #25167:

The fix for #25167 does not fix this one completely

The PR #25173 fixes #25167. So, I applied the two code changes from that PR to my local installation (files grouper.py and categorical.py in pandas/core/groupby/).
This fixes my test case, but only if the 'sort' parameter is set to True. If sort=False, the bug persists (same result as shown above):

df.groupby('cat', observed=True, sort=False)['val'].agg('sum')

still returns

cat
3    1.0
1    2.0
2    3.0
Name: val, dtype: float64

after installing PR #25173.

If I modify it a bit, PR #25173 also fixes my problem

The relevant change from PR #25173 that fixes my sort=True case are these two new lines in grouper.py (inserted after line 302):

                    if sort:
                        codes = np.sort(codes)

If I extend the condition and also check for the grouper being ordered, the sort=False case also works:

                    if sort or self.grouper.ordered:
                        codes = np.sort(codes)

(This implies that grouping by an ordered Categorical and setting sort=True should result in the same ordering of codes. Also, that grouping by an ordered Categorical always sorts.)

With this modification, my test case seems OK both for sort=Trueand sort=False.

How to proceed

If my modification is OK (as in, breaks no tests and fits into Things), I propose to

@WillAyd WillAyd reopened this Mar 26, 2019
@kpflugshaupt
Copy link
Contributor Author

Thanks, @WillAyd. Will you do the change, or is this expected from me?
When I clone the repository, I somehow do not see the changed sources of PR #25173, even though it was closed 7 days ago. It's probably me doing something wrong, but it makes me wary of trying this myself.

(Besides, as mentioned, Paid Work)

@WillAyd
Copy link
Member

WillAyd commented Mar 26, 2019

If you could push a PR I can take a look on the review side. Be sure to check out the contributing guide if you have trouble and you an ask specific development questions on Gitter:

https://pandas.pydata.org/pandas-docs/stable/development/contributing.html

(Besides, as mentioned, Paid Work)

Pandas is for all practical purposes maintained entirely by volunteers - any help you can add to that is certainly welcome!

kpflugshaupt added a commit to kpflugshaupt/pandas that referenced this issue Mar 26, 2019
As documented in pandas-dev#25871, groupby() on an ordered Categorical messes up category order when 'observed=True' is specified. 
Specifically, group labels will be ordered by first occurrence (as for an unordered Categorical), but grouped aggregation results will retain the Categorical's order.
The fix is a modified subset of pandas-dev#25173, which fixes a related case, but has not been merged yet.
kpflugshaupt added a commit to kpflugshaupt/pandas that referenced this issue Mar 28, 2019
* BUG: Fix groupby on ordered Categoricals (GH25871)

As documented in pandas-dev#25871, groupby() on an ordered Categorical messes up category order when 'observed=True' is specified. 
Specifically, group labels will be ordered by first occurrence (as for an unordered Categorical), but grouped aggregation results will retain the Categorical's order.
The fix is a modified subset of pandas-dev#25173, which fixes a related case, but has not been merged yet.

* BUG: Fix groupby on ordered Categoricals (GH25871)

* new test

* Fix groupby on ordered Categoricals (GH25871)

Testing all combinations of:
- ordered vs. unordered grouping column
- 'observed' True vs. False
- 'sort' True vs. False
In all cases, result group ordering must be correct. 
The test is built such that the result index labels are equal to aggregation results if all goes well (except for the one unobserved category)
kpflugshaupt added a commit to kpflugshaupt/pandas that referenced this issue Mar 28, 2019
* BUG: Fix groupby sorting on ordered Categoricals (GH25871) (#1)

* BUG: Fix groupby on ordered Categoricals (GH25871)

As documented in pandas-dev#25871, groupby() on an ordered Categorical messes up category order when 'observed=True' is specified. 
Specifically, group labels will be ordered by first occurrence (as for an unordered Categorical), but grouped aggregation results will retain the Categorical's order.
The fix is a modified subset of pandas-dev#25173, which fixes a related case, but has not been merged yet.

* BUG: Fix groupby on ordered Categoricals (GH25871)

* new test

* Fix groupby on ordered Categoricals (GH25871)

Testing all combinations of:
- ordered vs. unordered grouping column
- 'observed' True vs. False
- 'sort' True vs. False
In all cases, result group ordering must be correct. 
The test is built such that the result index labels are equal to aggregation results if all goes well (except for the one unobserved category)

* Revert "BUG: Fix groupby sorting on ordered Categoricals (GH25871) (#1)"

This reverts commit b265349.
kpflugshaupt added a commit to kpflugshaupt/pandas that referenced this issue Mar 28, 2019
This test had an adjustment for column order when 'observed=True' is set. This hid the fact that, with that parameter set, the data columns were not actually reordered -- it was just the column group labels (analogous to index labels in pandas-dev#25871), leaving the data columns in place and out of sync. (This was not visible as the data consisted only of ones).

I've made the test more sensitive (unsyncing of data columns will be caught now) and removed the special case for 'observed=True'. As there are no unobserved categories in this case, the result should not be influenced by this parameter.
@jreback jreback added this to the 0.25.0 milestone Apr 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment