Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
groupby aggregation on ordered Categorial with 'observed=True' breaks order #25871
import pandas as pd # Create a DataFrame with an ordered categorical column, one category not present df = pd.DataFrame( dict(cat = pd.Series([3, 1, 2, 1, 3, 2], dtype=pd.CategoricalDtype( categories=[1, 2, 3, 4], ordered=True) ), val = pd.Series([1.5, 0.5, 1.0, 0.5, 1.5, 1.0]) ) )
Including unobserved categories gives correct groups:
# Sum 'val' grouped by 'cat', including unobserved categories df.groupby('cat', observed=False)['val'].agg('sum')
Excluding unobserved categories changes the order, groups are wrong:
# Sum 'val' grouped by 'cat', excluding unobserved categories df.groupby('cat', observed=True)['val'].agg('sum')
The sample code shows that grouping with an ordered factor does not respect the factor's order when
Related issues: #25167
df.groupby('cat', observed=True, sort=True)['val'].agg('sum') df.groupby('cat', observed=True, sort=False)['val'].agg('sum')
both give the same, wrong result as shown above.
I have digged further, and found that this issue is not a duplicate of #25167:
The fix for #25167 does not fix this one completely
The PR #25173 fixes #25167. So, I applied the two code changes from that PR to my local installation (files grouper.py and categorical.py in pandas/core/groupby/).
df.groupby('cat', observed=True, sort=False)['val'].agg('sum')
after installing PR #25173.
If I modify it a bit, PR #25173 also fixes my problem
The relevant change from PR #25173 that fixes my
if sort: codes = np.sort(codes)
If I extend the condition and also check for the grouper being ordered, the
if sort or self.grouper.ordered: codes = np.sort(codes)
(This implies that grouping by an ordered Categorical and setting
With this modification, my test case seems OK both for
How to proceed
If my modification is OK (as in, breaks no tests and fits into Things), I propose to
Thanks, @WillAyd. Will you do the change, or is this expected from me?
(Besides, as mentioned, Paid Work)
If you could push a PR I can take a look on the review side. Be sure to check out the contributing guide if you have trouble and you an ask specific development questions on Gitter:
Pandas is for all practical purposes maintained entirely by volunteers - any help you can add to that is certainly welcome!