BUG: Categorical.unique() preserves categories #15439

kernc · 2017-02-17T16:06:45Z

closes BUG: groupby upon categorical and sort=False triggers ValueError #13179
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

jreback · 2017-02-17T16:19:33Z

you are changing Categorical.unique which is a no-no. This should only be done in .groupby

jreback · 2017-02-17T16:20:12Z

pandas/tests/test_categorical.py

        self.assert_index_equal(res.categories, exp)
-        tm.assert_categorical_equal(res, Categorical(exp))


nothing should be changing in this file

jreback · 2017-02-17T16:20:35Z

pandas/tests/groupby/test_categorical.py

+        tm.assert_index_equal(df.groupby('A', sort=False).first().index, index)
+
+        # ordered=False
+        df = DataFrame({'A': pd.Categorical(list('ba'),


IIRC from the issue there are 4 cases? (product of sort=True/False, ordered=True/False)

groupby above and below is called twice (sort=True/False) for each case of ordered. There are four assertions.

jreback · 2017-02-17T16:21:56Z

have a look here: https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby.py#L2314

kernc · 2017-02-17T17:04:49Z

pandas/tests/groupby/test_categorical.py

+        tm.assert_index_equal(df.groupby('A', sort=False).first().index, index)
+
+        # ordered=False
+        df = DataFrame({'A': pd.Categorical(list('ba'),


groupby above and below is called twice (sort=True/False) for each case of ordered. There are four assertions.

kernc · 2017-02-17T17:05:26Z

pandas/core/groupby.py

@@ -2315,6 +2315,9 @@ def __init__(self, index, grouper=None, obj=None, name=None, level=None,
                # groupby
                else:
                    cat = self.grouper.unique()
+                    cat.add_categories([c for c in self.grouper.categories
+                                        if c not in cat.categories],


How strongly would self.grouper.categories[~self.grouper.categories.isin(cat.categories)] be preferred?

i like that better! (also vectorized).

as an aside, prob should add an asv for categorical grouping.

Kind of like this? I don't know what I'm doing. 8) flake8 diff surely doesn't pass now.

you can do:

git diff master | flake8 --diff for flake8 checking locally :>

yes, but the asv additions cannot.

jreback · 2017-02-17T17:56:45Z

doc/source/whatsnew/v0.20.0.txt

@@ -539,6 +539,7 @@ Bug Fixes

 - Bug in ``resample``, where a non-string ```loffset`` argument would not be applied when resampling a timeseries (:issue:`13218`)

+- Bug in ``.groupby`` where ```.groupby(categorical, sort=False)`` would raise ``ValueError`` due to non-matching categories (:issue:`13179`)


need a subsection for this

A whole subsection for an obvious bug? What would the title be?

how is this an obvious bug? a sub-section is not very long just showing what it did before and what it does now. Your whatsnew is so non-obvious now (and just changing wording is not going to help). An example is worth 1000 words.

I meant obvious as in the combination of sort=False and a missing category in the data causing:

ValueError: items in new_categories are not the same as in old categories

There is no other change in behavior except this now works as expected?

its not obvious at all. you can simply use the test example, show the old behavior and the new

👍 Is this, by chance, a subsection of Bug Fixes? It doesn't feel like api_breaking nor a particularly mentionable enhancement ...

How is this?

jreback · 2017-02-17T17:57:55Z

pandas/core/categorical.py

@@ -1851,7 +1851,7 @@ def unique(self):
        """

        # unlike np.unique, unique1d does not sort
-        unique_codes = unique1d(self.codes)
+        unique_codes = unique1d(self.codes).astype(self.codes.dtype)


use .astype(copy=False)

this is actually a bug (separate)

Needs a ticket?

actually you don't need this change. This is just for taking (which needs to be int64 anyhow; it will upcast it later if its not).

In case of groupby it will be upcast, but in general, .unique() should be type preserving?

Ah, never mind. :)

jreback · 2017-02-17T22:05:23Z

pandas/core/groupby.py

@@ -2315,6 +2315,11 @@ def __init__(self, index, grouper=None, obj=None, name=None, level=None,
                # groupby
                else:
                    cat = self.grouper.unique()


hmm I am wondering why we are doing any of this. isn't self.grouper.categories the end result that we want anyway?

Yes, but in case of sort=False, some expect the result categories sorted in the order of encounter.

ahh, ok for that case then (though try to use vectorized operations here). In fact probably create a method on Categorical, call it ._codes_for_grouping(sort=True/False) where you can handle this in a method (rather than clutter it up here).

This is vectorized, right?

self.grouper can also be a CategoricalIndex. ~~How do I get one from the other?~~

yep that is fine (though return a new object, never use inplace).

add the method to a CI as well (which just calls it on the .values), its a pass thru

codecov-io · 2017-02-17T23:42:04Z

Codecov Report

Merging #15439 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #15439      +/-   ##
==========================================
+ Coverage   90.37%   90.37%   +<.01%     
==========================================
  Files         135      135              
  Lines       49476    49481       +5     
==========================================
+ Hits        44713    44720       +7     
+ Misses       4763     4761       -2

Impacted Files	Coverage Δ
pandas/core/groupby.py	`94.98% <100%> (+0.09%)`	✅
pandas/core/categorical.py	`96.87% <100%> (+0.03%)`	✅
pandas/indexes/category.py	`98.59% <100%> (ø)`	✅

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f62e8f2...55733b8. Read the comment docs.

jreback · 2017-02-18T16:52:02Z

pandas/core/categorical.py

@@ -602,6 +602,21 @@ def _get_categories(self):
    categories = property(fget=_get_categories, fset=_set_categories,
                          doc=_categories_doc)

+    def _codes_for_groupby(self, sort=False):
+        """ Return a Categorical adjusted for groupby """


if you can add a doc-string (Parameters, Returns), just like its a publick method.

jreback

looks pretty good
just like to give the docs another pass

jreback · 2017-02-18T18:17:44Z

pandas/core/categorical.py

+    def _codes_for_groupby(self, sort):
+        """
+        Return a Categorical adjusted for groupby
+


adjusted is not very clear ; can you make this a bit more clear (and the comments below)

pretend you have never seen this and see it from a new reader perspective

jorisvandenbossche

Looks good!

I only have one question: should we have different behaviour of sort=False depending on whether the categorical is ordered or not?
Because now the sort=False seems to be ignored in case of an ordered categorical?

jorisvandenbossche · 2017-02-21T21:40:44Z

doc/source/whatsnew/v0.20.0.txt

+GroupBy on Categoricals
+^^^^^^^^^^^^^^^^^^^^^^^
+
+In previous version, ``.groupby(..., sort=False)`` would fail with a ``ValueError`` when grouping on a categorical series with some categories not appearing in the data. (:issue:`13179`)


previous version -> previous versions

jreback · 2017-02-22T17:30:57Z

I only have one question: should we have different behaviour of sort=False depending on whether the categorical is ordered or not?
Because now the sort=False seems to be ignored in case of an ordered categorical?

hmm, @kernc can you take a look at the original issue (in the reference from the code) and see if you can divine what is happening?

kernc · 2017-02-22T17:42:13Z

Indeed, the order is retained by .unique(), most recently edited in 29f1f42, "BUG: Groupby(sort=False) with datetime-like Categorical raises ValueError" with several tests.

Semantically, I don't think the order should be lost if the categorical is ordered. The relationships 'low' < 'med' < 'high' don't simply disappear if the data contains no 'med'. Transitivity persists.
Remember that this whole debacle is due to some Gimli desiring the groups be in random order when groupby sort=False, while in reality, few could care less. 😆

This lgtm; please edit to your liking.

jreback · 2017-02-22T18:50:44Z

thanks @kernc nice PR!

keep em coming!

closes pandas-dev#13179 Author: Kernc <kerncece@gmail.com> Closes pandas-dev#15439 from kernc/Categorical.unique-nostrip-unused and squashes the following commits: 55733b8 [Kernc] fixup! BUG: Fix .groupby(categorical, sort=False) failing 2aec326 [Kernc] fixup! BUG: Fix .groupby(categorical, sort=False) failing c813146 [Kernc] PERF: add asv for categorical grouping 0c550e6 [Kernc] BUG: Fix .groupby(categorical, sort=False) failing

jreback added Categorical Categorical Data Type Groupby labels Feb 17, 2017

jreback reviewed Feb 17, 2017

View reviewed changes

kernc force-pushed the Categorical.unique-nostrip-unused branch from 0d0d0bf to 8aea45f Compare February 17, 2017 16:56

kernc commented Feb 17, 2017

View reviewed changes

kernc force-pushed the Categorical.unique-nostrip-unused branch from bafd99d to 81a2183 Compare February 17, 2017 17:54

jreback reviewed Feb 17, 2017

View reviewed changes

kernc added 2 commits February 18, 2017 16:36

BUG: Fix .groupby(categorical, sort=False) failing

0c550e6

PERF: add asv for categorical grouping

c813146

kernc force-pushed the Categorical.unique-nostrip-unused branch 2 times, most recently from 8f47bd9 to a7cd756 Compare February 18, 2017 16:48

jreback reviewed Feb 18, 2017

View reviewed changes

kernc force-pushed the Categorical.unique-nostrip-unused branch from a7cd756 to 52c1bc8 Compare February 18, 2017 18:05

jreback approved these changes Feb 18, 2017

View reviewed changes

fixup! BUG: Fix .groupby(categorical, sort=False) failing

2aec326

kernc force-pushed the Categorical.unique-nostrip-unused branch 2 times, most recently from 4842bc7 to 82ff3d4 Compare February 21, 2017 20:41

jorisvandenbossche reviewed Feb 21, 2017

View reviewed changes

fixup! BUG: Fix .groupby(categorical, sort=False) failing

55733b8

kernc force-pushed the Categorical.unique-nostrip-unused branch from 82ff3d4 to 55733b8 Compare February 22, 2017 00:01

jreback closed this in f638550 Feb 22, 2017

jreback added this to the 0.20.0 milestone Feb 22, 2017

jorisvandenbossche mentioned this pull request Nov 14, 2017

Any categorical dtype object's .unique() changes categories #18291

Closed

		self.assert_index_equal(res.categories, exp)
		tm.assert_categorical_equal(res, Categorical(exp))

		@@ -539,6 +539,7 @@ Bug Fixes

		- Bug in ``resample``, where a non-string ```loffset`` argument would not be applied when resampling a timeseries (:issue:`13218`)

		- Bug in ``.groupby`` where ```.groupby(categorical, sort=False)`` would raise ``ValueError`` due to non-matching categories (:issue:`13179`)

BUG: Categorical.unique() preserves categories #15439

BUG: Categorical.unique() preserves categories #15439

Conversation

kernc commented Feb 17, 2017 • edited Loading

jreback commented Feb 17, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Feb 17, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback Feb 18, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kernc Feb 18, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Feb 17, 2017 • edited Loading

Codecov Report

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Feb 22, 2017

kernc commented Feb 22, 2017 • edited Loading

jreback commented Feb 22, 2017

kernc commented Feb 17, 2017 •

edited

Loading

jreback Feb 18, 2017 •

edited

Loading

kernc Feb 18, 2017 •

edited

Loading

codecov-io commented Feb 17, 2017 •

edited

Loading

kernc commented Feb 22, 2017 •

edited

Loading