Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Categorical.unique() preserves categories #15439

Closed

Conversation

kernc
Copy link
Contributor

@kernc kernc commented Feb 17, 2017

@jreback
Copy link
Contributor

jreback commented Feb 17, 2017

you are changing Categorical.unique which is a no-no. This should only be done in .groupby

@jreback jreback added Categorical Categorical Data Type Groupby labels Feb 17, 2017
self.assert_index_equal(res.categories, exp)
tm.assert_categorical_equal(res, Categorical(exp))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nothing should be changing in this file

tm.assert_index_equal(df.groupby('A', sort=False).first().index, index)

# ordered=False
df = DataFrame({'A': pd.Categorical(list('ba'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC from the issue there are 4 cases? (product of sort=True/False, ordered=True/False)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

groupby above and below is called twice (sort=True/False) for each case of ordered. There are four assertions.

@jreback
Copy link
Contributor

jreback commented Feb 17, 2017

@kernc kernc force-pushed the Categorical.unique-nostrip-unused branch from 0d0d0bf to 8aea45f Compare February 17, 2017 16:56
tm.assert_index_equal(df.groupby('A', sort=False).first().index, index)

# ordered=False
df = DataFrame({'A': pd.Categorical(list('ba'),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

groupby above and below is called twice (sort=True/False) for each case of ordered. There are four assertions.

@@ -2315,6 +2315,9 @@ def __init__(self, index, grouper=None, obj=None, name=None, level=None,
# groupby
else:
cat = self.grouper.unique()
cat.add_categories([c for c in self.grouper.categories
if c not in cat.categories],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How strongly would self.grouper.categories[~self.grouper.categories.isin(cat.categories)] be preferred?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like that better! (also vectorized).

as an aside, prob should add an asv for categorical grouping.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of like this? I don't know what I'm doing. 8) flake8 diff surely doesn't pass now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can do:

git diff master | flake8 --diff for flake8 checking locally :>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but the asv additions cannot.

@kernc kernc force-pushed the Categorical.unique-nostrip-unused branch from bafd99d to 81a2183 Compare February 17, 2017 17:54
@@ -539,6 +539,7 @@ Bug Fixes

- Bug in ``resample``, where a non-string ```loffset`` argument would not be applied when resampling a timeseries (:issue:`13218`)

- Bug in ``.groupby`` where ```.groupby(categorical, sort=False)`` would raise ``ValueError`` due to non-matching categories (:issue:`13179`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a subsection for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A whole subsection for an obvious bug? What would the title be?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this an obvious bug? a sub-section is not very long just showing what it did before and what it does now. Your whatsnew is so non-obvious now (and just changing wording is not going to help). An example is worth 1000 words.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant obvious as in the combination of sort=False and a missing category in the data causing:

ValueError: items in new_categories are not the same as in old categories

There is no other change in behavior except this now works as expected?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its not obvious at all. you can simply use the test example, show the old behavior and the new

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Is this, by chance, a subsection of Bug Fixes? It doesn't feel like api_breaking nor a particularly mentionable enhancement ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this?

@@ -1851,7 +1851,7 @@ def unique(self):
"""

# unlike np.unique, unique1d does not sort
unique_codes = unique1d(self.codes)
unique_codes = unique1d(self.codes).astype(self.codes.dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use .astype(copy=False)

this is actually a bug (separate)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a ticket?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually you don't need this change. This is just for taking (which needs to be int64 anyhow; it will upcast it later if its not).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of groupby it will be upcast, but in general, .unique() should be type preserving?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, never mind. :)

@@ -2315,6 +2315,11 @@ def __init__(self, index, grouper=None, obj=None, name=None, level=None,
# groupby
else:
cat = self.grouper.unique()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I am wondering why we are doing any of this. isn't self.grouper.categories the end result that we want anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but in case of sort=False, some expect the result categories sorted in the order of encounter.

Copy link
Contributor

@jreback jreback Feb 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, ok for that case then (though try to use vectorized operations here). In fact probably create a method on Categorical, call it ._codes_for_grouping(sort=True/False) where you can handle this in a method (rather than clutter it up here).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is vectorized, right?

Copy link
Contributor Author

@kernc kernc Feb 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.grouper can also be a CategoricalIndex. How do I get one from the other?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep that is fine (though return a new object, never use inplace).

add the method to a CI as well (which just calls it on the .values), its a pass thru

@codecov-io
Copy link

codecov-io commented Feb 17, 2017

Codecov Report

Merging #15439 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #15439      +/-   ##
==========================================
+ Coverage   90.37%   90.37%   +<.01%     
==========================================
  Files         135      135              
  Lines       49476    49481       +5     
==========================================
+ Hits        44713    44720       +7     
+ Misses       4763     4761       -2
Impacted Files Coverage Δ
pandas/core/groupby.py 94.98% <100%> (+0.09%)
pandas/core/categorical.py 96.87% <100%> (+0.03%)
pandas/indexes/category.py 98.59% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f62e8f2...55733b8. Read the comment docs.

@kernc kernc force-pushed the Categorical.unique-nostrip-unused branch 2 times, most recently from 8f47bd9 to a7cd756 Compare February 18, 2017 16:48
@@ -602,6 +602,21 @@ def _get_categories(self):
categories = property(fget=_get_categories, fset=_set_categories,
doc=_categories_doc)

def _codes_for_groupby(self, sort=False):
""" Return a Categorical adjusted for groupby """
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you can add a doc-string (Parameters, Returns), just like its a publick method.

@kernc kernc force-pushed the Categorical.unique-nostrip-unused branch from a7cd756 to 52c1bc8 Compare February 18, 2017 18:05
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good
just like to give the docs another pass

def _codes_for_groupby(self, sort):
"""
Return a Categorical adjusted for groupby

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adjusted is not very clear ; can you make this a bit more clear (and the comments below)

pretend you have never seen this and see it from a new reader perspective

@kernc kernc force-pushed the Categorical.unique-nostrip-unused branch 2 times, most recently from 4842bc7 to 82ff3d4 Compare February 21, 2017 20:41
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

I only have one question: should we have different behaviour of sort=False depending on whether the categorical is ordered or not?
Because now the sort=False seems to be ignored in case of an ordered categorical?

GroupBy on Categoricals
^^^^^^^^^^^^^^^^^^^^^^^

In previous version, ``.groupby(..., sort=False)`` would fail with a ``ValueError`` when grouping on a categorical series with some categories not appearing in the data. (:issue:`13179`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previous version -> previous versions

@kernc kernc force-pushed the Categorical.unique-nostrip-unused branch from 82ff3d4 to 55733b8 Compare February 22, 2017 00:01
@jreback
Copy link
Contributor

jreback commented Feb 22, 2017

I only have one question: should we have different behaviour of sort=False depending on whether the categorical is ordered or not?
Because now the sort=False seems to be ignored in case of an ordered categorical?

hmm, @kernc can you take a look at the original issue (in the reference from the code) and see if you can divine what is happening?

@kernc
Copy link
Contributor Author

kernc commented Feb 22, 2017

Indeed, the order is retained by .unique(), most recently edited in 29f1f42, "BUG: Groupby(sort=False) with datetime-like Categorical raises ValueError" with several tests.

Semantically, I don't think the order should be lost if the categorical is ordered. The relationships 'low' < 'med' < 'high' don't simply disappear if the data contains no 'med'. Transitivity persists.
Remember that this whole debacle is due to some Gimli desiring the groups be in random order when groupby sort=False, while in reality, few could care less. 😆

This lgtm; please edit to your liking.

@jreback jreback closed this in f638550 Feb 22, 2017
@jreback
Copy link
Contributor

jreback commented Feb 22, 2017

thanks @kernc nice PR!

keep em coming!

@jreback jreback added this to the 0.20.0 milestone Feb 22, 2017
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017
closes pandas-dev#13179

Author: Kernc <kerncece@gmail.com>

Closes pandas-dev#15439 from kernc/Categorical.unique-nostrip-unused and squashes the following commits:

55733b8 [Kernc] fixup! BUG: Fix .groupby(categorical, sort=False) failing
2aec326 [Kernc] fixup! BUG: Fix .groupby(categorical, sort=False) failing
c813146 [Kernc] PERF: add asv for categorical grouping
0c550e6 [Kernc] BUG: Fix .groupby(categorical, sort=False) failing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: groupby upon categorical and sort=False triggers ValueError
4 participants