BUG: Fix Categorical comparsion with Series of dtype 'category' #16667

funnycrab · 2017-06-11T13:51:23Z

This is a fix attempt for issue #16659.

closes #xxxx
tests added / passed
passes git diff upstream/master --name-only -- '*.py' | flake8 --diff
whatsnew entry

jreback · 2017-06-11T15:46:15Z

pandas/core/categorical.py

        try:
-            return (self.categories.equals(other.categories) and
-                    self.ordered == other.ordered)


instead simply

other = Categorical(other)

jreback · 2017-06-11T15:46:42Z

pandas/tests/test_categorical.py

@@ -152,6 +152,11 @@ def test_is_equal_dtype(self):
            CategoricalIndex(c1, categories=list('cab'))))
        assert not c1.is_dtype_equal(CategoricalIndex(c1, ordered=True))

+        s1 = pd.Series(c1)


add the issue as a comment

jreback · 2017-06-11T15:47:42Z

doc/source/whatsnew/v0.21.0.txt

@@ -130,6 +130,8 @@ Numeric
 Categorical
 ^^^^^^^^^^^

+- Bug in ``Categorical.is_dtype_equal()`` where comparison with Series whose dtype is 'category' is not handled correctly (:issue:`16659`)


comparison with a Series with dtype='category'.

remove the 'is not handled correctly' (that's what the Bug indicates).

jreback · 2017-06-11T17:03:28Z

pandas/core/categorical.py

-                other_categorical = other.values
-            else:
-                other_categorical = other
+                other = Categorical(other)


you can always do this no if needed (pass copy=False)

jreback · 2017-06-11T17:04:20Z

pandas/core/categorical.py

+            from pandas.core.series import Series
+
+            if isinstance(other, Series):
+                other = Categorical(other)


or even better just do this if not Categorical or CategoricalIndex

Thanks for your guidance.

It seems that the following code

if not isinstance(other, Categorical) and not isinstance(CategoricalIndex): other = Categorical(other)

will lead to the failure of following test case

assert not c1.is_dtype_equal(Index(list('aabca')))

jreback · 2017-06-11T17:05:49Z

pandas/tests/test_categorical.py

+        # GH 16659
+        s1 = pd.Series(c1)
+        assert c1.is_dtype_equal(s1)
+        assert not c2.is_dtype_equal(s1)


can u add some tests for things that are not is_dryoe_equal (i don't remember if these are sufficiently covered)

e.g. pass in scalers, ndarray, Dataframe

and cycle thru all Indexes (except CI)

can you update for this

codecov · 2017-06-11T17:26:28Z

Codecov Report

Merging #16667 into master will decrease coverage by 0.42%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #16667      +/-   ##
==========================================
- Coverage   91.43%   91.01%   -0.43%     
==========================================
  Files         163      161       -2     
  Lines       50091    49353     -738     
==========================================
- Hits        45800    44917     -883     
- Misses       4291     4436     +145

Flag	Coverage Δ
#multiple	`88.78% <100%> (-0.44%)`	⬇️
#single	`40.26% <66.66%> (-0.16%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/categorical.py	`95.48% <100%> (-0.32%)`	⬇️
pandas/io/s3.py	`0% <0%> (-85%)`	⬇️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/tests/plotting/__init__.py	`77.77% <0%> (-22.23%)`	⬇️
pandas/core/tools/datetimes.py	`66.94% <0%> (-17.55%)`	⬇️
pandas/util/_decorators.py	`65.3% <0%> (-15.4%)`	⬇️
pandas/compat/pickle_compat.py	`69.51% <0%> (-6.1%)`	⬇️
pandas/core/generic.py	`92% <0%> (-3.72%)`	⬇️
pandas/core/dtypes/missing.py	`87.03% <0%> (-3.59%)`	⬇️
pandas/core/indexes/range.py	`92.81% <0%> (-2.85%)`	⬇️
... and 99 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 96a5274...2771e57. Read the comment docs.

TomAugspurger · 2017-06-12T21:36:20Z

doc/source/whatsnew/v0.21.0.txt

@@ -130,6 +130,8 @@ Numeric
 Categorical
 ^^^^^^^^^^^

+- Bug in ``Categorical.is_dtype_equal()`` where comparison with a Series with dtype='category' (:issue:`16659`)


- Bug in ``Categorical.is_dtype_equal()`` where comparison with a Series with `'category'` dtype incorrectly returned False (:issue:`16659`)

TomAugspurger · 2017-06-12T21:40:34Z

pandas/core/categorical.py

        try:
+            from pandas.core.series import Series
+
+            if isinstance(other, Series):


how about

if is_categorical(other): other = Categorical(other)

Thank you for the advice. However, this seems not to work either.

It will at least make the following test case to fail,

assert c3.is_dtype_equal(c3)

One of the reason is that the new instance created by Categorical(c3) will not preserve the attribute ordered of c3. Not quite sure whether this is by design.

that doesn't sound right. an u show a self contained repro

Ah, that makes sense I guess.

In [5]: pd.Categorical(pd.Categorical([0, 1], ordered=True)) Out[5]: [0, 1] Categories (2, int64): [0, 1]

so not ordered. The default for the constructor is ordered=False.

I would say just explicitly check for the known cases of Categorical, Series w/ category dtype, and CategoricalIndex and just extract the categories and ordered off of them and build a new other = Categorical(categories, ordered=ordered).

Once I finish #16015 this will be much simpler (just use .dtype), but that won't be for a few weeks probably.

no this is a bug
could be fixed in his PR or independently before this PR

don't work around this pls put in place the correct patch

Doing Categorical(thing, ordered=False) and having it come out ordered would be very odd. If we do this, we would need to change the default to ordered=None.

yep I agree.

@jreback With what? With that Categorical(thing, ordered=False) should return ordered False? Or with changing the default to None ?

I don't find this necessarily a bug. Personally I would find keeping ordered=False as the default much clearer.

the bug is this.

This should take the ordered flag from the passed Categorical/CategoricalIndex

In [2]: c = pd.Categorical(list('bac'), ordered=True) In [3]: c Out[3]: [b, a, c] Categories (3, object): [a < b < c] In [4]: pd.Categorical(c) Out[4]: [b, a, c] Categories (3, object): [a, b, c]

so it needs to default to ordered=None. If its not specified it can take the attribute of a passed categorical (we do this already for categories). otherwise, will be False

jreback · 2017-06-30T20:50:12Z

can you update

funnycrab · 2017-07-01T16:24:54Z

I was thinking that an agreement was yet to achieve.

So, I will try to fix the two bugs mentioned above in this PR altogether.

jreback · 2017-07-19T10:32:07Z

can you rebase / update

#16667 (comment) could be done as a separate PR prior to this (or here).

funnycrab · 2017-07-20T17:11:55Z

Sorry for late reply.

So, by "rebase", do you mean that I can make a new branch from current topic branch and fix #16667 (comment) on that new branch and rebase its content to the master first? Something like,

git rebase --onto master current_topic_branch new_branch

Or just create a new separate PR to fix #16667 (comment), and have it merged into master first. Then rebase the topic branch in this PR after that commit (on master)?

Thanks for your guidance.

gfyoung · 2017-07-20T17:17:53Z

So, by "rebase", do you mean that I can make a new branch from current topic branch and fix #16667 (comment) on that new branch and rebase its content to the master first?

Rebase means the following:

Take your current branch and rewind your changes
Fast-forward the rewinded branch to the current state of master
Reapply your changes on the updated version of the branch

To do this, just to the following:

git rebase master

No new branch is needed, nor is there any need for a new PR

gfyoung · 2017-07-20T17:20:37Z

Or just create a new separate PR to fix #16667 (comment), and have it merged into master first. Then rebase the topic branch in this PR after that commit (on master)?

So this is an option that @jreback presented to you in terms of addressing his comment. You are also welcome to address the issue directly in this PR (so you wouldn't need to create a new one for it).

gfyoung · 2017-07-20T17:21:05Z

Sorry for late reply.

No worries! We've had to wait MUCH longer than this for PR's before 😄

funnycrab · 2017-07-22T04:09:39Z

@gfyoung

Thanks for you detailed explanation. I was overthinking about the rebase.

So what I have done is as follows.

Update my upstream/master with

git fetch upstream

Switch to local master branch and update it

git checkout master
git merge upstream/master

Switch back to local branch fix_bug_in_categorical_comparison

git checkout fix_bug_in_categorical_comparison

Rebase it to fresh master branch

git rebase master

Finally, push to origin/fix_bug_in_categorical_comparison with force

git push --force origin fix_bug_in_categorical_comparison

Since I am the only one working on this branch, I believe the force push here should not cause any trouble. Please let me know if this is not considered a good practice. Thank you.

gfyoung · 2017-07-22T04:14:00Z

@funnycrab : That should be just fine!

funnycrab · 2017-07-22T04:50:06Z

@gfyoung
OK. Thank you. I will then focus on fixing the real issue.

jreback · 2017-09-23T16:53:39Z

can you rebase

pep8speaks · 2017-09-25T16:30:27Z

Hello @funnycrab! Thanks for updating the PR.

In the file pandas/core/categorical.py, following are the PEP8 issues :

Line 2029:1: E112 expected an indented block
Line 2029:1: E305 expected 2 blank lines after class or function definition, found 0
Line 2029:3: E227 missing whitespace around bitwise or shift operator
Line 2029:7: E225 missing whitespace around operator
Line 2031:3: E225 missing whitespace around operator
Line 2031:5: E225 missing whitespace around operator
Line 2031:7: E225 missing whitespace around operator
Line 2032:13: E113 unexpected indentation
Line 2037:3: E227 missing whitespace around bitwise or shift operator
Line 2037:7: E225 missing whitespace around operator
Line 2038:13: E113 unexpected indentation
Line 2040:7: E225 missing whitespace around operator
Line 2041:3: E225 missing whitespace around operator
Line 2041:5: E225 missing whitespace around operator
Line 2041:7: E225 missing whitespace around operator
Line 2042:13: E113 unexpected indentation
Line 2044:7: E225 missing whitespace around operator
Line 2044:80: E501 line too long (114 > 79 characters)

funnycrab · 2017-09-25T16:37:11Z

@jreback

Sorry for the late reply. During the rebasing, there are some conflicts in the pandas/core/categorical.py. I just rebased without really resolving the conflicts. So the commit 0697610 is a broken commit. I will try to fix the issue as soon as possible.

TomAugspurger · 2017-09-25T16:52:23Z

Apologies for the conflicts @funnycrab. Let me know if you need help cleaning them up.

funnycrab · 2017-09-26T16:02:05Z

@TomAugspurger

It is OK. Actually, it is my bad for keeping this issue open for such a long time. Hopefully, I can fix this early October.

Will let you know when clarification is needed. Thanks in advance!

jreback · 2017-09-26T18:59:49Z

pandas/core/categorical.py

+            if isinstance(other, Series):
+                other = Categorical(other)
+
+<<<<<<< a581a743fe6740011e4fb0a7031ee92ce57b480b


merge issues

jreback · 2017-09-26T19:00:01Z

pandas/tests/test_categorical.py

+        # GH 16659
+        s1 = pd.Series(c1)
+        assert c1.is_dtype_equal(s1)
+        assert not c2.is_dtype_equal(s1)


can you update for this

jreback · 2017-11-10T20:20:29Z

can you rebase / update

This is a fix attempt for issue pandas-dev#16659.

…d tighten the wording in doc whatsnew

jreback · 2017-11-26T23:44:55Z

can you rebase

jreback · 2017-12-10T23:31:36Z

closing as stale, but if you want to continue working ping and we can reopen

jreback requested changes Jun 11, 2017

View reviewed changes

jreback reviewed Jun 11, 2017

View reviewed changes

jreback added Bug Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions labels Jun 11, 2017

jreback requested changes Jun 11, 2017

View reviewed changes

TomAugspurger reviewed Jun 12, 2017

View reviewed changes

TomAugspurger added this to the 0.21.0 milestone Jun 30, 2017

jreback removed this from the 0.21.0 milestone Jun 30, 2017

funnycrab force-pushed the fix_bug_in_categorical_comparison branch from 2581275 to 2fb416d Compare July 22, 2017 03:51

funnycrab force-pushed the fix_bug_in_categorical_comparison branch from 2fb416d to 0697610 Compare September 25, 2017 16:30

jreback requested changes Sep 26, 2017

View reviewed changes

BUG: Fix Categorical comparsion with Series of dtype 'category'

a1099a4

This is a fix attempt for issue pandas-dev#16659.

simplify the fix, add issue reference number to corresponding test an…

2771e57

…d tighten the wording in doc whatsnew

funnycrab force-pushed the fix_bug_in_categorical_comparison branch from 0697610 to 2771e57 Compare November 12, 2017 02:55

jreback closed this Dec 10, 2017

This pull request was closed.

BUG: Fix Categorical comparsion with Series of dtype 'category' #16667

BUG: Fix Categorical comparsion with Series of dtype 'category' #16667

Conversation

funnycrab commented Jun 11, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback Jun 11, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jun 11, 2017 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jun 30, 2017

funnycrab commented Jul 1, 2017

jreback commented Jul 19, 2017

funnycrab commented Jul 20, 2017

gfyoung commented Jul 20, 2017

gfyoung commented Jul 20, 2017

gfyoung commented Jul 20, 2017

funnycrab commented Jul 22, 2017

gfyoung commented Jul 22, 2017

funnycrab commented Jul 22, 2017

jreback commented Sep 23, 2017

pep8speaks commented Sep 25, 2017

funnycrab commented Sep 25, 2017

TomAugspurger commented Sep 25, 2017

funnycrab commented Sep 26, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Nov 10, 2017

jreback commented Nov 26, 2017

jreback commented Dec 10, 2017

funnycrab commented Jun 11, 2017 •

edited

Loading

jreback Jun 11, 2017 •

edited

Loading

codecov bot commented Jun 11, 2017 •

edited

Loading