Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix Categorical comparsion with Series of dtype 'category' #16667

Closed

Conversation

funnycrab
Copy link
Contributor

@funnycrab funnycrab commented Jun 11, 2017

This is a fix attempt for issue #16659.

  • closes #xxxx
  • tests added / passed
  • passes git diff upstream/master --name-only -- '*.py' | flake8 --diff
  • whatsnew entry

try:
return (self.categories.equals(other.categories) and
self.ordered == other.ordered)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead simply

other = Categorical(other)

@@ -152,6 +152,11 @@ def test_is_equal_dtype(self):
CategoricalIndex(c1, categories=list('cab'))))
assert not c1.is_dtype_equal(CategoricalIndex(c1, ordered=True))

s1 = pd.Series(c1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue as a comment

@@ -130,6 +130,8 @@ Numeric
Categorical
^^^^^^^^^^^

- Bug in ``Categorical.is_dtype_equal()`` where comparison with Series whose dtype is 'category' is not handled correctly (:issue:`16659`)
Copy link
Contributor

@jreback jreback Jun 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comparison with a Series with dtype='category'.

remove the 'is not handled correctly' (that's what the Bug indicates).

@jreback jreback added Bug Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions labels Jun 11, 2017
other_categorical = other.values
else:
other_categorical = other
other = Categorical(other)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can always do this no if needed (pass copy=False)

from pandas.core.series import Series

if isinstance(other, Series):
other = Categorical(other)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or even better just do this if not Categorical or CategoricalIndex

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your guidance.

It seems that the following code

if not isinstance(other, Categorical) and not isinstance(CategoricalIndex):
    other = Categorical(other)

will lead to the failure of following test case

assert not c1.is_dtype_equal(Index(list('aabca')))

# GH 16659
s1 = pd.Series(c1)
assert c1.is_dtype_equal(s1)
assert not c2.is_dtype_equal(s1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u add some tests for things that are not is_dryoe_equal (i don't remember if these are sufficiently covered)

e.g. pass in scalers, ndarray, Dataframe

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and cycle thru all Indexes (except CI)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you update for this

@codecov
Copy link

codecov bot commented Jun 11, 2017

Codecov Report

Merging #16667 into master will decrease coverage by 0.42%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16667      +/-   ##
==========================================
- Coverage   91.43%   91.01%   -0.43%     
==========================================
  Files         163      161       -2     
  Lines       50091    49353     -738     
==========================================
- Hits        45800    44917     -883     
- Misses       4291     4436     +145
Flag Coverage Δ
#multiple 88.78% <100%> (-0.44%) ⬇️
#single 40.26% <66.66%> (-0.16%) ⬇️
Impacted Files Coverage Δ
pandas/core/categorical.py 95.48% <100%> (-0.32%) ⬇️
pandas/io/s3.py 0% <0%> (-85%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/tests/plotting/__init__.py 77.77% <0%> (-22.23%) ⬇️
pandas/core/tools/datetimes.py 66.94% <0%> (-17.55%) ⬇️
pandas/util/_decorators.py 65.3% <0%> (-15.4%) ⬇️
pandas/compat/pickle_compat.py 69.51% <0%> (-6.1%) ⬇️
pandas/core/generic.py 92% <0%> (-3.72%) ⬇️
pandas/core/dtypes/missing.py 87.03% <0%> (-3.59%) ⬇️
pandas/core/indexes/range.py 92.81% <0%> (-2.85%) ⬇️
... and 99 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 96a5274...2771e57. Read the comment docs.

@@ -130,6 +130,8 @@ Numeric
Categorical
^^^^^^^^^^^

- Bug in ``Categorical.is_dtype_equal()`` where comparison with a Series with dtype='category' (:issue:`16659`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- Bug in ``Categorical.is_dtype_equal()`` where comparison with a Series with `'category'` dtype incorrectly returned False (:issue:`16659`)

try:
from pandas.core.series import Series

if isinstance(other, Series):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about

if is_categorical(other):
    other = Categorical(other)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the advice. However, this seems not to work either.

It will at least make the following test case to fail,

assert c3.is_dtype_equal(c3)

One of the reason is that the new instance created by Categorical(c3) will not preserve the attribute ordered of c3. Not quite sure whether this is by design.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that doesn't sound right. an u show a self contained repro

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that makes sense I guess.

In [5]: pd.Categorical(pd.Categorical([0, 1], ordered=True))
Out[5]:
[0, 1]
Categories (2, int64): [0, 1]

so not ordered. The default for the constructor is ordered=False.

I would say just explicitly check for the known cases of Categorical, Series w/ category dtype, and CategoricalIndex and just extract the categories and ordered off of them and build a new other = Categorical(categories, ordered=ordered).

Once I finish #16015 this will be much simpler (just use .dtype), but that won't be for a few weeks probably.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no this is a bug
could be fixed in his PR or independently before this PR

don't work around this pls put in place the correct patch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing Categorical(thing, ordered=False) and having it come out ordered would be very odd. If we do this, we would need to change the default to ordered=None.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep I agree.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback With what? With that Categorical(thing, ordered=False) should return ordered False? Or with changing the default to None ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't find this necessarily a bug. Personally I would find keeping ordered=False as the default much clearer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the bug is this.

This should take the ordered flag from the passed Categorical/CategoricalIndex

In [2]: c = pd.Categorical(list('bac'), ordered=True)

In [3]: c
Out[3]: 
[b, a, c]
Categories (3, object): [a < b < c]

In [4]: pd.Categorical(c)
Out[4]: 
[b, a, c]
Categories (3, object): [a, b, c]

so it needs to default to ordered=None. If its not specified it can take the attribute of a passed categorical (we do this already for categories). otherwise, will be False

@TomAugspurger TomAugspurger added this to the 0.21.0 milestone Jun 30, 2017
@jreback jreback removed this from the 0.21.0 milestone Jun 30, 2017
@jreback
Copy link
Contributor

jreback commented Jun 30, 2017

can you update

@funnycrab
Copy link
Contributor Author

I was thinking that an agreement was yet to achieve.

So, I will try to fix the two bugs mentioned above in this PR altogether.

@jreback
Copy link
Contributor

jreback commented Jul 19, 2017

can you rebase / update

#16667 (comment) could be done as a separate PR prior to this (or here).

@funnycrab
Copy link
Contributor Author

Sorry for late reply.

So, by "rebase", do you mean that I can make a new branch from current topic branch and fix #16667 (comment) on that new branch and rebase its content to the master first? Something like,

git rebase --onto master current_topic_branch new_branch

Or just create a new separate PR to fix #16667 (comment), and have it merged into master first. Then rebase the topic branch in this PR after that commit (on master)?

Thanks for your guidance.

@gfyoung
Copy link
Member

gfyoung commented Jul 20, 2017

So, by "rebase", do you mean that I can make a new branch from current topic branch and fix #16667 (comment) on that new branch and rebase its content to the master first?

Rebase means the following:

  1. Take your current branch and rewind your changes
  2. Fast-forward the rewinded branch to the current state of master
  3. Reapply your changes on the updated version of the branch

To do this, just to the following:

git rebase master

No new branch is needed, nor is there any need for a new PR

@gfyoung
Copy link
Member

gfyoung commented Jul 20, 2017

Or just create a new separate PR to fix #16667 (comment), and have it merged into master first. Then rebase the topic branch in this PR after that commit (on master)?

So this is an option that @jreback presented to you in terms of addressing his comment. You are also welcome to address the issue directly in this PR (so you wouldn't need to create a new one for it).

@gfyoung
Copy link
Member

gfyoung commented Jul 20, 2017

Sorry for late reply.

No worries! We've had to wait MUCH longer than this for PR's before 😄

@funnycrab funnycrab force-pushed the fix_bug_in_categorical_comparison branch from 2581275 to 2fb416d Compare July 22, 2017 03:51
@funnycrab
Copy link
Contributor Author

@gfyoung

Thanks for you detailed explanation. I was overthinking about the rebase.

So what I have done is as follows.

Update my upstream/master with

git fetch upstream

Switch to local master branch and update it

git checkout master
git merge upstream/master

Switch back to local branch fix_bug_in_categorical_comparison

git checkout fix_bug_in_categorical_comparison

Rebase it to fresh master branch

git rebase master

Finally, push to origin/fix_bug_in_categorical_comparison with force

git push --force origin fix_bug_in_categorical_comparison

Since I am the only one working on this branch, I believe the force push here should not cause any trouble. Please let me know if this is not considered a good practice. Thank you.

@gfyoung
Copy link
Member

gfyoung commented Jul 22, 2017

@funnycrab : That should be just fine!

@funnycrab
Copy link
Contributor Author

@gfyoung
OK. Thank you. I will then focus on fixing the real issue.

@jreback
Copy link
Contributor

jreback commented Sep 23, 2017

can you rebase

@funnycrab funnycrab force-pushed the fix_bug_in_categorical_comparison branch from 2fb416d to 0697610 Compare September 25, 2017 16:30
@pep8speaks
Copy link

Hello @funnycrab! Thanks for updating the PR.

Line 2029:1: E112 expected an indented block
Line 2029:1: E305 expected 2 blank lines after class or function definition, found 0
Line 2029:3: E227 missing whitespace around bitwise or shift operator
Line 2029:7: E225 missing whitespace around operator
Line 2031:3: E225 missing whitespace around operator
Line 2031:5: E225 missing whitespace around operator
Line 2031:7: E225 missing whitespace around operator
Line 2032:13: E113 unexpected indentation
Line 2037:3: E227 missing whitespace around bitwise or shift operator
Line 2037:7: E225 missing whitespace around operator
Line 2038:13: E113 unexpected indentation
Line 2040:7: E225 missing whitespace around operator
Line 2041:3: E225 missing whitespace around operator
Line 2041:5: E225 missing whitespace around operator
Line 2041:7: E225 missing whitespace around operator
Line 2042:13: E113 unexpected indentation
Line 2044:7: E225 missing whitespace around operator
Line 2044:80: E501 line too long (114 > 79 characters)

@funnycrab
Copy link
Contributor Author

@jreback

Sorry for the late reply. During the rebasing, there are some conflicts in the pandas/core/categorical.py. I just rebased without really resolving the conflicts. So the commit 0697610 is a broken commit. I will try to fix the issue as soon as possible.

@TomAugspurger
Copy link
Contributor

Apologies for the conflicts @funnycrab. Let me know if you need help cleaning them up.

@funnycrab
Copy link
Contributor Author

@TomAugspurger

It is OK. Actually, it is my bad for keeping this issue open for such a long time. Hopefully, I can fix this early October.

Will let you know when clarification is needed. Thanks in advance!

if isinstance(other, Series):
other = Categorical(other)

<<<<<<< a581a743fe6740011e4fb0a7031ee92ce57b480b
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge issues

# GH 16659
s1 = pd.Series(c1)
assert c1.is_dtype_equal(s1)
assert not c2.is_dtype_equal(s1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you update for this

@jreback
Copy link
Contributor

jreback commented Nov 10, 2017

can you rebase / update

@funnycrab funnycrab force-pushed the fix_bug_in_categorical_comparison branch from 0697610 to 2771e57 Compare November 12, 2017 02:55
@jreback
Copy link
Contributor

jreback commented Nov 26, 2017

can you rebase

@jreback
Copy link
Contributor

jreback commented Dec 10, 2017

closing as stale, but if you want to continue working ping and we can reopen

@jreback jreback closed this Dec 10, 2017
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants