BUG/API: Unordered Categorical should ignore order in comparisons? #16014

Closed
TomAugspurger opened this Issue Apr 16, 2017 · 9 comments

Comments

Projects
None yet
4 participants
@TomAugspurger
Contributor

TomAugspurger commented Apr 16, 2017

I think that when comparing two unordered categorical-dtyped series with categories which differ only by ordered should compare equal

Code Sample, a copy-pastable example if possible

In [1]: import pandas as pd
c
In [2]: c1 = pd.Series(pd.Categorical(['a', 'b']))

In [3]: c2 = pd.Series(pd.Categorical(['a', 'b'], categories=['b', 'a']))

In [4]: c1
Out[4]:
0    a
1    b
dtype: category
Categories (2, object): [a, b]

In [5]: c2
Out[5]:
0    a
1    b
dtype: category
Categories (2, object): [b, a]

In [6]: c1 == c2
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-d8d43a43a02a> in <module>()
----> 1 c1 == c2

/Users/taugspurger/Envs/statsmodels-dev/lib/python3.6/site-packages/pandas/core/ops.py in wrapper(self, other, axis)
    811                 msg = 'Can only compare identically-labeled Series objects'
    812                 raise ValueError(msg)
--> 813             return self._constructor(na_op(self.values, other.values),
    814                                      index=self.index, name=name)
    815         elif isinstance(other, pd.DataFrame):  # pragma: no cover

/Users/taugspurger/Envs/statsmodels-dev/lib/python3.6/site-packages/pandas/core/ops.py in na_op(x, y)
    752         # in either operand
    753         if is_categorical_dtype(x):
--> 754             return op(x, y)
    755         elif is_categorical_dtype(y) and not isscalar(y):
    756             return op(y, x)

/Users/taugspurger/Envs/statsmodels-dev/lib/python3.6/site-packages/pandas/core/categorical.py in f(self, other)
     55             if ((len(self.categories) != len(other.categories)) or
     56                     not ((self.categories == other.categories).all())):
---> 57                 raise TypeError("Categoricals can only be compared if "
     58                                 "'categories' are the same")
     59             if not (self.ordered == other.ordered):

TypeError: Categoricals can only be compared if 'categories' are the same

Expected Output

I think this should return True. Unordered categories shouldn't care about the order :)

cc @janschulz thoughts?

@TomAugspurger TomAugspurger added this to the Next Major Release milestone Apr 16, 2017

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Apr 16, 2017

Contributor

yes this seems reasonable. creating a Categorical w/o cateogories creates categories in the order they are found, but that's an implementation detail.

if you change this how much breakage on tests? (IOW this is going to affect is_dtype_equal as well)

Contributor

jreback commented Apr 16, 2017

yes this seems reasonable. creating a Categorical w/o cateogories creates categories in the order they are found, but that's an implementation detail.

if you change this how much breakage on tests? (IOW this is going to affect is_dtype_equal as well)

@jankatins

This comment has been minimized.

Show comment
Hide comment
@jankatins

jankatins Apr 16, 2017

Contributor

Once upon a time, categoricals were modeled after factor in R and there, levels (=categories) were sorted:

> levels(factor(c('b','a', 'c')) )
[1] "a" "b" "c"

In that case, both cases would result in the same categories.

I think the sorting was removed ("categories in order of apearance") and this was overlooked? [edit] None of the commits show this (as far as my github blame skill reach...), so it either happend in one of the early discussions or in my imagination and this was simpy overlooked in the implementation.[/]

Contributor

jankatins commented Apr 16, 2017

Once upon a time, categoricals were modeled after factor in R and there, levels (=categories) were sorted:

> levels(factor(c('b','a', 'c')) )
[1] "a" "b" "c"

In that case, both cases would result in the same categories.

I think the sorting was removed ("categories in order of apearance") and this was overlooked? [edit] None of the commits show this (as far as my github blame skill reach...), so it either happend in one of the early discussions or in my imagination and this was simpy overlooked in the implementation.[/]

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Apr 16, 2017

Member

creating a Categorical w/o cateogories creates categories in the order they are found, but that's an implementation detail.

@jreback no, when no categories are specified, the found values are sorted:

In [1]: pd.Categorical(['b', 'a'])
Out[1]:
[b, a]
Categories (2, object): [a, b]

None of the commits show this (as far as my github blame skill reach...), so it either happend in one of the early discussions or in my imagination and this was simpy overlooked in the implementation

It just didn't happen AFAIK :-)

Member

jorisvandenbossche commented Apr 16, 2017

creating a Categorical w/o cateogories creates categories in the order they are found, but that's an implementation detail.

@jreback no, when no categories are specified, the found values are sorted:

In [1]: pd.Categorical(['b', 'a'])
Out[1]:
[b, a]
Categories (2, object): [a, b]

None of the commits show this (as far as my github blame skill reach...), so it either happend in one of the early discussions or in my imagination and this was simpy overlooked in the implementation

It just didn't happen AFAIK :-)

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Apr 20, 2017

Contributor

right, forgot we sort :>

so the question is why are we differentiating between passed categories and categories created from direct factorizing in the ordered=False case?

For ordered=False

  • I think leaving the passed categories as is is fine (e.g. no sorting)
  • a bit indifferent to whether we should sort factorized categories (a) (I know R does), or show them in order of appearance (b)

But these all should be equivalent
i

Contributor

jreback commented Apr 20, 2017

right, forgot we sort :>

so the question is why are we differentiating between passed categories and categories created from direct factorizing in the ordered=False case?

For ordered=False

  • I think leaving the passed categories as is is fine (e.g. no sorting)
  • a bit indifferent to whether we should sort factorized categories (a) (I know R does), or show them in order of appearance (b)

But these all should be equivalent
i

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Apr 20, 2017

Contributor

I'll play around with it when I'm doing the CategoricalDtype stuff. For ease of implementation, we'll have to do something. Having an "ordering" function that deterministically orders a given set of categories is necessary so that the codes match up.

In [8]: pd.Categorical(['b', 'a', 1])
Out[8]:
[b, a, 1]
Categories (3, object): [1, a, b]

In [9]: pd.Categorical(['b', 'a', 1, pd.Timestamp('2017')])
Out[9]:
[b, a, 1, 2017-01-01 00:00:00]
Categories (4, object): [b, a, 1, 2017-01-01 00:00:00]
Contributor

TomAugspurger commented Apr 20, 2017

I'll play around with it when I'm doing the CategoricalDtype stuff. For ease of implementation, we'll have to do something. Having an "ordering" function that deterministically orders a given set of categories is necessary so that the codes match up.

In [8]: pd.Categorical(['b', 'a', 1])
Out[8]:
[b, a, 1]
Categories (3, object): [1, a, b]

In [9]: pd.Categorical(['b', 'a', 1, pd.Timestamp('2017')])
Out[9]:
[b, a, 1, 2017-01-01 00:00:00]
Categories (4, object): [b, a, 1, 2017-01-01 00:00:00]
@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Apr 20, 2017

Member

If we are comparing to R, they do not care about order of categories when they are not ordered:

> f1 <- factor(c('b','a', 'c'))
> f2 <- factor(c('b','a', 'c'), levels = c('b', 'a', 'c'))
> f1
[1] b a c
Levels: a b c
> f2
[1] b a c
Levels: b a c
> f1 == f2
[1] TRUE TRUE TRUE

but ... they also don't care when they are ordered:

> f1 <- factor(c('b','a', 'c'), ordered = T)
> f1
[1] b a c
Levels: a < b < c
> f2 <- factor(c('b','a', 'c'), levels = c('b', 'a', 'c'), ordered = T)
> f2
[1] b a c
Levels: b < a < c
> f1 == f2
[1] TRUE TRUE TRUE

If we change something, I would personally keep it so that this second case (ordered categories with different order) raises when trying to compare.
For the unordered case, I am fine with changing this.

a bit indifferent to whether we should sort factorized categories (a) (I know R does), or show them in order of appearance (b)

I certainly think we should not change the current behaviour, which is have sorted categories when you do not specify them manually.

so the question is why are we differentiating between passed categories and categories created from direct factorizing in the ordered=False case?

Because it is a different case. In case of passed categories, you explicitly pass them, so I find it logical that pandas respects the passed values by the user.

BTW, there is no difference between the ordered is False vs True here. In both cases, if you do not specify categories yourself, they are derived from the unique values and sorted:

In [3]: pd.Categorical(['b', 'a'])
Out[3]: 
[b, a]
Categories (2, object): [a, b]

In [4]: pd.Categorical(['b', 'a'], ordered=True)
Out[4]: 
[b, a]
Categories (2, object): [a < b]
Member

jorisvandenbossche commented Apr 20, 2017

If we are comparing to R, they do not care about order of categories when they are not ordered:

> f1 <- factor(c('b','a', 'c'))
> f2 <- factor(c('b','a', 'c'), levels = c('b', 'a', 'c'))
> f1
[1] b a c
Levels: a b c
> f2
[1] b a c
Levels: b a c
> f1 == f2
[1] TRUE TRUE TRUE

but ... they also don't care when they are ordered:

> f1 <- factor(c('b','a', 'c'), ordered = T)
> f1
[1] b a c
Levels: a < b < c
> f2 <- factor(c('b','a', 'c'), levels = c('b', 'a', 'c'), ordered = T)
> f2
[1] b a c
Levels: b < a < c
> f1 == f2
[1] TRUE TRUE TRUE

If we change something, I would personally keep it so that this second case (ordered categories with different order) raises when trying to compare.
For the unordered case, I am fine with changing this.

a bit indifferent to whether we should sort factorized categories (a) (I know R does), or show them in order of appearance (b)

I certainly think we should not change the current behaviour, which is have sorted categories when you do not specify them manually.

so the question is why are we differentiating between passed categories and categories created from direct factorizing in the ordered=False case?

Because it is a different case. In case of passed categories, you explicitly pass them, so I find it logical that pandas respects the passed values by the user.

BTW, there is no difference between the ordered is False vs True here. In both cases, if you do not specify categories yourself, they are derived from the unique values and sorted:

In [3]: pd.Categorical(['b', 'a'])
Out[3]: 
[b, a]
Categories (2, object): [a, b]

In [4]: pd.Categorical(['b', 'a'], ordered=True)
Out[4]: 
[b, a]
Categories (2, object): [a < b]
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Apr 20, 2017

Contributor

Because it is a different case. In case of passed categories, you explicitly pass them, so I find it logical that pandas respects the passed values by the user.

this is all fine, but then they should compare equally. The point of ``ordered=False` is the don't care about the ordering (so its a set rather than a list comparison).

Contributor

jreback commented Apr 20, 2017

Because it is a different case. In case of passed categories, you explicitly pass them, so I find it logical that pandas respects the passed values by the user.

this is all fine, but then they should compare equally. The point of ``ordered=False` is the don't care about the ordering (so its a set rather than a list comparison).

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Apr 20, 2017

Contributor

If we change something, I would personally keep it so that this second case (ordered categories with different order) raises when trying to compare.
For the unordered case, I am fine with changing this.

Agreed

they are derived from the unique values and sorted:

Well... usually :)

In [1]: import pandas as pd

In [2]: pd.Categorical(['b', 'a', 1])
Out[2]:
[b, a, 1]
Categories (3, object): [1, a, b]

In [3]: pd.Categorical(['b', 'a', 1, pd.Timestamp('2017')])
Out[3]:
[b, a, 1, 2017-01-01 00:00:00]
Categories (4, object): [b, a, 1, 2017-01-01 00:00:00]

But I agree that the current behavior around building categorical shouldn't change, just the behavior when comparing them. I think we're all in agreement on that.

Contributor

TomAugspurger commented Apr 20, 2017

If we change something, I would personally keep it so that this second case (ordered categories with different order) raises when trying to compare.
For the unordered case, I am fine with changing this.

Agreed

they are derived from the unique values and sorted:

Well... usually :)

In [1]: import pandas as pd

In [2]: pd.Categorical(['b', 'a', 1])
Out[2]:
[b, a, 1]
Categories (3, object): [1, a, b]

In [3]: pd.Categorical(['b', 'a', 1, pd.Timestamp('2017')])
Out[3]:
[b, a, 1, 2017-01-01 00:00:00]
Categories (4, object): [b, a, 1, 2017-01-01 00:00:00]

But I agree that the current behavior around building categorical shouldn't change, just the behavior when comparing them. I think we're all in agreement on that.

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Apr 20, 2017

Member

But I agree that the current behavior around building categorical shouldn't change, just the behavior when comparing them. I think we're all in agreement on that.

Yep!

Your corner case is a bit strange in that in the first case it seems to be sorted and in the second not, while in both cases the values are not sortable (eg if you would put them in an index or series and sort, you get an error).

Member

jorisvandenbossche commented Apr 20, 2017

But I agree that the current behavior around building categorical shouldn't change, just the behavior when comparing them. I think we're all in agreement on that.

Yep!

Your corner case is a bit strange in that in the first case it seems to be sorted and in the second not, while in both cases the values are not sortable (eg if you would put them in an index or series and sort, you get an error).

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue May 12, 2017

BUG: Categorical comparison with unordered
Fixes categorical comparison operations improperly considering
ordering when two unordered categoricals are compared.

Closes #16014

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue May 12, 2017

BUG: Categorical comparison with unordered
Fixes categorical comparison operations improperly considering
ordering when two unordered categoricals are compared.

Closes #16014

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue May 12, 2017

BUG: Categorical comparison with unordered
Fixes categorical comparison operations improperly considering
ordering when two unordered categoricals are compared.

Closes #16014

@jreback jreback modified the milestones: 0.20.2, Next Major Release May 12, 2017

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue May 15, 2017

BUG: Categorical comparison with unordered
Fixes categorical comparison operations improperly considering
ordering when two unordered categoricals are compared.

Closes #16014

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue May 18, 2017

BUG: Categorical comparison with unordered
Fixes categorical comparison operations improperly considering
ordering when two unordered categoricals are compared.

Closes #16014

TomAugspurger added a commit that referenced this issue May 18, 2017

BUG: Categorical comparison with unordered (#16339)
Fixes categorical comparison operations improperly considering
ordering when two unordered categoricals are compared.

Closes #16014

pcluo added a commit to pcluo/pandas that referenced this issue May 22, 2017

BUG: Categorical comparison with unordered (#16339)
Fixes categorical comparison operations improperly considering
ordering when two unordered categoricals are compared.

Closes #16014

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue May 29, 2017

BUG: Categorical comparison with unordered (#16339)
Fixes categorical comparison operations improperly considering
ordering when two unordered categoricals are compared.

Closes #16014
(cherry picked from commit 91e9e52)

TomAugspurger added a commit that referenced this issue May 30, 2017

BUG: Categorical comparison with unordered (#16339)
Fixes categorical comparison operations improperly considering
ordering when two unordered categoricals are compared.

Closes #16014
(cherry picked from commit 91e9e52)

stangirala added a commit to stangirala/pandas that referenced this issue Jun 11, 2017

BUG: Categorical comparison with unordered (#16339)
Fixes categorical comparison operations improperly considering
ordering when two unordered categoricals are compared.

Closes #16014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment