Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: relax categorical equality when comparing against object #8938

Closed
jreback opened this issue Nov 29, 2014 · 6 comments · Fixed by #8946
Closed

API: relax categorical equality when comparing against object #8938

jreback opened this issue Nov 29, 2014 · 6 comments · Fixed by #8946
Labels
API Design Categorical Categorical Data Type
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented Nov 29, 2014

from SO

In [1]: a = pd.Series(['a','b','c'],dtype="category")
In [2]: b = pd.Series(['a','b','c'],dtype="object")
In [3]: c = pd.Series(['a','b','cc'],dtype="object")
In [5]: a==b
TypeError: Cannot compare a Categorical for op <built-in function eq> with type <type 'numpy.ndarray'>. If you want to 
compare values, use 'series <op> np.asarray(cat)'.
In [6]: A = pd.DataFrame({'A':a,'B':[1,2,3]})
In [7]: B = pd.DataFrame({'A':b,'C':[4,5,6]})

In [9]: A.merge(B,on='A') 
Out[9]: 
   A  B  C
0  a  1  4
1  b  2  5
2  c  3  6

In [10]: A.merge(B,on='A').dtypes
Out[10]: 
A    object
B     int64
C     int64
dtype: object

In [11]: A.dtypes
Out[11]: 
A    category
B       int64
dtype: object

In [12]: B.dtypes
Out[12]: 
A    object
C     int64
dtype: object
@jreback jreback added API Design Categorical Categorical Data Type labels Nov 29, 2014
@jreback jreback added this to the 0.16.0 milestone Nov 29, 2014
@jreback
Copy link
Contributor Author

jreback commented Nov 29, 2014

cc @JanSchulz
@jorisvandenbossche
cc @shoyer
cc @immerrr

I think the merging is the correct behavior (@immerrr and I worked on this for a while to make it correct).

as a single categorical merging with either an equal categorical (meaning the same categories) or an object array is OK, though you do lose the 'character', meaning the Categorical of the dtypes. But this is like downgrading a type, e.g. going from integer to float when you are merging a float. So I don't think this is that big of a deal.

Note that we could try to infer the categorical if possible after merging (which is possible if the object happens to not contain new categories).

@shoyer
Copy link
Member

shoyer commented Nov 30, 2014

Agreed, the first example should work.

@immerrr
Copy link
Contributor

immerrr commented Nov 30, 2014

Yup, it seems natural to enable that.

@jankatins
Copy link
Contributor

The problem is when there is a custom ordering, then a object array with "a"<"b"<"c" is not the same as an categorical with "a">"b">"c", so what should "a" < "b" comparison do: use the object order or the categorical one?

In my head I think about categorical elements as a new type and under python 3, different types are not comparable per default and I went with that thought in the implementation.

@jreback
Copy link
Contributor Author

jreback commented Nov 30, 2014

@JanSchulz I agree with you on the ordering case and < or >. Clearly they are not comparable.

However, merging and equality are are about positional equality.

and we already support comparisons against a scalar

In [2]: s = Series(list('abc'),dtype='category')

In [3]: s
Out[3]: 
0    a
1    b
2    c
dtype: category
Categories (3, object): [a < b < c]

In [4]: s=='a'
Out[4]: 
0     True
1    False
2    False
dtype: bool

I'll put up a PR

@jankatins
Copy link
Contributor

Ah, ok, now I understand: more or less merge treats each value in the "to-be-merged" object array as a single value, so the scalar cases applies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Categorical Categorical Data Type
Projects
None yet
4 participants