Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Set ops for CategoricalIndex #10186

Open
sinhrks opened this issue May 21, 2015 · 8 comments
Open

API: Set ops for CategoricalIndex #10186

sinhrks opened this issue May 21, 2015 · 8 comments
Labels
Bug Categorical Categorical Data Type setops union, intersection, difference, symmetric_difference

Comments

@sinhrks
Copy link
Member

sinhrks commented May 21, 2015

Derived from #10157. Would like to clarify what these results should be. Basically, I think:

  • The result must be CategoricalIndex
  • The result should have the same values as the result of normal Index which has the same original values.
  • Result's category should only include categories which the result actually has.

Followings are current results.

intersection

# for reference
pd.Index([1, 2, 3, 1, 2, 3]).intersection(pd.Index([2, 3, 4, 2, 3, 4]))
# Int64Index([2, 2, 3, 3], dtype='int64')

pd.CategoricalIndex([1, 2, 3, 1, 2, 3]).intersection(pd.CategoricalIndex([2, 3, 4, 2, 3, 4]))
# CategoricalIndex([2, 2, 3, 3], categories=[1, 2, 3], ordered=False, dtype='category')
# -> Is this OK or it should have categories=[2, 3]?

union

Doc says "Form the union of two Index objects and sorts if possible". I'm not sure whether the last sentence says "raise error if sort is impossible" or "not sort if impossible"?

pd.Index([1, 2, 4]).union(pd.Index([2, 3, 4]))
# Int64Index([1, 2, 3, 4], dtype='int64')

pd.CategoricalIndex([1, 2, 4]).union(pd.CategoricalIndex([2, 3, 4]))
# CategoricalIndex([1, 2, 4, 3], categories=[1, 2, 3, 4], ordered=False, dtype='category')
# -> Should be sorted?
pd.Index([1, 2, 3, 1, 2, 3]).union(pd.Index([2, 3, 4, 2, 3, 4]))
# InvalidIndexError: Reindexing only valid with uniquely valued Index objects
-> This should results Index([1, 2, 3, 1, 2, 3, 4, 4])?

pd.CategoricalIndex([1, 2, 3, 1, 2, 3]).union(pd.CategoricalIndex([2, 3, 4, 2, 3, 4]))
# TypeError: type() takes 1 or 3 arguments
# -> should raise understandable error, or Int64Index shouldn't raise (and return unsorted result?)

difference

pd.CategoricalIndex([1, 2, 4, 5]).difference(pd.CategoricalIndex([2, 3, 4]))
# Int64Index([1, 5], dtype='int64')
# -> should be CategoricalIndex?

sym_diff

pd.CategoricalIndex([1, 2, 4, 5]).sym_diff(pd.CategoricalIndex([2, 4]))
# Int64Index([1, 5], dtype='int64')
# -> should be CategoricalIndex?
@sinhrks sinhrks added API Design Categorical Categorical Data Type labels May 21, 2015
@sinhrks sinhrks added this to the 0.17.0 milestone May 21, 2015
@jreback
Copy link
Contributor

jreback commented May 21, 2015

So i would operate on both values/categories:

intersection you should take the intersection of the categories as well
union take the union of categories.
difference has to take the categories of lhs
sym_diff take the sym_diff of the categories (so make this a CI as well)

@jorisvandenbossche
Copy link
Member

One option to consider is to (for now) only allow these operations with indexes with the same categories

@jreback
Copy link
Contributor

jreback commented May 21, 2015

I think you simply call self._create_categorical on the rhs to coerce nicely

@jankatins
Copy link
Contributor

IMO (and I think not all agree here), a Categorical defines a new type and is therefore similar to an int or a string (e.g. a number of type int can be one value of int_min..int_max, similar to a value in a Categorical, which can only be one of the categories in that Categorical).

Therefore a CategoricalIndex should behave similar as two index of type int if the categories (and ordered) are the same and similar to one int and one string index f they have different categories / ordered.

So:

>>> pd.Index([1, 2, 3, 1, 2, 3]).intersection(pd.Index(["2", "3", "4", "2", "3", "4"]))
Index([], dtype='object')
>>> pd.CategoricalIndex([1, 2, 3, 1, 2, 3]).intersection(pd.CategoricalIndex([2, 3, 4, 2, 3, 4]))
# because the underlying categoricals have different categories [1,2,3]  and [2,3,4}
Index([], dtype='object') 

See also this:

>>> pd.Categorical([1,2,3], ordered=True) > pd.Categorical([2,3,4], ordered=True)
TypeError: Categoricals can only be compared if 'categories' are the same
>>> 1 > "2" # on py3, py2 is ...
TypeError: unorderable types: int() > str()

@sinhrks
Copy link
Member Author

sinhrks commented May 23, 2015

For reference, R doesn't care the order of categories and remove duplicated categories.

intersect(as.factor(c(1, 2, 3)), as.factor(c(2, 3, 4)))
# [1] "2" "3"
intersect(as.factor(c(1, 2, 3, 1, 2, 3)), as.factor(c(2, 3, 4, 2, 3, 4)))
# [1] "2" "3"

intersect(c(1, 2, 3, 1, 2, 3), c(2, 3, 4, 2, 3, 4))
# [1] 2 3

Let me summarize current opinions and choises. If I misunderstand, please lmk:

Category order is identical Category order is different
Categories are identical Perform set ops against values and categories. Result of values should be identical as the normal index's result. Ignore order and perform set ops (1), return empty(2) or raise error(3)?
Categories are different - Ignore order and perform set ops (1), return empty(2) or raise error(3)?

@jreback jreback modified the milestones: Next Major Release, 0.17.0 Aug 15, 2015
@TomAugspurger
Copy link
Contributor

Resurrecting this as part of my CategoricalDtype refactor. The semantics in union_categoricals are good for union I think:

  • ordered must match
  • when ordered, all categories must match

Currently CategoricalIndex.union(other) discards the .ordered, which isn't great.

In [22]: a = pd.CategoricalIndex(['a', 'b'], categories=['a', 'b', 'c'], ordered=True)

In [23]: b = pd.CategoricalIndex(['b', 'c'], categories=['a', 'b', 'c'], ordered=True)

In [24]: a.union(b).ordered
Out[24]: False

I think we'll follow those rules on the categories for each of the set operations.

@TomAugspurger
Copy link
Contributor

Actually, I think we can handle additional cases with union_categories when both are ordered. Currently we require that categories match exactly when ordered. We could easily support

  • x | y when x is a strict subset of y: {a < b} | {a < b < c} -> {a < b < c}
  • x | y when x - y are all greater than the max(y), or less then min(y).
    e.g. {a < b < c < d} | {a < b < c} -> {a < b < c < d}

We could even support union over categoricals with "gaps" like

{a < b < d < e} | {a < b < c < d < e} -> { a < b < c < d < e}

These rules should work for intersect, difference, and symmetric difference too.

@joseortiz3
Copy link
Contributor

joseortiz3 commented Apr 20, 2019

You guys probably already know this, but in case not, FYI: This is the current (24.2) behavior for union of categorical indices, which makes it difficult to do anything involving two slightly different categorical indices:

>>>pd.CategoricalIndex([1, 2, 4]).union(pd.CategoricalIndex([2, 3, 4]))
CategoricalIndex([1, 2, 4, nan], categories=[1, 2, 4], ordered=False, dtype='category')

not

CategoricalIndex([1, 2, 4, 3], categories=[1, 2, 4, 3], ordered=False, dtype='category') 
# or something

@mroeschke mroeschke added the Bug label Jun 28, 2020
@jbrockmendel jbrockmendel added the setops union, intersection, difference, symmetric_difference label Jun 17, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type setops union, intersection, difference, symmetric_difference
Projects
None yet
Development

No branches or pull requests

8 participants