API: Set ops for CategoricalIndex #10186

sinhrks · 2015-05-21T14:17:09Z

Derived from #10157. Would like to clarify what these results should be. Basically, I think:

The result must be CategoricalIndex
The result should have the same values as the result of normal Index which has the same original values.
Result's category should only include categories which the result actually has.

Followings are current results.

intersection

# for reference
pd.Index([1, 2, 3, 1, 2, 3]).intersection(pd.Index([2, 3, 4, 2, 3, 4]))
# Int64Index([2, 2, 3, 3], dtype='int64')

pd.CategoricalIndex([1, 2, 3, 1, 2, 3]).intersection(pd.CategoricalIndex([2, 3, 4, 2, 3, 4]))
# CategoricalIndex([2, 2, 3, 3], categories=[1, 2, 3], ordered=False, dtype='category')
# -> Is this OK or it should have categories=[2, 3]?

union

Doc says "Form the union of two Index objects and sorts if possible". I'm not sure whether the last sentence says "raise error if sort is impossible" or "not sort if impossible"?

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.union.html

pd.Index([1, 2, 4]).union(pd.Index([2, 3, 4]))
# Int64Index([1, 2, 3, 4], dtype='int64')

pd.CategoricalIndex([1, 2, 4]).union(pd.CategoricalIndex([2, 3, 4]))
# CategoricalIndex([1, 2, 4, 3], categories=[1, 2, 3, 4], ordered=False, dtype='category')
# -> Should be sorted?

pd.Index([1, 2, 3, 1, 2, 3]).union(pd.Index([2, 3, 4, 2, 3, 4]))
# InvalidIndexError: Reindexing only valid with uniquely valued Index objects
-> This should results Index([1, 2, 3, 1, 2, 3, 4, 4])?

pd.CategoricalIndex([1, 2, 3, 1, 2, 3]).union(pd.CategoricalIndex([2, 3, 4, 2, 3, 4]))
# TypeError: type() takes 1 or 3 arguments
# -> should raise understandable error, or Int64Index shouldn't raise (and return unsorted result?)

difference

pd.CategoricalIndex([1, 2, 4, 5]).difference(pd.CategoricalIndex([2, 3, 4]))
# Int64Index([1, 5], dtype='int64')
# -> should be CategoricalIndex?

sym_diff

pd.CategoricalIndex([1, 2, 4, 5]).sym_diff(pd.CategoricalIndex([2, 4]))
# Int64Index([1, 5], dtype='int64')
# -> should be CategoricalIndex?

The text was updated successfully, but these errors were encountered:

jreback · 2015-05-21T14:27:28Z

So i would operate on both values/categories:

intersection you should take the intersection of the categories as well
union take the union of categories.
difference has to take the categories of lhs
sym_diff take the sym_diff of the categories (so make this a CI as well)

jorisvandenbossche · 2015-05-21T14:27:48Z

One option to consider is to (for now) only allow these operations with indexes with the same categories

jreback · 2015-05-21T14:29:52Z

I think you simply call self._create_categorical on the rhs to coerce nicely

jankatins · 2015-05-23T12:21:28Z

IMO (and I think not all agree here), a Categorical defines a new type and is therefore similar to an int or a string (e.g. a number of type int can be one value of int_min..int_max, similar to a value in a Categorical, which can only be one of the categories in that Categorical).

Therefore a CategoricalIndex should behave similar as two index of type int if the categories (and ordered) are the same and similar to one int and one string index f they have different categories / ordered.

So:

>>> pd.Index([1, 2, 3, 1, 2, 3]).intersection(pd.Index(["2", "3", "4", "2", "3", "4"]))
Index([], dtype='object')
>>> pd.CategoricalIndex([1, 2, 3, 1, 2, 3]).intersection(pd.CategoricalIndex([2, 3, 4, 2, 3, 4]))
# because the underlying categoricals have different categories [1,2,3]  and [2,3,4}
Index([], dtype='object')

See also this:

>>> pd.Categorical([1,2,3], ordered=True) > pd.Categorical([2,3,4], ordered=True)
TypeError: Categoricals can only be compared if 'categories' are the same
>>> 1 > "2" # on py3, py2 is ...
TypeError: unorderable types: int() > str()

sinhrks · 2015-05-23T14:44:50Z

For reference, R doesn't care the order of categories and remove duplicated categories.

intersect(as.factor(c(1, 2, 3)), as.factor(c(2, 3, 4)))
# [1] "2" "3"
intersect(as.factor(c(1, 2, 3, 1, 2, 3)), as.factor(c(2, 3, 4, 2, 3, 4)))
# [1] "2" "3"

intersect(c(1, 2, 3, 1, 2, 3), c(2, 3, 4, 2, 3, 4))
# [1] 2 3

Let me summarize current opinions and choises. If I misunderstand, please lmk:

	Category order is identical	Category order is different
Categories are identical	Perform set ops against values and categories. Result of values should be identical as the normal index's result.	Ignore order and perform set ops (1), return empty(2) or raise error(3)?
Categories are different	-	Ignore order and perform set ops (1), return empty(2) or raise error(3)?

TomAugspurger · 2017-08-06T22:02:57Z

Resurrecting this as part of my CategoricalDtype refactor. The semantics in union_categoricals are good for union I think:

ordered must match
when ordered, all categories must match

Currently CategoricalIndex.union(other) discards the .ordered, which isn't great.

In [22]: a = pd.CategoricalIndex(['a', 'b'], categories=['a', 'b', 'c'], ordered=True)

In [23]: b = pd.CategoricalIndex(['b', 'c'], categories=['a', 'b', 'c'], ordered=True)

In [24]: a.union(b).ordered
Out[24]: False

I think we'll follow those rules on the categories for each of the set operations.

TomAugspurger · 2017-08-11T18:16:39Z

Actually, I think we can handle additional cases with union_categories when both are ordered. Currently we require that categories match exactly when ordered. We could easily support

x | y when x is a strict subset of y: {a < b} | {a < b < c} -> {a < b < c}
x | y when x - y are all greater than the max(y), or less then min(y).
e.g. {a < b < c < d} | {a < b < c} -> {a < b < c < d}

We could even support union over categoricals with "gaps" like

{a < b < d < e} | {a < b < c < d < e} -> { a < b < c < d < e}

These rules should work for intersect, difference, and symmetric difference too.

joseortiz3 · 2019-04-20T06:36:04Z

You guys probably already know this, but in case not, FYI: This is the current (24.2) behavior for union of categorical indices, which makes it difficult to do anything involving two slightly different categorical indices:

>>>pd.CategoricalIndex([1, 2, 4]).union(pd.CategoricalIndex([2, 3, 4]))
CategoricalIndex([1, 2, 4, nan], categories=[1, 2, 4], ordered=False, dtype='category')

not

CategoricalIndex([1, 2, 4, 3], categories=[1, 2, 4, 3], ordered=False, dtype='category') 
# or something

sinhrks added API Design Categorical Categorical Data Type labels May 21, 2015

sinhrks added this to the 0.17.0 milestone May 21, 2015

jreback modified the milestones: Next Major Release, 0.17.0 Aug 15, 2015

sinhrks mentioned this issue Jun 8, 2016

API/ENH: union Categorical #13361

Closed

Dr-Irv mentioned this issue Feb 28, 2018

BUG: names on union and intersection for Index were inconsistent (GH9943 GH9862) #19849

Merged

3 tasks

mroeschke added the Bug label Jun 28, 2020

mroeschke removed the API Design label Apr 18, 2021

jbrockmendel added the setops union, intersection, difference, symmetric_difference label Jun 17, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: Set ops for CategoricalIndex #10186

API: Set ops for CategoricalIndex #10186

sinhrks commented May 21, 2015

jreback commented May 21, 2015

jorisvandenbossche commented May 21, 2015

jreback commented May 21, 2015

jankatins commented May 23, 2015

sinhrks commented May 23, 2015

TomAugspurger commented Aug 6, 2017

TomAugspurger commented Aug 11, 2017

joseortiz3 commented Apr 20, 2019 •

edited

Loading

API: Set ops for CategoricalIndex #10186

API: Set ops for CategoricalIndex #10186

Comments

sinhrks commented May 21, 2015

intersection

union

difference

sym_diff

jreback commented May 21, 2015

jorisvandenbossche commented May 21, 2015

jreback commented May 21, 2015

jankatins commented May 23, 2015

sinhrks commented May 23, 2015

TomAugspurger commented Aug 6, 2017

TomAugspurger commented Aug 11, 2017

joseortiz3 commented Apr 20, 2019 • edited Loading

joseortiz3 commented Apr 20, 2019 •

edited

Loading