Join GitHub today
CategoricalIndex.equals incorrectly considers category order when unordered #16603
>>> import pandas as pd >>> pd.CategoricalIndex(['a', 'b'], categories=['a', 'b']).equals( ... pd.CategoricalIndex(['a', 'b'], categories=['b', 'a'])) False
I fixed this for regular Categoricals in #16339, but that didn't affect CategoricalIndex.equals. Do we want
Discovered during my CategoricalDtype refactor, which does fix it so that order is ignored when
referenced this issue
Jun 10, 2017
Joining is broken because of this. This randomly broke a real world script depending on input data. Real headache to troubleshoot!
INSTALLED VERSIONS ------------------ commit: None python: 3.5.3.final.0 python-bits: 64 OS: Linux OS-release: 4.10.0-33-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8
@naure do you have time to submit a PR fixing this? I haven't had a chance, and won't for a little while yet.…
On Mon, Dec 4, 2017 at 8:46 AM, naure ***@***.***> wrote: Joining is broken because of this. This randomly broke a real world script depending on input data. Real headache to troubleshoot! Test case: # Same categories but in different order Xi = pd.Categorical(["A"], categories=["A", "B"]) Yi = pd.Categorical(["B"], categories=["B", "A"]) x = pd.DataFrame(1, columns=["x"], index=Xi) assert "B" not in x.index y = pd.DataFrame(2, columns=["y"], index=Yi) assert "B" in y.index xy = x.join(y, how="inner") print(x) print(y) print("Incorrect join:") print(xy) # It used the codes instead of the labels assert x.index.codes == y.index.codes == xy.index.codes ==  assert len(xy) == 0, "Should be empty" # Broken assert "B" not in xy.index # Broken assert "B" not in xy # Actually passes somehow Output: x A 1 y B 2 Incorrect join: x y B 1 2 --------------------------------------------------------------------------- AssertionError Traceback (most recent call last) <ipython-input-690-c66e71d6d8ba> in <module>() 18 assert x.index.codes == y.index.codes == xy.index.codes ==  19 ---> 20 assert len(xy) == 0, "Should be empty" # Broken 21 assert "B" not in xy.index # Broken 22 assert "B" not in xy # Actually passes somehow AssertionError: Should be empty — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#16603 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHInR9ZCKhWKkRPYWU7TDF3Ofnd22bks5s9AWzgaJpZM4NwHJG> .
Thanks. CategoricalIndex is defined at https://github.com/pandas-dev/pandas/blob/master/pandas/core/indexes/category.py. It inherits the
The basic idea in that branch was to override the set ops methods for
No I think this is actually a regression in equality testing of Categorical, independently of the effort you mentioned. Commit. It's not clear whether it should ultimately be equal or not, but if it does, that breaks a number of operations.
Here is a monkey-patch for anyone who needs it safely working regardless of version.
@TomAugspurger Would like this as a PR?
Why do you say that?
What is "it" in this case that's being compared?
That equality you linked to is on the CategoricalDtype, whose equality semantics are a bit strange especially around empty Categoricals.
Apologies for my slowness, I was looking at
So, as you say, Categorical.equals should be defined as:
The issue with the current definition is that it assumes dtype equality implies that mapping between values and codes match, which isn't always true. So I'd propose something like:
if self.is_dtype_equal(other): if self.categories.equals(other.categories): # the fast case for codes aligning np.array_equal(self._codes, other._codes) else: # unorded categories with different order codes2 = _recode_for_categories(Yi.codes, Yi.categories, Xi.categories) return np.array_equal(self._codes, codes2) return False
then we should be OK. I'll submit that as a PR later today, unless you beat me to it :)