API: Index.append behaviour with categoricals #14586

jorisvandenbossche · 2016-11-04T14:28:37Z

Follow-up of #14545.

We had a long discussion on what the behaviour of concat should be when you have categorical data: #13767. In the end, for 0.19.0, we changed the behaviour of raising an error when categories didn't match to returning object dtyped data (only data with identical categories and ordered attributed gives a categorical as result). The table below is a summary of the changes between 0.18.1 and 0.19.0:

For categorical Series:

left	right	append/concat 0.18	append/concat 0.19.0
category	category (identical categories)	category	category
category	category (different categories)	error	object
category	not category	category	object
category	not category (different categories)	category with NaNs	object

However, we didn't change behaviour of append for Indexes (the above append is for series):

For CategoricalIndex:

left	right	append 0.18	append 0.19.0	append 0.19.1
category	category (identical categories)	category	category	category
category	category (different categories)	error	error	error
category	not category	category	category	category
category	not category (with other values)	error	error	error
not category	category (with other values)	object	error	object

The last line, i.e. the case where the calling Index is not a CategoricalIndex, changed by accident in 0.19.0, and it is this that I corrected for in PR #14545 for 0.19.1.

Questions:

Do we want the same behaviour for Index.append as we now have for Series.append with categorical data? This means that the column in the table above becomes 'object' apart from the first row.
Do we want to make an exception for the case where the values in the 'right' correspond with the categories? (so that pd.CategoricalIndex(['a', 'b', 'c']).append(pd.Index(['a']))keeps working)

Changing this to always return object dtype unless for categoricals with indentical categories is easy, but gives a few failures in our test suite. Namely, in some indexing tests (indexing a DataFrame with a CategoricalIndex) there are changes in behaviour because indexing with a non-existing value in the index was performed using CategoricalIndex.append(). But this we can workaround in the indexing code of course.

cc @JanSchulz @sinhrks

The text was updated successfully, but these errors were encountered:

jbrockmendel · 2020-11-18T19:45:03Z

I'm running into this in a couple of places, mainly in setitem-with-expansion operations:

cat = pd.Categorical(["a", "b", "c", "d", "e"], ordered=True)

ci = pd.CategoricalIndex(cat[:-1])   # <-- "e" is an unused category

df = pd.DataFrame(np.random.randn(3, 4), columns=ci)

>>> df["e"] = 0
>>> is_dtype_equal(df.columns.dtype, cat.dtype)
True

>>> df["f"] = 1
TypeError: 'fill_value=f' is not present in this Categorical's categories

xref #37774 im inclined to cast rather than raise here. One option that this could look like is:

     def _concat(self, to_concat: List["Index"], name: Label) -> "CategoricalIndex":
         # if calling index is category, don't check dtype of others
-        codes = np.concatenate([self._is_dtype_compat(c).codes for c in to_concat])
-        cat = self._data._from_backing_data(codes)
-        return type(self)._simple_new(cat, name=name)
+        try:
+            codes = np.concatenate([self._is_dtype_compat(c).codes for c in to_concat])
+        except TypeError:
+            # not all to_concat elements are among our categories (or NA)
+            from pandas.core.dtypes.concat import concat_compat
+
+            res = concat_compat(to_concat)
+            return Index(res, name=name)
+        else:
+            cat = self._data._from_backing_data(codes)
+            return type(self)._simple_new(cat, name=name)

(another option would be to work this behavior into concat_compat; i havent thought about this much. I guess i do like this idea on the principle of having every concat-like operation behave the same way)

Making this change would also allow us to get rid of a couple of try/excepts in reshape.pivot, and get rid of _to_safe_for_reshape.

jorisvandenbossche added Indexing Related to indexing on series/frames, not to indexes themselves API Design Categorical Categorical Data Type labels Nov 4, 2016

jorisvandenbossche added this to the 0.20.0 milestone Nov 4, 2016

jreback modified the milestones: 0.20.0, Next Major Release Mar 23, 2017

toobaz added Index Related to the Index class or subclasses and removed Indexing Related to indexing on series/frames, not to indexes themselves labels Jun 28, 2019

mroeschke added Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels May 5, 2020

jbrockmendel mentioned this issue Nov 26, 2020

API: CategoricalIndex.append fallback to concat_compat #38098

Merged

5 tasks

jreback modified the milestones: Contributions Welcome, 1.2 Nov 26, 2020

jreback closed this as completed in #38098 Nov 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: Index.append behaviour with categoricals #14586

API: Index.append behaviour with categoricals #14586

jorisvandenbossche commented Nov 4, 2016

jbrockmendel commented Nov 18, 2020

API: Index.append behaviour with categoricals #14586

API: Index.append behaviour with categoricals #14586

Comments

jorisvandenbossche commented Nov 4, 2016

jbrockmendel commented Nov 18, 2020