ENH: concat and append now can handle unordered categories #13767

sinhrks · 2016-07-24T01:33:28Z

closes Appending Pandas dataframes in for loop results in ValueError #13524
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

on current master:

# different categoricals
pd.concat([pd.Series([1, 2], dtype='category'),
           pd.Series([3, 4], dtype='category')])
# ValueError: incompatible categories in categorical concat

# categorical + normal (values are contained in categories) -> object dtype
pd.concat([pd.Series([1, 2], dtype='category'),
           pd.Series([1, 2])])
#0    1
#1    2
#0    1
#1    2
# dtype: object

# categorical + normal (values are not contained in categories) -> object dtype
pd.concat([pd.Series([1, 2], dtype='category'),
           pd.Series([3, 4])])
#0    1
#1    2
#0    3
#1    4
# dtype: object

this PR (updated according to the discussion):

# different categoricals
pd.concat([pd.Series([1, 2], dtype='category'),
           pd.Series([3, 4], dtype='category')])
#0    1
#1    2
#0    3
#1    4
# dtype: int64

# specifying union_categoricals keeps category if possible
pd.concat([pd.Series([1, 2], dtype='category'),
           pd.Series([3, 4], dtype='category')], union_categoricals=True)
#0    1
#1    2
#0    3
#1    4
# dtype: category
Categories (4, int64): [1, 2, 3, 4]

# categorical + normal (values are contained in categories) -> category dtype
pd.concat([pd.Series([1, 2], dtype='category'),
           pd.Series([1, 2])])
#0    1
#1    2
#0    1
#1    2
# dtype: int64

# categorical + normal (values are not contained in categories) -> int dtype (keep original dtype)
pd.concat([pd.Series([1, 2], dtype='category'),
           pd.Series([3, 4])])
#0    1
#1    2
#0    3
#1    4
# dtype: int64

CC: @JanSchulz, @chris-b1

codecov-io · 2016-07-24T01:54:06Z

Current coverage is 85.25% (diff: 100%)

Merging #13767 into master will decrease coverage by <.01%

@@             master     #13767   diff @@
==========================================
  Files           139        139          
  Lines         50496      50491     -5   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits          43049      43044     -5   
  Misses         7447       7447          
  Partials          0          0

Powered by Codecov. Last update 8023029...96a372e

jankatins · 2016-07-24T15:00:30Z

pandas/tests/test_categorical.py

@@ -3852,16 +3852,15 @@ def test_concat(self):
        res = pd.concat([df, df])
        tm.assert_frame_equal(exp, res)

-        # Concat should raise if the two categoricals do not have the same
-        # categories


IMO the old test case should be modified that it still again tests the case that "more unequal" cats fail the concat.

jankatins · 2016-07-24T15:15:41Z

What is the actual "problem" or use case here?

I'm opposed to this PR: IMO concat() should not change the categories. In my thinking (pd.Series([1, 2], dtype='category'), pd.Series([3, 4], dtype='category') is as (un)similar as ([1,2], ["A", "B"]). The latter might cast to a type which can take both (object) but not to a new int_or_string type...

So, I find the master variant more in line with that thinking. What I'm not sure is whether it should raise or (as it does now) convert to object. Other cat handling of two different cats (apart from the special union_categoricals) raises but pandas in general casts to object. As cat handling deviates in other cases from normal pandas casting rules, IMO it should error here, too.

I can see a logic for "categorical + iterable of the same values as in the categories = new categorical with the same categories as the first one" as that can be thought similar as cat[x] = value.

sinhrks · 2016-07-25T22:37:24Z

@JanSchulz thx for sharing your opinion. Can you illustrate your intention covering all possible patterns? (I copied mine from #13524).

concat 2 categories -> use the rule of union_categorical
concat category and other dtype (which values are all in the category, including empty) -> category
- this rule is applied regardless of order (if there is at least one category in concatenating values)
- the property like ordered should be preserved.
concat category and other dtype (which values are not in the category) -> not category (dtype is infered)

jankatins · 2016-07-26T16:19:58Z

I come from using catagoricals as things like lickert scales: "Fully agree ... full disagree" and such things. From that standpoint, I usually read data in, convert to cat and replace the categories with the full set. In that case the categories have a specific meaning (e.g the lickert scale or in other cases the 50 US states or such things which are fixed).

From that comes the rule that you shouldn't be able to set a value in the cat column to something not in categories (adding "Germany" to a cat encoding the 50 US states makes no sense). And from that the "error if combining two different categories" rule. So I'm fully happy that there isn't any "default" (i.e. anything in the cat accessor or in the default "non-categorical-specific" API like concat and similar functions) way to combine two different categories and trying to do it would error.

I see union_categorical as a way to use a categorical as a memory saving string type and as in the other issues where such additions/changes were proposed, I would love to see such a type instead of changing categorical to fit that needs.

The second usecase is as an internal helper for someone using it with reading in csv files before letting the user do some "meaningful" replacements.

sinhrks · 2016-07-26T23:15:44Z

Thanks for sharing the usecase. Based on this, what the "rules" on your mind are? can u list up possible rules like: category + category (same category) = category covering possible patterns to discuss the expected behavior.

jankatins · 2016-07-27T19:46:54Z

IMO these are the rules:

category + setting with values in the categories = fine
category + setting with values not in the categories = error (not converting to object/common "normal" type)
category + same category configuration ("order" + categories OR "unordered" + same categories (the unordered case could actually be less strict and only test if the set of categories is the same)) = fine
category + different category = error

jreback · 2016-07-29T10:45:49Z

I agree with @JanSchulz rules, except where error -> just coerce to the base type (of the categorical) if there is one, if multiple categoricals and different base types then must upcast (potentially to object).

The idea is that concat/append would always work, whether you add new categoricals or not, but the dtype will be preserved in a strict way.

I suppose that we could further add an argument to concat itself that just calls union_categoricals, IOW allows a non-strict interpretation. maybe union_categoricals=False as the default

jankatins · 2016-07-29T11:24:09Z

@jreback

category + setting with values not in the categories = error (not converting to object/common "normal" type)

except where error -> just coerce to the base type (of the categorical) if there is one, if multiple categoricals and different base types then must upcast (potentially to object).

If you also want to change setting as well, then this is a whole different thing... IMO the above rules are what currently applies.

I'm still not convinced: So what happens if I have a Categorical which is encoded as complete agree ... completely disagree (5 step lickert scale) and now I concat a column which is encoded as a categorical which encode the 50 US States. It will now work (both are string dtypes), but the meaning is completely screwed. Or imagine things like two categoricals which have one category in common (50 US states + 50 states capitals -> washington is in both...): will that mean that the final categorical have x+y-1 categories? In my opinion, each categorical configuration is it's own dtype (is_dtype_equal in Categorical actually tests just that: https://github.com/pydata/pandas/blob/master/pandas/core/categorical.py#L1791-L1809), so upcasting to a combined one is the same as removing all meaning from the categorical configuration = upcasting to object.

What is the actual usecase here: internal csv parsing of different files to a single dataframe? If so, then in my eyes this looks (again :-)) like the "pd-String" usecase: you encode it directly to save memory and then you are have the problem to concat two dfs because you don't know the strings (=you have no beforehand knowledge about the expected categories).

If so, then there are multiple different ways to solve this:

add a pd.String dtype :-)
add a convenience method like pd.concat_nonstrict() which would basically do a union_categorical on the category columns.
add a nonstrict_category_checks=False to concat/append which would basically switch to pd.concat_nonstrict() behaviour above.

sinhrks · 2016-07-29T11:43:02Z

I agree that concat shouldn't fail, as users understand what they're doing. Considering all dtypes (not only category), coercing to the object dtype should be a basic rule (such as DatetimeTZ with different tz).

Based on this, I've organized the rules as below table.

left	right	result
category	category (identical categories and ordered)	category
category	category (different categories or ordered)	object (dtype is inferred)
category	not category	object (dtype is inferred)

Also, adding following options to concat should cover all cases:

union_categorical (default False): work as @jreback suggested. Apply union_categorical rule to concat Categorical.
strict (default False): raise when different dtype / different categories are being concatenated.

jreback · 2016-08-01T10:27:51Z

doc/source/whatsnew/v0.19.0.txt

+Categorical Concatenation
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+- A function :func:`union_categorical` has been added for combining categoricals, see :ref:`Unioning Categoricals<categorical.union>` (:issue:`13361`)


categoricals

sinhrks · 2016-08-01T12:58:11Z

once updated based on the discussion. Note that following fixed are not included:

union_categoricals with Index.append (being included in BUG: concat/append misc fixes #13660)
strict option (should be after BUG: concat/append misc fixes #13660)

jreback · 2016-08-04T10:42:23Z

pandas/tools/merge.py

@@ -1258,9 +1258,12 @@ def concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
    join_axes : list of Index objects
        Specific indexes to use for the other n - 1 axes instead of performing
        inner/outer set logic
-    verify_integrity : boolean, default False


there is a doc-string in merge.rst, can you make sure to update that as well.

jreback · 2016-09-03T14:54:29Z

ok, let's take out union_categoricals arg to concat for now and put it in a separate PR, @sinhrks ?

wesm · 2016-09-03T15:06:36Z

@jreback that sounds good to me. Easier to add it later, and having the extra function is a stopgap for users who need it

jreback · 2016-09-04T15:39:45Z

doc/source/categorical.rst

+
+.. _categorical.concat:
+
+Concatenation


can you add a link with a note in merge.rst to here

jreback · 2016-09-05T15:12:13Z

@sinhrks can you update

jreback · 2016-09-06T10:37:30Z

@sinhrks can you update?

jorisvandenbossche · 2016-09-06T22:07:19Z

@sinhrks I think the decision was to take out the kwarg for now? (see #13767 (comment) above).
(but maybe that is quite some work ..)

sinhrks · 2016-09-07T01:47:21Z

@jorisvandenbossche You're right. Updated to remove union_categoricals kw.

jreback · 2016-09-07T01:54:49Z

pandas/tools/tests/test_concat.py

+
+        # all category nan-likes => category
+        s1 = pd.Series([np.nan, np.nan], dtype='category')
+        s2 = pd.Series([np.nan, np.nan], dtype='category')


can u add simikar for Datetimes and all NaT ?

sure, will do in today.

jorisvandenbossche · 2016-09-07T13:12:53Z

@sinhrks @jreback I am going to merge this, the final comments can be addressed in a follow-up PR (but they are not critical for the rc I think).

…eter * github.com:pydata/pandas: (554 commits) BUG: compat with Stata ver 111 Fix: F999 dictionary key '2000q4' repeated with different values (pandas-dev#14198) BLD: Test for Python 3.5 with C locale BUG: DatetimeTZBlock can't assign values near dst boundary BUG: union_categorical with Series and cat idx BUG: fix str.contains for series containing only nan values BUG: Categorical constructor not idempotent with ext dtype TST: Make encoded sep check more locale sensitive (pandas-dev#14161) DOC: minor typo in 0.19.0 whatsnew file (pandas-dev#14185) BUG: fix tz-aware datetime convert to DatetimeIndex (GH 14088) BUG : bug in setting a slice of a Series with a np.timedelta64 RLS: v0.19.0rc1 DOC: clean-up 0.19.0 whatsnew file (pandas-dev#14176) DOC: cleanup build warnings (pandas-dev#14172) Add steps to run gbq integration testing to the contributing docs (pandas-dev#14144) ENH: concat and append now can handle unordered categories (pandas-dev#13767) DEPR: Deprecate pandas.core.datetools (pandas-dev#14105) API/DEPR: Remove +/- as setops for DatetimeIndex/PeriodIndex (GH9630) (pandas-dev#14164) Fix trivial typo in comment (pandas-dev#14174) API/DEPR: Remove +/- as setops for Index (GH8227) (pandas-dev#14127) ...

* github.com:pydata/pandas: (554 commits) BUG: compat with Stata ver 111 Fix: F999 dictionary key '2000q4' repeated with different values (pandas-dev#14198) BLD: Test for Python 3.5 with C locale BUG: DatetimeTZBlock can't assign values near dst boundary BUG: union_categorical with Series and cat idx BUG: fix str.contains for series containing only nan values BUG: Categorical constructor not idempotent with ext dtype TST: Make encoded sep check more locale sensitive (pandas-dev#14161) DOC: minor typo in 0.19.0 whatsnew file (pandas-dev#14185) BUG: fix tz-aware datetime convert to DatetimeIndex (GH 14088) BUG : bug in setting a slice of a Series with a np.timedelta64 RLS: v0.19.0rc1 DOC: clean-up 0.19.0 whatsnew file (pandas-dev#14176) DOC: cleanup build warnings (pandas-dev#14172) Add steps to run gbq integration testing to the contributing docs (pandas-dev#14144) ENH: concat and append now can handle unordered categories (pandas-dev#13767) DEPR: Deprecate pandas.core.datetools (pandas-dev#14105) API/DEPR: Remove +/- as setops for DatetimeIndex/PeriodIndex (GH9630) (pandas-dev#14164) Fix trivial typo in comment (pandas-dev#14174) API/DEPR: Remove +/- as setops for Index (GH8227) (pandas-dev#14127) ...

sinhrks added Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode Categorical Categorical Data Type labels Jul 24, 2016

sinhrks added this to the 0.19.0 milestone Jul 24, 2016

sinhrks changed the title ~~ENH: concat and append now can handleunordered categories~~ ENH: concat and append now can handle unordered categories Jul 24, 2016

jankatins reviewed Jul 24, 2016
View reviewed changes

sinhrks force-pushed the append_categorical branch 4 times, most recently from 693730c to 9402cb1 Compare August 1, 2016 10:03

jreback reviewed Aug 1, 2016
View reviewed changes

sinhrks force-pushed the append_categorical branch from 9402cb1 to 113418d Compare August 1, 2016 12:56

sinhrks force-pushed the append_categorical branch 4 times, most recently from b7c3b12 to 940a137 Compare August 4, 2016 00:14

jreback reviewed Aug 4, 2016
View reviewed changes

sinhrks force-pushed the append_categorical branch from a04ff74 to 561735a Compare September 3, 2016 03:03

sinhrks force-pushed the append_categorical branch from 561735a to 86cc682 Compare September 4, 2016 01:03

jreback reviewed Sep 4, 2016
View reviewed changes

doc/source/categorical.rst

.. _categorical.concat:

Concatenation

Copy link

Contributor

jreback Sep 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a link with a note in merge.rst to here

sinhrks force-pushed the append_categorical branch from 86cc682 to 7f1e293 Compare September 6, 2016 22:01

ENH: concat and append now can handleunordered categories

589d88d

sinhrks force-pushed the append_categorical branch from 7f1e293 to 589d88d Compare September 6, 2016 22:04

reomove union_categoricals kw from concat

96a372e

sinhrks force-pushed the append_categorical branch from 11d47d2 to 96a372e Compare September 6, 2016 22:59

jreback reviewed Sep 7, 2016
View reviewed changes

jreback mentioned this pull request Sep 7, 2016

BUG: union_categoricals w/Series & CategoricalIndex #14173

Closed

jorisvandenbossche merged commit ab4bd36 into pandas-dev:master Sep 7, 2016

jorisvandenbossche mentioned this pull request Sep 7, 2016

API: union_categoricals in concat #14177

Open

jorisvandenbossche modified the milestones: 0.19.0rc, 0.19.0 Sep 7, 2016

This was referenced Oct 31, 2016

BUG/API: Index.append with mixed object/Categorical indices #14545

Merged

API: Index.append behaviour with categoricals #14586

Closed

sinhrks deleted the append_categorical branch January 8, 2017 06:23

ivirshup mentioned this pull request Aug 27, 2019

Incorrect docs for merging categoricals #28166

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: concat and append now can handle unordered categories #13767

ENH: concat and append now can handle unordered categories #13767

sinhrks commented Jul 24, 2016 •

edited

Loading

codecov-io commented Jul 24, 2016 •

edited

Loading

jankatins Jul 24, 2016 •

edited

Loading

jankatins commented Jul 24, 2016 •

edited

Loading

sinhrks commented Jul 25, 2016 •

edited

Loading

jankatins commented Jul 26, 2016 •

edited

Loading

sinhrks commented Jul 26, 2016

jankatins commented Jul 27, 2016 •

edited

Loading

jreback commented Jul 29, 2016

jankatins commented Jul 29, 2016 •

edited

Loading

sinhrks commented Jul 29, 2016 •

edited

Loading

jreback Aug 1, 2016

sinhrks commented Aug 1, 2016

jreback Aug 4, 2016

jreback commented Sep 3, 2016

wesm commented Sep 3, 2016

jreback Sep 4, 2016

jreback commented Sep 5, 2016

jreback commented Sep 6, 2016

jorisvandenbossche commented Sep 6, 2016

sinhrks commented Sep 7, 2016

jreback Sep 7, 2016

sinhrks Sep 7, 2016

jorisvandenbossche commented Sep 7, 2016

ENH: concat and append now can handle unordered categories #13767

ENH: concat and append now can handle unordered categories #13767

Conversation

sinhrks commented Jul 24, 2016 • edited Loading

codecov-io commented Jul 24, 2016 • edited Loading

Current coverage is 85.25% (diff: 100%)

jankatins Jul 24, 2016 • edited Loading

Choose a reason for hiding this comment

jankatins commented Jul 24, 2016 • edited Loading

sinhrks commented Jul 25, 2016 • edited Loading

jankatins commented Jul 26, 2016 • edited Loading

sinhrks commented Jul 26, 2016

jankatins commented Jul 27, 2016 • edited Loading

jreback commented Jul 29, 2016

jankatins commented Jul 29, 2016 • edited Loading

sinhrks commented Jul 29, 2016 • edited Loading

jreback Aug 1, 2016

Choose a reason for hiding this comment

sinhrks commented Aug 1, 2016

jreback Aug 4, 2016

Choose a reason for hiding this comment

jreback commented Sep 3, 2016

wesm commented Sep 3, 2016

jreback Sep 4, 2016

Choose a reason for hiding this comment

jreback commented Sep 5, 2016

jreback commented Sep 6, 2016

jorisvandenbossche commented Sep 6, 2016

sinhrks commented Sep 7, 2016

jreback Sep 7, 2016

Choose a reason for hiding this comment

sinhrks Sep 7, 2016

Choose a reason for hiding this comment

jorisvandenbossche commented Sep 7, 2016

sinhrks commented Jul 24, 2016 •

edited

Loading

codecov-io commented Jul 24, 2016 •

edited

Loading

jankatins Jul 24, 2016 •

edited

Loading

jankatins commented Jul 24, 2016 •

edited

Loading

sinhrks commented Jul 25, 2016 •

edited

Loading

jankatins commented Jul 26, 2016 •

edited

Loading

jankatins commented Jul 27, 2016 •

edited

Loading

jankatins commented Jul 29, 2016 •

edited

Loading

sinhrks commented Jul 29, 2016 •

edited

Loading