Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: concat and append now can handle unordered categories #13767

Merged
merged 2 commits into from
Sep 7, 2016

Conversation

sinhrks
Copy link
Member

@sinhrks sinhrks commented Jul 24, 2016

on current master:

# different categoricals
pd.concat([pd.Series([1, 2], dtype='category'),
           pd.Series([3, 4], dtype='category')])
# ValueError: incompatible categories in categorical concat

# categorical + normal (values are contained in categories) -> object dtype
pd.concat([pd.Series([1, 2], dtype='category'),
           pd.Series([1, 2])])
#0    1
#1    2
#0    1
#1    2
# dtype: object

# categorical + normal (values are not contained in categories) -> object dtype
pd.concat([pd.Series([1, 2], dtype='category'),
           pd.Series([3, 4])])
#0    1
#1    2
#0    3
#1    4
# dtype: object

this PR (updated according to the discussion):

# different categoricals
pd.concat([pd.Series([1, 2], dtype='category'),
           pd.Series([3, 4], dtype='category')])
#0    1
#1    2
#0    3
#1    4
# dtype: int64

# specifying union_categoricals keeps category if possible
pd.concat([pd.Series([1, 2], dtype='category'),
           pd.Series([3, 4], dtype='category')], union_categoricals=True)
#0    1
#1    2
#0    3
#1    4
# dtype: category
Categories (4, int64): [1, 2, 3, 4]

# categorical + normal (values are contained in categories) -> category dtype
pd.concat([pd.Series([1, 2], dtype='category'),
           pd.Series([1, 2])])
#0    1
#1    2
#0    1
#1    2
# dtype: int64

# categorical + normal (values are not contained in categories) -> int dtype (keep original dtype)
pd.concat([pd.Series([1, 2], dtype='category'),
           pd.Series([3, 4])])
#0    1
#1    2
#0    3
#1    4
# dtype: int64

CC: @JanSchulz, @chris-b1

@sinhrks sinhrks added Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode Categorical Categorical Data Type labels Jul 24, 2016
@sinhrks sinhrks added this to the 0.19.0 milestone Jul 24, 2016
@sinhrks sinhrks changed the title ENH: concat and append now can handleunordered categories ENH: concat and append now can handle unordered categories Jul 24, 2016
@codecov-io
Copy link

codecov-io commented Jul 24, 2016

Current coverage is 85.25% (diff: 100%)

Merging #13767 into master will decrease coverage by <.01%

@@             master     #13767   diff @@
==========================================
  Files           139        139          
  Lines         50496      50491     -5   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits          43049      43044     -5   
  Misses         7447       7447          
  Partials          0          0          

Powered by Codecov. Last update 8023029...96a372e

@@ -3852,16 +3852,15 @@ def test_concat(self):
res = pd.concat([df, df])
tm.assert_frame_equal(exp, res)

# Concat should raise if the two categoricals do not have the same
# categories
Copy link
Contributor

@jankatins jankatins Jul 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the old test case should be modified that it still again tests the case that "more unequal" cats fail the concat.

@jankatins
Copy link
Contributor

jankatins commented Jul 24, 2016

What is the actual "problem" or use case here?

I'm opposed to this PR: IMO concat() should not change the categories. In my thinking (pd.Series([1, 2], dtype='category'), pd.Series([3, 4], dtype='category') is as (un)similar as ([1,2], ["A", "B"]). The latter might cast to a type which can take both (object) but not to a new int_or_string type...

So, I find the master variant more in line with that thinking. What I'm not sure is whether it should raise or (as it does now) convert to object. Other cat handling of two different cats (apart from the special union_categoricals) raises but pandas in general casts to object. As cat handling deviates in other cases from normal pandas casting rules, IMO it should error here, too.

I can see a logic for "categorical + iterable of the same values as in the categories = new categorical with the same categories as the first one" as that can be thought similar as cat[x] = value.

@sinhrks
Copy link
Member Author

sinhrks commented Jul 25, 2016

@JanSchulz thx for sharing your opinion. Can you illustrate your intention covering all possible patterns? (I copied mine from #13524).

  • concat 2 categories -> use the rule of union_categorical
  • concat category and other dtype (which values are all in the category, including empty) -> category
    • this rule is applied regardless of order (if there is at least one category in concatenating values)
    • the property like ordered should be preserved.
  • concat category and other dtype (which values are not in the category) -> not category (dtype is infered)

@jankatins
Copy link
Contributor

jankatins commented Jul 26, 2016

I come from using catagoricals as things like lickert scales: "Fully agree ... full disagree" and such things. From that standpoint, I usually read data in, convert to cat and replace the categories with the full set. In that case the categories have a specific meaning (e.g the lickert scale or in other cases the 50 US states or such things which are fixed).

From that comes the rule that you shouldn't be able to set a value in the cat column to something not in categories (adding "Germany" to a cat encoding the 50 US states makes no sense). And from that the "error if combining two different categories" rule. So I'm fully happy that there isn't any "default" (i.e. anything in the cat accessor or in the default "non-categorical-specific" API like concat and similar functions) way to combine two different categories and trying to do it would error.

I see union_categorical as a way to use a categorical as a memory saving string type and as in the other issues where such additions/changes were proposed, I would love to see such a type instead of changing categorical to fit that needs.

The second usecase is as an internal helper for someone using it with reading in csv files before letting the user do some "meaningful" replacements.

@sinhrks
Copy link
Member Author

sinhrks commented Jul 26, 2016

Thanks for sharing the usecase. Based on this, what the "rules" on your mind are? can u list up possible rules like: category + category (same category) = category covering possible patterns to discuss the expected behavior.

@jankatins
Copy link
Contributor

jankatins commented Jul 27, 2016

IMO these are the rules:

  • category + setting with values in the categories = fine
  • category + setting with values not in the categories = error (not converting to object/common "normal" type)
  • category + same category configuration ("order" + categories OR "unordered" + same categories (the unordered case could actually be less strict and only test if the set of categories is the same)) = fine
  • category + different category = error

@jreback
Copy link
Contributor

jreback commented Jul 29, 2016

I agree with @JanSchulz rules, except where error -> just coerce to the base type (of the categorical) if there is one, if multiple categoricals and different base types then must upcast (potentially to object).

The idea is that concat/append would always work, whether you add new categoricals or not, but the dtype will be preserved in a strict way.

I suppose that we could further add an argument to concat itself that just calls union_categoricals, IOW allows a non-strict interpretation. maybe union_categoricals=False as the default

@jankatins
Copy link
Contributor

jankatins commented Jul 29, 2016

@jreback

category + setting with values not in the categories = error (not converting to object/common "normal" type)

except where error -> just coerce to the base type (of the categorical) if there is one, if multiple categoricals and different base types then must upcast (potentially to object).

If you also want to change setting as well, then this is a whole different thing... IMO the above rules are what currently applies.

I'm still not convinced: So what happens if I have a Categorical which is encoded as complete agree ... completely disagree (5 step lickert scale) and now I concat a column which is encoded as a categorical which encode the 50 US States. It will now work (both are string dtypes), but the meaning is completely screwed. Or imagine things like two categoricals which have one category in common (50 US states + 50 states capitals -> washington is in both...): will that mean that the final categorical have x+y-1 categories? In my opinion, each categorical configuration is it's own dtype (is_dtype_equal in Categorical actually tests just that: https://github.com/pydata/pandas/blob/master/pandas/core/categorical.py#L1791-L1809), so upcasting to a combined one is the same as removing all meaning from the categorical configuration = upcasting to object.

What is the actual usecase here: internal csv parsing of different files to a single dataframe? If so, then in my eyes this looks (again :-)) like the "pd-String" usecase: you encode it directly to save memory and then you are have the problem to concat two dfs because you don't know the strings (=you have no beforehand knowledge about the expected categories).

If so, then there are multiple different ways to solve this:

  • add a pd.String dtype :-)
  • add a convenience method like pd.concat_nonstrict() which would basically do a union_categorical on the category columns.
  • add a nonstrict_category_checks=False to concat/append which would basically switch to pd.concat_nonstrict() behaviour above.

@sinhrks
Copy link
Member Author

sinhrks commented Jul 29, 2016

I agree that concat shouldn't fail, as users understand what they're doing. Considering all dtypes (not only category), coercing to the object dtype should be a basic rule (such as DatetimeTZ with different tz).

Based on this, I've organized the rules as below table.

left right result
category category (identical categories and ordered) category
category category (different categories or ordered) object (dtype is inferred)
category not category object (dtype is inferred)

Also, adding following options to concat should cover all cases:

  • union_categorical (default False): work as @jreback suggested. Apply union_categorical rule to concat Categorical.
  • strict (default False): raise when different dtype / different categories are being concatenated.

@sinhrks sinhrks force-pushed the append_categorical branch 4 times, most recently from 693730c to 9402cb1 Compare August 1, 2016 10:03
Categorical Concatenation
^^^^^^^^^^^^^^^^^^^^^^^^^

- A function :func:`union_categorical` has been added for combining categoricals, see :ref:`Unioning Categoricals<categorical.union>` (:issue:`13361`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

categoricals

@sinhrks
Copy link
Member Author

sinhrks commented Aug 1, 2016

once updated based on the discussion. Note that following fixed are not included:

@sinhrks sinhrks force-pushed the append_categorical branch 4 times, most recently from b7c3b12 to 940a137 Compare August 4, 2016 00:14
@@ -1258,9 +1258,12 @@ def concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
join_axes : list of Index objects
Specific indexes to use for the other n - 1 axes instead of performing
inner/outer set logic
verify_integrity : boolean, default False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a doc-string in merge.rst, can you make sure to update that as well.

@jreback
Copy link
Contributor

jreback commented Sep 3, 2016

ok, let's take out union_categoricals arg to concat for now and put it in a separate PR, @sinhrks ?

@wesm
Copy link
Member

wesm commented Sep 3, 2016

@jreback that sounds good to me. Easier to add it later, and having the extra function is a stopgap for users who need it


.. _categorical.concat:

Concatenation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a link with a note in merge.rst to here

@jreback
Copy link
Contributor

jreback commented Sep 5, 2016

@sinhrks can you update

@jreback
Copy link
Contributor

jreback commented Sep 6, 2016

@sinhrks can you update?

@jorisvandenbossche
Copy link
Member

@sinhrks I think the decision was to take out the kwarg for now? (see #13767 (comment) above).
(but maybe that is quite some work ..)

@sinhrks
Copy link
Member Author

sinhrks commented Sep 7, 2016

@jorisvandenbossche You're right. Updated to remove union_categoricals kw.


# all category nan-likes => category
s1 = pd.Series([np.nan, np.nan], dtype='category')
s2 = pd.Series([np.nan, np.nan], dtype='category')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u add simikar for Datetimes and all NaT ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, will do in today.

@jorisvandenbossche
Copy link
Member

@sinhrks @jreback I am going to merge this, the final comments can be addressed in a follow-up PR (but they are not critical for the rc I think).

@jorisvandenbossche jorisvandenbossche merged commit ab4bd36 into pandas-dev:master Sep 7, 2016
@jorisvandenbossche jorisvandenbossche modified the milestones: 0.19.0rc, 0.19.0 Sep 7, 2016
trbs added a commit to trbs/pandas that referenced this pull request Sep 12, 2016
…eter

* github.com:pydata/pandas: (554 commits)
  BUG: compat with Stata ver 111
  Fix: F999 dictionary key '2000q4' repeated with different values (pandas-dev#14198)
  BLD: Test for Python 3.5 with C locale
  BUG: DatetimeTZBlock can't assign values near dst boundary
  BUG: union_categorical with Series and cat idx
  BUG: fix str.contains for series containing only nan values
  BUG: Categorical constructor not idempotent with ext dtype
  TST: Make encoded sep check more locale sensitive (pandas-dev#14161)
  DOC: minor typo in 0.19.0 whatsnew file (pandas-dev#14185)
  BUG: fix tz-aware datetime convert to DatetimeIndex (GH 14088)
  BUG : bug in setting a slice of a Series with a np.timedelta64
  RLS: v0.19.0rc1
  DOC: clean-up 0.19.0 whatsnew file (pandas-dev#14176)
  DOC: cleanup build warnings (pandas-dev#14172)
  Add steps to run gbq integration testing to the contributing docs (pandas-dev#14144)
  ENH: concat and append now can handle unordered categories (pandas-dev#13767)
  DEPR: Deprecate pandas.core.datetools (pandas-dev#14105)
  API/DEPR: Remove +/- as setops for DatetimeIndex/PeriodIndex (GH9630) (pandas-dev#14164)
  Fix trivial typo in comment (pandas-dev#14174)
  API/DEPR: Remove +/- as setops for Index (GH8227) (pandas-dev#14127)
  ...
trbs added a commit to trbs/pandas that referenced this pull request Sep 12, 2016
* github.com:pydata/pandas: (554 commits)
  BUG: compat with Stata ver 111
  Fix: F999 dictionary key '2000q4' repeated with different values (pandas-dev#14198)
  BLD: Test for Python 3.5 with C locale
  BUG: DatetimeTZBlock can't assign values near dst boundary
  BUG: union_categorical with Series and cat idx
  BUG: fix str.contains for series containing only nan values
  BUG: Categorical constructor not idempotent with ext dtype
  TST: Make encoded sep check more locale sensitive (pandas-dev#14161)
  DOC: minor typo in 0.19.0 whatsnew file (pandas-dev#14185)
  BUG: fix tz-aware datetime convert to DatetimeIndex (GH 14088)
  BUG : bug in setting a slice of a Series with a np.timedelta64
  RLS: v0.19.0rc1
  DOC: clean-up 0.19.0 whatsnew file (pandas-dev#14176)
  DOC: cleanup build warnings (pandas-dev#14172)
  Add steps to run gbq integration testing to the contributing docs (pandas-dev#14144)
  ENH: concat and append now can handle unordered categories (pandas-dev#13767)
  DEPR: Deprecate pandas.core.datetools (pandas-dev#14105)
  API/DEPR: Remove +/- as setops for DatetimeIndex/PeriodIndex (GH9630) (pandas-dev#14164)
  Fix trivial typo in comment (pandas-dev#14174)
  API/DEPR: Remove +/- as setops for Index (GH8227) (pandas-dev#14127)
  ...
@sinhrks sinhrks deleted the append_categorical branch January 8, 2017 06:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Appending Pandas dataframes in for loop results in ValueError
9 participants