New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API/ENH: union Categorical #13361

Closed
wants to merge 6 commits into
base: master
from

Conversation

Projects
None yet
9 participants
@chris-b1
Contributor

chris-b1 commented Jun 4, 2016

I was looking into #10153 (parsing Categoricals directly) and one thing that seems to be needed
for that is a good way to combine Categoricals. That part alone is complicated enough
so I decided to do a separate PR for it.

This adds a union_categoricals function that takes a list of (identical dtyped, unordered)
and combines them without doing a full factorization again, unioning the categories.

This seems like it might be generically useful, does it belong the public api somewhere and/or
as a method a Categorical? Maybe as on option on concat?

Might also be useful for dask? e.g. dask/dask#1040, cc @mrocklin

An example timing below. Obviously depends on the density of the categories,
but for relatively small numbers of categories, seems to generally be in 5-6x
speed up range vs concatting everything and re-categorizing.

from pandas.types.concat import union_categoricals

group1 = ['aaaaa', 'bbbbb', 'cccccc', 'ddddddd', 'eeeeeeee']
group2 = group1[3:] + ['fffffff', 'gggggggg', 'ddddddddd']

a = np.random.choice(group1, 1000000).astype('object')
b = np.random.choice(group2, 1000000).astype('object')

a_cat, b_cat = pd.Categorical(a), pd.Categorical(b)

In [10]: %timeit np.concatenate([a,b,a,b,a])
10 loops, best of 3: 82.7 ms per loop

In [12]: %timeit pd.Categorical(np.concatenate([a_cat.get_values(),b_cat.get_values(),
                                                a_cat.get_values(),b_cat.get_values(),
                                                a_cat.get_values()]))
1 loops, best of 3: 344 ms per loop

In [14]: %timeit union_categoricals([a_cat,b_cat,a_cat,b_cat,a_cat])
10 loops, best of 3:  40.9 ms per loop
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jun 4, 2016

Contributor

see #9927

Contributor

jreback commented Jun 4, 2016

see #9927

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jun 4, 2016

Contributor

came out of this post: http://stackoverflow.com/questions/29709918/pandas-and-category-replacement

and originally some discussions with @mrocklin

Contributor

jreback commented Jun 4, 2016

came out of this post: http://stackoverflow.com/questions/29709918/pandas-and-category-replacement

and originally some discussions with @mrocklin

@codecov-io

This comment has been minimized.

Show comment
Hide comment
@codecov-io

codecov-io Jun 4, 2016

Current coverage is 84.23%

Merging #13361 into master will increase coverage by <.01%

@@             master     #13361   diff @@
==========================================
  Files           138        138          
  Lines         50724      50742    +18   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          42726      42744    +18   
  Misses         7998       7998          
  Partials          0          0          

Powered by Codecov. Last updated by 103f7d3...ccaeb76

codecov-io commented Jun 4, 2016

Current coverage is 84.23%

Merging #13361 into master will increase coverage by <.01%

@@             master     #13361   diff @@
==========================================
  Files           138        138          
  Lines         50724      50742    +18   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          42726      42744    +18   
  Misses         7998       7998          
  Partials          0          0          

Powered by Codecov. Last updated by 103f7d3...ccaeb76

@chris-b1

This comment has been minimized.

Show comment
Hide comment
@chris-b1

chris-b1 Jun 4, 2016

Contributor

Yeah, I'm definitely guilty of mostly thinking of Categoricals as a "memory efficient string" type.

If I understand your approach in that answer correct, this is what it would look like to concat? It's faster than re-categorizing the whole set of data, but slower than my union_categoricals because (I think) it has to re-hash every element (unless that's possible to avoid?), where I'm reusing the existing integer codes.

In [50]: %%timeit
    ...: 
    ...: from pandas.types.concat import _concat_categorical
    ...: 
    ...: recoded = []
    ...: new_cats = []
    ...: 
    ...: for i, c in enumerate([a_cat,b_cat,a_cat,b_cat,a_cat]):
    ...:     if i == 0:
    ...:         new_cats = c.categories
    ...:     else:
    ...:         new_cats = c.categories.union(new_cats)
    ...: 
    ...: for c in [a_cat,b_cat,a_cat,b_cat,a_cat]:
    ...:     recoded.append(pd.Categorical(c.get_values(), categories=new_cats))
    ...: 
    ...: _concat_categorical(recoded)
1 loops, best of 3: 299 ms per loop
Contributor

chris-b1 commented Jun 4, 2016

Yeah, I'm definitely guilty of mostly thinking of Categoricals as a "memory efficient string" type.

If I understand your approach in that answer correct, this is what it would look like to concat? It's faster than re-categorizing the whole set of data, but slower than my union_categoricals because (I think) it has to re-hash every element (unless that's possible to avoid?), where I'm reusing the existing integer codes.

In [50]: %%timeit
    ...: 
    ...: from pandas.types.concat import _concat_categorical
    ...: 
    ...: recoded = []
    ...: new_cats = []
    ...: 
    ...: for i, c in enumerate([a_cat,b_cat,a_cat,b_cat,a_cat]):
    ...:     if i == 0:
    ...:         new_cats = c.categories
    ...:     else:
    ...:         new_cats = c.categories.union(new_cats)
    ...: 
    ...: for c in [a_cat,b_cat,a_cat,b_cat,a_cat]:
    ...:     recoded.append(pd.Categorical(c.get_values(), categories=new_cats))
    ...: 
    ...: _concat_categorical(recoded)
1 loops, best of 3: 299 ms per loop
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jun 4, 2016

Contributor

Here is a much simpler method along the lines of what I had suggested

In [66]: def f():
   ....:     cats = Index(a_cat.categories.tolist() + b_cat.categories.difference(a_cat.categories).tolist())
   ....:     new_b = b_cat.set_categories(cats)
   ....:     return pd.Categorical.from_codes(np.concatenate([a_cat.codes,new_b.codes]),cats)
   ....: 

In [72]: %timeit pd.Categorical(np.concatenate([a_cat.get_values(),b_cat.get_values()]))
10 loops, best of 3: 110 ms per loop

n [81]: %timeit f()
10 loops, best of 3: 63.4 ms per loop

These are not exactly equal as the categories in the when concatting are in a different order (as they are seen differently; this is the 'expected' result). In fact the 'result' is the correct one here. The categories of all of 1 appear before all of 2.

I am sure this could be optimized further as there are some checks in set_categories that are unecessary for this purpose (that is very generic).

This relies on the fact that you don't need to reset the codes in 1 at all (just extend the categories).

Contributor

jreback commented Jun 4, 2016

Here is a much simpler method along the lines of what I had suggested

In [66]: def f():
   ....:     cats = Index(a_cat.categories.tolist() + b_cat.categories.difference(a_cat.categories).tolist())
   ....:     new_b = b_cat.set_categories(cats)
   ....:     return pd.Categorical.from_codes(np.concatenate([a_cat.codes,new_b.codes]),cats)
   ....: 

In [72]: %timeit pd.Categorical(np.concatenate([a_cat.get_values(),b_cat.get_values()]))
10 loops, best of 3: 110 ms per loop

n [81]: %timeit f()
10 loops, best of 3: 63.4 ms per loop

These are not exactly equal as the categories in the when concatting are in a different order (as they are seen differently; this is the 'expected' result). In fact the 'result' is the correct one here. The categories of all of 1 appear before all of 2.

I am sure this could be optimized further as there are some checks in set_categories that are unecessary for this purpose (that is very generic).

This relies on the fact that you don't need to reset the codes in 1 at all (just extend the categories).

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jun 4, 2016

Contributor

In fact I assert that this could be WAY faster all you have to do is this.

we construct a union of the new categories (order actually doesn matter, though it happens to be in order). call this cats (which is what I construct above)
we know that the codes are going to be

In [101]: dict(zip(range(len(cats)),cats))
Out[101]: 
{0: 'aaaaa',
 1: 'bbbbb',
 2: 'cccccc',
 3: 'ddddddd',
 4: 'eeeeeeee',
 5: 'ddddddddd',
 6: 'fffffff',
 7: 'gggggggg'}

Each of the cats that you want to union also has this mapping (but to different codes). All you have to do it map the original code -> new code. Concat then codes and you are done. This is essentially what I am doing above, but It could be a single loop, you can even do it like this.

In [129]: indexer = cats.get_indexer(b_cat.categories)

In [130]: indexer
Out[130]: array([3, 5, 4, 6, 7])

In [131]: new_indexer = np.empty(len(cats),dtype=int)

In [132]: new_indexer.fill(-1)

In [133]: new_indexer[-len(b_cat.categories):] = indexer

In [134]: new_indexer
Out[134]: array([-1, -1, -1,  3,  5,  4,  6,  7])

Now you have a direct map from b_cat codes -> cat codes.

Contributor

jreback commented Jun 4, 2016

In fact I assert that this could be WAY faster all you have to do is this.

we construct a union of the new categories (order actually doesn matter, though it happens to be in order). call this cats (which is what I construct above)
we know that the codes are going to be

In [101]: dict(zip(range(len(cats)),cats))
Out[101]: 
{0: 'aaaaa',
 1: 'bbbbb',
 2: 'cccccc',
 3: 'ddddddd',
 4: 'eeeeeeee',
 5: 'ddddddddd',
 6: 'fffffff',
 7: 'gggggggg'}

Each of the cats that you want to union also has this mapping (but to different codes). All you have to do it map the original code -> new code. Concat then codes and you are done. This is essentially what I am doing above, but It could be a single loop, you can even do it like this.

In [129]: indexer = cats.get_indexer(b_cat.categories)

In [130]: indexer
Out[130]: array([3, 5, 4, 6, 7])

In [131]: new_indexer = np.empty(len(cats),dtype=int)

In [132]: new_indexer.fill(-1)

In [133]: new_indexer[-len(b_cat.categories):] = indexer

In [134]: new_indexer
Out[134]: array([-1, -1, -1,  3,  5,  4,  6,  7])

Now you have a direct map from b_cat codes -> cat codes.

@chris-b1

This comment has been minimized.

Show comment
Hide comment
@chris-b1

chris-b1 Jun 4, 2016

Contributor

Thanks. That's actually almost exactly what I'm doing, but my code is way more complex than it needs to be because I'm directly messing with the hash tables, rather than using the index set ops and get_indexer.

Here's a MUCH simpler impl based on your ideas that gets to similar performance

In [105]: to_union = [a_cat,b_cat,a_cat,b_cat,a_cat]

In [106]: %timeit union_categoricals(to_union)
10 loops, best of 3: 53.5 ms per loop

In [107]: def simple_union_cat(to_union):
     ...:     for i, c in enumerate(to_union):
     ...:         if i == 0:
     ...:             cats = c.categories
     ...:         else:
     ...:             cats = cats.union(c.categories)
     ...:     new_codes = []
     ...:     for c in to_union:
     ...:         indexer = cats.get_indexer(c.categories)
     ...:         new_codes.append(indexer.take(c.codes))
     ...:     codes = np.concatenate(new_codes)
     ...:     return pd.Categorical.from_codes(codes, cats)

In [108]: %timeit simple_union_cat(to_union)
10 loops, best of 3: 73.2 ms per loop
Contributor

chris-b1 commented Jun 4, 2016

Thanks. That's actually almost exactly what I'm doing, but my code is way more complex than it needs to be because I'm directly messing with the hash tables, rather than using the index set ops and get_indexer.

Here's a MUCH simpler impl based on your ideas that gets to similar performance

In [105]: to_union = [a_cat,b_cat,a_cat,b_cat,a_cat]

In [106]: %timeit union_categoricals(to_union)
10 loops, best of 3: 53.5 ms per loop

In [107]: def simple_union_cat(to_union):
     ...:     for i, c in enumerate(to_union):
     ...:         if i == 0:
     ...:             cats = c.categories
     ...:         else:
     ...:             cats = cats.union(c.categories)
     ...:     new_codes = []
     ...:     for c in to_union:
     ...:         indexer = cats.get_indexer(c.categories)
     ...:         new_codes.append(indexer.take(c.codes))
     ...:     codes = np.concatenate(new_codes)
     ...:     return pd.Categorical.from_codes(codes, cats)

In [108]: %timeit simple_union_cat(to_union)
10 loops, best of 3: 73.2 ms per loop
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jun 4, 2016

Contributor

@chris-b1 great!

I would move the API to core/categorical.py (or maybe types/concat.py)

Contributor

jreback commented Jun 4, 2016

@chris-b1 great!

I would move the API to core/categorical.py (or maybe types/concat.py)

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jun 4, 2016

Contributor

I think instead of .union on the categories it makes sense to append in order (so make lists out of them as Union sorts)

Contributor

jreback commented Jun 4, 2016

I think instead of .union on the categories it makes sense to append in order (so make lists out of them as Union sorts)

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jun 4, 2016

Contributor

this should raise if any of the cats are ordered I think

Contributor

jreback commented Jun 4, 2016

this should raise if any of the cats are ordered I think

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jun 4, 2016

Contributor

prob want an asv for this as well.

Contributor

jreback commented Jun 4, 2016

prob want an asv for this as well.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jun 5, 2016

Contributor

lgtm. pls add a whatsnew entry. I think maybe a tiny doc-mention is categorical.rst would be good as well.

Contributor

jreback commented Jun 5, 2016

lgtm. pls add a whatsnew entry. I think maybe a tiny doc-mention is categorical.rst would be good as well.

@jreback jreback added this to the 0.18.2 milestone Jun 5, 2016

@chris-b1

This comment has been minimized.

Show comment
Hide comment
@chris-b1

chris-b1 Jun 5, 2016

Contributor

@jreback - I updated this with a doc note

Contributor

chris-b1 commented Jun 5, 2016

@jreback - I updated this with a doc note

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jun 5, 2016

Contributor

cc @janschulz
@jorisvandenbossche

Contributor

jreback commented Jun 5, 2016

cc @janschulz
@jorisvandenbossche

@chris-b1

This comment has been minimized.

Show comment
Hide comment
@chris-b1

chris-b1 Jun 6, 2016

Contributor

I'm actually pretty far along (well, in terms of it basically working, not necessarily docs/tests/etc) on #10153, so if you'd prefer to review it all at once, I can push those changes onto this branch.

group = ['aaaaaaaa', 'bbbbbbbb', 'cccccc', 'ddddddd', 'eeeeeeee']
df = pd.DataFrame({'a': np.random.choice(group, 10000000).astype('object'),
                   'b': np.random.choice(group, 10000000).astype('object'),
                   'c': np.random.choice(group, 10000000).astype('object')})
df.to_csv('strings.csv', index=False)


In [1]: %timeit df = pd.read_csv('strings.csv').apply(pd.Categorical);
1 loops, best of 3: 6.39 s per loop

In [2]: %timeit pd.read_csv('strings.csv', dtype='category')
1 loops, best of 3: 3.65 s per loop
Contributor

chris-b1 commented Jun 6, 2016

I'm actually pretty far along (well, in terms of it basically working, not necessarily docs/tests/etc) on #10153, so if you'd prefer to review it all at once, I can push those changes onto this branch.

group = ['aaaaaaaa', 'bbbbbbbb', 'cccccc', 'ddddddd', 'eeeeeeee']
df = pd.DataFrame({'a': np.random.choice(group, 10000000).astype('object'),
                   'b': np.random.choice(group, 10000000).astype('object'),
                   'c': np.random.choice(group, 10000000).astype('object')})
df.to_csv('strings.csv', index=False)


In [1]: %timeit df = pd.read_csv('strings.csv').apply(pd.Categorical);
1 loops, best of 3: 6.39 s per loop

In [2]: %timeit pd.read_csv('strings.csv', dtype='category')
1 loops, best of 3: 3.65 s per loop
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jun 6, 2016

Contributor

let's do that in a separate branch (u can do it on top of this one)

Contributor

jreback commented Jun 6, 2016

let's do that in a separate branch (u can do it on top of this one)

Show outdated Hide outdated pandas/types/concat.py
for i, c in enumerate(to_union):
if i == 0:
cats = c.categories.tolist()

This comment has been minimized.

@shoyer

shoyer Jun 6, 2016

Member

Use numpy arrays (or pandas indexes) and concatenate them all at once rather than converting things into lists. Something like:

cats = to_union[0].categories
appended_cats = cats.append([c.categories for c in to_union[1:]])
unique_cats = appended_cats.unique()

In the worst case, suppose you have n categoricals of fixed size, each of which has all unique values.

My version takes linear time (and space), whereas your version will require quadratic time -- n index/hash table constructions of O(n) elements each, on average.

@shoyer

shoyer Jun 6, 2016

Member

Use numpy arrays (or pandas indexes) and concatenate them all at once rather than converting things into lists. Something like:

cats = to_union[0].categories
appended_cats = cats.append([c.categories for c in to_union[1:]])
unique_cats = appended_cats.unique()

In the worst case, suppose you have n categoricals of fixed size, each of which has all unique values.

My version takes linear time (and space), whereas your version will require quadratic time -- n index/hash table constructions of O(n) elements each, on average.

This comment has been minimized.

@jreback

jreback Jun 6, 2016

Contributor

you soln won't preseve order, though with a small mod I think it could, need to np.concatenate the categories directly (that's why they were listifed in the first place). Index union doesn't preserve order.

@jreback

jreback Jun 6, 2016

Contributor

you soln won't preseve order, though with a small mod I think it could, need to np.concatenate the categories directly (that's why they were listifed in the first place). Index union doesn't preserve order.

This comment has been minimized.

@jreback

jreback Jun 6, 2016

Contributor

though to be honest I can't imagine this is a perf issue. The entire point of this is that categories is << codes. But I suppose if it can be done with no additional complexity then it should.

@jreback

jreback Jun 6, 2016

Contributor

though to be honest I can't imagine this is a perf issue. The entire point of this is that categories is << codes. But I suppose if it can be done with no additional complexity then it should.

This comment has been minimized.

@chris-b1

chris-b1 Jun 6, 2016

Contributor

Thanks @shoyer. I didn't know pd.unique preserves order (numpy sorts), but seems to be the case. Is that an api guarantee?

In [12]: pd.Index(['b','c']).append(pd.Index(['a'])).unique()
Out[12]: array(['b', 'c', 'a'], dtype=object)
@chris-b1

chris-b1 Jun 6, 2016

Contributor

Thanks @shoyer. I didn't know pd.unique preserves order (numpy sorts), but seems to be the case. Is that an api guarantee?

In [12]: pd.Index(['b','c']).append(pd.Index(['a'])).unique()
Out[12]: array(['b', 'c', 'a'], dtype=object)

This comment has been minimized.

@jreback

jreback Jun 6, 2016

Contributor

@chris-b1 yes it IS a guarantee (and that's why its much faster than np.unique)

@jreback

jreback Jun 6, 2016

Contributor

@chris-b1 yes it IS a guarantee (and that's why its much faster than np.unique)

Show outdated Hide outdated pandas/types/concat.py
indexer = cats.get_indexer(c.categories)
new_codes.append(indexer.take(c.codes))
codes = np.concatenate(new_codes)
return Categorical.from_codes(codes, cats)

This comment has been minimized.

@jankatins

jankatins Jun 6, 2016

Contributor

This still does a validation, maybe change to use the fastpath way? Categorical(codes, categories=categories, ordered=False, fastpath=True)

@jankatins

jankatins Jun 6, 2016

Contributor

This still does a validation, maybe change to use the fastpath way? Categorical(codes, categories=categories, ordered=False, fastpath=True)

This comment has been minimized.

@jreback

jreback Jun 6, 2016

Contributor

maybe we should add a validate=True flag to .from_codes to (optionally) skip the validation?

@jreback

jreback Jun 6, 2016

Contributor

maybe we should add a validate=True flag to .from_codes to (optionally) skip the validation?

This comment has been minimized.

@jankatins

jankatins Jun 6, 2016

Contributor

Basically you would just end up doing two if fastpath checks right after another (first "from_codes, then ininit)...fastpath=True` is deep internal stuff, so why not use the constructor as intended?

@jankatins

jankatins Jun 6, 2016

Contributor

Basically you would just end up doing two if fastpath checks right after another (first "from_codes, then ininit)...fastpath=True` is deep internal stuff, so why not use the constructor as intended?

Show outdated Hide outdated pandas/tests/test_categorical.py
@@ -3943,6 +3943,38 @@ def f():
'category', categories=list('cab'))})
tm.assert_frame_equal(result, expected)
def test_union(self):

This comment has been minimized.

@jankatins

jankatins Jun 6, 2016

Contributor

does that belong to the concat tests or the categoricals? The API is in concat and not in Categorical (which I like... :-) )

@jankatins

jankatins Jun 6, 2016

Contributor

does that belong to the concat tests or the categoricals? The API is in concat and not in Categorical (which I like... :-) )

This comment has been minimized.

@jreback

jreback Jun 6, 2016

Contributor

yeah it could go there

@jreback

jreback Jun 6, 2016

Contributor

yeah it could go there

@jankatins

This comment has been minimized.

Show comment
Hide comment
@jankatins

jankatins Jun 6, 2016

Contributor

Seems that people want Categoricals mostly as memory efficient strings :-)/:-( I like that it is not part of the categorical API but a helper function (see #9927 (comment)).

Contributor

jankatins commented Jun 6, 2016

Seems that people want Categoricals mostly as memory efficient strings :-)/:-( I like that it is not part of the categorical API but a helper function (see #9927 (comment)).

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jun 6, 2016

Contributor

yeah I think people are really wanting pd.String, but pd.Categorical is working (mostly)

Contributor

jreback commented Jun 6, 2016

yeah I think people are really wanting pd.String, but pd.Categorical is working (mostly)

@chris-b1

This comment has been minimized.

Show comment
Hide comment
@chris-b1

chris-b1 Jun 7, 2016

Contributor

Alright, updated for those comments, let me know if you see anything else. Thanks for fastpath suggestion @janschulz, that ends up making a decent performance impact.

Contributor

chris-b1 commented Jun 7, 2016

Alright, updated for those comments, let me know if you see anything else. Thanks for fastpath suggestion @janschulz, that ends up making a decent performance impact.

@jreback jreback changed the title from WIP/API/ENH: union Categorical to API/ENH: union Categorical Jun 7, 2016

@@ -201,6 +201,43 @@ def convert_categorical(x):
return Categorical(concatted, rawcats)
def union_categoricals(to_union):
"""

This comment has been minimized.

@jreback

jreback Jun 7, 2016

Contributor

add a versionadded tag

@jreback

jreback Jun 7, 2016

Contributor

add a versionadded tag

Show outdated Hide outdated pandas/tools/tests/test_concat.py
@@ -919,6 +920,37 @@ def test_concat_keys_with_none(self):
keys=['b', 'c', 'd', 'e'])
tm.assert_frame_equal(result, expected)
def test_union_categorical(self):
# GH 13361
s = Categorical(list('abc'))

This comment has been minimized.

@jreback

jreback Jun 7, 2016

Contributor

add a test that makes sure that the returned categories are in order of appearance in the union (you might have an existing one which checks this accidently, but make an explict one, and mark it)

@jreback

jreback Jun 7, 2016

Contributor

add a test that makes sure that the returned categories are in order of appearance in the union (you might have an existing one which checks this accidently, but make an explict one, and mark it)

Show outdated Hide outdated pandas/types/concat.py
Parameters
----------
to_union : list like of Categorical

This comment has been minimized.

@jreback

jreback Jun 7, 2016

Contributor

add Raises (and list when that happens)

@jreback

jreback Jun 7, 2016

Contributor

add Raises (and list when that happens)

Show outdated Hide outdated pandas/util/testing.py
@@ -963,12 +963,17 @@ def assertNotIsInstance(obj, cls, msg=''):
def assert_categorical_equal(left, right, check_dtype=True,
obj='Categorical'):
obj='Categorical', ignore_order=False):
assertIsInstance(left, pd.Categorical, '[Categorical] ')

This comment has been minimized.

@jreback

jreback Jun 7, 2016

Contributor

can you add a doc-string

@jreback

jreback Jun 7, 2016

Contributor

can you add a doc-string

Show outdated Hide outdated pandas/types/concat.py
for c in to_union):
raise TypeError("dtype of categories must be the same")
unique_cats = unique(np.concatenate([c.categories for c in to_union]))

This comment has been minimized.

@shoyer

shoyer Jun 7, 2016

Member

Is this really safe if categories is an index with a pandas-specific dtype? e.g., datetime64 with timezone or period.

That's why I thought we needed to use the Index.append() method.

@shoyer

shoyer Jun 7, 2016

Member

Is this really safe if categories is an index with a pandas-specific dtype? e.g., datetime64 with timezone or period.

That's why I thought we needed to use the Index.append() method.

This comment has been minimized.

@chris-b1

chris-b1 Jun 7, 2016

Contributor

Oh good point - I'll change that and add some tests.

@chris-b1

chris-b1 Jun 7, 2016

Contributor

Oh good point - I'll change that and add some tests.

@chris-b1

This comment has been minimized.

Show comment
Hide comment
@chris-b1

chris-b1 Jun 7, 2016

Contributor

@shoyer, @jreback - updated for your comments

Contributor

chris-b1 commented Jun 7, 2016

@shoyer, @jreback - updated for your comments

Show outdated Hide outdated doc/source/categorical.rst
Unioning
~~~~~~~~
If you want to combine categoricals that do not necessarily have

This comment has been minimized.

@jreback

jreback Jun 7, 2016

Contributor

versionadded tag here

@jreback

jreback Jun 7, 2016

Contributor

versionadded tag here

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jun 7, 2016

Contributor

can u post timings?

Contributor

jreback commented Jun 7, 2016

can u post timings?

@chris-b1

This comment has been minimized.

Show comment
Hide comment
@chris-b1

chris-b1 Jun 7, 2016

Contributor

I also updated my comment at the top. Hard to say if this is "good," but 8x faster than the really naive approach, so that's something.

In [10]: %timeit np.concatenate([a,b,a,b,a])
10 loops, best of 3: 82.7 ms per loop

In [12]: %timeit pd.Categorical(np.concatenate([a_cat.get_values(),b_cat.get_values(),
                                                a_cat.get_values(),b_cat.get_values(),
                                                a_cat.get_values()]))
1 loops, best of 3: 344 ms per loop

In [14]: %timeit union_categoricals([a_cat,b_cat,a_cat,b_cat,a_cat])
10 loops, best of 3:  40.9 ms per loop
Contributor

chris-b1 commented Jun 7, 2016

I also updated my comment at the top. Hard to say if this is "good," but 8x faster than the really naive approach, so that's something.

In [10]: %timeit np.concatenate([a,b,a,b,a])
10 loops, best of 3: 82.7 ms per loop

In [12]: %timeit pd.Categorical(np.concatenate([a_cat.get_values(),b_cat.get_values(),
                                                a_cat.get_values(),b_cat.get_values(),
                                                a_cat.get_values()]))
1 loops, best of 3: 344 ms per loop

In [14]: %timeit union_categoricals([a_cat,b_cat,a_cat,b_cat,a_cat])
10 loops, best of 3:  40.9 ms per loop
Show outdated Hide outdated pandas/types/concat.py
if any(c.ordered for c in to_union):
raise TypeError("Can only combine unordered Categoricals")
first = to_union[0]

This comment has been minimized.

@shoyer

shoyer Jun 8, 2016

Member

You should graceful catch the condition that the list of categoricals is empty.

@shoyer

shoyer Jun 8, 2016

Member

You should graceful catch the condition that the list of categoricals is empty.

raise TypeError("dtype of categories must be the same")
cats = first.categories
unique_cats = cats.append([c.categories for c in to_union[1:]]).unique()

This comment has been minimized.

@shoyer

shoyer Jun 8, 2016

Member

.unique() returns a NumPy array, which should fail the tests for non-supported dtypes. I think you need .drop_duplicates() here.

@shoyer

shoyer Jun 8, 2016

Member

.unique() returns a NumPy array, which should fail the tests for non-supported dtypes. I think you need .drop_duplicates() here.

This comment has been minimized.

@shoyer

shoyer Jun 8, 2016

Member

But if the tests pass, I am clearly confused.

@shoyer

shoyer Jun 8, 2016

Member

But if the tests pass, I am clearly confused.

This comment has been minimized.

@chris-b1

chris-b1 Jun 8, 2016

Contributor

The tests will (should!) pass, because I'm wrapping back in an index, and the non-numpy index types have .unique() overloaded to return an Index. But it likely is cleaner to use .drop_duplicates()

@chris-b1

chris-b1 Jun 8, 2016

Contributor

The tests will (should!) pass, because I'm wrapping back in an index, and the non-numpy index types have .unique() overloaded to return an Index. But it likely is cleaner to use .drop_duplicates()

This comment has been minimized.

@jreback

jreback Jun 8, 2016

Contributor

you should just and use pd.unique directly I think
less copies that way too

append is meant for 2 indexes only

@jreback

jreback Jun 8, 2016

Contributor

you should just and use pd.unique directly I think
less copies that way too

append is meant for 2 indexes only

This comment has been minimized.

@chris-b1

chris-b1 Jun 8, 2016

Contributor

Actually drop_duplicates is quite a bit slower, it looks like it's coercing to object rather than using the hash tables that unique does.

In [74]: %timeit Index(a).append(Index(b)).drop_duplicates()
10 loops, best of 3: 197 ms per loop

In [75]: %timeit Index(Index(a).append(Index(b)).unique())
10 loops, best of 3: 71.5 ms per loop
@chris-b1

chris-b1 Jun 8, 2016

Contributor

Actually drop_duplicates is quite a bit slower, it looks like it's coercing to object rather than using the hash tables that unique does.

In [74]: %timeit Index(a).append(Index(b)).drop_duplicates()
10 loops, best of 3: 197 ms per loop

In [75]: %timeit Index(Index(a).append(Index(b)).unique())
10 loops, best of 3: 71.5 ms per loop

This comment has been minimized.

@chris-b1

chris-b1 Jun 8, 2016

Contributor

Just to be clear it's a bit messier than that, idx.unique() returns an Index type for categorical, datetime (tz), and period. But it returns a numpy array for ints/floats/object (maybe this should be changed?), which is why I have to wrap the results in Index again.

@chris-b1

chris-b1 Jun 8, 2016

Contributor

Just to be clear it's a bit messier than that, idx.unique() returns an Index type for categorical, datetime (tz), and period. But it returns a numpy array for ints/floats/object (maybe this should be changed?), which is why I have to wrap the results in Index again.

This comment has been minimized.

@jreback

jreback Jun 8, 2016

Contributor

I think there is an issue for that - if not pls create one

@jreback

jreback Jun 8, 2016

Contributor

I think there is an issue for that - if not pls create one

This comment has been minimized.

@chris-b1

chris-b1 Jun 8, 2016

Contributor
@chris-b1

This comment has been minimized.

@shoyer

shoyer Jun 8, 2016

Member

Oops -- I just opened a new one: #13395

@shoyer

shoyer Jun 8, 2016

Member

Oops -- I just opened a new one: #13395

This comment has been minimized.

@jreback

jreback Jun 8, 2016

Contributor

ok close that new one - I think there is one about series as well - this is an old compat thing

@jreback

jreback Jun 8, 2016

Contributor

ok close that new one - I think there is one about series as well - this is an old compat thing

@sinhrks

This comment has been minimized.

Show comment
Hide comment
@sinhrks

sinhrks Jun 8, 2016

Member

xref #10186. We can fix CategoricalIndex.union as the same after this PR.

Member

sinhrks commented Jun 8, 2016

xref #10186. We can fix CategoricalIndex.union as the same after this PR.

@jreback jreback closed this in d5bea25 Jun 8, 2016

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jun 8, 2016

Contributor

thanks @chris-b1 nice enhancement!

Contributor

jreback commented Jun 8, 2016

thanks @chris-b1 nice enhancement!

@chris-b1 chris-b1 deleted the chris-b1:union-categorical branch Jun 8, 2016

@shoyer

This comment has been minimized.

Show comment
Hide comment
@shoyer

shoyer Jun 9, 2016

Member

Quick follow-up question: do we really want to encourage users importing this from pandas.types.concat? If we don't want to expose that implementation detail as API, we probably should export it to the main namespace.

Member

shoyer commented Jun 9, 2016

Quick follow-up question: do we really want to encourage users importing this from pandas.types.concat? If we don't want to expose that implementation detail as API, we probably should export it to the main namespace.

@max-sixty

This comment has been minimized.

Show comment
Hide comment
@max-sixty

max-sixty Jun 9, 2016

Contributor

Quick follow-up question: do we really want to encourage users importing this from pandas.types.concat? If we don't want to expose that implementation detail as API, we probably should export it to the main namespace.

👍

Or pd.Categorical.union(*cats) or similar

Contributor

max-sixty commented Jun 9, 2016

Quick follow-up question: do we really want to encourage users importing this from pandas.types.concat? If we don't want to expose that implementation detail as API, we probably should export it to the main namespace.

👍

Or pd.Categorical.union(*cats) or similar

@shoyer

This comment has been minimized.

Show comment
Hide comment
@shoyer

shoyer Jun 9, 2016

Member

I like the class method pd.Categorical.union(cats)

Member

shoyer commented Jun 9, 2016

I like the class method pd.Categorical.union(cats)

@jankatins

This comment has been minimized.

Show comment
Hide comment
@jankatins

jankatins Jun 9, 2016

Contributor

Why as a class method? As far as I remember we also don't encourage to use Categorical directly but usually as s.astype("category") -> pd.union_categorical(a, b)? [Edit] ok, this doesn't work because the functions does only take Categorical and not Series of type category[/]

[Edit] Or is that analog to the "constructor" Categorical.from_codes(), like Categorical.from_union(*cats)?

Another thought is that the (public) usecase for this is mainly to use categoricals as a memory efficient string type ("I don't care about the meaning of the categories, just concat them"), so I guess if ever there is a pd.String, this function is then mostly unneeded (the internal usecase in reading in cats directly shouldn't need a public API). From that perspective a pandas.types.concat.union_categorical(a,b) is also fine...

[edit]
My new favorite is Categorical.from_union().

BTW: should that method take only Categoricals or also Series of type cat?

Contributor

jankatins commented Jun 9, 2016

Why as a class method? As far as I remember we also don't encourage to use Categorical directly but usually as s.astype("category") -> pd.union_categorical(a, b)? [Edit] ok, this doesn't work because the functions does only take Categorical and not Series of type category[/]

[Edit] Or is that analog to the "constructor" Categorical.from_codes(), like Categorical.from_union(*cats)?

Another thought is that the (public) usecase for this is mainly to use categoricals as a memory efficient string type ("I don't care about the meaning of the categories, just concat them"), so I guess if ever there is a pd.String, this function is then mostly unneeded (the internal usecase in reading in cats directly shouldn't need a public API). From that perspective a pandas.types.concat.union_categorical(a,b) is also fine...

[edit]
My new favorite is Categorical.from_union().

BTW: should that method take only Categoricals or also Series of type cat?

@@ -201,6 +201,57 @@ def convert_categorical(x):
return Categorical(concatted, rawcats)
def union_categoricals(to_union):

This comment has been minimized.

@jankatins

jankatins Jun 9, 2016

Contributor

Add a ignore_order=False kwarg? That would prevent changed/copied orig categoricals if you ever need this... Would only disable the order check

if not ignore_order and any(c.ordered for c in to_union):
        raise TypeError("Can only combine unordered Categoricals")

It would still return a unordered cat, of course.

@jankatins

jankatins Jun 9, 2016

Contributor

Add a ignore_order=False kwarg? That would prevent changed/copied orig categoricals if you ever need this... Would only disable the order check

if not ignore_order and any(c.ordered for c in to_union):
        raise TypeError("Can only combine unordered Categoricals")

It would still return a unordered cat, of course.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jun 9, 2016

Contributor

actually, maybe we ought to just hijack pd.concat, this could be a fast-path that just calls union_categoricals if all of the inputs are Categoricals in the first place. From an API perspective it feels right, there is a little bit of added code complexity inside Concat though.

Contributor

jreback commented Jun 9, 2016

actually, maybe we ought to just hijack pd.concat, this could be a fast-path that just calls union_categoricals if all of the inputs are Categoricals in the first place. From an API perspective it feels right, there is a little bit of added code complexity inside Concat though.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jun 9, 2016

Contributor

so I like @janschulz Categorical.from_union() as well

Contributor

jreback commented Jun 9, 2016

so I like @janschulz Categorical.from_union() as well

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Jun 9, 2016

Member

At the moment, pd.concat raises if there are different categories:

In [58]: s1 = pd.Series(list('abcab'), dtype='category')

In [59]: s2 = pd.Series(list('cbdcb'), dtype='category')

In [60]: pd.concat([s1, s2])
ValueError: incompatible categories in categorical concat

So that would have to change to put this functionality in pd.concat. @jreback Is that what you were thinking of? Or provide a keyword in concat to ignore incompatible categories? (although I don't think concat should get a keyword specifically for categoricals).
Personally I am not in favor of changing the behaviour of pd.concat.

Also, if we expose this publicly, I think it should handle categorical Series objects as well (as this is what is encouraged to use, not Categoricals directly), as @janschulz also noted above.

Member

jorisvandenbossche commented Jun 9, 2016

At the moment, pd.concat raises if there are different categories:

In [58]: s1 = pd.Series(list('abcab'), dtype='category')

In [59]: s2 = pd.Series(list('cbdcb'), dtype='category')

In [60]: pd.concat([s1, s2])
ValueError: incompatible categories in categorical concat

So that would have to change to put this functionality in pd.concat. @jreback Is that what you were thinking of? Or provide a keyword in concat to ignore incompatible categories? (although I don't think concat should get a keyword specifically for categoricals).
Personally I am not in favor of changing the behaviour of pd.concat.

Also, if we expose this publicly, I think it should handle categorical Series objects as well (as this is what is encouraged to use, not Categoricals directly), as @janschulz also noted above.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jun 9, 2016

Contributor

@jorisvandenbossche yes maybe that is too much overloading.

Contributor

jreback commented Jun 9, 2016

@jorisvandenbossche yes maybe that is too much overloading.

@makmanalp

This comment has been minimized.

Show comment
Hide comment
@makmanalp

makmanalp Mar 17, 2017

Contributor

@jreback @jorisvandenbossche So then would the concat workflow be like:

  1. Load df1 and df2
  2. Union all categoricals from df1 and df2
  3. Run astype(new_categorical) on df1 and df2
  4. Concatenate df1 and df2

?

Contributor

makmanalp commented Mar 17, 2017

@jreback @jorisvandenbossche So then would the concat workflow be like:

  1. Load df1 and df2
  2. Union all categoricals from df1 and df2
  3. Run astype(new_categorical) on df1 and df2
  4. Concatenate df1 and df2

?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Mar 17, 2017

Contributor

that looks about right
but can u show an example?

Contributor

jreback commented Mar 17, 2017

that looks about right
but can u show an example?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment