New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: union_categoricals w/Series & CategoricalIndex #14173

Closed
jreback opened this Issue Sep 7, 2016 · 8 comments

Comments

Projects
None yet
3 participants
@jreback
Contributor

jreback commented Sep 7, 2016

In [25]: from pandas.types.concat import union_categoricals

In [26]: s1 = pd.Series(['a', 'b'], dtype='category')
    ...: s3 = pd.Series(['a', 'b', 'c'], dtype='category')

In [27]: s1
Out[27]: 
0    a
1    b
dtype: category
Categories (2, object): [a, b]

In [28]: s3
Out[28]: 
0    a
1    b
2    c
dtype: category
Categories (3, object): [a, b, c]

I don't see why this shouldn't work. This should return a Categorical.

In [29]: union_categoricals([s1,s3])
AttributeError: 'Series' object has no attribute 'categories'

actual categoricals are combinable

In [30]: union_categoricals([s1.values,s3.values])
Out[30]: 
[a, b, a, b, c]
Categories (3, object): [a, b, c]

This works fine

In [31]: union_categoricals([s1.values,pd.CategoricalIndex(s3.values)])
Out[31]: 
[a, b, a, b, c]
Categories (3, object): [a, b, c]

But this is broken

In [32]: union_categoricals([pd.CategoricalIndex(s1.values),pd.CategoricalIndex(s3.values)])
AttributeError: 'CategoricalIndex' object has no attribute 'is_dtype_equal'
@jreback

This comment has been minimized.

Show comment
Hide comment
Contributor

jreback commented Sep 7, 2016

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Sep 7, 2016

Contributor

xref to #13767 though this is independent

Contributor

jreback commented Sep 7, 2016

xref to #13767 though this is independent

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Sep 7, 2016

Contributor

further I think we need a doc-note at the end of the unionining section in categorical.rst. Showing that when unioned categoricals are re-coded. Even though this is technically a dtype implementation detail that the categoricals hide, unioning re-codes, so the user should be aware.

In [16]: c1 = pd.Categorical(['a', 'b'])
    ...: c2 = pd.Categorical(['b', 'c'])

In [17]: c1
Out[17]: 
[a, b]
Categories (2, object): [a, b]

# in c1, b is coded to 1
In [20]: c1.codes
Out[20]: array([0, 1], dtype=int8)

In [21]: c2
Out[21]: 
[b, c]
Categories (2, object): [b, c]

# in c2, b is coded to 0
In [22]: c2.codes
Out[22]: array([0, 1], dtype=int8)

In [23]: from pandas.types.concat import union_categoricals

# b is now re-coded to 1 (same as c2, but not the same as c1)
In [24]: union_categoricals([c1,c2])
Out[24]: 
[a, b, b, c]
Categories (3, object): [a, b, c]

In [25]: union_categoricals([c1,c2]).codes
Out[25]: array([0, 1, 1, 2], dtype=int8)

note that some users, @mrocklin (dask), actually like/want this!

Contributor

jreback commented Sep 7, 2016

further I think we need a doc-note at the end of the unionining section in categorical.rst. Showing that when unioned categoricals are re-coded. Even though this is technically a dtype implementation detail that the categoricals hide, unioning re-codes, so the user should be aware.

In [16]: c1 = pd.Categorical(['a', 'b'])
    ...: c2 = pd.Categorical(['b', 'c'])

In [17]: c1
Out[17]: 
[a, b]
Categories (2, object): [a, b]

# in c1, b is coded to 1
In [20]: c1.codes
Out[20]: array([0, 1], dtype=int8)

In [21]: c2
Out[21]: 
[b, c]
Categories (2, object): [b, c]

# in c2, b is coded to 0
In [22]: c2.codes
Out[22]: array([0, 1], dtype=int8)

In [23]: from pandas.types.concat import union_categoricals

# b is now re-coded to 1 (same as c2, but not the same as c1)
In [24]: union_categoricals([c1,c2])
Out[24]: 
[a, b, b, c]
Categories (3, object): [a, b, c]

In [25]: union_categoricals([c1,c2]).codes
Out[25]: array([0, 1, 1, 2], dtype=int8)

note that some users, @mrocklin (dask), actually like/want this!

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Sep 7, 2016

Member

Yep, I agree that union_categoricals should handle categorical serieses

Member

jorisvandenbossche commented Sep 7, 2016

Yep, I agree that union_categoricals should handle categorical serieses

@chris-b1

This comment has been minimized.

Show comment
Hide comment
@chris-b1

chris-b1 Sep 8, 2016

Contributor

How should the index be handled in the Series case - thrown away, or add an ignore_index kw to support what concat does?

In [1]: pd.concat([pd.Series([1,2,3]), pd.Series([3,4,5])])
Out[1]: 
0    1
1    2
2    3
0    3
1    4
2    5
dtype: int64

In [2]: pd.concat([pd.Series([1,2,3]), pd.Series([3,4,5])], ignore_index=True)
Out[2]: 
0    1
1    2
2    3
3    3
4    4
5    5
dtype: int64
Contributor

chris-b1 commented Sep 8, 2016

How should the index be handled in the Series case - thrown away, or add an ignore_index kw to support what concat does?

In [1]: pd.concat([pd.Series([1,2,3]), pd.Series([3,4,5])])
Out[1]: 
0    1
1    2
2    3
0    3
1    4
2    5
dtype: int64

In [2]: pd.concat([pd.Series([1,2,3]), pd.Series([3,4,5])], ignore_index=True)
Out[2]: 
0    1
1    2
2    3
3    3
4    4
5    5
dtype: int64
@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Sep 8, 2016

Member

I would do it similar as concat does (although that in many cases the ignore_index=True case would be a better default ...)

Member

jorisvandenbossche commented Sep 8, 2016

I would do it similar as concat does (although that in many cases the ignore_index=True case would be a better default ...)

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Sep 8, 2016

Contributor

I disagree, I think this should always return a Categorical. pd.concat is designed to return Series/DataFrame, but union_categoricals specifically is meant for returning a Categorical.

Until we add this functionaility to pd.concat, concat of Series is a big tricky, but I don't think we should change the api of this function (otherwise it just becomes another concat)

Contributor

jreback commented Sep 8, 2016

I disagree, I think this should always return a Categorical. pd.concat is designed to return Series/DataFrame, but union_categoricals specifically is meant for returning a Categorical.

Until we add this functionaility to pd.concat, concat of Series is a big tricky, but I don't think we should change the api of this function (otherwise it just becomes another concat)

@chris-b1

This comment has been minimized.

Show comment
Hide comment
@chris-b1

chris-b1 Sep 8, 2016

Contributor

Oh, you wouldn't even box it back in the original type? I guess that makes sense as that usecase should really be handled by concat once the api is figured out.

Contributor

chris-b1 commented Sep 8, 2016

Oh, you wouldn't even box it back in the original type? I guess that makes sense as that usecase should really be handled by concat once the api is figured out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment