Any categorical dtype object's .unique() changes categories #18291

toobaz · 2017-11-14T20:46:48Z

Code Sample, a copy-pastable example if possible

In [2]: pd.Categorical([1,3,2,4], categories=[4,2,3,1,5]).unique()
Out[2]: 
[1, 3, 2, 4]
Categories (4, int64): [1, 3, 2, 4]

same happens with CategoricalIndex and Series of dtype=category (as long as ordered=False).

Problem description

The general pandas approach to categoricals seems to change the .categories field as seldom as possible. So I see no reason why unique(), which by definition does not affect the set of distinct values, should change the categories (reordering them based on appearance, and dropping unused ones).

Expected Output

In [2]: pd.Categorical([1,3,2,4], categories=[4,2,3,1,5]).unique()
Out[2]: 
[1, 3, 2, 4]
Categories (4, int64): [4, 2, 3, 1, 5]

Output of `pd.show_versions()`

In [3]: pd.show_versions()

INSTALLED VERSIONS

commit: 63e8527
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-3-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: it_IT.UTF-8
LOCALE: it_IT.UTF-8

pandas: 0.22.0.dev0+131.g63e8527d3
pytest: 3.2.3
pip: 9.0.1
setuptools: 36.7.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.0dev
tables: 3.3.0
numexpr: 2.6.1
feather: 0.3.1
matplotlib: 2.0.0
openpyxl: None
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.6
lxml: None
bs4: 4.5.3
html5lib: 0.999999999
sqlalchemy: 1.0.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.2.1

The text was updated successfully, but these errors were encountered:

jschendel · 2017-11-14T21:38:19Z

You can maintain the order of used categories by passing ordered=True. Based on the docstring, dropping unused categories is intentional, though I'm not sure of the exact reasoning:

pandas/pandas/core/categorical.py

Lines 2018 to 2026 in 63e8527

    
               def unique(self): 
        
                   """ 
        
                   Return the ``Categorical`` which ``categories`` and ``codes`` are 
        
                   unique. Unused categories are NOT returned. 
        
                   - unordered category: values and categories are sorted by appearance 
        
                     order. 
        
                   - ordered category: values are sorted by appearance order, categories 
        
                     keeps existing order.

Seems like this could be better documented. I don't think that docstring is displayed anywhere in the documentation/api reference? The best I could find was an example in a note at the end of the working with categories section of the documentation, and a couple examples in the api reference for pandas.unique.

toobaz · 2017-11-14T21:48:15Z

Based on the docstring, dropping unused categories is intentional, though I'm not sure of the exact reasoning:

Indeed, I should have clarified that I'm not claiming the behavior is in contrast with the docs: I just see no justification for it, and - unless there is any which I'm missing - I suggest to change it.

For the records: this came from a test in #17897, and in particular from the fact that CategoricalIndex.drop_duplicates() does not change the categories.

toobaz · 2017-11-14T21:51:34Z

By the way: I think "Return the Categorical which categories and codes are unique." (from the docstring) doesn't make much sense as the categories are always unique, right?

jorisvandenbossche · 2017-11-14T22:08:03Z

It seems this change was done in #10508 without much discussion. So we should check the interaction with groupby on categorical columns

jorisvandenbossche · 2017-11-14T22:15:05Z

I actually raised this before here #15939 (comment), but that discussion was mainly closed by referring to the docs and saying that we could further discuss it in a new issue, which you now opened :-)

This one might also be relevant to look at: #15439 (as I said above, we should explore the interaction with groupby)

toobaz · 2017-11-15T08:58:38Z

we should explore the interaction with groupby

Sure this might be a good occasion to discuss (more for consistency than for the actual implementation, which is relatively simple to adapt).

Current behavior.

if sort=True, then the categories order is followed, be them ordered=True or not
if sort=False, then if categories are ordered=True, then their order is used (except that unused categories go at the end), while if ordered=False, then the order of appearance is used (again, with unused categories at the end)

While I wouldn't change the behavior with sort=True (which is the default), I see the following possibilities for sort=False:

leaving everything as it is
in the sort=False, ordered=True case, follow the order of appearance rather than the order of categories
in the sort=False, ordered=False case, follow the order of categories rather than the order of appearance

1 seems to me a bit inconsistent, and I have a slight preference (for coherence with other dtypes) for 2 over 3 (despite 3. being potentially slightly more efficient). Anyway, in those cases (if any) in which sort=False and we follow the order of categories, I suggest to place unused categories at their natural place rather than at the end, which doesn't make any sense to me.

toobaz · 2017-11-15T09:38:36Z

Uhm... 3. would reopen #8868 , so we probably want to avoid it.

toobaz mentioned this issue Nov 14, 2017

API: add "level=" argument to MultiIndex.unique() #17897

Merged

4 tasks

gfyoung added API Design Categorical Categorical Data Type labels Nov 16, 2017

topper-123 mentioned this issue Jul 18, 2018

API: Categorical.unique() should not drop unused categories #21648

Closed

h-vetinari mentioned this issue Sep 24, 2018

API/ENH: overhaul/unify/improve .unique #22824

Open

6 tasks

mroeschke added the Bug label Jun 28, 2020

This was referenced Nov 28, 2020

ENH: Categorical.unique can keep same dtype #38135

Closed

BUG: Categorical.unique should keep dtype unchanged #38140

Merged

jreback modified the milestones: 1.2, 1.3 Dec 2, 2020

jreback closed this as completed in #38140 Apr 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any categorical dtype object's .unique() changes categories #18291

Any categorical dtype object's .unique() changes categories #18291

toobaz commented Nov 14, 2017

INSTALLED VERSIONS

jschendel commented Nov 14, 2017

toobaz commented Nov 14, 2017

toobaz commented Nov 14, 2017

jorisvandenbossche commented Nov 14, 2017

jorisvandenbossche commented Nov 14, 2017

toobaz commented Nov 15, 2017

toobaz commented Nov 15, 2017 •

edited

Loading

Any categorical dtype object's .unique() changes categories #18291

Any categorical dtype object's .unique() changes categories #18291

Comments

toobaz commented Nov 14, 2017

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jschendel commented Nov 14, 2017

toobaz commented Nov 14, 2017

toobaz commented Nov 14, 2017

jorisvandenbossche commented Nov 14, 2017

jorisvandenbossche commented Nov 14, 2017

toobaz commented Nov 15, 2017

toobaz commented Nov 15, 2017 • edited Loading

Output of `pd.show_versions()`

toobaz commented Nov 15, 2017 •

edited

Loading