Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any categorical dtype object's .unique() changes categories #18291

Closed
toobaz opened this issue Nov 14, 2017 · 7 comments · Fixed by #38140
Closed

Any categorical dtype object's .unique() changes categories #18291

toobaz opened this issue Nov 14, 2017 · 7 comments · Fixed by #38140
Labels
API Design Bug Categorical Categorical Data Type
Milestone

Comments

@toobaz
Copy link
Member

toobaz commented Nov 14, 2017

Code Sample, a copy-pastable example if possible

In [2]: pd.Categorical([1,3,2,4], categories=[4,2,3,1,5]).unique()
Out[2]: 
[1, 3, 2, 4]
Categories (4, int64): [1, 3, 2, 4]

same happens with CategoricalIndex and Series of dtype=category (as long as ordered=False).

Problem description

The general pandas approach to categoricals seems to change the .categories field as seldom as possible. So I see no reason why unique(), which by definition does not affect the set of distinct values, should change the categories (reordering them based on appearance, and dropping unused ones).

Expected Output

In [2]: pd.Categorical([1,3,2,4], categories=[4,2,3,1,5]).unique()
Out[2]: 
[1, 3, 2, 4]
Categories (4, int64): [4, 2, 3, 1, 5]

Output of pd.show_versions()

In [3]: pd.show_versions()

INSTALLED VERSIONS

commit: 63e8527
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-3-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: it_IT.UTF-8
LOCALE: it_IT.UTF-8

pandas: 0.22.0.dev0+131.g63e8527d3
pytest: 3.2.3
pip: 9.0.1
setuptools: 36.7.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.0dev
tables: 3.3.0
numexpr: 2.6.1
feather: 0.3.1
matplotlib: 2.0.0
openpyxl: None
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.6
lxml: None
bs4: 4.5.3
html5lib: 0.999999999
sqlalchemy: 1.0.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.2.1

@jschendel
Copy link
Member

You can maintain the order of used categories by passing ordered=True. Based on the docstring, dropping unused categories is intentional, though I'm not sure of the exact reasoning:

def unique(self):
"""
Return the ``Categorical`` which ``categories`` and ``codes`` are
unique. Unused categories are NOT returned.
- unordered category: values and categories are sorted by appearance
order.
- ordered category: values are sorted by appearance order, categories
keeps existing order.

Seems like this could be better documented. I don't think that docstring is displayed anywhere in the documentation/api reference? The best I could find was an example in a note at the end of the working with categories section of the documentation, and a couple examples in the api reference for pandas.unique.

@toobaz
Copy link
Member Author

toobaz commented Nov 14, 2017

Based on the docstring, dropping unused categories is intentional, though I'm not sure of the exact reasoning:

Indeed, I should have clarified that I'm not claiming the behavior is in contrast with the docs: I just see no justification for it, and - unless there is any which I'm missing - I suggest to change it.

For the records: this came from a test in #17897, and in particular from the fact that CategoricalIndex.drop_duplicates() does not change the categories.

@toobaz
Copy link
Member Author

toobaz commented Nov 14, 2017

By the way: I think "Return the Categorical which categories and codes are unique." (from the docstring) doesn't make much sense as the categories are always unique, right?

@jorisvandenbossche
Copy link
Member

It seems this change was done in #10508 without much discussion. So we should check the interaction with groupby on categorical columns

@jorisvandenbossche
Copy link
Member

I actually raised this before here #15939 (comment), but that discussion was mainly closed by referring to the docs and saying that we could further discuss it in a new issue, which you now opened :-)

This one might also be relevant to look at: #15439 (as I said above, we should explore the interaction with groupby)

@toobaz
Copy link
Member Author

toobaz commented Nov 15, 2017

we should explore the interaction with groupby

Sure this might be a good occasion to discuss (more for consistency than for the actual implementation, which is relatively simple to adapt).

Current behavior.

  • if sort=True, then the categories order is followed, be them ordered=True or not
  • if sort=False, then if categories are ordered=True, then their order is used (except that unused categories go at the end), while if ordered=False, then the order of appearance is used (again, with unused categories at the end)

While I wouldn't change the behavior with sort=True (which is the default), I see the following possibilities for sort=False:

  1. leaving everything as it is
  2. in the sort=False, ordered=True case, follow the order of appearance rather than the order of categories
  3. in the sort=False, ordered=False case, follow the order of categories rather than the order of appearance

1 seems to me a bit inconsistent, and I have a slight preference (for coherence with other dtypes) for 2 over 3 (despite 3. being potentially slightly more efficient). Anyway, in those cases (if any) in which sort=False and we follow the order of categories, I suggest to place unused categories at their natural place rather than at the end, which doesn't make any sense to me.

@toobaz
Copy link
Member Author

toobaz commented Nov 15, 2017

Uhm... 3. would reopen #8868 , so we probably want to avoid it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Bug Categorical Categorical Data Type
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants