Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get category mapping for categoricals #14883

Closed
mcrumiller opened this issue Mar 6, 2024 · 2 comments
Closed

Get category mapping for categoricals #14883

mcrumiller opened this issue Mar 6, 2024 · 2 comments
Labels
A-dtype-categorical Area: categorical data type enhancement New feature or an improvement of an existing feature

Comments

@mcrumiller
Copy link
Contributor

mcrumiller commented Mar 6, 2024

Description

While usually internal implementations should remain hidden, categorical data is universally known to be represented by integers, and this representation is often useful to developers. I propose we expose the rev mapping in the API via the following:

  • Series.cat.get_map - get map for categorical series as dictionary
  • pl.StringCache.get_map - get global categorical map.

This has also come up on discord a few times. We usually refer people to s.unique(), but that does not always show all of the categories.

@mcrumiller mcrumiller added the enhancement New feature or an improvement of an existing feature label Mar 6, 2024
@deanm0000
Copy link
Collaborator

deanm0000 commented Mar 7, 2024

I like them both. For the first one though you can do:

s=pl.Series("a",['apple','banana', 'apple','carrot','date','eggplant'],dtype=pl.Categorical)
s=s.filter(s!='carrot')
{x[1]:x[0] for x in s.cat.get_categories().to_frame().with_row_index('i').iter_rows()}
{'apple': 0, 'banana': 1, 'carrot': 2, 'date': 3, 'eggplant': 4}

works even though carrot isn't in s anymore.

Actually you don't even need to go through a frame, you can alternatively do

{val:i for i, val in enumerate(s.cat.get_categories())}

@c-peters c-peters added the A-dtype-categorical Area: categorical data type label Mar 11, 2024
@c-peters
Copy link
Collaborator

The indices of the cat.get_categories() are equal to the physicals. I'm not convinced that we really need a new method for this as the method by @deanm0000 will provide you the mapping

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-dtype-categorical Area: categorical data type enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

3 participants