Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow categoricals/enums backed by smaller int dtypes #13109

Open
mcrumiller opened this issue Dec 18, 2023 · 0 comments
Open

Allow categoricals/enums backed by smaller int dtypes #13109

mcrumiller opened this issue Dec 18, 2023 · 0 comments
Labels
A-dtype-categorical Area: categorical data type accepted Ready for implementation enhancement New feature or an improvement of an existing feature

Comments

@mcrumiller
Copy link
Contributor

mcrumiller commented Dec 18, 2023

Description

Matlab's categorical arrays are always backed by the smallest integer type that can support the number of categories. While this isn't necessarily the best implementation (adding a new category after the limit is reached requires an upcast), it helps a lot with performance when the number of categories is small--for example, using a u8 to represent your categories can help with space.

I was wondering what people thought about either:

  1. Taking the Matlab approach, where the user never touches the underlying type, but that type is dynamic; or
  2. Allowing the underlying int type to be specified during construction, i.e. pl.Categorical(dtype=pl.UInt8).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-dtype-categorical Area: categorical data type accepted Ready for implementation enhancement New feature or an improvement of an existing feature
Projects
Status: Ready
Development

No branches or pull requests

2 participants