Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unordered enum data type #16699

Open
butterlyn opened this issue Jun 3, 2024 · 4 comments
Open

Unordered enum data type #16699

butterlyn opened this issue Jun 3, 2024 · 4 comments
Labels
A-dtype-categorical Area: categorical data type accepted Ready for implementation enhancement New feature or an improvement of an existing feature

Comments

@butterlyn
Copy link

Description

Following on from suggestion in #16689

Add a boolean parameter ordered to polars.Enum to allow for evaluating Enums irrespective of their category order.

The following should raise no errors:

import polars as pl

assert pl.Enum(categories=["yes", "no"], ordered=False) == pl.Enum(categories=["no", "yes"], ordered=True)
assert pl.Enum(categories=["yes", "no"], ordered=True) == pl.Enum(categories=["no", "yes"], ordered=True)
assert pl.Enum(categories=["yes", "no"], ordered=False) != pl.Enum(categories=["no", "yes"], ordered=False)

Example use case - unit testing

The intended purpose is to allow for defining an unordered pl.Enum in unit tests which can be used in columns of a DataFrame/LazyFrame supplied to polars.testing.assert_frame_equal. The idea is that the unit test should check that the correct pl.Enum is cast to the correct columns without caring about the order of the enum categories defined in the source code.

For example, for my_module:

source_code_dataframe = pl.DataFrame(
    data={
        "enum_column": ["yes", "yes", "no"]
    },
    schema={
        "enum_column": pl.Enum(["no", "yes"]),  # defaults to ordered=True
    }
)

We could write a unit test:

import polars as pl
from polars.testing import assert_frame_equal

from my_module import source_code_dataframe

def test_dataframe():
    expected = pl.DataFrame(
        data={
            "enum_column": ["yes", "yes", "no"]
        },
        schema={
            "enum_column": pl.Enum(["yes", "no"], ordered=False),  # specify that different enum category order shouldn't raise a dtype error
        }
    )
    assert_frame_equal(source_code_dataframe, expected)
@butterlyn butterlyn added the enhancement New feature or an improvement of an existing feature label Jun 3, 2024
@stinodego stinodego added accepted Ready for implementation A-dtype-categorical Area: categorical data type labels Jun 3, 2024
@stinodego
Copy link
Member

stinodego commented Jun 3, 2024

If you just need this for unit tests, you can just write your own assertion util for this, e.g. cast your Enums to strings before checking equality, and then check Enum categories separately.

Still this is an important variant of categorical data that we should support.

@butterlyn
Copy link
Author

Thanks @stinodego, yeah I mostly need this for unit testing

If you recommend casting Enums to strings when using polars.testing.assert_frame_equal, perhaps you'd consider reopening this: #16075? 😁 No fuss if not, happy to make do in the meantime since unordered enum can serve the same purpose

@stinodego
Copy link
Member

If you recommend casting Enums to strings when using polars.testing.assert_frame_equal, perhaps you'd consider reopening this: #16075? 😁 No fuss if not, happy to make do in the meantime since unordered enum can serve the same purpose

No, because the point I made there still stands :)

@mcrumiller
Copy link
Contributor

@butterlyn for now you can just make sure to always sort your categories before creating your Enum dtype.

def sorted_enum(categories):
    return pl.Enum(sorted(categories))

assert_series_equal(
    pl.Series(["a"], dtype=sorted_enum(["a", "b", "c"])),
    pl.Series(["a"], dtype=sorted_enum(["b", "c", "a"])),  # different order
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-dtype-categorical Area: categorical data type accepted Ready for implementation enhancement New feature or an improvement of an existing feature
Projects
Status: Ready
Development

No branches or pull requests

3 participants