Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List(Categorical/Enum).contains(Categorical/Enum) is broken since 0.20.2 #14559

Closed
2 tasks done
Object905 opened this issue Feb 17, 2024 · 1 comment · Fixed by #14744
Closed
2 tasks done

List(Categorical/Enum).contains(Categorical/Enum) is broken since 0.20.2 #14559

Object905 opened this issue Feb 17, 2024 · 1 comment · Fixed by #14744
Assignees
Labels
A-dtype-categorical Area: categorical data type accepted Ready for implementation bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@Object905
Copy link
Contributor

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

# pl.enable_string_cache()
# exception with both - global string cache and without

df = pl.DataFrame([(["a", "b"], "c"), (["a", "b"], "a")], schema={"li": pl.List(pl.Categorical), "x": pl.Categorical})

print(df.select(pl.col("li").list.contains(pl.col("x").cast(pl.Utf8))))
# ok

print(df.select(pl.col("li").list.contains(pl.col("x"))))
# polars.exceptions.InvalidOperationError: is_in operation not supported for dtypes `cat` and `list[cat]`


enum = pl.Enum(["a", "b", "c"])

df = pl.DataFrame([(["a", "b"], "c"), (["a", "b"], "a")], schema={"li": pl.List(enum), "x": pl.Categorical})


print(df.select(pl.col("li").list.contains(pl.col("x").cast(pl.Utf8))))
# ok

print(df.select(pl.col("li").list.contains(pl.col("x"))))
# polars.exceptions.InvalidOperationError: is_in operation not supported for dtypes `cat` and `list[enum]`

Log output

No response

Issue description

Broken for Categorical-Categorical, Enum-Enum and Categorical-Enum.
Both with and without pl.enable_string_cache.

Expected behavior

To work,, like in 0.20.2

Installed versions

--------Version info---------
Polars:               0.20.9
Index type:           UInt32
Platform:             Linux-6.7.4-arch1-1-x86_64-with-glibc2.39
Python:               3.12.1 (main, Feb  5 2024, 17:00:58) [GCC 13.2.1 20230801]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          3.0.0
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               <not installed>
gevent:               24.2.1
hvplot:               <not installed>
matplotlib:           <not installed>
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.0
pyarrow:              15.0.0
pydantic:             2.6.1
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.27
xlsx2csv:             0.8.2
xlsxwriter:           3.1.9
@Object905 Object905 added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Feb 17, 2024
@stinodego stinodego added the A-dtype-categorical Area: categorical data type label Feb 17, 2024
@mcrumiller
Copy link
Contributor

mcrumiller commented Feb 18, 2024

There are a few missing is_in implementations:

  • String is in Categorical
  • String is in Enum

also fails, but perhaps should fail:

  • Categorical is in Enum
  • Enum is Categorical

I don't know if those last two should be implemented. Both Categorical and Enums should coerce strings into their revmap representation, but if your data is already a Categorical and/or Enum, then it means that you have explicitly placed them into particular sets, and those sets are not always comparable.

I do think we could benefit from a slightly more informative error message than, for example:

polars.exceptions.InvalidOperationError: `is_in` cannot check for Enum(Some(local), Physical) values in String data

e.g. explain why it cannot check. Also the enum innards are probably not very useful in the error message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-dtype-categorical Area: categorical data type accepted Ready for implementation bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants