Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polars returns the incorrect number elements in a list after calling .unique() #14834

Closed
2 tasks done
adeboyed opened this issue Mar 4, 2024 · 3 comments
Closed
2 tasks done
Labels
bug Something isn't working invalid A bug report that is not actually a bug needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@adeboyed
Copy link

adeboyed commented Mar 4, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
df = pl.DataFrame([ pl.Series("col", [['A', 'U'], ['U', 'U']], dtype=pl.List(pl.Utf8)) ])
print(df.with_columns(pl.col('col').list.unique().len().alias('col uniq count')))

Log output

shape: (2, 2)
┌────────────┬────────────────┐
│ col        ┆ col uniq count │
│ ---        ┆ ---            │
│ list[str]  ┆ u32            │
╞════════════╪════════════════╡
│ ["A", "U"] ┆ 2              │
│ ["U", "U"] ┆ 2              │
└────────────┴────────────────┘

Issue description

If you run df.with_columns(pl.col('col').list.unique().alias('col uniq'))
You can see polars is doing the right thing:

shape: (2, 2)
┌────────────┬────────────┐
│ colcol uniq   │
│ ------        │
│ list[str]  ┆ list[str]  │
╞════════════╪════════════╡
│ ["A", "U"] ┆ ["A", "U"] │
│ ["U", "U"] ┆ ["U"]      │
└────────────┴────────────┘

You can't even chain them separate calls together:

df.with_columns(pl.col('col').list.unique().alias('col uniq'))\
.with_columns(
    pl.col('col uniq').len().alias('col uniq count')
)

Output:

shape: (2, 3)
┌────────────┬────────────┬────────────────┐
│ col        ┆ col uniq   ┆ col uniq count │
│ ---        ┆ ---        ┆ ---            │
│ list[str]  ┆ list[str]  ┆ u32            │
╞════════════╪════════════╪════════════════╡
│ ["A", "U"] ┆ ["A", "U"] ┆ 2              │
│ ["U", "U"] ┆ ["U"]      ┆ 2              │
└────────────┴────────────┴────────────────┘

Expected behavior

shape: (2, 2)
┌────────────┬────────────────┐
│ col        ┆ col uniq count │
│ ---        ┆ ---            │
│ list[str]  ┆ u32            │
╞════════════╪════════════════╡
│ ["A", "U"] ┆ 2              │
│ ["U", "U"] ┆ 1              │
└────────────┴────────────────┘

Installed versions

--------Version info---------
Polars:               0.20.13
Index type:           UInt32
Platform:             Windows-10-10.0.19045-SP0
Python:               3.10.11 (tags/v3.10.11:7d4cc5a, Apr  5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  0.9.0
cloudpickle:          <not installed>
connectorx:           0.3.2
deltalake:            <not installed>
fsspec:               2023.10.0
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.2
numpy:                1.25.2
openpyxl:             <not installed>
pandas:               2.1.4
pyarrow:              15.0.0
pydantic:             2.5.2
pyiceberg:            <not installed>
pyxlsb:               1.0.10
sqlalchemy:           2.0.20
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@adeboyed adeboyed added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Mar 4, 2024
@MarcoGorelli
Copy link
Collaborator

MarcoGorelli commented Mar 4, 2024

you're looking for list.len, not len

len just returns the length of the series. as that returns a single element, it broadcasts it to the length of the dataframe (see https://docs.pola.rs/user-guide/concepts/contexts/#selection)

In [17]: import polars as pl
    ...: df = pl.DataFrame([ pl.Series("col", [['A', 'U'], ['U', 'U']], dtype=pl.List(pl.Utf8)) ])
    ...: print(df.with_columns(pl.col('col').list.unique().list.len().alias('col uniq count')))
shape: (2, 2)
┌────────────┬────────────────┐
│ colcol uniq count │
│ ------            │
│ list[str]  ┆ u32            │
╞════════════╪════════════════╡
│ ["A", "U"] ┆ 2              │
│ ["U", "U"] ┆ 1              │
└────────────┴────────────────┘

@MarcoGorelli MarcoGorelli added the invalid A bug report that is not actually a bug label Mar 4, 2024
@MarcoGorelli
Copy link
Collaborator

closing then, but thanks for the issue! lmk if i've misunderstood and I can reopen

@adeboyed
Copy link
Author

adeboyed commented Mar 4, 2024

@MarcoGorelli Thanks so much for identifying my issue and replying so quickly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working invalid A bug report that is not actually a bug needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

2 participants