Confusion about change from `pl.count()` to `pl.len()` #14498

stevenlis · 2024-02-14T23:22:19Z

Description

The upgrade guide indicates that:

Note that pl.count() and group_by(...).count() are unchanged. These count the number of rows in the context, so nulls are not applicable in the same way.

But based on my testing:

Both pl.count() and Expr.count() now behave the same way, ignoring nulls. This behavior is also mentioned in the API reference of pl.count(), which states, "This function is syntactic sugar for col(columns).count()."
group_by().agg(pl.count()) is getting a "DeprecationWarning: pl.count() is deprecated. Please use pl.len() instead."

Meanwhile, I don't know why we can use group_by().agg(pl.count()) as @mcrumiller mentioned here. This seems counterintuitive to me. The usage of both pl.len() and pl.count() creates inconsistency and confusions. Instead of having two separate functions/expressions, why not have a single function/expr with an arg to ignore_nulls?

Link

https://docs.pola.rs/releases/upgrade/0.20/#count-now-ignores-null-values

The text was updated successfully, but these errors were encountered:

ritchie46 · 2024-02-14T23:36:17Z

What is counter-intuitive? We don't have a notion of null rows, so we cannot make an ignore_nulls argument. They behave different and count will not be exposed as a general catch all.

That's why we only expose it on Expr. And pl.len() will be exposed as a catch all. They are different for a reason.

mcrumiller · 2024-02-14T23:36:25Z

Everything conforms to how the documentation specifies.

import polars as pl
from polars import col

df = pl.DataFrame({"a": [1, 2, None]})

df.select(pl.count())              # returns 3 (counts all rows)
df.select(pl.count("a"))           # returns 2 (counts non-nulls only)
df.select(col("a").count())        # returns 2 (counts non-nulls only)
df.group_by(pl.lit(True)).count()  # returns 3 (counts all rows)

The reason we cannot ignore nulls in group_by().count() is because there is no concept of a null row. Is a [0, 1, null] a null row? Is [null] a null row? (it's not the same as an empty row, which would be []). A group_by operation returns a frame, not a row, hence we cannot say how many non-null rows there are, because there is no concept of a null row. Does this make sense?

I agree that it is confusing. It is self-consistent, but it is nonetheless confusing.

stevenlis · 2024-02-15T00:01:49Z

@ritchie46 @mcrumiller Thanks for the explanations. It seems like I missed the following statement in the doc.

This way of using the function is deprecated. Please use :func:len instead.

stevenlis added the documentation Improvements or additions to documentation label Feb 14, 2024

stevenlis closed this as completed Feb 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusion about change from `pl.count()` to `pl.len()` #14498

Confusion about change from `pl.count()` to `pl.len()` #14498

stevenlis commented Feb 14, 2024

ritchie46 commented Feb 14, 2024

mcrumiller commented Feb 14, 2024 •

edited

Loading

stevenlis commented Feb 15, 2024 •

edited

Loading

Confusion about change from pl.count() to pl.len() #14498

Confusion about change from pl.count() to pl.len() #14498

Comments

stevenlis commented Feb 14, 2024

Description

Link

ritchie46 commented Feb 14, 2024

mcrumiller commented Feb 14, 2024 • edited Loading

stevenlis commented Feb 15, 2024 • edited Loading

Confusion about change from `pl.count()` to `pl.len()` #14498

Confusion about change from `pl.count()` to `pl.len()` #14498

mcrumiller commented Feb 14, 2024 •

edited

Loading

stevenlis commented Feb 15, 2024 •

edited

Loading