Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusion about change from pl.count() to pl.len() #14498

Closed
stevenlis opened this issue Feb 14, 2024 · 3 comments
Closed

Confusion about change from pl.count() to pl.len() #14498

stevenlis opened this issue Feb 14, 2024 · 3 comments
Labels
documentation Improvements or additions to documentation

Comments

@stevenlis
Copy link

Description

The upgrade guide indicates that:

Note that pl.count() and group_by(...).count() are unchanged. These count the number of rows in the context, so nulls are not applicable in the same way.

But based on my testing:

  • Both pl.count() and Expr.count() now behave the same way, ignoring nulls. This behavior is also mentioned in the API reference of pl.count(), which states, "This function is syntactic sugar for col(columns).count()."
  • group_by().agg(pl.count()) is getting a "DeprecationWarning: pl.count() is deprecated. Please use pl.len() instead."

Meanwhile, I don't know why we can use group_by().agg(pl.count()) as @mcrumiller mentioned here. This seems counterintuitive to me. The usage of both pl.len() and pl.count() creates inconsistency and confusions. Instead of having two separate functions/expressions, why not have a single function/expr with an arg to ignore_nulls?

Link

https://docs.pola.rs/releases/upgrade/0.20/#count-now-ignores-null-values

@stevenlis stevenlis added the documentation Improvements or additions to documentation label Feb 14, 2024
@ritchie46
Copy link
Member

What is counter-intuitive? We don't have a notion of null rows, so we cannot make an ignore_nulls argument. They behave different and count will not be exposed as a general catch all.

That's why we only expose it on Expr. And pl.len() will be exposed as a catch all. They are different for a reason.

@mcrumiller
Copy link
Contributor

mcrumiller commented Feb 14, 2024

Everything conforms to how the documentation specifies.

import polars as pl
from polars import col

df = pl.DataFrame({"a": [1, 2, None]})

df.select(pl.count())              # returns 3 (counts all rows)
df.select(pl.count("a"))           # returns 2 (counts non-nulls only)
df.select(col("a").count())        # returns 2 (counts non-nulls only)
df.group_by(pl.lit(True)).count()  # returns 3 (counts all rows)

The reason we cannot ignore nulls in group_by().count() is because there is no concept of a null row. Is a [0, 1, null] a null row? Is [null] a null row? (it's not the same as an empty row, which would be []). A group_by operation returns a frame, not a row, hence we cannot say how many non-null rows there are, because there is no concept of a null row. Does this make sense?

I agree that it is confusing. It is self-consistent, but it is nonetheless confusing.

@stevenlis
Copy link
Author

stevenlis commented Feb 15, 2024

@ritchie46 @mcrumiller Thanks for the explanations. It seems like I missed the following statement in the doc.

This way of using the function is deprecated. Please use :func:len instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

3 participants