Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize references to "null" in methods #17272

Closed
Mateuscvieira opened this issue Jun 28, 2024 · 9 comments
Closed

Standardize references to "null" in methods #17272

Mateuscvieira opened this issue Jun 28, 2024 · 9 comments
Labels
A-api Area: changes to the public API accepted Ready for implementation enhancement New feature or an improvement of an existing feature

Comments

@Mateuscvieira
Copy link

Description

This is kind of nitpicking, but I can't tell you how many times I have had to rerun cells because of this. In the current version of Polars, the method to drop null values is called drop_nulls, while the method to fill null values is called fill_null (no s).
Can we please rename this? Either both plural or both singular.

@Mateuscvieira Mateuscvieira added the enhancement New feature or an improvement of an existing feature label Jun 28, 2024
@stinodego stinodego added A-api Area: changes to the public API needs decision Awaiting decision by a maintainer labels Jun 28, 2024
@stinodego stinodego added this to the 1.0.0 milestone Jun 28, 2024
@stinodego stinodego added the accepted Ready for implementation label Jun 28, 2024
@stinodego
Copy link
Member

We will rename fill_null to fill_nulls. This aligns with has_nulls / drop_nulls.

Also for fill_nan -> fill_nans.

@JulianCologne
Copy link
Contributor

JulianCologne commented Jun 29, 2024

We will rename fill_null to fill_nulls. This aligns with has_nulls / drop_nulls.

Also for fill_nan -> fill_nans.

different opinion here! imo we should drop the "s" for all cases

Reasons

  • fill_nan / fill_null are ROW level operation. For each row you fill a single nan / null. This is also in line with column identifiers which should be singular instead of plural ("name" instead of "names", "age" instead of "ages", "value" instead of "values", ...)
  • has_nulls is strictly speaking semantically incorrect. nulls as a plural implies that it checks for multiple null values in a column which is not the case! It checks for a single null and returns if the column contains a single null (potentially more) -> should be has_null
  • drop_nulls: strong opinion for a rename to drop_null_rows which would make it much more descriptive! This was probably "copied" from pandas fillna which is much different because it can also remove columns. Otherwise would also lean towards drop_null

@stinodego
Copy link
Member

stinodego commented Jun 29, 2024

  • fill_nan / fill_null are ROW level operation. For each row you fill a single nan / null. This is also in line with column identifiers which should be singular instead of plural ("name" instead of "names", "age" instead of "ages", "value" instead of "values", ...)

Not sure. fill_nulls says "fill all the nulls in this column with value X". What makes you think this is a row-level operation?

  • has_nulls is strictly speaking semantically incorrect. nulls as a plural implies that it checks for multiple null values in a column which is not the case! It checks for a single null and returns if the column contains a single null (potentially more) -> should be has_null

I have considered this, but you conveniently gloss over the fact that has_null isn't quite correct either because it also returns true with multiple nulls. Neither has_null or has_nulls is completely correct, but contains_at_least_one_null is too long, so we have to choose.

  • drop_nulls: strong opinion for a rename to drop_null_rows which would make it much more descriptive! This was probably "copied" from pandas fillna which is much different because it can also remove columns. Otherwise would also lean towards drop_null

An expression doesn't have any rows, so drop_null_rows makes no sense.

For what it's worth, personally I feel like fill_null, drop_null, has_null feels better, but I cannot really make a good argument for it.

Anyway, I'll sleep on this one before merging it, but I'm not convinced at all by your arguments here.

@gab23r
Copy link
Contributor

gab23r commented Jun 29, 2024

I personally prefer the singular form as well. These functions drop/fill or check the existence of any null value so it makes sense to remove the s.

Moreover, I think the fill_null is much more used than drop_nulls and has_nulls, so it will break less code.

@JulianCologne
Copy link
Contributor

JulianCologne commented Jun 29, 2024

I have considered this, but you conveniently gloss over the fact that has_null isn't quite correct either because it also returns true with multiple nulls. Neither has_null or has_nulls is completely correct, but contains_at_least_one_null is too long, so we have to choose.

has_null is completely correct imo. It answers the question if the column has/contains null. The quantity is irrelevat here 🤓.

Thinking about this a bit more the best solution imo would be actually be to completely remove has_null(s) and introduce Expr.contains (contains(None) with fast-path) as a super-function. Everyone is familiar with this concept. (Also contains isn't called contains_at_least_one because that is implied)

Thoughts? 💭
I think that would improve the api (there is already 5x contains but not yet on Expr) 🤓😎

@lyngc
Copy link

lyngc commented Jun 29, 2024

Also prefer singular

@stinodego
Copy link
Member

Moreover, I think the fill_null is much more used than drop_nulls and has_nulls, so it will break less code.

Not sure about fill_null vs drop_nulls frequency, but has_nulls was added very recently so renaming it will be very low impact.

@ritchie46
Copy link
Member

After some thought I agree with @JulianCologne. I was thinking about this yesterday and is_null is an elementwise question. The same can be said for fill_null.

I think we should ask if it is a single row/elementwise question and if so go for singular.

drop_nulls isn't an elementwise operation, so it is fine to be plural here. As the longer version would be drop_null_rows.

I want to put this on a hold as I don't think this merits a change at all.

@stinodego
Copy link
Member

I want to put this on a hold as I don't think this merits a change at all.

Agree. Status quo is fine, I think. I'll close this for now.

@stinodego stinodego closed this as not planned Won't fix, can't repro, duplicate, stale Jun 30, 2024
@stinodego stinodego removed their assignment Jun 30, 2024
@stinodego stinodego removed this from the 1.0.0 milestone Jun 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-api Area: changes to the public API accepted Ready for implementation enhancement New feature or an improvement of an existing feature
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

6 participants