Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: The Extract function should filter rows before selecting columns #547

Merged
merged 4 commits into from
Nov 28, 2023

Conversation

eitsupi
Copy link
Collaborator

@eitsupi eitsupi commented Nov 27, 2023

For base data.frame, it is possible to filter on columns that are not included in the result, such as mtcars[mtcars$cyl >= 8, c("disp", "mpg")], but until now this was not possible with polars DataFrame because column selection was performed first.

This PR fixes this behavior by filtering first and also allows filtering by Expr for LazyFrame.

@eitsupi eitsupi marked this pull request as draft November 27, 2023 13:05
@eitsupi eitsupi marked this pull request as ready for review November 27, 2023 13:21
@etiennebacher
Copy link
Collaborator

I wonder how much of the base R syntax we want to support in polars. Shouldn't we just promote the existing syntax? It is convenient to extract a variable by name or position, but if we allow filtering with [ too then people can expect more support for base R syntax that would simply increase the maintenance burden for not so much benefit

@eitsupi
Copy link
Collaborator Author

eitsupi commented Nov 28, 2023

I wonder how much of the base R syntax we want to support in polars. Shouldn't we just promote the existing syntax? It is convenient to extract a variable by name or position, but if we allow filtering with [ too then people can expect more support for base R syntax that would simply increase the maintenance burden for not so much benefit

I see your point.

But I basically think that defining S3 methods for base R functions should be done in this package. (The exception is functions that use NSE which should use rlang.)
This is because if we don't define the S3 method here, each downstream package may implement it.

@etiennebacher
Copy link
Collaborator

But I basically think that defining S3 methods for base R functions should be done in this package

I'm completely fine with defining the S3 methods here ([.DataFrame was already here before this PR). What I'd like to avoid is to get users thinking they can use all base R syntax on a DataFrame. If the users see they can filter with base R methods, they could also expect being able to create new variables with [<-.DataFrame (which doesn't exist here).

When I look at the diff, I see that it was already possible to filter rows with with [ so I don't really understand, does this PR add new behavior?

@eitsupi
Copy link
Collaborator Author

eitsupi commented Nov 28, 2023

When I look at the diff, I see that it was already possible to filter rows with with [ so I don't really understand, does this PR add new behavior?

This PR basically just swaps the order of select and filter.
In the main branch, the select is done first, so the filter can only be used on the columns after the select is done.

Copy link
Collaborator

@etiennebacher etiennebacher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright so even if we want to discuss more about the S3 methods, this can be done in an issue, this PR is not really related. Thanks!

@eitsupi eitsupi changed the title fix: Extract should filtering first fix: The Extract function should filter rows before selecting columns Nov 28, 2023
@eitsupi
Copy link
Collaborator Author

eitsupi commented Nov 28, 2023

Thanks for looking at this.

I have a complaint that method chaining with $ does not fit well with base pipe (e.g. foo |> _$bar() is not allowed), so I think it would be great if basic operations could be done with base R functions.
(At this point, I am not interested in supporting [<- column changes as they do not fit well with base pipe.)

Also, since the Polars API is subject to potentially breaking changes (e.g., the change from with_column to with_columns a while ago), it would make sense to have a stable base R API, I think.

@eitsupi eitsupi merged commit 2fbf67d into main Nov 28, 2023
17 checks passed
@eitsupi eitsupi deleted the fix-extract branch November 28, 2023 07:39
@etiennebacher
Copy link
Collaborator

What would the advantage of foo |> _$bar() over foo$bar()? In my understanding $ replaces the purpose of |> here

@eitsupi
Copy link
Collaborator Author

eitsupi commented Nov 28, 2023

What would the advantage of foo |> _$bar() over foo$bar()? In my understanding $ replaces the purpose of |> here

This is the simplest example. In practice, it would look something like this:

foo |>
  bar() |>
  pl$LazyFrame() # <- We can't pipe after this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants