Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement: Subset selection similar to R dplyr filter #26809

Closed
rasmuse opened this issue Jun 12, 2019 · 5 comments
Closed

Enhancement: Subset selection similar to R dplyr filter #26809

rasmuse opened this issue Jun 12, 2019 · 5 comments

Comments

@rasmuse
Copy link

rasmuse commented Jun 12, 2019

Edited: Fixed typo

This is a feature request. I start by showing how I usually work and then suggest a new method NDFrame.subset including an implementation sketch.

I have also written a longer explanation and some examples here (personal blog) and here (jupyter notebook).

Happy to discuss, and if there is interest in this I could probably provide a PR.

Code Sample

To filter row subsets of DataFrames or Series based on the values I usually write something like

data_subset = (
    data
    .pipe(lambda d: d[d['some_column'] > 0])
    .pipe(lambda d: d[complicated_predicate(d)])
    # etc, chaining operations as necessary
)

Problem description

This works perfectly well but the syntax seems unneccesarily complicated given how often I do this operation. For comparison, using filter() in R's dplyr package you would write

data.subset <- data %>%
    filter(some.column > 0) %>%
    filter(complicated.predicate(.)) %>%
    #etc

To my eyes (although I don't use R much), the R code makes the intention more visible because

  1. the R code has less brackets etc, and
  2. the verb filter is much more specific than pipe.

Suggestion

I would suggest adding a method subset to NDFrame. A minimal implementation could be something like this:

class NDFrame:

    ...

    def subset(d, predicate, *args, **kwargs):
        return d[predicate(d, *args, **kwargs)]

This could be used as follows:

data_subset = (
    data
    .subset(lambda d: d['some_column'] > 0)
    .subset(complicated_predicate)
    # etc, chaining operations as necessary
)
@WillAyd
Copy link
Member

WillAyd commented Jun 12, 2019

You are aware that you can rewrite your original function like this right?

data[(data['some_column'] > 0) & (complicated_predicate(data))]

There's also Numexpr for some ops.

Unless I'm missing something not clear the suggested syntax adds anything

@rasmuse
Copy link
Author

rasmuse commented Jun 12, 2019

Thank you, yes, I am aware.

Maybe I should have been more clear that the suggestion is only for readability and ease of editing. I often have to perform a long list of related operations (maybe 10-15 in a row) to datasets and in these cases I find it most readable if they are expressed as a chain of calls with each line representing one thought. Unless the complicated_predicate is not very tightly coupled to the comparison data['some_column'] > 0 then I prefer to keep the two things on separate lines, because it makes it easier to

  • understand the code as a chain of operations,
  • remove or comment out one line, and
  • change the order of the operations.

The other way to preserve these benefits is to do like

d = data
d = d[d > 0]
d = d[complicated_predicate(d)]
d = d.unstack()
d = d.groupby('some_level').mean()
# etc, maybe 5-10 more lines of stacking, unstacking, selecting, grouping, ...

result = d  # finally assign a more meaningful name

But in that case I find it much more readable, and faster, to write

result = (
    data
    .subset(lambda d: d['some_column'] > 0)
    .subset(complicated_predicate)
    .unstack()
    .groupby('some_level').mean()
    # etc
)

I find that the latter example is much more readable because it's easier to scan the code and say to myself "uh-huh, subset, subset, unstack, group, mean, ...". In the end this helps me focus on the problem domain.

@jreback
Copy link
Contributor

jreback commented Jun 12, 2019

@rasmuse you can already do this in a nice chained way as .loc accepts a callable

data_subset = (
    data
    .loc[lambda d: d[d['some_column'] > 0]]
    .loc[lambda d: d[complicated_predicate(d)]]
    # etc, chaining operations as necessary
)

here's a nice article:
https://towardsdatascience.com/the-unreasonable-effectiveness-of-method-chaining-in-pandas-15c2109e3c69
https://tomaugspurger.github.io/method-chaining.html

and in the pandas docs: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#selection-by-callable

@jreback jreback closed this as completed Jun 12, 2019
@rasmuse
Copy link
Author

rasmuse commented Jun 12, 2019

@jreback Wow, thank you for that! Embarassed I had not seen it in the docs yet. I had already seen the article at towardsdatascience.com and it very much resembles my way of working, but it did not mention loc accepting a callable.

And just for completeness I guess you meant to write

data_subset = (
    data
    .loc[lambda d: d['some_column'] > 0]
    .loc[complicated_predicate]
    # etc, chaining operations as necessary
)

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jun 12, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants