Enhancement: Subset selection similar to R dplyr filter #26809

rasmuse · 2019-06-12T11:12:18Z

Edited: Fixed typo

This is a feature request. I start by showing how I usually work and then suggest a new method NDFrame.subset including an implementation sketch.

I have also written a longer explanation and some examples here (personal blog) and here (jupyter notebook).

Happy to discuss, and if there is interest in this I could probably provide a PR.

Code Sample

To filter row subsets of DataFrames or Series based on the values I usually write something like

data_subset = (
    data
    .pipe(lambda d: d[d['some_column'] > 0])
    .pipe(lambda d: d[complicated_predicate(d)])
    # etc, chaining operations as necessary
)

Problem description

This works perfectly well but the syntax seems unneccesarily complicated given how often I do this operation. For comparison, using filter() in R's dplyr package you would write

data.subset <- data %>%
    filter(some.column > 0) %>%
    filter(complicated.predicate(.)) %>%
    #etc

To my eyes (although I don't use R much), the R code makes the intention more visible because

the R code has less brackets etc, and
the verb filter is much more specific than pipe.

Suggestion

I would suggest adding a method subset to NDFrame. A minimal implementation could be something like this:

class NDFrame:

    ...

    def subset(d, predicate, *args, **kwargs):
        return d[predicate(d, *args, **kwargs)]

This could be used as follows:

data_subset = (
    data
    .subset(lambda d: d['some_column'] > 0)
    .subset(complicated_predicate)
    # etc, chaining operations as necessary
)

The text was updated successfully, but these errors were encountered:

WillAyd · 2019-06-12T15:28:30Z

You are aware that you can rewrite your original function like this right?

data[(data['some_column'] > 0) & (complicated_predicate(data))]

There's also Numexpr for some ops.

Unless I'm missing something not clear the suggested syntax adds anything

rasmuse · 2019-06-12T16:21:53Z

Thank you, yes, I am aware.

Maybe I should have been more clear that the suggestion is only for readability and ease of editing. I often have to perform a long list of related operations (maybe 10-15 in a row) to datasets and in these cases I find it most readable if they are expressed as a chain of calls with each line representing one thought. Unless the complicated_predicate is not very tightly coupled to the comparison data['some_column'] > 0 then I prefer to keep the two things on separate lines, because it makes it easier to

understand the code as a chain of operations,
remove or comment out one line, and
change the order of the operations.

The other way to preserve these benefits is to do like

d = data
d = d[d > 0]
d = d[complicated_predicate(d)]
d = d.unstack()
d = d.groupby('some_level').mean()
# etc, maybe 5-10 more lines of stacking, unstacking, selecting, grouping, ...

result = d  # finally assign a more meaningful name

But in that case I find it much more readable, and faster, to write

result = (
    data
    .subset(lambda d: d['some_column'] > 0)
    .subset(complicated_predicate)
    .unstack()
    .groupby('some_level').mean()
    # etc
)

I find that the latter example is much more readable because it's easier to scan the code and say to myself "uh-huh, subset, subset, unstack, group, mean, ...". In the end this helps me focus on the problem domain.

jreback · 2019-06-12T16:28:11Z

@rasmuse you can already do this in a nice chained way as .loc accepts a callable

data_subset = (
    data
    .loc[lambda d: d[d['some_column'] > 0]]
    .loc[lambda d: d[complicated_predicate(d)]]
    # etc, chaining operations as necessary
)

here's a nice article:
https://towardsdatascience.com/the-unreasonable-effectiveness-of-method-chaining-in-pandas-15c2109e3c69
https://tomaugspurger.github.io/method-chaining.html

and in the pandas docs: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#selection-by-callable

rasmuse · 2019-06-12T16:38:58Z

@jreback Wow, thank you for that! Embarassed I had not seen it in the docs yet. I had already seen the article at towardsdatascience.com and it very much resembles my way of working, but it did not mention loc accepting a callable.

And just for completeness I guess you meant to write

data_subset = (
    data
    .loc[lambda d: d['some_column'] > 0]
    .loc[complicated_predicate]
    # etc, chaining operations as necessary
)

TomAugspurger · 2019-06-12T16:39:13Z

You might be interested in #26642 / related issues. Basically, we go the names of .filter and .select backwards. We may be able to rectify this in the future.

…

On Wed, Jun 12, 2019 at 11:28 AM Jeff Reback ***@***.***> wrote: Closed #26809 <#26809>. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#26809?email_source=notifications&email_token=AAKAOIQEVQYQSTWQA6SCOSTP2EP2NA5CNFSM4HXH32G2YY3PNVWWK3TUL52HS4DFWZEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW5KTDN5WW2ZLOORPWSZGOR6DN3NI#event-2407980469>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKAOITLX7BEWNBUFQVX43LP2EP2NANCNFSM4HXH32GQ> .

WillAyd added the Usage Question label Jun 12, 2019

jreback closed this as completed Jun 12, 2019

jreback mentioned this issue Jun 22, 2020

ENH: Use binary right shift as pipe operator #34925

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement: Subset selection similar to R dplyr filter #26809

Enhancement: Subset selection similar to R dplyr filter #26809

rasmuse commented Jun 12, 2019 •

edited

WillAyd commented Jun 12, 2019 •

edited

rasmuse commented Jun 12, 2019

jreback commented Jun 12, 2019 •

edited

rasmuse commented Jun 12, 2019

TomAugspurger commented Jun 12, 2019 via email

Enhancement: Subset selection similar to R dplyr filter #26809

Enhancement: Subset selection similar to R dplyr filter #26809

Comments

rasmuse commented Jun 12, 2019 • edited

Code Sample

Problem description

Suggestion

WillAyd commented Jun 12, 2019 • edited

rasmuse commented Jun 12, 2019

jreback commented Jun 12, 2019 • edited

rasmuse commented Jun 12, 2019

TomAugspurger commented Jun 12, 2019 via email

rasmuse commented Jun 12, 2019 •

edited

WillAyd commented Jun 12, 2019 •

edited

jreback commented Jun 12, 2019 •

edited