Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR: filter & select #12401

Closed
1 of 2 tasks
jreback opened this issue Feb 20, 2016 · 15 comments
Closed
1 of 2 tasks

DEPR: filter & select #12401

jreback opened this issue Feb 20, 2016 · 15 comments
Labels
API Design Deprecate Functionality to remove in pandas Indexing Related to indexing on series/frames, not to indexes themselves Needs Discussion Requires discussion from core team before further action
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented Feb 20, 2016

do we need label selectors? we should for sure just have a single method for this. maybe call it query_labels? to be consistent with .query as the workhorse for data selection.

xref #6599

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves API Design Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action labels Feb 20, 2016
@jreback jreback added this to the 0.19.0 milestone Feb 20, 2016
@jorisvandenbossche
Copy link
Member

I personally find filter a useful function (at least I have used it to good purpose in my own work) to select certain columns. See also the examples added in #12399. Although it should rather be called select ...

Less sure about select. That seems less useful, certainly now loc accepts a function.

@jreback
Copy link
Contributor Author

jreback commented Feb 14, 2017

I think I have revised my thoughts here.

we should promote (in the doc / the-one-way-to-do-it), .select as the main label filtering function, and deprecate .filter (which ATM serve the same purpose). Maybe needs some API tweaks.

.filter is traditionally a data selection / filtering function.

@jorisvandenbossche
Copy link
Member

They are quite different at the moment:

  • filter:
    • acts on columns by default (for dataframe)
    • can select based on list, or simple 'like'/more advanced regex
  • select:
    • acts on index by default
    • selects based on function applied to index labels

@jreback
Copy link
Contributor Author

jreback commented Feb 14, 2017

further: .filter uses .select for regex matching in its implementation.

@jreback
Copy link
Contributor Author

jreback commented Feb 14, 2017

further: we use .filter() in .groupby() to allow a filter for group inclusion (boolean return)

@shoyer
Copy link
Member

shoyer commented Feb 15, 2017

I have found DataFrame.filter to be useful, especially with like or regex. I have never used DataFrame.select, which feels very non-idiomatic to me.

So I would be happy to deprecate select. It's also highly confusing how GroupBy.filter works like DataFrame.select, not .filter.

@jreback jreback modified the milestones: 0.20.0, 0.21.0 Mar 29, 2017
@dkasak
Copy link

dkasak commented Jun 29, 2017

It's also highly confusing how GroupBy.filter works like DataFrame.select, not .filter.

I agree this is highly confusing. Is renaming one of those out of the question? filter is a common name for a higher-order function which filters elements based on the result of a Boolean-valued function that was passed in, exactly like GroupBy.filter, so that seems like an appropriate name for what is currently DataFrame.select. There's also Python's builtin filter function.

Another option might be merging the functionality of select and filter under one name, so it supports both list-like and function arguments.

@jreback
Copy link
Contributor Author

jreback commented Jul 15, 2017

so the problem as highlited by @jorisvandenbossche is that .select acts on the index (which is what groubpy.filter and boolean selection does). so it is a highly confusing name.

.filter is also a confusing name as it acts on the labels of columns.

We need a combined functionaility of the current DataFrame.select/filter (IOW to select labels from an axis and should accept a list-like, scalar and callable, like most other functions)

signature should be something like this (default for most functions is axis=0)

def select_labels(arraylike or scalar or callable, axis=0, regex=False)

now as to what to do:

  • select_labels I think is a nice name (open to suggestions), though other systems (spark & sql), use .select to mean label/column selection.
  • deprecate .select in favor of .select_labels
  • deprecate .filter in favor of select_labels

@dkasak interested in taking this on?

@shoyer
Copy link
Member

shoyer commented Jul 15, 2017

I would suggest simply deprecating/removing select without making a replacement. Indexing is a fine alternative.

DataFrame.filter() is useful. I wish it were called select instead, both because that matches SQL and filter suggests filtering rows with a boolean expression (like filter in dplyr or Ibis), but I don't think changing the name is worth the hassle.

In general, I think we should avoid making small changes in the API for the basic grammar of data manipulation in pandas, unless we rethink things more broadly for a larger, breaking change (e.g., in pandas2).

@jreback
Copy link
Contributor Author

jreback commented Jul 15, 2017

but I don't think changing the name is worth the hassle

sure it is - pandas is going to exist for 1.x for quite some time

better to make changes to the right spelling sooner rather than later

I am all for deprecating filter and calling it select (or select_labels)

@dkasak
Copy link

dkasak commented Jul 16, 2017

I don't have time to handle this at the moment, but I may be interested in doing it when time permits if it hasn't been done already by then.

FWIW, upon some thought, I still think changing the name of .filter to .select* would be best. I don't feel strongly about .select vs .select_labels. I generally prefer shorter names, but the added verbosity here might make things clearer. Calling it .select has the benefit that only one name is deprecated, not two.

I'm not so sure about dropping the current behaviour of .select entirely because I have a use case which I'm not sure how to implement without it (and without resorting to things like .reset_index() to regain the ability to select by using a function).

In particular, I have a MultiIndex with 2 levels, each of which has elements of type str. In other words, each index value is conceptually a pair of strings. Currently I'm doing something like

df.select(lambda x: condition1(x[0]) and condition2(x[1]))

and similar to select particular rows. How could this be implemented without current .select functionality?

@jreback
Copy link
Contributor Author

jreback commented Jul 16, 2017

can u show a complete example of how using select

@TomAugspurger TomAugspurger mentioned this issue Sep 11, 2017
6 tasks
jreback added a commit to jreback/pandas that referenced this issue Sep 22, 2017
jreback added a commit to jreback/pandas that referenced this issue Sep 29, 2017
jreback added a commit to jreback/pandas that referenced this issue Sep 29, 2017
jreback added a commit to jreback/pandas that referenced this issue Oct 1, 2017
jreback added a commit to jreback/pandas that referenced this issue Oct 1, 2017
jreback added a commit to jreback/pandas that referenced this issue Oct 2, 2017
@jreback jreback modified the milestones: 0.21.0, 1.0 Oct 2, 2017
jreback added a commit to jreback/pandas that referenced this issue Oct 3, 2017
jreback added a commit to jreback/pandas that referenced this issue Oct 3, 2017
ghost pushed a commit to reef-technologies/pandas that referenced this issue Oct 16, 2017
alanbato pushed a commit to alanbato/pandas that referenced this issue Nov 10, 2017
No-Stream pushed a commit to No-Stream/pandas that referenced this issue Nov 28, 2017
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Dec 5, 2017

On the pandas-dev mailing list concern was raised about the the deprecation of select, see https://mail.python.org/pipermail/pandas-dev/2017-November/000649.html

I think the example makes a point. For me the alternative like .loc[:, lambda df: complex_fxn_that_selects_a_few_cols(df.columns)] is harder to read and to teach as .select(complex_fxn_that_selects_a_few_cols(). Which makes the deprecation of select a step backwards for those cases.

@jondo
Copy link

jondo commented Jan 16, 2018

The deprecation message currently only suggests a replacement for the case axis=0.

I suggest to expand this to:

use df.loc[df.index.map(crit)] to select labels, df.loc(axis=1)[df.columns.map(crit)] to select columns.

@smcinerney
Copy link

I only just found out about this change and the doc still doesn't give guidance. For actual selection by column value, people also use numpy operators np.select(condlist, choicelist, ...) (for multiple values) and np.where(cond, [valTrue, valFalse]) for two values. Is that good/bad/another alternative? Witness the confusion on SO. I think the root of the issue is that pandas select verb disagreed with what numpy and SQL select do, hence created confusion.

There's still a docbug needed on this, but first we need to know what you actually recommend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Deprecate Functionality to remove in pandas Indexing Related to indexing on series/frames, not to indexes themselves Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

6 participants