Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Harmonize drop and rename API #12392

Closed
nickeubank opened this issue Feb 19, 2016 · 44 comments
Closed

ENH: Harmonize drop and rename API #12392

nickeubank opened this issue Feb 19, 2016 · 44 comments
Labels
API Design Needs Discussion Requires discussion from core team before further action
Milestone

Comments

@nickeubank
Copy link
Contributor

rename accepts a columns argument or an index argument, while drop looks for a labels and axis pair. I don't know about anyone else, but I have to check the help file every time I come back to pandas to remember which takes which.

How would people feel about adding columns and index arguments to drop? They could just be added in addition to labels/axis if we want to provide backwards compatibility and just raise an exception if the user tries to mix them.

@jreback
Copy link
Contributor

jreback commented Feb 19, 2016

actually this is a bigger API issue that @TomAugspurger and I briefly touched on here

.rename and .rename_axis
.reindex and .reindex_axis
are consistent with each other

.drop and .fillna are also consistent (just not with the others)

So thoughts on how to proceed here. I'd rather not make add hoc changes, rather try to construct an overall consistent way of doing things; we can certainly provide back-compat, but unifying things is probably a good thing.

@jreback jreback added API Design Needs Discussion Requires discussion from core team before further action labels Feb 19, 2016
@jreback jreback changed the title ENH: Harmonize drop and rename API ENH: Harmonize drop and rename API Feb 19, 2016
@TomAugspurger
Copy link
Contributor

I don't have a strong preference for one style over the other. The only upshot of the .rename(index=, columns=) approach is that you can do both at once instead of .rename_axis(index).rename_axis(columns, axis=1), very minor.

I would slightly favor just recommending and documenting the _axis methods (with labels, axis) rather than changing any method signatures.

@jreback
Copy link
Contributor

jreback commented Feb 19, 2016

do you think we should add corresponding .drop_axis and .fillna_axis? or too much clutter

@nickeubank
Copy link
Contributor Author

Personally, I have a preference for columns and index as arguments -- they've always felt more intuitive and pythonic to me. But that's second to the value of harmonization.

Just documenting the _axis methods still leaves an uncomfortable inconsistency though, no? We offer a work around, I'd be in favor of fixing .drop and .fillna.

I'm agnostic on adding .drop_axis/.fillna_axis methods.

If we change the .drop and .fillna methods to take columns, index, do we still want to support the labels, axis arguments for backwards compatibility or break the api?

@jreback
Copy link
Contributor

jreback commented Feb 19, 2016

why don't you list all of the relevant methods (might be some more that I am forgetting), and make a proposal.

@nickeubank
Copy link
Contributor Author

OK

@nickeubank
Copy link
Contributor Author

drop and fillna:

  • change primary arguments from labels, axis to columns, index
  • Accept labels, axis arguments for backward compatibility, but move to back of argument list
    (note this will break code by people who passed labels as first positional argument, but ok since will throw and exception
    if no positional arguments allowed)

drop_axis and fillna_axis:

  • New method that accepts labels, axis

Others:

  • Could implement for apply if we really wanted? I'm dis-inclined, but possible.
  • Could implement for add() , sub() , mul(), div(), radd(), rsum(), etc...

Open question:

  • How should these work for panels? (I never use panels, so not sure of best practices)

@jreback
Copy link
Contributor

jreback commented Feb 19, 2016

see that's the problem. In reality we should leave everything alone and maybe just change reindex/rename. The labels/axis idiom is much more common (and to be honest quite a bit more useful). Rarely do you actually change 2 things at once (which violates many pythonic principles). I would rather chain things like:

.reindex(...., axis='index').reindex(...., axis='columns')

though we are actually flexible enough to accept both paradigms.

@nickeubank
Copy link
Contributor Author

Oh, I don't really care about the "two things at once" -- I just liked the "columns" argument for being more meaningful.

So your preference is:

reindex/rename:
- change primary arguments to label / axis
- keep taking columns / index for backwards compatibility?

That's fine by me -- like I said, I'm mostly interested in harmonization!

@max-sixty
Copy link
Contributor

The labels/axis idiom is much more common (and to be honest quite a bit more useful). Rarely do you actually change 2 things at once (which violates many pythonic principles).

+1

And, I know people have gone back & forth on this a bit - but I would also 'vote' for:

  • .rename being like xarray: renaming axes names only or, where the object has a name (currently Series), renaming the object
  • .relabel used for reindexing-like operations with a mapping from old to new labels

@shoyer
Copy link
Member

shoyer commented Feb 19, 2016

The labels/axis idiom is much more common (and to be honest quite a bit more useful). Rarely do you actually change 2 things at once (which violates many pythonic principles).

I agree that changing 2 things at once is not a great API, but I agree with @nickeubank that explicit columns and index arguments make for more readable code: compare df.drop(columns='foo') vs df.drop('foo', axis='columns') (or worse, df.drop('foo', axis=1), which is assuredly more common because it's less typing).

@jorisvandenbossche
Copy link
Member

I would like to avoid adding new methods as drop_axis (which is actually not a good name IMO, as it sounds you want to drop a full axis, while you want do drop certain items from an axis)

Further, I think we should make a clear distinction between methods that modify the axis (rename, drop, reindex), and methods that perform operation over a certain axis (apply, add, ..). Those last ones use the axis= idiom to specify the direction of operation, and that is indeed a common idiom. I think the discussion should only be about rename, reindex and drop

I personally also like the explicit column and index arguments in eg df.rename(columns=..) (this reads very natural). So I would not like to see these go (or deprecated).

It is not really good API design, but I think it is perfectly possible to combine both idioms in one method for all of the discussed functions as kind of a compromise?
For example, changes of the current signature could be:

  • df.reindex(index=None, columns=None, ...) -> df.reindex(labels=None, index=None, columns=None, axis=0, ...)
  • df.drop(labels, axis=0, ...) -> df.drop(labels=None, axis=0, index=None, columns=None, ...)

Which would be I think backwards compatible?
That would kind of harmonize the api for the different methods, but have the bad design of providing two ways to do something in one function.

@jorisvandenbossche
Copy link
Member

And, I know people have gone back & forth on this a bit - but I would also 'vote' for:

  • .rename being like xarray: renaming axes names only or, where the object has a name (currently Series), renaming the object
  • .relabel used for reindexing-like operations with a mapping from old to new labels

@MaximilianR Maybe open a separate issue to discuss that? What kind of idiom to use in the signature maybe depends on this, but the question of adding such a method is separate discussion I think.

@nickeubank
Copy link
Contributor Author

I think that @jorisvandenbossche's suggestion works perfectly. The real brilliance is that it even works for someone who used positional arguments for rename (i.e. typed df.rename({0:-99}) instead of df.rename(index={0:-99}))!

@nickeubank
Copy link
Contributor Author

I take that back – if somebody uses more than one positional argument (index and columns) the results will differ.

On further reflection, I think we only have two choices: break the API, or tack the new arguments on to the end of the argument list so anyone who uses positional arguments is OK.

@jorisvandenbossche
Copy link
Member

I take that back – if somebody uses more than one positional argument (index and columns) the results will differ.

I think even that should be possible to detect and warn. If the user did originally df.reindex(index, columns), with the new signature df.reindex(labels=None, index=None, columns=None, axis=0, ...) those would map to labels and index, but as you shouldn't use both at the same time, we can detect this case and give an informative message.

@nickeubank
Copy link
Contributor Author

@jorisvandenbossche My impression was that "backwards compatibility" / "not breaking the API" means that old code still runs fine -- an informative error beats a silent failure, but seems like that's still API-breaking.

An overview of where I think we stand:

1. Do nothing

2. Backwards Compatible

rename(index=None, columns=None, **kwargs) ->
rename(index=None, columns=None, labels=None, axis=0, **kwargs)
(where **kwargs now takes labels,axis)

drop(labels, axis=0, level=None, inplace=False, errors='raise')->
drop(labels, axis=0, level=None, inplace=False, errors='raise', index=None, columns=None)

Pros:
* Backwards compatible
* Can use both with same named arguments

Cons:
* Cannot use both with same positional argument patterns

3. Break-API - All options available

rename(index=None, columns=None, **kwargs) ->
rename(labels=None, axis=None, index=None, columns=None, labels=None, axis=0, **kwargs)

drop(labels, axis=0, level=None, inplace=False, errors='raise')->
drop(labels, axis=None, index=None, columns=None, level=None, inplace=False, errors='raise')

Pros:
* Backwards compatible for people who use named arguments
* Allows all forms of interaction

Cons:
* API Breaking

4. Break-API -- adopt labels,axis

rename(index=None, columns=None, **kwargs) ->
rename(labels=None, axis=0, labels=None, axis=0, **kwargs)

Pros:
* Conforms with syntax of other functions like apply
* Minimal number of functions broken

Cons:
* index/axis less readable than index/columns

5. Break-API -- adopt columns/index

drop(labels, axis=0, level=None, inplace=False, errors='raise')->
drop(index=None, columns=None, level=None, inplace=False, errors='raise')
Pros:
* More readable new API
* Only breaks a few functions

Cons:
* Not consistent with use of [transformation]/axis argument structure in other places

My take:

I think we should shoot for either 2 (to ensure backwards compatibility) or 4. 2 because I think api breaking for these kind of core functions is bad, and 4 because I'm increasingly won over by @jreback's argument -- while I prefer index/columns in general, I think that the labels/axis is more consistent with the general pandas library, and I think minimal API breaking is desirable.

@jorisvandenbossche
Copy link
Member

Nice overview!

@jorisvandenbossche My impression was that "backwards compatibility" / "not breaking the API" means that old code still runs fine -- an informative error beats a silent failure, but seems like that's still API-breaking.

@nickeubank An informative message does not necessarily need to be an error! It can also be a warning (or we can even decide to just pass it through correctly without warning, although I wouldn't do that). So I am still convinced this can be done in a backwards compatible way (and your options 2 and 3 can be combined).

\2. Backwards Compatible
...
Cons:

  • Cannot use both with same positional argument patterns

I don't think this is really a con, as using it with only positional arguments is never a sane thing to do regarding clarity of your code :-)

Further, I think there is 6th option: use separate methods for the two idioms (like reindex / reindex_axis)

So I think we have to choose between:

a) combine both idioms within the same methods and live with the bad API design (in a back compat or incompat way -> your options 2 and 3)
b) choose one of the idioms and deprecate the other (your options 4 and 5)
c) have separate methods for each idiom

I would personally be in favor of a)

@nickeubank
Copy link
Contributor Author

@jorvisvandenbossche good call about positional argument differences not being a big deal.

I think that makes my 2 (backwards compatible with both sets or arguments) my preference.

@jreback
Copy link
Contributor

jreback commented Feb 22, 2016

@nickeubank can you survey all the methods and see which use each idiom? kind of like a value_counts, most important is prob number per class of idiom. (e.g. make several categories and measure how many methods of each type of idiom we have for both). Just to get an overview of the entire API.

@nickeubank
Copy link
Contributor Author

@jreback Sure, but will need some time -- busy week!

@jreback
Copy link
Contributor

jreback commented Feb 22, 2016

@nickeubank np. this issue would be for 0.19.0 in any event.

@jreback jreback added this to the 0.19.0 milestone Feb 22, 2016
@nickeubank
Copy link
Contributor Author

A DataFrame has ~200 methods. Those that take columns as a modifier argument:

  • pivot
  • pivot_table
  • reindex
  • rename
  • sort (but now depreciated -- sort_values uses axis.

Also note that columns is a keyword for the following, but in a somewhat different context:

  • All to_[format] calls
  • from_items
  • from_records

axis is in too many to count, but the ones that seem to use as a modifier (as reindex uses columns) in alphabetical order:

  • add
  • align
  • all
  • any
  • apply
  • compound
  • corrwith
  • count
  • cummax, cummin, etc.
  • div, divide
  • diff
  • dropna
  • eq
  • fillna
  • floordiv
    ... (ok, gonna stop there. You get the idea. It's everywhere)

In light of that, I would vote for leaving drop and company as they are, and adding labels/axis named arguments to rename/reindex (and pivot?). My vote is to put at the end of the argument list for full backwards compatibility, but am open to suggestions.

@nickeubank
Copy link
Contributor Author

Revisiting this, seems like we came to a consensus on two things then got stuck.

Consensus:

  • Current state is problematic and harmonization is desirable
  • The norm in pandas is clearly label/axis, not columns/index. So we should probably move
    rename/reindex to labels/axis`.

No Consensus:

Seems we have three options:

Option 1: Add labels/axis to end of the argument list, leave columns/index in place
Pros:

  • Fully backward compatible

Cons:

  • Doesn't quite achieve harmonization

Option 2: Put labels/axis at the front of the argument list, push back columns/index but still accept

Pros:

  • Backward compatible for named arguments
  • If users pass only one positional argument, also backwards compatible. In old framework, that would correspond to index argument; in new framework, would correspond to labels with a default axis of 0.
  • If users pass multiple positional arguments (index and columns in old framework), an exception would be raised since nothing columns would accept would constitute a valid axis argument, so the failure would not be silent.

Cons:

  • Will break old code that used both columns and index

Option 3: Replace columns/index with labels/axis
Pros:

  • Cleaner

Cons:

  • Not backward compatible

Personally, I like 1 or 2 (though my indifference between the two is partially motivated by the fact I always name my arguments so they're equivalent for me ;))

@toobaz
Copy link
Member

toobaz commented Jul 17, 2017

we actually already have rename_axis and reindex_axis for exactly this (for the axis-keyword idiom). So we could add a new drop-like method with the named axes idiom
But, what name to use for this? As the current drop should actually be "drop_axis", and the existing drop should be changed.
Is it needed to have two functions for each operation?

I think having two methods doing the same thing is confusing (less so if the documentation of each just clarified the difference from the other, but still I don't think both are worth keeping).

@MaximilianR Maybe open a separate issue to discuss that?

Done: #16990 . Clearly this discussion on the signature also applies to that bug, assuming my proposal (of adding .relabel) is accepted. I'm personally slightly in favor of index=, just because it is more common in pandas methods (although I do realize the difference between working on values and on indices, it's still good if the two have a similar interface).

@jreback
Copy link
Contributor

jreback commented Oct 2, 2017

@jorisvandenbossche any possibility of getting this in? obviously aside from #17644 which is merged

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 5, 2017

What's left to do here? The same changes to reindex and rename as Joris made to drop?

If so, I can put together a PR this afternoon.

@jreback
Copy link
Contributor

jreback commented Oct 5, 2017

yep i think so; that’s a bit more involved though

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 5, 2017

Yes, I was just going to post that :) I may have found a (somewhat) hacky solution. Will have the start of a PR in a bit.

The difficulty is disambiguating

>>> df.rename(fn, axis=1)  # OK
>>> df.rename(index=fn, axis=1)  # TypeError

But I may have a way.

@TomAugspurger
Copy link
Contributor

How much to want to do the other side of this though? As I'm writing the release notes for adding axis to rename and it reads strange coming right after the drop section adding index / columns.

I'm comfortable with recommending index=, columns= as the preferred way going forward. Adding axis to reindex and rename is (implicitly) recommending the other style.

@toobaz
Copy link
Member

toobaz commented Oct 5, 2017

I'm comfortable with recommending index=, columns= as the preferred way going forward

I think that @nickeubank 's comment provides strong evidence in favor of axis=. Together with coherence with numpy, which won't harm, and with the use of dim= in xarray. And while apparently axis=1 is not considered very pythonic (not so obvious to me), and coherence with numpy is not top priority, being able to do axis="columns" looks to me sufficient to restore readability.

Keeping both approaches where index= and columns= are already present is the best solution, but I think the standard/recommended way should be axis=, which incidentally is also often simpler to implement.

@TomAugspurger
Copy link
Contributor

Yes, re-reading that comment does make a good case for it.

OK then, I'll put up my WIP for rename, and finish it up later tonight.

@toobaz
Copy link
Member

toobaz commented Oct 6, 2017

(By the way: something else good, and very pythonic, about axis= is that the reader knows by definition that a method he once saw used on e.g. index works exactly in the same way on columns, or vice-versa)

@jorisvandenbossche
Copy link
Member

I disagree with that comment (#12392 (comment)): it is correct that the axis idiom is used a lot more in pandas, but we are speaking here about very specific functions where this comparison does not hold.
Eg in df.mean(axis=) you are applying the function over either axis (this would be difficult to express with index= or columns= arguments). But in the rename/drop methods, you are altering one of the axes, not applying a function along one of the axes. In that case, the index/columns args do make sense in a way that is not comparable to all those other methods that take the axis arg (and in that sense: yes, I personally will recommend people to write drop(columns=[..]) instead of drop([..], axis=1)).

But anyhow, that's not really that relevant anymore :-) As it is good to make them consistent anyway, which means adding axis to rename, and then people can do what they like most.

@TomAugspurger Thanks for picking this up! Will look at the PR now.

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Oct 6, 2017
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Oct 10, 2017
TomAugspurger added a commit that referenced this issue Oct 10, 2017
* API: Added axis argument to rename

xref: #12392

* API: Accept 'axis' keyword argument for reindex
@TomAugspurger TomAugspurger modified the milestones: 0.21.0, Next Major Release Oct 12, 2017
@TomAugspurger
Copy link
Contributor

Were reindex and rename the last ones needed here? Can this be closed?

@jorisvandenbossche
Copy link
Member

Yes, I think drop, rename and reindex were the only ones.

Closed by #17644, #17800 and #17842

@jorisvandenbossche jorisvandenbossche modified the milestones: Next Major Release, 0.21.0 Oct 13, 2017
ghost pushed a commit to reef-technologies/pandas that referenced this issue Oct 16, 2017
* API: Added axis argument to rename

xref: pandas-dev#12392

* API: Accept 'axis' keyword argument for reindex
alanbato pushed a commit to alanbato/pandas that referenced this issue Nov 10, 2017
* API: Added axis argument to rename

xref: pandas-dev#12392

* API: Accept 'axis' keyword argument for reindex
No-Stream pushed a commit to No-Stream/pandas that referenced this issue Nov 28, 2017
* API: Added axis argument to rename

xref: pandas-dev#12392

* API: Accept 'axis' keyword argument for reindex
@ghost ghost mentioned this issue Jul 22, 2019
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

7 participants