Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add inplace-kwarg to df.update #22286

Closed
wants to merge 2 commits into from

Conversation

h-vetinari
Copy link
Contributor

There wasn't a lot of discussion yet in #21858, so I thought I'd speed things up with a PR. So far:

While it may not be desirable to break the python dict-default of update being inplace (this PR leaves the default unchanged), I think it's very relevant for chaining, and (IMO) pandas' philosophy in general.

In #21858, I quoted @TomAugspurger (commenting in #21841 about not adding an option to inplace), which I find a good statement:

Generally, we're moving away from inplace operations. It's confusing whether inplace means no copy or not. [...] We recommend chaining your method calls, and hope to provide better memory control in the future.

I find this applies just as much to df.update, hence the PR.

@codecov
Copy link

codecov bot commented Aug 11, 2018

Codecov Report

Merging #22286 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22286      +/-   ##
==========================================
+ Coverage   92.18%   92.18%   +<.01%     
==========================================
  Files         169      169              
  Lines       50810    50815       +5     
==========================================
+ Hits        46839    46844       +5     
  Misses       3971     3971
Flag Coverage Δ
#multiple 90.6% <100%> (ø) ⬆️
#single 42.37% <0%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/frame.py 97.2% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fb784ca...83f7f5e. Read the comment docs.

@gfyoung gfyoung requested review from jreback and toobaz August 11, 2018 23:43
@gfyoung gfyoung added Reshaping Concat, Merge/Join, Stack/Unstack, Explode API Design labels Aug 11, 2018
@@ -5214,6 +5214,9 @@ def update(self, other, join='left', overwrite=True, filter_func=None,
raise_conflict : bool, default False
If True, will raise a ValueError if the DataFrame and `other`
both contain non-NA data in the same place.
inplace : bool, default True
If True, follows the convention by ``dict`` of updating inplace. If
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

versionadded tag

remove the commentary about dict

if inplace:
df.update(other, inplace=inplace)
else:
df = df.update(other, inplace=inplace)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not an actual test, you need to rename to something else and assert that the original is not changed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a valid objection - though I haven't seen it anywhere on the tests I've worked on that had an inplace-kwarg.

@jreback
Copy link
Contributor

jreback commented Aug 14, 2018

I am really -1 on inplace generally. To the extent that this is actually doing inplace operations, I guess explict is better than implicit.

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Aug 14, 2018

I am really -1 on inplace generally. To the extent that this is actually doing inplace operations, I guess explict is better than implicit.

This is one of the few(?) methods that's inplace by default - so to move away from that, it's necessary to introduce an inplace-kwarg in the first place, to then eventually, maybe deprecate away from inplace=True... ;-)

@h-vetinari
Copy link
Contributor Author

@jreback All green. I opened a follow-up for giving the same capabilities to Series.update, #22358.

@Liam3851
Copy link
Contributor

Going towards my comments on #21858 and the name "update" seeming wrong for a non-inplace method. It seems to me that there are a lot of methods which perform variations on the same functionality (join, combine, combine_first, update). It seems to me that non-inplace update is basically a slightly-modified version of join or combine. Perhaps deprecation of update and the addition of its specific features (e.g. filter_func) as optional parameters to one of the other non-inplace methods would be more appropriate?

@h-vetinari
Copy link
Contributor Author

@Liam3851 I think we see the world quite differently on that. To me, the essential functionality of update (like any other method) is the intended result, in this case a DF suitably updated with another one.

Whether that operation is inplace or not is purely a semantic distinction, and (IMHO) down to your habits - meaning no offense, of course. To illustrate my point, the exact same argument could be used to argue that fillna must work in place (hypothetical quote):

To me, fillna connotes inplace -- you're filling the "holes" of this DataFrame, why should this return another object?

Regarding your suggestion to wrap this into other methods, I don't think it would work. join does something different, combine_first should be deprecated (#21859), and combine needs a func-argument. If there were a default function that would allow using combine as update, I could maybe see it work - all the existing keywords of update would work, plus it could be without inplace from the get-go. But I tried quickly writing such a function, and it essentially boils down to rewriting the essential functionality of update manually.

Finally, I find "combine" to be a much too general term, when the thought process a user might try to materialise is simply "update this DF with that DF, but don't overwrite" (or maybe do...). For that reason, I never used combine (resp. needed to), because update was always the more obvious choice.

@Liam3851
Copy link
Contributor

@h-vetinari To me "update" has a distinct connotation in terms of both dicts and SQL as changing the underlying data, but I agree the point is largely semantic.

My personal experience as a longtime user is that I've hardly ever used update, while combine_first is bread-and-butter and has an analog in xarray, so I'm a pretty strong -1 on deprecating the latter (#21859) and replacing it with the former.

@h-vetinari
Copy link
Contributor Author

@jreback I incorporated the requested changes a week ago.

@h-vetinari
Copy link
Contributor Author

@jreback Rebased and all green. The content has not changed from a week ago.

@h-vetinari
Copy link
Contributor Author

@jreback Two weeks of green. ;-)
Would appreciate some progress on this as it's blocking PRs for #22358 and #21855

@jreback
Copy link
Contributor

jreback commented Aug 28, 2018

@h-vetinari there are a number of open questions on this API generally. maybe @TomAugspurger @jorisvandenbossche have some thoughts

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Aug 28, 2018

@jreback OK cool, let's have a discussion. Could you specify which open questions you see?

@h-vetinari
Copy link
Contributor Author

@jreback @TomAugspurger @jorisvandenbossche

Do you see any downside to adding an inplace kwarg to update? That in and of itself should not be controversial - explicit is better than implicit, no?

The discussion about the joins can be continued in #21855.

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Sep 10, 2018

@jreback @TomAugspurger @jorisvandenbossche @toobaz

This PR has been green for 4 weeks (the issue has been open another month on top of that with little discussion) - could I please have some guidance of how to proceed (then I'd also offer to tackle PRs for #22358 and #21855)?

As this PR suggests, I'd prefer to add an inplace-kwarg to update, but could imagine another option as well (quoting myself from the issue):

Alternatively, if update is such a reserved name, one could think of having the required functionality (fusing two dataframes with given precedence and requirement for output dimensions) in a separate method called e.g. df.fuse or df.coalesce?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Sep 10, 2018 via email

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Sep 10, 2018

@TomAugspurger
Thanks for the response! I responded to your remark in the issue with the suggestion quoted above (of then having a different method for that functionality) - any comment about that? Especially in chained calls, I find it not desirable to have inplace operations (whether returning self or not).

Aside from that, does any inplace operation currently return something other than None?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Sep 10, 2018 via email

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Sep 10, 2018

@TomAugspurger

Why do you prefer not-inplace ops in method chains, or more precisely, why does it matter?

That's easy to answer: because the original might still be needed later in computation (example below). That's why - if this is not about the functionality, but rather the semantic distinction of what "update" means - I suggested a different method name.

As for the example, here's the simple setup:

N = 1000
df = pd.DataFrame({'x': np.random.randint(0, 10, (N,))})
df2 = df.sample(int(N/2)).add(2).mod(10)
df3 = df.sample(int(N/2)).add(2).mod(10)

My suggestion: with inplace=False (1 sloc):

res = df.loc[df.update(df2, inplace=False).gt(df.update(df3, inplace=False)).x].mean()

Even better would be something that's not inplace by default (1 sloc):

res = df.loc[df.fuse(df2).gt(df.fuse(df3)).x].mean()

With self-returning update, it would work as well, at the cost of (potentially many) copy-calls (1 sloc):

res = df.loc[df.copy().update(df2).gt(df.copy().update(df3)).x].mean()

Current status (5 sloc):

dfo = df.copy()  # original
dfi = df.copy()  # intermediate
dfi.update(df2)
df.update(df3)
res = dfo.loc[dfi.gt(df).x].mean()

@TomAugspurger
Copy link
Contributor

Thanks. I think I would recommend your 3rd solution. I prefer that to adding additional keyword arguments or new methods.

@h-vetinari
Copy link
Contributor Author

@TomAugspurger

Thanks. I think I would recommend your 3rd solution. I prefer that to adding additional keyword arguments or new methods.

I get your point, just two comments:

  • the inplace-ing would/will get even more complicated with allowing different joins for update (ENH: more joins for DataFrame.update #21855)

  • but perhaps even more importantly:

    [...] does any inplace operation currently return something other than None?

    This was the reason I didn't understand your comment in the issue - I didn't get that you meant that the method should return something despite being inplace, as this goes against all (my) experience in python/pandas.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Sep 10, 2018 via email

@pep8speaks
Copy link

Hello @h-vetinari! Thanks for updating the PR.

@jreback
Copy link
Contributor

jreback commented Sep 23, 2018

@h-vetinari I don't mind the code changes you did, they look fine. The issue is that inplace=True is no-where the default anywhere in pandas. Changing this is simply not-tenable, w/o a deprecation cycle (e.g. adding the keyword defaulting to None and then showing a warning if its not passed).

So we have .update (in-place defaults) and .combine_first which is not very standard terminology.

In an ideal world I think adding .coalesce is probably the right thing to do (does R use this term?).
which is basically a rename of .combine_first, and deprecate .update.

@h-vetinari
Copy link
Contributor Author

@jreback

Thanks for the comment, I strongly agree in principle.

The issue is that inplace=True is no-where the default anywhere in pandas

It is effectively in .update, just without having the explicit kwarg...

In an ideal world I think adding .coalesce is probably the right thing to do (does R use this term?).
which is basically a rename of .combine_first, and deprecate .update.

I really like this idea. This would also basically solve the whole discussion about more joins for .update (see #21855), which is why I brought up .coalesce (or .fuse) as an alternative there as well...

And yes, dplyr uses "coalesce", which itself is inspired by SQL: https://cran.r-project.org/web/packages/dplyr/dplyr.pdf#page.15

@jreback
Copy link
Contributor

jreback commented Sep 23, 2018

It is effectively in .update, just without having the explicit kwarg...

yes and that is exactly the problem. We can't simply add an inplace=True keyword here, which I think would be way more confusing

@h-vetinari
Copy link
Contributor Author

yes and that is exactly the problem. We can't simply add an inplace=True keyword here, which I think would be way more confusing

I disagree that this would be more confusing (just more explicit), but this is moot anyway if we follow the .coalesce route. Should I open an issue to introduce .coalesce (with the capabilities of update; to be able to eventually deprecate combine_first and update)?

@jreback
Copy link
Contributor

jreback commented Sep 23, 2018

@h-vetinari i think that would be ok, to get some more commentary on this, esp from @jorisvandenbossche and @TomAugspurger (and some off-line discussions that I had with @cpcloud )

@jreback
Copy link
Contributor

jreback commented Dec 3, 2018

closing this. there are many issues about this, and this PR is stale.

@jreback jreback closed this Dec 3, 2018
@h-vetinari
Copy link
Contributor Author

It wasn't stale just stalled - it's a PR I haven't pushed because there are more pressing issues to me.

You mention "many" issues, but so far, no-one has given an actual example (aside from personal preference that .update should always mean inplace).

In any case, I'm letting this be, but tried to restart the conversation in #22812. Would appreciate if you could comment there. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: add inplace-kwarg to update
6 participants