Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: coalesce-method (upgrade for update/combine_first) #22812

Open
h-vetinari opened this issue Sep 23, 2018 · 11 comments
Open

ENH: coalesce-method (upgrade for update/combine_first) #22812

h-vetinari opened this issue Sep 23, 2018 · 11 comments
Labels
Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@h-vetinari
Copy link
Contributor

h-vetinari commented Sep 23, 2018

The state of update/combine_first in v0.23:

I tried to make some steps towards #21855 and #21859 by adding an inplace-kwarg to df.update in #22286, which has been stalled in discussion whether update should ever be inplace at all, resp. how to move away from inplacing generally.

Today, some headway was made with the comment by @jreback:

So we have .update (in-place defaults) and .combine_first which is not very standard terminology.
In an ideal world I think adding .coalesce is probably the right thing to do (does R use this term?).
which is basically a rename of .combine_first, and deprecate .update.

which I'm strongly in favour of (with the caveat that it should use the capabilities of update; I suggested something similar in #21855; would also solve most of the discussion there). And yes, dplyr uses "coalesce", which itself is inspired by SQL: https://cran.r-project.org/web/packages/dplyr/dplyr.pdf#page.15

This discussion is opened on the advice of @jreback, who would like to involve:

[...] to get some more commentary on this, esp from @jorisvandenbossche and @TomAugspurger (and some off-line discussions that I had with @cpcloud )

Also tagging the other participants of #21855: @gfyoung @toobaz

Summing up this proposal:

  1. Add .coalesce to generic.py, à la:
    def coalesce(self, other, join='left', overwrite=True, filter_func=None, raise_conflict=False):
    which is not inplace and inherited by DataFrame/Series
  2. support different joins, at least: join='left'|'outer'|'inner'|'right' (most of the discussion in ENH: more joins for DataFrame.update #21855 is about potentially allowing different joins for different axes, and which keywords to use for that).
  3. (potentially; not essential to the proposal) slowly deprecate .update and .combine_first
@jorisvandenbossche
Copy link
Member

Just about the name:

One disadvantage of this name: not being native speaking, and not coming from a database background, I would not even have the slightest idea what even the word itself would mean, let alone what it would do.

I know it is a term used in SQL (and which is a big plus for using such a name), and in the meantime I know it from there, but IMO it is a rather newcomer unfriendly name. In that sense, I find 'combine' or 'update' more natural terminology.


Then about the functionality: I think a coalesce method would need to be rather like a combine_first than as a update, in terms of default overwrite behaviour (which might not be what you want?), since the SQL coalesce is about the first non-missing value.


(sidenote in general: sorry for the slow progress at the moment in those discussions, it seems there are currently not many other core devs that have time to participate, and such discussions take time / need input of other people. That can certainly be annoying, but is part of the open source process ..)

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Sep 27, 2018

@jorisvandenbossche

Thanks for the response! I understand that dev time is a very limited resource... ;-)

I don't feel strongly about the name. I'd be fine with having it under update (which some people strongly oppose to being not inplace; hence this issue), and I'd be fine with having it as coalesce (which has the SQL background; though you're right that the default for overwrite should then be False).

I'd also be fine with naming it something shorter (and less specific than update/coalesce/combine_first), like fuse, which - in terms of language difficulty - should be IMO on par with melt/pivot/merge, and would avoid clashing with people's pre-conceived notions about what capabilities the name implies (inplace/overwrite/join-style etc.). It's also nice and short to type.

https://www.dictionary.com/browse/fuse:
[...]
verb (used with object), fused, fus·ing.
2. to combine or blend by melting together; melt.
3. to unite or blend into a whole, as if by melting together:

The author skillfully fuses these fragments DataFrames into a cohesive whole.

verb (used without object), fused, fus·ing.
[...]
5. to become united or blended:

The two groups DataFrames fused to create one strong union.

In any case, my main point is about having the desired capabilities, not about a specific name.

@h-vetinari
Copy link
Contributor Author

@gfyoung
Copy link
Member

gfyoung commented Dec 3, 2018

@h-vetinari : Yikes! Sorry that we all went dark on this...

Given that you already had some backing already on this proposal from @jreback (and in some way from @jorisvandenbossche ), I would suggest that you try implementing this and open a PR. In terms of implementation, I'm inclined to agree that naming / behavior should be consistent with SQL.

If we're afraid that "coalesce" is unfriendly to end-users, I don't see why we couldn't alias it to another, more friendly-sounding name if need be.

@gfyoung
Copy link
Member

gfyoung commented Dec 3, 2018

@pandas-dev/pandas-core @h-vetinari : As a side note, I noticed that you have a handful of substantive PR's open that touch upon some pretty big functionality questions but seemed to have reached impasse's either as a result of core-dev's not answering or @h-vetinari your not updating them for awhile...

Perhaps we / you might want to clean those up first for merging before opening yet another PR? 😉

@h-vetinari
Copy link
Contributor Author

@gfyoung

If we're afraid that "coalesce" is unfriendly to end-users, I don't see why we couldn't alias it to another, more friendly-sounding name if need be.

My preference would be .fuse, as outlined in #22812 (comment)

I noticed that you have a handful of substantive PR's open that touch upon some pretty big functionality questions but seemed to have reached impasse's either as a result of core-dev's not answering or @h-vetinari your not updating them for awhile...
Perhaps we / you might want to clean those up first for merging before opening yet another PR?

I try to keep all my PRs current and have responded quickly to any and all feedback, but since that is such a scarce resource (and in my case, >95% of it is done by @jreback, it has to be acknowledged. So thanks for that! :) ), I have no problem doing several big PRs in parallel, which increases the total feedback throughput I receive and therefore lets me make faster progress.

In detail:

@wesm
Copy link
Member

wesm commented Dec 3, 2018

If we're changing the name, I'm wondering if it's worth creating yet-another-name versus using something somewhat standardized (COALESCE from SQL, for better or for worse)

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Dec 3, 2018

@wesm
Thanks for taking the time to answer.

The pre-existing name is a strong argument for sure. While I care much more about the functionality than what name it resides under, I also see the appeal of not having the baggage of preconceptions that come with the name (e.g. coalesce does not have a notion of a join/overwrite in SQL land, or update being inplace, etc.)

This could look as follows in a putative docstring:

    def fuse(self, other, join='left', overwrite=True, filter_func=None,
             errors='ignore'):
        """
        Unite two DataFrames into one, filling non-NA values where possible.

        This method always aligns on index and columns. The parameters easily
        allow switching between different usage patterns, which sometimes have
        dedicated methods in other languages (or previous versions of pandas):

        * `update` (e.g. for python-dict):
            same result as `df.fuse(other)`, but `fuse` is not inplace.
        * `coalesce` (SQL):
            equivalent (entry-by-entry) to `df.fuse(other, overwrite=False)`.
        * `combine_first` (pandas before version 1.0):
            `df.combine_first(other)` is equivalent to
            `df.fuse(other, join='outer', overwrite=False) or
            `other.fuse(df, join='outer')`.

        Two additional keywords allow further tuning of the behavior, e.g. to
        raise if non-NA values coincide.

        Parameters
        ----------
        other : DataFrame, or object coercible into a DataFrame
            [rest is the same as for df.update (but with more joins)]

That being said, I'd be just as happy with coalesce.

@wesm
Copy link
Member

wesm commented Dec 4, 2018

The name "fuse" doesn't really connote the "overlay" aspect of combine_first to me, but I don't have super strong feelings about it

@h-vetinari
Copy link
Contributor Author

It's really a pity that the .update story in pandas is still a wasteland over two years later (e.g. Series.update has no kwargs, and cannot be pipelined due to forcefully being in-place), much less providing something for coalesce or equivalent.

PS. I was prepared to implement all that along the lines of the docstring above (and do it right, going deep into the bowels of the type promotion to avoid painful inconsistencies for the user), but decided to spend my energies elsewhere after investing 100s of hours that ended up being repaid with what I can only describe as hostility (and - where my changes got taken over - basically zero attribution).

@pwwang
Copy link

pwwang commented Mar 17, 2022

@h-vetinari Is this something you desired: https://pwwang.github.io/datar/notebooks/coalesce/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

6 participants