ENH: coalesce-method (upgrade for update/combine_first) #22812

h-vetinari · 2018-09-23T14:07:41Z

The state of update/combine_first in v0.23:

.update signature does not match between DataFrame/Series (ENH: unify signature for df.update and Series.update #22358)
df.update has a join-kwarg that only supports left, although the source code itself notes:
# TODO: Support other joins (ENH: more joins for DataFrame.update #21855)
.update is one of the (very) few pandas-methods that's inplace by default, but does not have an inplace-kwarg (ENH: add inplace-kwarg to df.update #22286)
.combine_first is effectively (the not-yet-implemented) .update(join='outer'), has an awkward, non-standard name, and much fewer capabilities than .update. (DEPR: combine_first (replace with update(..., join='outer'); for both Series/DF) #21859)

I tried to make some steps towards #21855 and #21859 by adding an inplace-kwarg to df.update in #22286, which has been stalled in discussion whether update should ever be inplace at all, resp. how to move away from inplacing generally.

Today, some headway was made with the comment by @jreback:

So we have .update (in-place defaults) and .combine_first which is not very standard terminology.
In an ideal world I think adding .coalesce is probably the right thing to do (does R use this term?).
which is basically a rename of .combine_first, and deprecate .update.

which I'm strongly in favour of (with the caveat that it should use the capabilities of update; I suggested something similar in #21855; would also solve most of the discussion there). And yes, dplyr uses "coalesce", which itself is inspired by SQL: https://cran.r-project.org/web/packages/dplyr/dplyr.pdf#page.15

This discussion is opened on the advice of @jreback, who would like to involve:

[...] to get some more commentary on this, esp from @jorisvandenbossche and @TomAugspurger (and some off-line discussions that I had with @cpcloud )

Also tagging the other participants of #21855: @gfyoung @toobaz

Summing up this proposal:

Add .coalesce to generic.py, à la:
def coalesce(self, other, join='left', overwrite=True, filter_func=None, raise_conflict=False):
which is not inplace and inherited by DataFrame/Series
support different joins, at least: join='left'|'outer'|'inner'|'right' (most of the discussion in ENH: more joins for DataFrame.update #21855 is about potentially allowing different joins for different axes, and which keywords to use for that).
(potentially; not essential to the proposal) slowly deprecate .update and .combine_first

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2018-09-27T12:11:14Z

Just about the name:

One disadvantage of this name: not being native speaking, and not coming from a database background, I would not even have the slightest idea what even the word itself would mean, let alone what it would do.

I know it is a term used in SQL (and which is a big plus for using such a name), and in the meantime I know it from there, but IMO it is a rather newcomer unfriendly name. In that sense, I find 'combine' or 'update' more natural terminology.

Then about the functionality: I think a coalesce method would need to be rather like a combine_first than as a update, in terms of default overwrite behaviour (which might not be what you want?), since the SQL coalesce is about the first non-missing value.

(sidenote in general: sorry for the slow progress at the moment in those discussions, it seems there are currently not many other core devs that have time to participate, and such discussions take time / need input of other people. That can certainly be annoying, but is part of the open source process ..)

h-vetinari · 2018-09-27T12:53:22Z

@jorisvandenbossche

Thanks for the response! I understand that dev time is a very limited resource... ;-)

I don't feel strongly about the name. I'd be fine with having it under update (which some people strongly oppose to being not inplace; hence this issue), and I'd be fine with having it as coalesce (which has the SQL background; though you're right that the default for overwrite should then be False).

I'd also be fine with naming it something shorter (and less specific than update/coalesce/combine_first), like fuse, which - in terms of language difficulty - should be IMO on par with melt/pivot/merge, and would avoid clashing with people's pre-conceived notions about what capabilities the name implies (inplace/overwrite/join-style etc.). It's also nice and short to type.

https://www.dictionary.com/browse/fuse:
[...]
verb (used with object), fused, fus·ing.
2. to combine or blend by melting together; melt.
3. to unite or blend into a whole, as if by melting together:

The author skillfully fuses these ~~fragments~~ DataFrames into a cohesive whole.

verb (used without object), fused, fus·ing.
[...]
5. to become united or blended:

The two ~~groups~~ DataFrames fused to create one strong union.

In any case, my main point is about having the desired capabilities, not about a specific name.

h-vetinari · 2018-12-03T06:53:22Z

Re-pinging @jreback @jorisvandenbossche @TomAugspurger @cpcloud @gfyoung @toobaz

gfyoung · 2018-12-03T09:13:22Z

@h-vetinari : Yikes! Sorry that we all went dark on this...

Given that you already had some backing already on this proposal from @jreback (and in some way from @jorisvandenbossche ), I would suggest that you try implementing this and open a PR. In terms of implementation, I'm inclined to agree that naming / behavior should be consistent with SQL.

If we're afraid that "coalesce" is unfriendly to end-users, I don't see why we couldn't alias it to another, more friendly-sounding name if need be.

gfyoung · 2018-12-03T09:17:51Z

@pandas-dev/pandas-core @h-vetinari : As a side note, I noticed that you have a handful of substantive PR's open that touch upon some pretty big functionality questions but seemed to have reached impasse's either as a result of core-dev's not answering or @h-vetinari your not updating them for awhile...

Perhaps we / you might want to clean those up first for merging before opening yet another PR? 😉

h-vetinari · 2018-12-03T17:46:13Z

@gfyoung

If we're afraid that "coalesce" is unfriendly to end-users, I don't see why we couldn't alias it to another, more friendly-sounding name if need be.

My preference would be .fuse, as outlined in #22812 (comment)

I noticed that you have a handful of substantive PR's open that touch upon some pretty big functionality questions but seemed to have reached impasse's either as a result of core-dev's not answering or @h-vetinari your not updating them for awhile...
Perhaps we / you might want to clean those up first for merging before opening yet another PR?

I try to keep all my PRs current and have responded quickly to any and all feedback, but since that is such a scarce resource (and in my case, >95% of it is done by @jreback, it has to be acknowledged. So thanks for that! :) ), I have no problem doing several big PRs in parallel, which increases the total feedback throughput I receive and therefore lets me make faster progress.

In detail:

ENH: Add set_index to Series #22225 is stalled in API discussion (see API: capabilities of df.set_index #24046)
DEPR: join_axes-kwarg in pd.concat #22318 has been waiting for input from @jorisvandenbossche for 2 months now (haven't pushed that PR as not very important to me)
API: Series.str-accessor infers dtype (and Index.str does not raise on all-NA) #23167 is progressing (after some big precursors had to be dealt with)
API: Unify .update to generic #23192 is stuck on a long chain of blocking PRs/issues: TST: add test coverage for maybe_promote #23982 -> BUG/Internals: maybe_promote #23833 -> BUG/Internals: maybe_upcast_putmask #23823 -> API: Unify .update to generic #23192
DEPR: deprecate default of skipna=False in infer_dtype #24050 is quite fresh
(I also closed ENH: add return_inverse to duplicated for DataFrame/Series/Index/MultiIndex #21645 as outdated since you looked)
and finally two other small PRs not worth mentioning

wesm · 2018-12-03T20:24:22Z

If we're changing the name, I'm wondering if it's worth creating yet-another-name versus using something somewhat standardized (COALESCE from SQL, for better or for worse)

h-vetinari · 2018-12-03T23:58:33Z

@wesm
Thanks for taking the time to answer.

The pre-existing name is a strong argument for sure. While I care much more about the functionality than what name it resides under, I also see the appeal of not having the baggage of preconceptions that come with the name (e.g. coalesce does not have a notion of a join/overwrite in SQL land, or update being inplace, etc.)

This could look as follows in a putative docstring:

    def fuse(self, other, join='left', overwrite=True, filter_func=None,
             errors='ignore'):
        """
        Unite two DataFrames into one, filling non-NA values where possible.

        This method always aligns on index and columns. The parameters easily
        allow switching between different usage patterns, which sometimes have
        dedicated methods in other languages (or previous versions of pandas):

        * `update` (e.g. for python-dict):
            same result as `df.fuse(other)`, but `fuse` is not inplace.
        * `coalesce` (SQL):
            equivalent (entry-by-entry) to `df.fuse(other, overwrite=False)`.
        * `combine_first` (pandas before version 1.0):
            `df.combine_first(other)` is equivalent to
            `df.fuse(other, join='outer', overwrite=False) or
            `other.fuse(df, join='outer')`.

        Two additional keywords allow further tuning of the behavior, e.g. to
        raise if non-NA values coincide.

        Parameters
        ----------
        other : DataFrame, or object coercible into a DataFrame
            [rest is the same as for df.update (but with more joins)]

That being said, I'd be just as happy with coalesce.

wesm · 2018-12-04T00:02:36Z

The name "fuse" doesn't really connote the "overlay" aspect of combine_first to me, but I don't have super strong feelings about it

h-vetinari · 2021-06-06T13:01:07Z

It's really a pity that the .update story in pandas is still a wasteland over two years later (e.g. Series.update has no kwargs, and cannot be pipelined due to forcefully being in-place), much less providing something for coalesce or equivalent.

PS. I was prepared to implement all that along the lines of the docstring above (and do it right, going deep into the bowels of the type promotion to avoid painful inconsistencies for the user), but decided to spend my energies elsewhere after investing 100s of hours that ended up being repaid with what I can only describe as hostility (and - where my changes got taken over - basically zero attribution).

pwwang · 2022-03-17T18:44:21Z

@h-vetinari Is this something you desired: https://pwwang.github.io/datar/notebooks/coalesce/

gfyoung added Reshaping Concat, Merge/Join, Stack/Unstack, Explode API Design labels Sep 23, 2018

h-vetinari mentioned this issue Sep 26, 2018

API/ENH: overhaul/unify/improve .unique #22824

Open

6 tasks

jorisvandenbossche mentioned this issue Sep 27, 2018

DEPR: combine_first (replace with update(..., join='outer'); for both Series/DF) #21859

Open

h-vetinari mentioned this issue Dec 3, 2018

ENH: add inplace-kwarg to df.update #22286

Closed

4 tasks

h-vetinari mentioned this issue Dec 3, 2018

API/DEPR: Deprecate inplace parameter #16529

Open

h-vetinari mentioned this issue Feb 24, 2019

REF: Fix maybe_promote #25425

Closed

3 tasks

mroeschke added Enhancement and removed API Design labels Jun 22, 2021

jreback mentioned this issue May 31, 2023

DEPR: combine_first #53461

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: coalesce-method (upgrade for update/combine_first) #22812

ENH: coalesce-method (upgrade for update/combine_first) #22812

h-vetinari commented Sep 23, 2018 •

edited

Loading

jorisvandenbossche commented Sep 27, 2018

h-vetinari commented Sep 27, 2018 •

edited

Loading

h-vetinari commented Dec 3, 2018

gfyoung commented Dec 3, 2018 •

edited

Loading

gfyoung commented Dec 3, 2018

h-vetinari commented Dec 3, 2018

wesm commented Dec 3, 2018

h-vetinari commented Dec 3, 2018 •

edited

Loading

wesm commented Dec 4, 2018

h-vetinari commented Jun 6, 2021

pwwang commented Mar 17, 2022

ENH: coalesce-method (upgrade for update/combine_first) #22812

ENH: coalesce-method (upgrade for update/combine_first) #22812

Comments

h-vetinari commented Sep 23, 2018 • edited Loading

jorisvandenbossche commented Sep 27, 2018

h-vetinari commented Sep 27, 2018 • edited Loading

h-vetinari commented Dec 3, 2018

gfyoung commented Dec 3, 2018 • edited Loading

gfyoung commented Dec 3, 2018

h-vetinari commented Dec 3, 2018

wesm commented Dec 3, 2018

h-vetinari commented Dec 3, 2018 • edited Loading

wesm commented Dec 4, 2018

h-vetinari commented Jun 6, 2021

pwwang commented Mar 17, 2022

h-vetinari commented Sep 23, 2018 •

edited

Loading

h-vetinari commented Sep 27, 2018 •

edited

Loading

gfyoung commented Dec 3, 2018 •

edited

Loading

h-vetinari commented Dec 3, 2018 •

edited

Loading