ENH: more joins for DataFrame.update #21855

h-vetinari · 2018-07-11T06:27:40Z

This should be non-controversial, as even the source code for DataFrame.update literally says (https://github.com/pandas-dev/pandas/blob/v0.23.3/pandas/core/frame.py#L5054):
# TODO: Support other joins

I tried to look if a previous issue for this exists, but did not find one.

Some thoughts that arise:

default for join should clearly be 'left'
df1.update(df2, join='right', overwrite=some_boolean) would be the same as df2.update(df1, join='left', overwrite=not some_boolean). IMO this is not a terrible redundancy, as it allows each user to choose the order that more easily fits their thought pattern.
df1.combine_first(df2) would be the same as df1.update(df2, join='outer', overwrite=False), only that combine_first has much fewer options and controls (i.e. filter_func and raise_conflict). Actually, I'd very much like to deprecate combine_first before pandas 1.0. Only difference is that update returns None, which should be changed as well IMO -- relevant xrefs: ENH: add inplace-kwarg to update #21858 DEPR: combine_first (replace with update(..., join='outer'); for both Series/DF) #21859
this should IMO also support a way to control which axes are joined in what way (edit: the below was the original proposal; better variants are discussed in ENH: more joins for DataFrame.update #21855 (comment)).
- The first way that came to mind would be with an axis=0|1|None-keyword, like in DataFrame.align. However, upon further investigation, I don't believe this to be a good choice, as anything other than axis=None would implicitly have to choose a join for the other axis to actually decide the index/columns of the result.
- Since "explicit is better than implicit", I'd like to propose a version with just one kwarg, namely:
```
join=['left', 'left']  # same as 'left' (and so on for 'inner'|'outer'|'right')
join=['left', 'inner'] | ['left', 'outer'] etc. (for all other 12 combinations)
```
- I'd say list and tuple would be reasonable to allow as containers, but not more.

The text was updated successfully, but these errors were encountered:

gfyoung · 2018-07-12T00:36:08Z

I think this is a pretty solid proposal. That being said, given that this is a core part of DataFrame, let's ping @pandas-dev/pandas-core as well. Thoughts?

jreback · 2018-07-12T00:41:49Z

we only join on a single axes at a time
so your last point is a no go as you have written it
however look at how pd.concat handles the non concat axes which is what you are describing

h-vetinari · 2018-07-12T07:31:30Z

@jreback I don't think I follow completely. Of course concat and align join on both axes, the former always defaulting to 'outer', and the latter explicitly with axis=None.

But more to the point, both methods satisfy different needs, and update has yet another set of requirements. Consider the following:

one = pd.DataFrame([[0, 1], [2, 3]], columns=list('ab'))
two = pd.DataFrame([[10, 11], [12, 13]], index=[1, 2], columns=list('bc'))

## this case is clear: join on both axes
one.align(two, join='outer', axis=None)
#      a    b   c
# 0  0.0  1.0 NaN
# 1  2.0  3.0 NaN
# 2  NaN  NaN NaN
#     a     b     c
# 0 NaN   NaN   NaN
# 1 NaN  10.0  11.0
# 2 NaN  12.0  13.0

## this aligns only columns
one.align(two, join='outer', axis=1)
#    a  b   c
# 0  0  1 NaN
# 1  2  3 NaN
#     a   b   c
# 1 NaN  10  11
# 2 NaN  12  13

But this does not help us, as this does not give an answer whether the index should be [0,1] | [0,1,2] | [1,2] | [1], and reasonable arguments can be made at least for 'left' and 'outer'.

concat also does not solve the problem either:

pd.concat([one, two], axis=0, sort=False)
#      a   b     c
# 0  0.0   1   NaN
# 1  2.0   3   NaN
# 1  NaN  10  11.0
# 2  NaN  12  13.0
pd.concat([one, two], axis=1)
#      a    b     b     c
# 0  0.0  1.0   NaN   NaN
# 1  2.0  3.0  10.0  11.0
# 2  NaN  NaN  12.0  13.0

Actually, one can very easily emulate joins in two axes with concat, (with the restriction that the concatenation axis can just be 'outer'|'inner'), although it took some figuring out, because the join_axes kwarg is not very well-documented...

## 'right'-join for the non-concatenation axis
pd.concat([one, two], join='outer', axis=1, join_axes=[two.index])
#      a    b   b   c
# 1  2.0  3.0  10  11
# 2  NaN  NaN  12  13

The only thing missing here would be the processing of the 'b' columns (and a decision on which columns to retain). Assuming we'd chose 'left' for the columns, we'd have

one.update(two, join=['right', 'left'])
#      a   b
# 1  2.0  10
# 2  NaN  12
one.update(two, join=['right', 'left'], overwrite=False)
#      a   b
# 1  2.0   3
# 2  NaN  12

Personally, I think this is much more readable than:

one.update(two, join='left', axis=1, join_axes=[two.index], overwrite=False)

So, summing up, align does not care what the non-concatenation axis looks like, but update must know to decide on the right non-concatenation index (and at least 'left' and 'outer' should be possible there), and concat is essentially built (with the join_axes-kwarg) to do just what I propose to do. Of course, concat has to deal with concatenating potentially many objects, whereas update just takes two -- this allows the flexibility to add a more user-friendly API for update.

h-vetinari · 2018-07-17T13:20:35Z

@jreback Any response to the above?

h-vetinari · 2018-08-06T06:44:10Z

@pandas-dev @jreback gentle ping :)

toobaz · 2018-08-09T14:31:09Z

I'm +1 on deprecating combine_first, allowing update to replace it. I'm +1 in general on the proposed new skills for update. I'm -0.5 however on the proposed values for join: I think I would rather have

index = 'left'|'right'|...
columns = 'left'|'right'|...

(which as a bonus - and only if trivial implementation-wise - could accept an actual Index instance) and deprecate join.
The reasons are:

I can't think of any other method in the API which has something similar to join=['left', 'right'], while several methods have index= and columns=
I want to be able to pass the concatenation index for columns while using the default for index (e.g. if the two dataframes have the same index and I'm lazy)
Series.update still does not support join, so we have no consistency constraints on that side

Am I missing anything @h-vetinari ?

h-vetinari · 2018-08-09T15:27:43Z

@toobaz

[...] I would rather have

index = 'left'|'right'|...
columns = 'left'|'right'|...

Good idea! Certainly better than the ['left', 'right'] thing I came up with. I like it a lot, the only question is how unambiguous the kwarg-names are in the context of:

DataFrame.update(other, index='left', columns='left', overwrite=True,
                 filter_func=None, raise_conflict=False)

And even though join currently does nothing, I guess removing it counts as an API-break...?

The idea for the "bonus" is very similar to the join_axes-kwarg of concat, but with an intuitive meaning. This would essentially wrap reindex then? +0.5

Thinking a bit more about join, maybe it could be left in to broadcast its value to index/columns, with higher precedence for the latter. There should probably be a warning if all three are set? I could only see that with None-defaults for all, and some logic as follows:

def DataFrame.update(other, join=None, index=None, columns=None, overwrite=True,
                     filter_func=None, raise_conflict=False):
[...]
if all(x is not None for x in [join, index, columns]):
    raise UserWarning('The join-keyword will be ignored, since both index/columns are set.')
join = 'left' if join is None else join
index = join if index is None else index
columns = join if columns is None else columns
[...]

Alternatively, join could just be slapped with a DeprecationWarning, and phased out completely with 1.0... Ultimately, I think this would be easiest, and (combined with your index/columns-suggestion) might be my favourite solution so far.

h-vetinari · 2018-08-13T15:09:18Z

@toobaz Any thoughts to the above, resp. preferences for the summary below (not counting deprecation cycles)? I'm torn between options 1. & 2.

replace join (currently does nothing) with index and columns
leave join and broadcast its value to index and columns (if the latter are not set)
something else

h-vetinari · 2018-08-15T06:14:00Z

Considerations for the API to allow different join-types for update:

Desirable to allow different join-types for index/columns
Possibly have index / columns-kwargs together with (or instead of) join - could also allow passing index object instead of 'left'|'outer'|... (see ENH: more joins for DataFrame.update #21855 (comment))
Signature between DataFrame.update and Series.update should match as much as possible (xref ENH: unify signature for df.update and Series.update #22358)

So then, the question is which combination makes the most sense. Here the variants so far for DataFrame.update:

DataFrame.update(other, index='left', columns='left', ...) with deprecation of join
DataFrame.update(other, join='left', index=None, columns=None, ...); with index/columns superseding join:
[...]
index = join if index is None else index
columns = join if index is None else columns
DataFrame.update(other, join=None, index=None, columns=None, ...) like above but with no default for join to be able to issue a warning:
[...]
if all(x is not None for x in [join, index, columns]):
raise UserWarning('The join-keyword will be ignored, since both index/columns are set.')
join = 'left' if join is None else join
index = join if index is None else index
columns = join if columns is None else columns

And the variants so far for Series.update:

Series.update(other, index='left', overwrite=True, filter_func=None, raise_conflict=False)
Series.update(other, join='left', overwrite=True, filter_func=None, raise_conflict=False)
Series.update(other, join='left', index=None, overwrite=True, ...) ; with index superseding join:
[...]
index = join if index is None else index
Series.update(other, join=None, index=None, overwrite=True, ...) like above but with no default for join to be able to issue a warning:
[...]
if all(x is not None for x in [join, index]):
raise UserWarning('The join-keyword will be ignored, since index has been passed.')
join = 'left' if join is None else join
index = join if index is None else index

I think that - from the POV of consistency (and the idea of @toobaz to be able to pass indexes directly) - the most reasonable choice would be DF_1./S_1. In particular, then join for df.update should be deprecated in v0.24. Alternatively, DF_2./S_3. also seems reasonable to me.

[edit 180823]: added some more options
[edit 180905]: rewrote comment for more clarity

h-vetinari · 2018-08-20T06:14:27Z

@toobaz If I may ask for your input to the above (since your last response), then I could formulate a cleaned-up proposal which can then be reviewed by everybody.

h-vetinari · 2018-09-05T12:27:23Z

@gfyoung @toobaz @jreback @TomAugspurger @jorisvandenbossche

With v0.24 scheduled at the end of the month (and the deprecation policy leading up to v1.0), I wanted to bring attention to this issue again. Depending on the choice of API (see #21855 (comment)), it might be necessary to deprecate the join-kwarg for df.update (which currently does nothing), so I think this should be discussed.

h-vetinari · 2018-09-11T16:26:40Z

@wesm
Very gentle ping to the BDFL - I've been wanting to work on this for 2 months but can't move past the API discussion that essentially nobody has time (or motivation) for. Goal is more joins for df.update, noted as TODO even in the source code:

https://github.com/pandas-dev/pandas/blob/v0.23.4/pandas/core/frame.py#L5054

Open questions:

if and how to allow different joins for different axes (see summary in ENH: more joins for DataFrame.update #21855 (comment)) - depending on the choice, a (kwarg-)deprecation might be necessary (and v0.24 is around the corner, which is the last chance to deprecate before 1.0).
whether to allow update to have inplace=False, see ENH: add inplace-kwarg to df.update #22286 (thinking also towards potentially deprecating combine_first, see OP and DEPR: combine_first (replace with update(..., join='outer'); for both Series/DF) #21859)
alternatively (though unlikely), leave df.update as-is and have a separate method like df.fuse or df.coalesce that fuses two dataframes with given precedence, and requirements for the output axes

gfyoung · 2018-09-11T18:40:07Z

@h-vetinari : Sorry that we've been pretty dark on this. I think there is definitely interest, but time is certainly a factor here. @jreback @toobaz : thoughts?

toobaz · 2018-09-11T19:07:36Z

Still very busy, sorry. Will be back next week.

wesm · 2018-09-13T20:50:40Z

I'm completely underwater until mid-November. If there is truly an impasse where I can help / weigh in, please let me know and I'll make the time

gfyoung · 2018-09-13T21:01:14Z

@wesm : I think @toobaz will be freed up next week, so we should be okay (he's probably best to speak to this issue at this point). Time is a scarce commodity indeed 🙂

h-vetinari · 2018-09-18T00:13:21Z

@jreback @toobaz @jorisvandenbossche

Friendly ping. :)

jorisvandenbossche · 2018-09-18T08:17:02Z

Some random thoughts for now:

I am not fully sure that I like the idea of index and columns keyword. It is true that they are already used in different methods, but they are also used in a different way. Eg in drop, you pass the actual index or column labels to drop. Comparing to merge, where you have a left_on/right_on (next to on), we could also think about a index_join/column_join (just thinking aloud, not sure I like this myself :), and it is of course not about the left/right frame here)
Maybe a more concrete worry is that in eg index='outer' it is not directly saying that this is about how to join.
One advantage of having (keeping) the single join keyword is that you can specify with a single keyword to eg outer join both axes, instead of having to pass both index='outer', columns='outer'.
If we add such functionality to have different join types for both axes, isn't this something we should start with adding to align ? (since that is the workhorse that will probably be used under the hood, and which allows you to already do outer updates manually)
To be clear, not saying that this should necessarily be done at the same, but I think we should make sure that the API we choose also works there.

h-vetinari · 2018-09-19T06:42:27Z

@toobaz
I'll wait for your answer before I respond to Joris (been spamming enough as is).

This was referenced Jul 11, 2018

ENH: add inplace-kwarg to update #21858

Open

DEPR: combine_first (replace with update(..., join='outer'); for both Series/DF) #21859

Open

gfyoung added Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jul 12, 2018

h-vetinari mentioned this issue Jul 16, 2018

DEPR: Pandas 0.24.0 Deprecations #21894

Closed

3 tasks

h-vetinari mentioned this issue Jul 17, 2018

DEPR: let's deprecate #18262

Closed

34 tasks

h-vetinari mentioned this issue Aug 11, 2018

ENH: add inplace-kwarg to df.update #22286

Closed

4 tasks

h-vetinari mentioned this issue Aug 15, 2018

ENH: unify signature for df.update and Series.update #22358

Open

h-vetinari mentioned this issue Aug 23, 2018

DEPR: join_axes-kwarg in pd.concat #22318

Merged

4 tasks

h-vetinari mentioned this issue Sep 23, 2018

ENH: coalesce-method (upgrade for update/combine_first) #22812

Open

h-vetinari mentioned this issue Oct 16, 2018

API: Unify .update to generic #23192

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: more joins for DataFrame.update #21855

ENH: more joins for DataFrame.update #21855

h-vetinari commented Jul 11, 2018 •

edited

Loading

gfyoung commented Jul 12, 2018

jreback commented Jul 12, 2018

h-vetinari commented Jul 12, 2018 •

edited

Loading

h-vetinari commented Jul 17, 2018

h-vetinari commented Aug 6, 2018 •

edited

Loading

toobaz commented Aug 9, 2018

h-vetinari commented Aug 9, 2018 •

edited

Loading

h-vetinari commented Aug 13, 2018

h-vetinari commented Aug 15, 2018 •

edited

Loading

h-vetinari commented Aug 20, 2018

h-vetinari commented Sep 5, 2018

h-vetinari commented Sep 11, 2018 •

edited

Loading

gfyoung commented Sep 11, 2018 •

edited

Loading

toobaz commented Sep 11, 2018

wesm commented Sep 13, 2018

gfyoung commented Sep 13, 2018 •

edited

Loading

h-vetinari commented Sep 18, 2018

jorisvandenbossche commented Sep 18, 2018

h-vetinari commented Sep 19, 2018

ENH: more joins for DataFrame.update #21855

ENH: more joins for DataFrame.update #21855

Comments

h-vetinari commented Jul 11, 2018 • edited Loading

gfyoung commented Jul 12, 2018

jreback commented Jul 12, 2018

h-vetinari commented Jul 12, 2018 • edited Loading

h-vetinari commented Jul 17, 2018

h-vetinari commented Aug 6, 2018 • edited Loading

toobaz commented Aug 9, 2018

h-vetinari commented Aug 9, 2018 • edited Loading

h-vetinari commented Aug 13, 2018

h-vetinari commented Aug 15, 2018 • edited Loading

h-vetinari commented Aug 20, 2018

h-vetinari commented Sep 5, 2018

h-vetinari commented Sep 11, 2018 • edited Loading

gfyoung commented Sep 11, 2018 • edited Loading

toobaz commented Sep 11, 2018

wesm commented Sep 13, 2018

gfyoung commented Sep 13, 2018 • edited Loading

h-vetinari commented Sep 18, 2018

jorisvandenbossche commented Sep 18, 2018

h-vetinari commented Sep 19, 2018

h-vetinari commented Jul 11, 2018 •

edited

Loading

h-vetinari commented Jul 12, 2018 •

edited

Loading

h-vetinari commented Aug 6, 2018 •

edited

Loading

h-vetinari commented Aug 9, 2018 •

edited

Loading

h-vetinari commented Aug 15, 2018 •

edited

Loading

h-vetinari commented Sep 11, 2018 •

edited

Loading

gfyoung commented Sep 11, 2018 •

edited

Loading

gfyoung commented Sep 13, 2018 •

edited

Loading