Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: more joins for DataFrame.update #21855

Open
h-vetinari opened this issue Jul 11, 2018 · 19 comments
Open

ENH: more joins for DataFrame.update #21855

h-vetinari opened this issue Jul 11, 2018 · 19 comments
Labels
Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@h-vetinari
Copy link
Contributor

h-vetinari commented Jul 11, 2018

This should be non-controversial, as even the source code for DataFrame.update literally says (https://github.com/pandas-dev/pandas/blob/v0.23.3/pandas/core/frame.py#L5054):
# TODO: Support other joins

I tried to look if a previous issue for this exists, but did not find one.

Some thoughts that arise:

  • default for join should clearly be 'left'
  • df1.update(df2, join='right', overwrite=some_boolean) would be the same as df2.update(df1, join='left', overwrite=not some_boolean). IMO this is not a terrible redundancy, as it allows each user to choose the order that more easily fits their thought pattern.
  • df1.combine_first(df2) would be the same as df1.update(df2, join='outer', overwrite=False), only that combine_first has much fewer options and controls (i.e. filter_func and raise_conflict). Actually, I'd very much like to deprecate combine_first before pandas 1.0. Only difference is that update returns None, which should be changed as well IMO -- relevant xrefs: ENH: add inplace-kwarg to update #21858 DEPR: combine_first (replace with update(..., join='outer'); for both Series/DF) #21859
  • this should IMO also support a way to control which axes are joined in what way (edit: the below was the original proposal; better variants are discussed in ENH: more joins for DataFrame.update #21855 (comment)).
    • The first way that came to mind would be with an axis=0|1|None-keyword, like in DataFrame.align. However, upon further investigation, I don't believe this to be a good choice, as anything other than axis=None would implicitly have to choose a join for the other axis to actually decide the index/columns of the result.
    • Since "explicit is better than implicit", I'd like to propose a version with just one kwarg, namely:
    join=['left', 'left']  # same as 'left' (and so on for 'inner'|'outer'|'right')
    join=['left', 'inner'] | ['left', 'outer'] etc. (for all other 12 combinations)
    
    • I'd say list and tuple would be reasonable to allow as containers, but not more.
@gfyoung
Copy link
Member

gfyoung commented Jul 12, 2018

I think this is a pretty solid proposal. That being said, given that this is a core part of DataFrame, let's ping @pandas-dev/pandas-core as well. Thoughts?

@jreback
Copy link
Contributor

jreback commented Jul 12, 2018

we only join on a single axes at a time
so your last point is a no go as you have written it
however look at how pd.concat handles the non concat axes which is what you are describing

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Jul 12, 2018

@jreback I don't think I follow completely. Of course concat and align join on both axes, the former always defaulting to 'outer', and the latter explicitly with axis=None.

But more to the point, both methods satisfy different needs, and update has yet another set of requirements. Consider the following:

one = pd.DataFrame([[0, 1], [2, 3]], columns=list('ab'))
two = pd.DataFrame([[10, 11], [12, 13]], index=[1, 2], columns=list('bc'))

## this case is clear: join on both axes
one.align(two, join='outer', axis=None)
#      a    b   c
# 0  0.0  1.0 NaN
# 1  2.0  3.0 NaN
# 2  NaN  NaN NaN
#     a     b     c
# 0 NaN   NaN   NaN
# 1 NaN  10.0  11.0
# 2 NaN  12.0  13.0

## this aligns only columns
one.align(two, join='outer', axis=1)
#    a  b   c
# 0  0  1 NaN
# 1  2  3 NaN
#     a   b   c
# 1 NaN  10  11
# 2 NaN  12  13

But this does not help us, as this does not give an answer whether the index should be [0,1] | [0,1,2] | [1,2] | [1], and reasonable arguments can be made at least for 'left' and 'outer'.

concat also does not solve the problem either:

pd.concat([one, two], axis=0, sort=False)
#      a   b     c
# 0  0.0   1   NaN
# 1  2.0   3   NaN
# 1  NaN  10  11.0
# 2  NaN  12  13.0
pd.concat([one, two], axis=1)
#      a    b     b     c
# 0  0.0  1.0   NaN   NaN
# 1  2.0  3.0  10.0  11.0
# 2  NaN  NaN  12.0  13.0

Actually, one can very easily emulate joins in two axes with concat, (with the restriction that the concatenation axis can just be 'outer'|'inner'), although it took some figuring out, because the join_axes kwarg is not very well-documented...

## 'right'-join for the non-concatenation axis
pd.concat([one, two], join='outer', axis=1, join_axes=[two.index])
#      a    b   b   c
# 1  2.0  3.0  10  11
# 2  NaN  NaN  12  13

The only thing missing here would be the processing of the 'b' columns (and a decision on which columns to retain). Assuming we'd chose 'left' for the columns, we'd have

one.update(two, join=['right', 'left'])
#      a   b
# 1  2.0  10
# 2  NaN  12
one.update(two, join=['right', 'left'], overwrite=False)
#      a   b
# 1  2.0   3
# 2  NaN  12

Personally, I think this is much more readable than:

one.update(two, join='left', axis=1, join_axes=[two.index], overwrite=False)

So, summing up, align does not care what the non-concatenation axis looks like, but update must know to decide on the right non-concatenation index (and at least 'left' and 'outer' should be possible there), and concat is essentially built (with the join_axes-kwarg) to do just what I propose to do. Of course, concat has to deal with concatenating potentially many objects, whereas update just takes two -- this allows the flexibility to add a more user-friendly API for update.

@h-vetinari
Copy link
Contributor Author

@jreback Any response to the above?

@h-vetinari h-vetinari mentioned this issue Jul 17, 2018
34 tasks
@h-vetinari
Copy link
Contributor Author

h-vetinari commented Aug 6, 2018

@pandas-dev @jreback gentle ping :)

@toobaz
Copy link
Member

toobaz commented Aug 9, 2018

I'm +1 on deprecating combine_first, allowing update to replace it. I'm +1 in general on the proposed new skills for update. I'm -0.5 however on the proposed values for join: I think I would rather have

index = 'left'|'right'|...
columns = 'left'|'right'|...

(which as a bonus - and only if trivial implementation-wise - could accept an actual Index instance) and deprecate join.
The reasons are:

  • I can't think of any other method in the API which has something similar to join=['left', 'right'], while several methods have index= and columns=
  • I want to be able to pass the concatenation index for columns while using the default for index (e.g. if the two dataframes have the same index and I'm lazy)
  • Series.update still does not support join, so we have no consistency constraints on that side

Am I missing anything @h-vetinari ?

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Aug 9, 2018

@toobaz

[...] I would rather have

index = 'left'|'right'|...
columns = 'left'|'right'|...

Good idea! Certainly better than the ['left', 'right'] thing I came up with. I like it a lot, the only question is how unambiguous the kwarg-names are in the context of:

DataFrame.update(other, index='left', columns='left', overwrite=True,
                 filter_func=None, raise_conflict=False)

And even though join currently does nothing, I guess removing it counts as an API-break...?

The idea for the "bonus" is very similar to the join_axes-kwarg of concat, but with an intuitive meaning. This would essentially wrap reindex then? +0.5

Thinking a bit more about join, maybe it could be left in to broadcast its value to index/columns, with higher precedence for the latter. There should probably be a warning if all three are set? I could only see that with None-defaults for all, and some logic as follows:

def DataFrame.update(other, join=None, index=None, columns=None, overwrite=True,
                     filter_func=None, raise_conflict=False):
[...]
if all(x is not None for x in [join, index, columns]):
    raise UserWarning('The join-keyword will be ignored, since both index/columns are set.')
join = 'left' if join is None else join
index = join if index is None else index
columns = join if columns is None else columns
[...]

Alternatively, join could just be slapped with a DeprecationWarning, and phased out completely with 1.0... Ultimately, I think this would be easiest, and (combined with your index/columns-suggestion) might be my favourite solution so far.

@h-vetinari
Copy link
Contributor Author

@toobaz Any thoughts to the above, resp. preferences for the summary below (not counting deprecation cycles)? I'm torn between options 1. & 2.

  1. replace join (currently does nothing) with index and columns
  2. leave join and broadcast its value to index and columns (if the latter are not set)
  3. something else

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Aug 15, 2018

Considerations for the API to allow different join-types for update:

So then, the question is which combination makes the most sense. Here the variants so far for DataFrame.update:

  1. DataFrame.update(other, index='left', columns='left', ...) with deprecation of join
  2. DataFrame.update(other, join='left', index=None, columns=None, ...); with index/columns superseding join:
    [...]
    index = join if index is None else index
    columns = join if index is None else columns
  3. DataFrame.update(other, join=None, index=None, columns=None, ...) like above but with no default for join to be able to issue a warning:
    [...]
    if all(x is not None for x in [join, index, columns]):
    raise UserWarning('The join-keyword will be ignored, since both index/columns are set.')
    join = 'left' if join is None else join
    index = join if index is None else index
    columns = join if columns is None else columns

And the variants so far for Series.update:

  1. Series.update(other, index='left', overwrite=True, filter_func=None, raise_conflict=False)
  2. Series.update(other, join='left', overwrite=True, filter_func=None, raise_conflict=False)
  3. Series.update(other, join='left', index=None, overwrite=True, ...) ; with index superseding join:
    [...]
    index = join if index is None else index
  4. Series.update(other, join=None, index=None, overwrite=True, ...) like above but with no default for join to be able to issue a warning:
    [...]
    if all(x is not None for x in [join, index]):
    raise UserWarning('The join-keyword will be ignored, since index has been passed.')
    join = 'left' if join is None else join
    index = join if index is None else index

I think that - from the POV of consistency (and the idea of @toobaz to be able to pass indexes directly) - the most reasonable choice would be DF_1./S_1. In particular, then join for df.update should be deprecated in v0.24. Alternatively, DF_2./S_3. also seems reasonable to me.

[edit 180823]: added some more options
[edit 180905]: rewrote comment for more clarity

@h-vetinari
Copy link
Contributor Author

@toobaz If I may ask for your input to the above (since your last response), then I could formulate a cleaned-up proposal which can then be reviewed by everybody.

@h-vetinari
Copy link
Contributor Author

@gfyoung @toobaz @jreback @TomAugspurger @jorisvandenbossche

With v0.24 scheduled at the end of the month (and the deprecation policy leading up to v1.0), I wanted to bring attention to this issue again. Depending on the choice of API (see #21855 (comment)), it might be necessary to deprecate the join-kwarg for df.update (which currently does nothing), so I think this should be discussed.

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Sep 11, 2018

@wesm
Very gentle ping to the BDFL - I've been wanting to work on this for 2 months but can't move past the API discussion that essentially nobody has time (or motivation) for. Goal is more joins for df.update, noted as TODO even in the source code:

https://github.com/pandas-dev/pandas/blob/v0.23.4/pandas/core/frame.py#L5054

Open questions:

@gfyoung
Copy link
Member

gfyoung commented Sep 11, 2018

@h-vetinari : Sorry that we've been pretty dark on this. I think there is definitely interest, but time is certainly a factor here. @jreback @toobaz : thoughts?

@toobaz
Copy link
Member

toobaz commented Sep 11, 2018

Still very busy, sorry. Will be back next week.

@wesm
Copy link
Member

wesm commented Sep 13, 2018

I'm completely underwater until mid-November. If there is truly an impasse where I can help / weigh in, please let me know and I'll make the time

@gfyoung
Copy link
Member

gfyoung commented Sep 13, 2018

@wesm : I think @toobaz will be freed up next week, so we should be okay (he's probably best to speak to this issue at this point). Time is a scarce commodity indeed 🙂

@h-vetinari
Copy link
Contributor Author

@jreback @toobaz @jorisvandenbossche

Friendly ping. :)

@jorisvandenbossche
Copy link
Member

Some random thoughts for now:

  • I am not fully sure that I like the idea of index and columns keyword. It is true that they are already used in different methods, but they are also used in a different way. Eg in drop, you pass the actual index or column labels to drop. Comparing to merge, where you have a left_on/right_on (next to on), we could also think about a index_join/column_join (just thinking aloud, not sure I like this myself :), and it is of course not about the left/right frame here)
    Maybe a more concrete worry is that in eg index='outer' it is not directly saying that this is about how to join.
  • One advantage of having (keeping) the single join keyword is that you can specify with a single keyword to eg outer join both axes, instead of having to pass both index='outer', columns='outer'.
  • If we add such functionality to have different join types for both axes, isn't this something we should start with adding to align ? (since that is the workhorse that will probably be used under the hood, and which allows you to already do outer updates manually)
    To be clear, not saying that this should necessarily be done at the same, but I think we should make sure that the API we choose also works there.

@h-vetinari
Copy link
Contributor Author

@toobaz
I'll wait for your answer before I respond to Joris (been spamming enough as is).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

6 participants