Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge on index vs merge on column - different NaN handling #13371

Open
ialong opened this issue Jun 5, 2016 · 10 comments
Open

merge on index vs merge on column - different NaN handling #13371

ialong opened this issue Jun 5, 2016 · 10 comments
Labels
Enhancement Needs Discussion Requires discussion from core team before further action Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@ialong
Copy link

ialong commented Jun 5, 2016

In the following examples there are three different merge behaviours when it comes to handling NaNs.
They are all based on pd.merge(..., how="left", ...)

The difference depends on:

1) Whether we are merging using an index or a column.
2) Whether the column keys we are merging on are the same value or not (i.e. if left_on = right_on).

Arguably, if we specify "left" as the merging criterion, the desired behaviour is to have NaNs in the columns coming from the right dataframe where there is no match between the left and right dataframes' key columns (see first merge in example below, 'd' and 'e' columns).
The problem is, if we are merging on left's index, the NaNs get filled with the index values from the left dataframe even if the names of the two columns don't match ('c' and 'd' in the example). We are thus led to believe there was a perfect match between the index of the left dataframe and the "key" column of the right dataframe ('d' here).

Gotchas:

-There is something puzzling going on with the new indices of the resulting dataframe (when merging on index).
-Type casting occurs when merging on index, perhaps suggesting NaNs are explicitly filled in a second step.

Proposed behaviour:

Maybe it is simply a matter of removing this NaN filling step.

Better yet, the "key" column in the merged dataframe should perhaps bear the name of left's index not of the "right_on" key (provided we used left's index to merge). I.e. in the second merge of the example, the 'd' column should be called 'c'.

This is really the source of the confusion when the two names are different. When they are the same the "no NaN" behaviour is arguably legitimate.

Also it might be worthwhile to cast the final column back to the original dtype if there are no NaNs.

Maybe this is not really an issue though, more something to be aware of. I would be interested in hearing any motivation behind this behaviour.

Looking forward to reading your thoughts!

Code Sample:

df1 = pd.DataFrame(columns=['a','b','c'], data=[[1,2,3],[4,5,6],[7,8,9],[10,11,12]])
df1_c_index = df1.set_index('c')
df2 = pd.DataFrame(columns=['d','e'], data=[[3,14],[6,15],[13,16]])

print 'df1', '\n', df1, '\n'
print 'df1_c_index', '\n', df1_c_index, '\n'
print 'df2', '\n', df2, '\n'

print "pd.merge(df1, df2, how='left', left_on='c', right_on='d')", '\n'
print pd.merge(df1, df2, how='left', left_on='c', right_on='d'), '\n'

print "pd.merge(df1_c_index, df2, how='left', left_index=True, right_on='d')", '\n'
print pd.merge(df1_c_index, df2, how='left', left_index=True, right_on='d'), '\n'

df2.rename(columns={'d':'c'}, inplace=True)

print 'df1', '\n', df1, '\n'
print 'df1_c_index', '\n', df1_c_index, '\n'
print 'df2', '\n', df2, '\n'

print "pd.merge(df1, df2, how='left', left_on='c', right_on='c')", '\n'
print pd.merge(df1, df2, how='left', left_on='c', right_on='c'), '\n'

print "pd.merge(df1_c_index, df2, how='left', left_index=True, right_on='c')", '\n'
print pd.merge(df1_c_index, df2, how='left', left_index=True, right_on='c'), '\n'

Output:

screen shot 2016-06-05 at 21 06 48

screen shot 2016-06-05 at 21 07 05

output of pd.show_versions():

INSTALLED VERSIONS


commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 15.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8

pandas: 0.18.1
nose: None
pip: 8.1.2
setuptools: 21.0.0
Cython: None
numpy: 1.11.0
scipy: 0.17.1
statsmodels: None
xarray: 0.7.2
IPython: 4.2.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.0.0
tables: None
numexpr: 2.6.0
matplotlib: 1.5.1
openpyxl: None
xlrd: 0.9.4
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Jun 6, 2016

#13170 will cast join keys if possible.

Yeah I would agree that if the index has the same name as a join column and it is not joined on (e.g. left_index=True), then prob should just raise an error (or maybe a warning). I doubt this is tested.

So why don't you do a pull-requests to implement your suggestions; see how much is needed to change to make consistent.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Nov 30, 2016

I started looking at #14076 to see if I can do something intelligent, and then found the discussion here. @jreback has a comment from Jun 6 that is confusing to me. Let's assume you are doing a left-join, left_index=True and right_on='some_column_name'. There are 3 cases to consider:

  1. The left index has the same name 'some_column_name' as the column being merged on.
  2. The left index has a different name from 'some_column_name'.
  3. The left index has no name at all.

In the first case, this is like the 4th merge in the original example, and I think the behavior is correct.

In the second case, this is like the 2nd merge above, but I think the name of the resulting column should come from the left DataFrame index name, because the left's DataFrame index is what is preserved as a column in the result. In the 2nd merge example above, it comes from the right, which is confusing.

In the third case, since there is no name for the left index, we could raise an error, or maybe the resulting name of the column should be called "left_index" (or something like that), to make it clear that it was that particular index that is used to create the values for the column.

When the merges involve a MultiIndex, then I think these are the cases:

  1. The left index has the same names as the set of names from the right DataFrame. In this case, everything works as expected.

  2. The left index has a complete set of name. Only some of those names (but not all) match the names on the right DataFrame. In this case, we do as I suggested in (2) above, namely, use the names from the left index.

  3. The left index is missing some (or all) names. In this case, I think we should raise an error, as it's not clear what a good naming convention should be for missing names in a MultiIndex. If we raise an error here, we probably should raise an error in the third case above for a single index.

Looking forward to your opinion.

@ialong
Copy link
Author

ialong commented Dec 1, 2016

Hi @Dr-Irv!
Thank you for joining the discussion, this has been on my to-do list for a while, hopefully we can work out something together.
I think you have completely understood the issue and thank you for giving the clear summary above.

As far as potential solutions go, here are my thoughts for each of the three cases you identify:

1) I agree this works as expected, although I'd like to get rid of the suspicious type casting behaviour (hopefully #13170 fixes this already, but I think there might be something fishy going on in the background that tells me "left"-merging on the left index is handled incorrectly).

2) I think your solution is the right one. For the sake of argument, we could keep both sets of columns and name them e.g. colName_x, colName_y (as in the non-"left"-merge). However, I think this is potentially misleading as it hides the fact that these were columns used for merging (it also gets messy if, as you say only some names match and some do not). So I agree with you.

3) We could raise an error, or we could retain the left index as it is and drop the "right_on" columns. I prefer the latter since it works similarly to joining two dataframes on index and, since we are "left"-merging, we are saying that the joining index/column we care about is the left one. The right one might be useful as a way to get us other values/fields we care about (contained in the right dataframe), but we should not necessarily expect the "right_on" column(s) to be appropriately named and to have reasonable values beyond the ones already included in "left_index".

I'd like the solution to be minimal, avoiding many "case/switch"-type statements.
Perhaps the simplest solution, at a high level, is to perform a merge on the two dataframes after resetting the index to a column (in which case merge already has the right behaviour) and then setting the index back as appropriate (i.e. using the left_index column names).
I am not sure if this is computationally inefficient and whether in practice it will be easy. Worth testing though.

Let me know what you think.

PS:
One thing to consider is whether we like the behaviour in the first example of my original post, where both "left_on" and "right_on" columns are kept. This is potentially useful as a hack to know which values of "left_on" are (or are not) in "right_on" but is arguably confusing to have them both in the final dataframe (especially since we have "left"-merged).
I would say they should be dropped, but this might change existing behaviour too much.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 1, 2016

@ialong Regarding your idea of what to do when the left index has no name at all (or in the case of a MultiIndex, has some levels missing names), you suggested to "retain the left index as it is and drop the "right_on" columns.". As I thought about this some more, I realized that there is another ambiguity that might be an underlying cause of the issue.

Namely, suppose you are doing a left merge where you have left_index=True and right_on='some_column_name'. As a left merge on the index, I would expect that the index would be preserved. But instead, what pandas does now is create a new index, and the index/column used for the merge becomes a column in the resulting DataFrame.

Consider this behavior with a MultiIndex:

mi = pd.MultiIndex.from_tuples([(1,10),(2,20),(3,30),(4,40)], names=['a','b'])
df1 = pd.DataFrame({'c': [100,200,300,400]}, index=mi)
df2 = pd.DataFrame({'a' : [1,3], 'b' : [10, 30], 'e' : [1000, 3000]})
print("df1\n", df1)
print("df2\n", df2)
print("merge\n", pd.merge(df1, df2, how='left', left_index=True, right_on=['a','b']))

This yields:

df1
         c
a b      
1 10  100
2 20  200
3 30  300
4 40  400
df2
    a   b     e
0  1  10  1000
1  3  30  3000
merge
      c    a     b       e
0  100  1.0  10.0  1000.0
1  200  2.0  20.0     NaN
1  300  3.0  30.0  3000.0
1  400  4.0  40.0     NaN

Note that the index columns ['a','b'] from df1 are not used in the index. One could argue that the result should be (ignoring the dtype issue for a moment):

            c       e
a   b                
1.0 10.0  100  1000.0
2.0 20.0  200     NaN
3.0 30.0  300  3000.0
4.0 40.0  400     NaN

This latter DataFrame has ['a','b'] as the index, which, at least to me, would be a more intuitive result, because the merge request was for a left join, on the left index, so you'd expect the index to be preserved.

Now, I imagine that if we were to change this undocumented behavior, we'd have a compatibility issue, but maybe that should be handled via a new parameter to merge.

Curious to get feedback from @jreback .

@ialong
Copy link
Author

ialong commented Dec 1, 2016

@Dr-Irv I agree! Although maybe it's enough to have an argument like "reset_index=False". It does not seem like erroneous behaviour since in your example the column names actually agree.

The problem (and what we need to fix) is that, if the names differ, those that survive are the names of the right dataframe's columns, without the appropriate NaNs in place.


The more I think about this, the more it seems like merges should always be done on columns, rather than indices, and then we "set_index" as specified by either the inputs or the arguments to merge.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 1, 2016

@ialong You wrote: "if the names differ, those that survive are the names of the right dataframe's columns,". I'm not sure I agree. If you are doing a left join, then I think the names of the left dataframe index (or columns) should survive.

I've had situations where I want to merge using the index on one DataFrame and columns from another, so I don't think we should remove that functionality as you suggest in the last paragraph. I think we just need agreement on what the behavior should be in this case. And we have to work through the 4 cases of left, right, inner, and outer joins, when one frame is joining on index, and the other on columns, and determine what the outcome should be in each of those cases.

@ialong
Copy link
Author

ialong commented Dec 1, 2016

"if the names differ, those that survive are the names of the right dataframe's columns,". I'm not sure I agree. If you are doing a left join, then I think the names of the left dataframe index (or columns) should survive.

@Dr-Irv That's the behaviour of my second example actually (maybe things have changed since then?):

index in "left" is called "c" but right_on='d'

as you can see the result is a column "d" with no NaNs, a meaningless index and no "c" column anywhere.

Of course if we were merging on two columns (rather than index and column), both survive (first example).


The more I think about this, the more it seems like merges should always be done on columns, rather than indices, and then we "set_index" as specified by either the inputs or the arguments to merge.

Sorry I realise this was not very clear, what I meant was that the merge routine should (internally) be called on columns.
So if you pass left_index=True or right_index=True merge would first reset_index(), then merge on the resulting columns, then set_index('cols') again (unless an argument has been passed to tell it not to set the index again, like I was suggesting in the previous post).

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 1, 2016

@ialong wrote:

That's the behaviour of my second example actually (maybe things have changed since then?):
index in "left" is called "c" but right_on='d'
as you can see the result is a column "d" with no NaNs, a meaningless index and no "c" column anywhere.

I am suggesting that the current behavior should be changed, so that in your second example, the column name would be 'c' that survived.

Sorry I realise this was not very clear, what I meant was that the merge routine should (internally) be called on columns.

I am not at all familiar with the current implementation, so I have no opinion on that!

@holy-motors
Copy link

There no activity for more than two years. Let's close this now and re-open if required.

@Johndg1974
Copy link

can someone help me with python homework?

@mroeschke mroeschke added Enhancement Needs Discussion Requires discussion from core team before further action and removed API Design labels May 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Discussion Requires discussion from core team before further action Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

7 participants