pd.merge regression when doing a left-join with missing data on the right. Result has a Float64Index #28220

dworvos · 2019-08-29T14:26:55Z

Code Sample, a copy-pastable example if possible

import pandas as pd

X = pd.DataFrame({ "count": [1, 2] }, index=["A", "B"])
Y = pd.DataFrame({"name": ["A", "C"], "value": [100, 200]})
Z = pd.merge(X, Y, left_index=True, right_on="name", how='left')
print(Z.to_string())
# in 0.23.4
#    count name  value
# 0      1    A  100.0
# 1      2    B    NaN

# in 0.24.2
#    count name  value
# 0      1    A  100.0
# 1      2    B    NaN

# in 0.25.1
#      count name  value
# 0.0      1    A  100.0
# NaN      2    B    NaN

assert isinstance(Z.index, pd.Int64Index)

Problem description

I looked on the GitHub tracker for similar issues but the closest I found was #24897. In previous versions of pandas it would return a Int64Index but now returns a Float64Index. I didn't see this behaviour documented in the release notes of 0.25, but please let me know if I've missed it. This bug is easy to reproduce in a virtualenv.

Expected Output

   count name  value
0      1    A  100.0
1      2    B    NaN

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.6.3.final.0
python-bits : 64
OS : Linux
OS-release : 3.10.0-957.el7.x86_64
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.utf-8
LANG : en_US.utf-8
LOCALE : en_US.UTF-8

pandas : 0.25.1
numpy : 1.17.1
pytz : 2019.2
dateutil : 2.8.0
pip : 19.2.3
setuptools : 41.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

The text was updated successfully, but these errors were encountered:

WillAyd · 2019-08-30T17:18:31Z

Yea I think you expected output makes sense. If you'd like to debug and see where regression was produced would certainly take a PR to fix

dworvos · 2019-09-03T14:25:32Z

Looks like a bug was introduced with change hash 41166c6 (which first introduces a failure to do the join correctly). The current incorrect behaviour was introduced with change hash cc3b2f0. Looks like both commits are from the same author - do you guys think it may be better use of time if I reached out to the author to better understand the two commits listed above and get some colour on a fix before digging in myself?

TomAugspurger · 2019-09-03T14:44:22Z

cc @JustinZhengBC if you could take a look.

dworvos · 2019-09-11T14:07:06Z

Taking a deeper dive, I see that the original intent was to fix #24212 where missing values in the index got the last value because of fancy indexing (-1), but it looks like to solve this problem, the index was expanded to include the NA value of the dtype before doing the fancy indexing. This solution has the unfortunately side effect of promoting the dtype if it's an integer index.

I'm not familiar with this part of the code, but it looks like the indexers returned by _get_join_indexers and their respective indexes should be the same shape. If that's the case, we can do a np.where to select the correct fill value from the respective result index (i.e. the left index if how='left' or the right index when how='right'). I think I can put together a PR (please let me know if my solution is not the most efficient) and some test cases. I have a suspicion that my PR likely won't keep the correct output I listed in my description, but I think keeping the left index is also an acceptable solution in this case. However, it may be an interface change between 0.24 and 0.25.

Just wanted to post an update before I go an run the rest of the test suite. I've gotten the test case in #24212 and this one to pass.

dworvos · 2019-09-12T00:05:47Z

I'm not sure it's possible to make everyone happy with my PR, it feels like #17257, #24212, and myself (although I'm pretty flexible one way or the other) want mutually exclusive index behaviours for the same operation. However, in the spirit of keeping consistency with the right / left join from SQL what are everyone's thoughts about returning the join keys as the index? (in all 4 cases, inner, outer, left, and right) The data would remain in the same range as the columns being joined (i.e. no unexpected values)

This change causes the test cases expecting a RangeIndex to fail so it's probably a pretty breaking change (although one could argue the current behaviour is also a quasi-breaking change). Let me know what you guys think.

TomAugspurger · 2019-09-12T02:36:35Z

This change causes the test cases expecting a RangeIndex to fail so it's probably a pretty breaking change (although one could argue the current behaviour is also a quasi-breaking change)

Can you give an example of what changes?

dworvos · 2019-09-12T21:43:14Z

This is an example of some the breaking behaviour based on the unit tests in test_merge.py:

df1 = pd.DataFrame({"entity_id": [101, 102]})
s = pd.Series([None, None], index=[101, 102], name="days")

result = df1.merge(s.to_frame("days"), left_on="entity_id", right_index=True)
# test case expects result to be a RangeIndex(0, 2)
# with PR it would be:
     entity_id  days
101        101  None
102        102  None

This example would be my PR's solution to #17257:

# right join
      year other  a  b
2000  2000     a  2  6
2000  2000     b  2  6
2000  2000     c  2  6
2001  2001     d  3  7
2002  2002     e  4  8
2002  2002     f  4  8
2002  2002     g  4  8
2003  2003   NaN  1  5

# outer join
      year other    a    b
2000  2000     a  2.0  6.0
2000  2000     b  2.0  6.0
2000  2000     c  2.0  6.0
2001  2001     d  3.0  7.0
2002  2002     e  4.0  8.0
2002  2002     f  4.0  8.0
2002  2002     g  4.0  8.0
2004  2004     h  NaN  NaN
2003  2003   NaN  1.0  5.0

This converts the resulting index to be in the set of keys expected by the domain of the join (i.e. years) - if the user wishes to keep the index of left in the resulting join, I admit it's a bit more tedious (as one would have to duplicate the index as a column; thinking out loud, a possible mitigation may be to fold the index into the resulting df as "index" + lsuffix in this case), although suboptimal it may be acceptable compromise compared to NA and a possibly dtype promotion in the index.

This example would be my PR's solution to #24212:

# left join
    id left_data right_data
11  11   left_11        NaN
22  22   left_22   right_22
33  33   left_33   right_33

Solves the left join from taking the last key and keeping the index in the user's expected domain.

Finally, here's the sample output from different types of merges from some unit tests I'm putting together (for brevity, I've excluded parts of the cases as they would flow into the same code block, swapping left and right should be commutative, except for row ordering)

Apologies for my edit, the test cases weren't quite correct (but the gist should be the same)

X = pd.DataFrame({ "count": [1, 2] }, index=["A", "B"])
Y = pd.DataFrame({ "name": ["A", "C"],      "value": [100, 200]})
Z = pd.DataFrame({ "name": ["A", "A", "C"], "value": [100, 200, 300]})
E = pd.DataFrame(columns=["name", "value"])

pd.merge(X, Y, right_on=["name"], how=how)
# left
#     count name  value
# A      1    A  100.0
# B      2    B    NaN
#
# right
#    count name  value
# 0    1.0    A    100
# 1    NaN    C    200
# note: same as Y.join(X, on="name") except column ordering
#
# inner
#     count name  value
# A      1    A    100
#
# outer
#     count name  value
# A    1.0    A  100.0
# B    2.0    B    NaN
# C    NaN    C  200.0

pd.merge(X, Z, right_on=["name"], how=how)
# left
#     count name  value
# A      1    A  100.0
# A      1    A  200.0
# B      2    B    NaN
#
# right
#     count name  value
# 0    1.0    A    100
# 1    1.0    A    200
# 2    NaN    C    300
# note: same as Z.join(X, on="name") except column ordering
#
# inner
#     count name  value
# A      1    A    100
# A      1    A    200
#
# outer
#     count name  value
# A    1.0    A  100.0
# A    1.0    A  200.0
# B    2.0    B    NaN
# C    NaN    C  300.0

pd.merge(X, E, right_on=["name"], how=how)
# left
#     count name value
# A      1    A   NaN
# B      2    B   NaN
#
# right
# Empty DataFrame
# Columns: [count, name, value]
# Index: []
# note: same as E.join(X, on="name") except column ordering
#
# inner
#  Empty DataFrame
# Columns: [count, name, value]
# Index: []
#
# outer
#     count name value
# A      1    A   NaN
# B      2    B   NaN

The PR would also work in the case of MultiIndex as well:

X = pd.DataFrame({ "count": [1, 2] }, index=pd.MultiIndex.from_arrays([ ["A", "B"], [0, 0] ], names=["name", "offset"]))
Y = pd.DataFrame({ "name": ["A", "A", "C"], "offset": [0, 1, 0], "value": [100, 200, 300]})
E = pd.DataFrame(columns=["value", "name", "offset"])

pd.merge(X, Y, right_on=["name", "offset"], how='left')
#       count name  offset  value
# A 0      1    A       0  100.0
# B 0      2    B       0    NaN

pd.merge(X, Y, right_on=["name", "offset"], how='right')
#    count name  offset  value
# 0    1.0    A       0    100
# 1    NaN    A       1    200
# 2    NaN    C       0    300
# note: same as Y.join(X, on=["name", "offset"]) except column ordering

pd.merge(X, Y, right_on=["name", "offset"]), how='inner')
#      count name  offset  value
# A 0      1    A       0    100

pd.merge(X, Y.set_index(["name", "offset"]), how='outer')
#      count name  offset  value
# A 0    1.0    A       0  100.0
# B 0    2.0    B       0    NaN
# A 1    NaN    A       1  200.0
# C 0    NaN    C       0  300.0

pd.merge(X, E, right_on=["name", "offset"], how='inner')
# Empty DataFrame
# Columns: [count, value, name, offset]
# Index: []

Sorry this reply has been long winded, please let me know what you guys think.

dworvos · 2019-09-28T21:13:23Z

I've tried a more conservative approach which hopefully works for enough people. My thoughts are by adding an additional parameter to pd.merge (I've called it index_na_value but I'm not wedded to the name) where if specified, uses that value instead of the na_value_for_dtype(index.dtype, compat=False) - then the user can accept the current default beahviour of a dtype promotion or use their own NA value specific for their domain without having to break too much older code. I've updated the pull request.

WillAyd added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Aug 30, 2019

WillAyd added this to the Contributions Welcome milestone Aug 30, 2019

dsaxton mentioned this issue Aug 31, 2019

BUG: Left join on index and column gives incorrect output #28243

Open

hugoecarl mentioned this issue Sep 5, 2019

BUG: Merge on CategoricalIndex fails if left_index=True & right_index=True, but not if on={index} #28189 #28296

Closed

5 tasks

dworvos mentioned this issue Sep 13, 2019

BUG: Fix merging non-indexes causes Index dtype promotion in when keys are missing #28436

Closed

5 tasks

phofl mentioned this issue May 31, 2020

BUG: Fix value setting in case of merging via index on one side and column on other side. #34496

Closed

9 tasks

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

debnathshoham mentioned this issue Oct 28, 2022

BUG: join on column with index #49360

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pd.merge regression when doing a left-join with missing data on the right. Result has a Float64Index #28220

pd.merge regression when doing a left-join with missing data on the right. Result has a Float64Index #28220

dworvos commented Aug 29, 2019

INSTALLED VERSIONS

WillAyd commented Aug 30, 2019

dworvos commented Sep 3, 2019 •

edited

TomAugspurger commented Sep 3, 2019

dworvos commented Sep 11, 2019

dworvos commented Sep 12, 2019

TomAugspurger commented Sep 12, 2019

dworvos commented Sep 12, 2019 •

edited

dworvos commented Sep 28, 2019

pd.merge regression when doing a left-join with missing data on the right. Result has a Float64Index #28220

pd.merge regression when doing a left-join with missing data on the right. Result has a Float64Index #28220

Comments

dworvos commented Aug 29, 2019

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

WillAyd commented Aug 30, 2019

dworvos commented Sep 3, 2019 • edited

TomAugspurger commented Sep 3, 2019

dworvos commented Sep 11, 2019

dworvos commented Sep 12, 2019

TomAugspurger commented Sep 12, 2019

dworvos commented Sep 12, 2019 • edited

dworvos commented Sep 28, 2019

Output of `pd.show_versions()`

dworvos commented Sep 3, 2019 •

edited

dworvos commented Sep 12, 2019 •

edited