Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pd.merge regression when doing a left-join with missing data on the right. Result has a Float64Index #28220

Open
dworvos opened this issue Aug 29, 2019 · 8 comments
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@dworvos
Copy link
Contributor

dworvos commented Aug 29, 2019

Code Sample, a copy-pastable example if possible

import pandas as pd

X = pd.DataFrame({ "count": [1, 2] }, index=["A", "B"])
Y = pd.DataFrame({"name": ["A", "C"], "value": [100, 200]})
Z = pd.merge(X, Y, left_index=True, right_on="name", how='left')
print(Z.to_string())
# in 0.23.4
#    count name  value
# 0      1    A  100.0
# 1      2    B    NaN

# in 0.24.2
#    count name  value
# 0      1    A  100.0
# 1      2    B    NaN

# in 0.25.1
#      count name  value
# 0.0      1    A  100.0
# NaN      2    B    NaN

assert isinstance(Z.index, pd.Int64Index)

Problem description

I looked on the GitHub tracker for similar issues but the closest I found was #24897. In previous versions of pandas it would return a Int64Index but now returns a Float64Index. I didn't see this behaviour documented in the release notes of 0.25, but please let me know if I've missed it. This bug is easy to reproduce in a virtualenv.

Expected Output

   count name  value
0      1    A  100.0
1      2    B    NaN

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.6.3.final.0
python-bits : 64
OS : Linux
OS-release : 3.10.0-957.el7.x86_64
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.utf-8
LANG : en_US.utf-8
LOCALE : en_US.UTF-8

pandas : 0.25.1
numpy : 1.17.1
pytz : 2019.2
dateutil : 2.8.0
pip : 19.2.3
setuptools : 41.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

@WillAyd
Copy link
Member

WillAyd commented Aug 30, 2019

Yea I think you expected output makes sense. If you'd like to debug and see where regression was produced would certainly take a PR to fix

@WillAyd WillAyd added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Aug 30, 2019
@WillAyd WillAyd added this to the Contributions Welcome milestone Aug 30, 2019
@dworvos
Copy link
Contributor Author

dworvos commented Sep 3, 2019

Looks like a bug was introduced with change hash 41166c6 (which first introduces a failure to do the join correctly). The current incorrect behaviour was introduced with change hash cc3b2f0. Looks like both commits are from the same author - do you guys think it may be better use of time if I reached out to the author to better understand the two commits listed above and get some colour on a fix before digging in myself?

@TomAugspurger
Copy link
Contributor

cc @JustinZhengBC if you could take a look.

@dworvos
Copy link
Contributor Author

dworvos commented Sep 11, 2019

Taking a deeper dive, I see that the original intent was to fix #24212 where missing values in the index got the last value because of fancy indexing (-1), but it looks like to solve this problem, the index was expanded to include the NA value of the dtype before doing the fancy indexing. This solution has the unfortunately side effect of promoting the dtype if it's an integer index.

I'm not familiar with this part of the code, but it looks like the indexers returned by _get_join_indexers and their respective indexes should be the same shape. If that's the case, we can do a np.where to select the correct fill value from the respective result index (i.e. the left index if how='left' or the right index when how='right'). I think I can put together a PR (please let me know if my solution is not the most efficient) and some test cases. I have a suspicion that my PR likely won't keep the correct output I listed in my description, but I think keeping the left index is also an acceptable solution in this case. However, it may be an interface change between 0.24 and 0.25.

Just wanted to post an update before I go an run the rest of the test suite. I've gotten the test case in #24212 and this one to pass.

@dworvos
Copy link
Contributor Author

dworvos commented Sep 12, 2019

I'm not sure it's possible to make everyone happy with my PR, it feels like #17257, #24212, and myself (although I'm pretty flexible one way or the other) want mutually exclusive index behaviours for the same operation. However, in the spirit of keeping consistency with the right / left join from SQL what are everyone's thoughts about returning the join keys as the index? (in all 4 cases, inner, outer, left, and right) The data would remain in the same range as the columns being joined (i.e. no unexpected values)

This change causes the test cases expecting a RangeIndex to fail so it's probably a pretty breaking change (although one could argue the current behaviour is also a quasi-breaking change). Let me know what you guys think.

@TomAugspurger
Copy link
Contributor

This change causes the test cases expecting a RangeIndex to fail so it's probably a pretty breaking change (although one could argue the current behaviour is also a quasi-breaking change)

Can you give an example of what changes?

@dworvos
Copy link
Contributor Author

dworvos commented Sep 12, 2019

This is an example of some the breaking behaviour based on the unit tests in test_merge.py:

df1 = pd.DataFrame({"entity_id": [101, 102]})
s = pd.Series([None, None], index=[101, 102], name="days")

result = df1.merge(s.to_frame("days"), left_on="entity_id", right_index=True)
# test case expects result to be a RangeIndex(0, 2)
# with PR it would be:
     entity_id  days
101        101  None
102        102  None

This example would be my PR's solution to #17257:

# right join
      year other  a  b
2000  2000     a  2  6
2000  2000     b  2  6
2000  2000     c  2  6
2001  2001     d  3  7
2002  2002     e  4  8
2002  2002     f  4  8
2002  2002     g  4  8
2003  2003   NaN  1  5

# outer join
      year other    a    b
2000  2000     a  2.0  6.0
2000  2000     b  2.0  6.0
2000  2000     c  2.0  6.0
2001  2001     d  3.0  7.0
2002  2002     e  4.0  8.0
2002  2002     f  4.0  8.0
2002  2002     g  4.0  8.0
2004  2004     h  NaN  NaN
2003  2003   NaN  1.0  5.0

This converts the resulting index to be in the set of keys expected by the domain of the join (i.e. years) - if the user wishes to keep the index of left in the resulting join, I admit it's a bit more tedious (as one would have to duplicate the index as a column; thinking out loud, a possible mitigation may be to fold the index into the resulting df as "index" + lsuffix in this case), although suboptimal it may be acceptable compromise compared to NA and a possibly dtype promotion in the index.

This example would be my PR's solution to #24212:

# left join
    id left_data right_data
11  11   left_11        NaN
22  22   left_22   right_22
33  33   left_33   right_33

Solves the left join from taking the last key and keeping the index in the user's expected domain.

Finally, here's the sample output from different types of merges from some unit tests I'm putting together (for brevity, I've excluded parts of the cases as they would flow into the same code block, swapping left and right should be commutative, except for row ordering)

Apologies for my edit, the test cases weren't quite correct (but the gist should be the same)

X = pd.DataFrame({ "count": [1, 2] }, index=["A", "B"])
Y = pd.DataFrame({ "name": ["A", "C"],      "value": [100, 200]})
Z = pd.DataFrame({ "name": ["A", "A", "C"], "value": [100, 200, 300]})
E = pd.DataFrame(columns=["name", "value"])

pd.merge(X, Y, right_on=["name"], how=how)
# left
#     count name  value
# A      1    A  100.0
# B      2    B    NaN
#
# right
#    count name  value
# 0    1.0    A    100
# 1    NaN    C    200
# note: same as Y.join(X, on="name") except column ordering
#
# inner
#     count name  value
# A      1    A    100
#
# outer
#     count name  value
# A    1.0    A  100.0
# B    2.0    B    NaN
# C    NaN    C  200.0

pd.merge(X, Z, right_on=["name"], how=how)
# left
#     count name  value
# A      1    A  100.0
# A      1    A  200.0
# B      2    B    NaN
#
# right
#     count name  value
# 0    1.0    A    100
# 1    1.0    A    200
# 2    NaN    C    300
# note: same as Z.join(X, on="name") except column ordering
#
# inner
#     count name  value
# A      1    A    100
# A      1    A    200
#
# outer
#     count name  value
# A    1.0    A  100.0
# A    1.0    A  200.0
# B    2.0    B    NaN
# C    NaN    C  300.0

pd.merge(X, E, right_on=["name"], how=how)
# left
#     count name value
# A      1    A   NaN
# B      2    B   NaN
#
# right
# Empty DataFrame
# Columns: [count, name, value]
# Index: []
# note: same as E.join(X, on="name") except column ordering
#
# inner
#  Empty DataFrame
# Columns: [count, name, value]
# Index: []
#
# outer
#     count name value
# A      1    A   NaN
# B      2    B   NaN

The PR would also work in the case of MultiIndex as well:

X = pd.DataFrame({ "count": [1, 2] }, index=pd.MultiIndex.from_arrays([ ["A", "B"], [0, 0] ], names=["name", "offset"]))
Y = pd.DataFrame({ "name": ["A", "A", "C"], "offset": [0, 1, 0], "value": [100, 200, 300]})
E = pd.DataFrame(columns=["value", "name", "offset"])

pd.merge(X, Y, right_on=["name", "offset"], how='left')
#       count name  offset  value
# A 0      1    A       0  100.0
# B 0      2    B       0    NaN

pd.merge(X, Y, right_on=["name", "offset"], how='right')
#    count name  offset  value
# 0    1.0    A       0    100
# 1    NaN    A       1    200
# 2    NaN    C       0    300
# note: same as Y.join(X, on=["name", "offset"]) except column ordering

pd.merge(X, Y, right_on=["name", "offset"]), how='inner')
#      count name  offset  value
# A 0      1    A       0    100

pd.merge(X, Y.set_index(["name", "offset"]), how='outer')
#      count name  offset  value
# A 0    1.0    A       0  100.0
# B 0    2.0    B       0    NaN
# A 1    NaN    A       1  200.0
# C 0    NaN    C       0  300.0

pd.merge(X, E, right_on=["name", "offset"], how='inner')
# Empty DataFrame
# Columns: [count, value, name, offset]
# Index: []

Sorry this reply has been long winded, please let me know what you guys think.

@dworvos
Copy link
Contributor Author

dworvos commented Sep 28, 2019

I've tried a more conservative approach which hopefully works for enough people. My thoughts are by adding an additional parameter to pd.merge (I've called it index_na_value but I'm not wedded to the name) where if specified, uses that value instead of the na_value_for_dtype(index.dtype, compat=False) - then the user can accept the current default beahviour of a dtype promotion or use their own NA value specific for their domain without having to break too much older code. I've updated the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants