Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: false negatives for multiindex joins validation #42272

Open
2 tasks
cf-vrgl opened this issue Jun 28, 2021 · 2 comments
Open
2 tasks

BUG: false negatives for multiindex joins validation #42272

cf-vrgl opened this issue Jun 28, 2021 · 2 comments
Labels
Bug Error Reporting Incorrect or improved errors from pandas Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@cf-vrgl
Copy link

cf-vrgl commented Jun 28, 2021

  • [x ] I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandas as pd

a = pd.DataFrame(range(10),
                 index=pd.MultiIndex.from_arrays(
                     [
                         pd.Index([1]*10, name="shared_and_not_unique"),
                         pd.Index(range(10), name="unique_and_not_in_b")]))

b = pd.DataFrame(range(2),
                 index=pd.Index([1, 2], name="shared_and_not_unique"))

c = pd.merge(a, b, left_index=True, right_index=True, validate="1:1")

d = pd.merge(a.droplevel("unique_and_not_in_b"), b, left_index=True, right_index=True, validate="1:1")

Problem description

pd.merge checks whether the index for the right and left are unique. It does not check whether the levels used for the join are unique. In the example above, a has a unique index, but the subset of a's index levels that overlap with b's index is not unique. As a result, a many to one merge is performed and 1:1 validation check is passed.

Expected Output

MergeError for c.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 2cb9652
python : 3.8.10.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19042
machine : AMD64
processor : Intel64 Family 6 Model 165 Stepping 2, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United States.1252
pandas : 1.2.4
numpy : 1.20.3
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.2
setuptools : 49.6.0.post20210108
Cython : 0.29.23
pytest : 6.2.4
hypothesis : None
sphinx : 4.0.2
blosc : None
feather : None
xlsxwriter : 1.4.3
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : 3.0.1
IPython : 7.24.1
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 2021.06.0
fastparquet : None
gcsfs : None
matplotlib : 3.4.2
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.7.0
sqlalchemy : 1.4.18
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.53.1

@cf-vrgl cf-vrgl added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 28, 2021
@attack68
Copy link
Contributor

Yes I agree the output on master, and agree a merge error should be raised since it is documented that:

left_indexbool, default False
    Use the index from the left DataFrame as the join key(s). If it is a MultiIndex, 
    the number of keys in the other DataFrame (either the index or a number of columns) 
    must match the number of levels.

right_indexbool, default False
    Use the index from the right DataFrame as the join key. Same caveats as left_index.

This should fail when counting the number of keys/levels, irrespective of the validate kwarg.

@attack68 attack68 added Reshaping Concat, Merge/Join, Stack/Unstack, Explode and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 28, 2021
@cf-vrgl
Copy link
Author

cf-vrgl commented Jun 28, 2021

It would be great to have a parameter when joining on index for both left and right that allows for joining on levels that exist in both indices. Continuing the example above, there are times when I would like to keep "unique_and_not_in_b" in the index after the merge. The below does not include it in the index of the result after the merge.

import pandas as pd
import numpy as np

a = pd.DataFrame(range(10),
                 index=pd.MultiIndex.from_arrays(
                     [
                         pd.Index([1]*10, name="shared_and_not_unique"),
                         pd.Index(range(10), name="unique_and_not_in_b")]))

b = pd.DataFrame(range(2),
                 index=pd.Index([1, 2], name="shared_and_not_unique"))


e = pd.merge(a, b, on=list(np.intersect1d(a.index.names, b.index.names)))

@mroeschke mroeschke added the Error Reporting Incorrect or improved errors from pandas label Aug 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Error Reporting Incorrect or improved errors from pandas Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

3 participants