BUG: false negatives for multiindex joins validation #42272

cf-vrgl · 2021-06-28T01:52:01Z

[x ] I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandas as pd

a = pd.DataFrame(range(10),
                 index=pd.MultiIndex.from_arrays(
                     [
                         pd.Index([1]*10, name="shared_and_not_unique"),
                         pd.Index(range(10), name="unique_and_not_in_b")]))

b = pd.DataFrame(range(2),
                 index=pd.Index([1, 2], name="shared_and_not_unique"))

c = pd.merge(a, b, left_index=True, right_index=True, validate="1:1")

d = pd.merge(a.droplevel("unique_and_not_in_b"), b, left_index=True, right_index=True, validate="1:1")

Problem description

pd.merge checks whether the index for the right and left are unique. It does not check whether the levels used for the join are unique. In the example above, a has a unique index, but the subset of a's index levels that overlap with b's index is not unique. As a result, a many to one merge is performed and 1:1 validation check is passed.

Expected Output

MergeError for c.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : 2cb9652
python : 3.8.10.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19042
machine : AMD64
processor : Intel64 Family 6 Model 165 Stepping 2, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United States.1252
pandas : 1.2.4
numpy : 1.20.3
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.2
setuptools : 49.6.0.post20210108
Cython : 0.29.23
pytest : 6.2.4
hypothesis : None
sphinx : 4.0.2
blosc : None
feather : None
xlsxwriter : 1.4.3
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : 3.0.1
IPython : 7.24.1
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 2021.06.0
fastparquet : None
gcsfs : None
matplotlib : 3.4.2
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.7.0
sqlalchemy : 1.4.18
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.53.1

The text was updated successfully, but these errors were encountered:

attack68 · 2021-06-28T14:47:38Z

Yes I agree the output on master, and agree a merge error should be raised since it is documented that:

left_indexbool, default False
    Use the index from the left DataFrame as the join key(s). If it is a MultiIndex, 
    the number of keys in the other DataFrame (either the index or a number of columns) 
    must match the number of levels.

right_indexbool, default False
    Use the index from the right DataFrame as the join key. Same caveats as left_index.

This should fail when counting the number of keys/levels, irrespective of the validate kwarg.

cf-vrgl · 2021-06-28T15:34:34Z

It would be great to have a parameter when joining on index for both left and right that allows for joining on levels that exist in both indices. Continuing the example above, there are times when I would like to keep "unique_and_not_in_b" in the index after the merge. The below does not include it in the index of the result after the merge.

import pandas as pd
import numpy as np

a = pd.DataFrame(range(10),
                 index=pd.MultiIndex.from_arrays(
                     [
                         pd.Index([1]*10, name="shared_and_not_unique"),
                         pd.Index(range(10), name="unique_and_not_in_b")]))

b = pd.DataFrame(range(2),
                 index=pd.Index([1, 2], name="shared_and_not_unique"))


e = pd.merge(a, b, on=list(np.intersect1d(a.index.names, b.index.names)))

cf-vrgl added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 28, 2021

attack68 added Reshaping Concat, Merge/Join, Stack/Unstack, Explode and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 28, 2021

mroeschke added the Error Reporting Incorrect or improved errors from pandas label Aug 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: false negatives for multiindex joins validation #42272

BUG: false negatives for multiindex joins validation #42272

cf-vrgl commented Jun 28, 2021

INSTALLED VERSIONS

attack68 commented Jun 28, 2021

cf-vrgl commented Jun 28, 2021

BUG: false negatives for multiindex joins validation #42272

BUG: false negatives for multiindex joins validation #42272

Comments

cf-vrgl commented Jun 28, 2021

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

attack68 commented Jun 28, 2021

cf-vrgl commented Jun 28, 2021

Output of `pd.show_versions()`