left join fails in case of non-unique indices #5391

Closed
behzadnouri opened this Issue Oct 30, 2013 · 4 comments

Comments

Projects
None yet
2 participants
Contributor

behzadnouri commented Oct 30, 2013

It seems to me that join operation fails if the index is not of unique values. The particular circumastance that I observed this was with multi-index:

df1.set_index( [ 'col1', 'col2', 'col3' ], inplace=True )
df2.join ( df1, on=['cola', 'colb', 'colc' ], how='left' )

I understand that the above join operation is not well-defined for non-unique index values, but pandas gives wrong values even for unique matches. ( no warnings, error messages whatsoever )

In case checking for index integrity has a heavy performance cost, it should be documented that this method fails if the index is not unique. ( or alternatively have the optional argument to enforce integrity check )

I could get correct join by doing below:

df1.drop_duplicates( cols=[ 'col1', 'col2', 'col3' ], inplace=True )
df1.set_index( [ 'col1', 'col2', 'col3' ], inplace=True )
df2.join ( df1, on=['cola', 'colb', 'colc' ], how='left' )
Contributor

jreback commented Oct 30, 2013

can u post the frames that reproduce this?
(a minimal sample)

Contributor

behzadnouri commented Nov 4, 2013

below is an example; As I mentioned above, I understand that the set_index here results in an index of non-unique values, but even for the X, Y, B which is unique we get wrong value;

>>> pd.__version__
'0.12.0-1040-gc435e72'
>>> df1
  col1 col2 col3  val1
0    X    Y    A     1
1    X    Y    A     4
2    X    Y    B     5
3    X    Y    C     6
4    X    Y    C     9
>>> df2
  cola colb colc val2
0    X    Y    A    a
1    X    Y    B    b
2    X    Y    C    c
>>> df1.set_index( [ 'col1', 'col2', 'col3' ], inplace=True )
>>> df2.join ( df1, on=['cola', 'colb', 'colc' ], how='left' )
  cola colb colc val2  val1
0    X    Y    A    a     1
1    X    Y    B    b     4
2    X    Y    C    c     5
Contributor

jreback commented Nov 4, 2013

works in the case of an outer join, so could be a bug (or their is some sort of uniqueness guarantee there)...mark as a bug

In [12]: df2.join ( df1, on=['cola', 'colb', 'colc' ], how='outer' )
Out[12]: 
  cola colb colc val2  val1
0    X    Y    A    a     1
0    X    Y    A    a     4
1    X    Y    B    b     5
2    X    Y    C    c     6
2    X    Y    C    c     9

@jreback jreback modified the milestone: 0.15.0, 0.14.0 Apr 4, 2014

@jreback jreback modified the milestone: 0.15.0, 0.15.1 Jul 28, 2014

Contributor

jreback commented Aug 21, 2014

closed by #7853

jreback closed this Aug 21, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment