Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: New feature allowing merging on combination of columns and index levels drops levels of index #20452

Closed
Dr-Irv opened this issue Mar 22, 2018 · 5 comments

Comments

Projects
None yet
3 participants
@Dr-Irv
Copy link
Contributor

commented Mar 22, 2018

Code Sample, a copy-pastable example if possible

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.23.0.dev0+657.g01882ba5b'

In [3]: df1 =  pd.DataFrame({'v1' : range(12)}, index=pd.MultiIndex.from_product([list('abc'),list('xy'),[1,2]], names=['abc','xy','num']))
   ...: df1
   ...:
Out[3]:
            v1
abc xy num
a   x  1     0
       2     1
    y  1     2
       2     3
b   x  1     4
       2     5
    y  1     6
       2     7
c   x  1     8
       2     9
    y  1    10
       2    11

In [4]: df2 = pd.DataFrame({'v2': [100*i for i in range(1,7)]}, index=pd.MultiIndex.from_product([list('abc'), list('xy')],names=['abc','xy']))

In [5]: df2
Out[5]:
         v2
abc xy
a   x   100
    y   200
b   x   300
    y   400
c   x   500
    y   600

In [6]: df1.merge(df2, on=['abc','xy'])  # 'num' disappears
Out[6]:
        v1   v2
abc xy
a   x    0  100
    x    1  100
    y    2  200
    y    3  200
b   x    4  300
    x    5  300
    y    6  400
    y    7  400
c   x    8  500
    x    9  500
    y   10  600
    y   11  600

In [7]: df1.reset_index().merge(df2, on=['abc','xy']) # This preserves 'num'
Out[7]:
   abc xy  num  v1   v2
0    a  x    1   0  100
1    a  x    2   1  100
2    a  y    1   2  200
3    a  y    2   3  200
4    b  x    1   4  300
5    b  x    2   5  300
6    b  y    1   6  400
7    b  y    2   7  400
8    c  x    1   8  500
9    c  x    2   9  500
10   c  y    1  10  600
11   c  y    2  11  600

In [8]: df1.merge(df2, on='xy')  # 'abc' and 'num' disappear
Out[8]:
    v1   v2
xy
x    0  100
x    0  300
x    0  500
x    1  100
x    1  300
x    1  500
x    4  100
x    4  300
x    4  500
x    5  100
x    5  300
x    5  500
x    8  100
x    8  300
x    8  500
x    9  100
x    9  300
x    9  500
y    2  200
y    2  400
y    2  600
y    3  200
y    3  400
y    3  600
y    6  200
y    6  400
y    6  600
y    7  200
y    7  400
y    7  600
y   10  200
y   10  400
y   10  600
y   11  200
y   11  400
y   11  600

Problem description

It seems that the new feature implemented in #17484 that allows merging on a combination of columns and index levels can drop index levels, which is really non-intuitive. In the first example, the index level named "num" gets dropped, while in the last example, both "abc" and "xy" are dropped.

If this is the desired behavior, then it needs to be carefully documented.

N.B. There is also an error in the docs of merging.rst that says this feature was introduced in v.0.22, but it will be introduced in v0.23

I'm guessing @jmmease will need to look at this.

Expected Output

In [6]: df1.merge(df2, on=['abc','xy'])
Out[6]:
            v1   v2
abc xy num
a   x  1     0  100
       2     1  100
    y  1     2  200
       2     3  200
b   x  1     4  300
       2     5  300
    y  1     6  400
       2     7  400
c   x  1     8  500
       2     9  500
    y  1    10  600
       2    11  600

In [8]: df1.merge(df2, on='xy')
Out[8]:
   abc_x  num  v1 abc_y   v2
xy
x      a    1   0     a  100
x      a    1   0     b  300
x      a    1   0     c  500
x      a    2   1     a  100
x      a    2   1     b  300
x      a    2   1     c  500
x      b    1   4     a  100
x      b    1   4     b  300
x      b    1   4     c  500
x      b    2   5     a  100
x      b    2   5     b  300
x      b    2   5     c  500
x      c    1   8     a  100
x      c    1   8     b  300
x      c    1   8     c  500
x      c    2   9     a  100
x      c    2   9     b  300
x      c    2   9     c  500
y      a    1   2     a  200
y      a    1   2     b  400
y      a    1   2     c  600
y      a    2   3     a  200
y      a    2   3     b  400
y      a    2   3     c  600
y      b    1   6     a  200
y      b    1   6     b  400
y      b    1   6     c  600
y      b    2   7     a  200
y      b    2   7     b  400
y      b    2   7     c  600
y      c    1  10     a  200
y      c    1  10     b  400
y      c    1  10     c  600
y      c    2  11     a  200
y      c    2  11     b  400
y      c    2  11     c  600

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.0.dev0+657.g01882ba5b
pytest: 3.4.0
pip: 9.0.1
setuptools: 38.5.1
Cython: 0.25.1
numpy: 1.14.1
scipy: 1.0.0
pyarrow: 0.8.0
xarray: None
IPython: 6.2.1
sphinx: 1.7.1
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2018.3
blosc: 1.5.1
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.2.0
openpyxl: 2.5.0
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.5
pymysql: 0.8.0
psycopg2: None
jinja2: 2.10
s3fs: 0.1.3
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@jonmmease

This comment has been minimized.

Copy link
Contributor

commented Mar 22, 2018

This is the expected behavior.

Rationale

When index levels are not included as on params to merge, the merge operation drops all index levels (this is true of 0.23 and earlier). For example, if we reset abc into a column in both df1 and df2, and then merge on the common abc column, the xy and num index levels are dropped in the result.

In [1]: df1.reset_index('abc').merge(df2.reset_index('abc'), on='abc')
Out[1]: 
   abc  v1   v2
0    a   0  100
1    a   0  200
2    a   1  100
3    a   1  200
4    a   2  100
5    a   2  200
6    a   3  100
7    a   3  200
8    b   4  300
9    b   4  400
10   b   5  300
11   b   5  400
12   b   6  300
13   b   6  400
14   b   7  300
15   b   7  400
16   c   8  500
17   c   8  600
18   c   9  500
19   c   9  600
20   c  10  500
21   c  10  600
22   c  11  500
23   c  11  600

Now (in 0.23), when index levels are included as on params to merge, the merge operation preserves these index levels (and only these levels) in the output. For example:

In[2]: df1.merge(df2, on='abc')
Out[2]: 
     v1   v2
abc         
a     0  100
a     0  200
a     1  100
a     1  200
a     2  100
a     2  200
a     3  100
a     3  200
b     4  300
b     4  400
b     5  300
b     5  400
b     6  300
b     6  400
b     7  300
b     7  400
c     8  500
c     8  600
c     9  500
c     9  600
c    10  500
c    10  600
c    11  500
c    11  600

Preserving all index levels (even those not referenced as on params) would be a reasonable thing to do, but doing so consistently would require a change to the legacy behavior of merge.

Documentation

In terms of documentation, there is a note in the sphinx docs for Merging on a combination of columns and index levels that says:

Note: When DataFrames are merged on a string that matches an index level in both frames, the index level is preserved as an index level in the resulting DataFrame.

I think this is accurate, but it could be more explicit regarding the fate of the remaining index levels.

Thanks for pointing out the errant New in version version in the documentation! I guess this one was missed when 0.22 turned into 0.23.

Care to contribute some changes to the documentation that would have helped clarify the expected behavior for you?

cc @TomAugspurger @jorisvandenbossche @jreback

@jreback

This comment has been minimized.

Copy link
Contributor

commented Mar 23, 2018

I agree here with @jmmease here. merge is an SQL style merge that happens to allow level merging. What to do with a non-merged level is to drop it. I suppose you could add it as an additional level in the result, but this would not be normally very useful / intuitive. We might consider warning if levels are dropped though.

@Dr-Irv

This comment has been minimized.

Copy link
Contributor Author

commented Mar 23, 2018

@jreback I think this is about whether levels of the row axis MultiIndex are thought of as columns or not. To me, I think of a MultiIndex as just another set of columns that is used as a primary key. Using reset_index or set_index, I can make levels into columns and vice-versa. And in SQL, if I have a set of columns, and do a merge, all of the columns are preserved. My view relates to the ideas presented in #8162.

So would we consider an option to merge that preserves the index columns not directly involved in the join?

What @jmmease is saying is that when you do a merge with a MultiIndex, levels not included in the merge keys are dropped. I'll attempt to make this clear in the docs.

@Dr-Irv

This comment has been minimized.

Copy link
Contributor Author

commented Mar 23, 2018

So with 0.23 development, here is an interesting behavior for join.

In [7]: df1.join(df2, on=['abc','xy'], how='inner')
Out[7]:
         v1   v2
    num
a x 1     0  100
    2     1  100
  y 1     2  200
    2     3  200
b x 1     4  300
    2     5  300
  y 1     6  400
    2     7  400
c x 1     8  500
    2     9  500
  y 1    10  600
    2    11  600

In [8]: df1.join(df2, on='xy', how='inner')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
#### Traceback omitted

ValueError: len(left_on) must equal the number of levels in the index of "right"

In [9]: df2.join(df1, on=['abc','xy'], how='inner')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
### Traceback omitted

ValueError: len(left_on) must equal the number of levels in the index of "right"

So the first join (which does not work in v0.22) gives the "correct" result, except that the names of the index have been lost. Isn't that a bug?

The second join and third join both fail. The third one is just a reversal of which DataFrame is on the left versus right.

So explaining all of this behavior in the docs is a bit of a challenge.

@Dr-Irv

This comment has been minimized.

Copy link
Contributor Author

commented Mar 23, 2018

I have a fix for the bug I reported in the previous comment about join dropping the names

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.