ERR automatic broadcast for merging different levels, #9455 #12219

nbonnotte · 2016-02-03T12:42:05Z

I'm adding tests to close #9455 after pull request #12158 solved the issue

But I confess I'm a bit surprised by the result:

n [2]: df1 = DataFrame(columns=['a', 'b'], data=[[0, 1]])

In [3]: df2a = DataFrame(columns=['c', 'd'], data=[[2, 3]])

In [4]: df2b = DataFrame(columns=['c', 'e'], data=[[4, 5]])

In [5]: df2 = concat([df2a, df2b], keys=['l', 'r'], axis=1)

In [6]: df2.index.name = 'a'

In [7]: df2 = df2.reset_index()

In [8]: df1
Out[8]: 
   a  b
0  0  1

In [9]: df2
Out[9]: 
   a  l     r   
      c  d  c  e
0  0  2  3  4  5

In [10]: merge(df1, df2, on='a')
pandas/tools/merge.py:467: PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance.
  self.right = self.right.drop(right_drop, axis=1)
Out[10]: 
   a  b  (l, c)  (l, d)  (r, c)  (r, e)
0  0  1       2       3       4       5

Is it all right ?

jreback · 2016-02-03T13:49:57Z

its a correct result, but not very useful. merging between different levels cannot automatically broadcast.

jreback · 2016-02-03T13:50:45Z

I would like to see this raise a ValueError instead of actually merging.

nbonnotte · 2016-02-18T13:46:46Z

I thought it was related to the multi-index not being lexsorted, but it is not:

In [36]: merge(df1, df2.sortlevel(axis=1), on='a')
Out[36]:
   a  b  (l, c)  (l, d)  (r, c)  (r, e)
0  0  1       2       3       4       5

I'm updating the name of the pull request.

nbonnotte · 2016-02-18T15:41:15Z

@jreback I'm a bit confused about what should or should not be allowed.

How is this different from #2024, which was closed by this commit?

jreback · 2016-02-18T17:10:45Z

oh, I think we out to change that behavior and raise. I think this should be a hard error, or at the very least a useful Warning (I agree its not PerformanceWarning which is prob just triggered).

Are there any situations where this is NOT an error?

nbonnotte · 2016-02-18T21:47:06Z

@jreback What do you mean by "not an error"?

jreback · 2016-02-18T21:59:06Z

I mean, can you come up with a usecase where this is actually useful? e.g. merging something with multi-levels with a single level frame? (so you have mixed tuples and column labels).

nbonnotte · 2016-02-18T22:40:27Z

Wanting to merge a single-level dataframe with a multi-level one seems natural to me (and, apparently, to other users, since this functionality was explicitly implemented).

If this is to be prevented with a ValueError, how to get the same result? One would have to do something like

multilevel_df.columns = multilevel_df.columns.values

which is not really difficult to do (or to guess?)

On the other hand, what are the drawbacks to keeping the feature? As far as I'm concerned, I was a bit surprised with the result, because I would naively have expected to get a multilevel dataframe, like

   a  b  l     r   
         c  d  c  e
0  0  1  2  3  4  5

But maybe that's because I'm still not familiar enough with the spirit of the API.

Anyway, raising an error here will break things. Following the principle "be liberal in what you accept, be conservative in what you do", I would suggest to keep the API as it is.

jreback · 2016-02-18T22:45:56Z

@nbonnotte you are totally missing my point. I am not asking about merging a single level to a SINGLE level of a multi-level at all.

from the example above

In [10]: merge(df1, df2, on='a')
pandas/tools/merge.py:467: PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance.
  self.right = self.right.drop(right_drop, axis=1)
Out[10]: 
   a  b  (l, c)  (l, d)  (r, c)  (r, e)
0  0  1       2       3       4       5

I can't imagine this is useful in any way and would always be an error. However if someone speaks up and says, hey this might be useful? I want to see that case.

nbonnotte · 2016-02-19T08:41:33Z

@jreback I feel like none of us understand what the other is saying. I am not talking either about actually merging a single level to a single level of a multi-level.

What I am saying is: what are the drawbacks to letting the feature? From my (still naive) point of view, I would find it difficult to guess the result of the operation: here, the multi-level is "flattened" (not sure the term is correct), but it could have been possible to instead merge the single-level index to a single level of the multi-level, right?

So I'm not saying we should actually merge a single level to a multi-level, just that we could, and it would seem to me at least as natural as the current result... which therefore might appear as unpredictable. That's a drawback, for me.

Are there any other drawbacks? Why should it be an error?

As for a use case, what about #2024 ?

Otherwise, I have an example. I often use pandas to work on the features I'm going to feed to scikit-learn, and I have features from many different origins: for instance, for each day weather data (temperature, pressure, etc.), calendar data (is it a bank holiday? a school holiday?) and let's say the target value of the previous day. At some point, I thus have multiple multi-level dataframes on the one hand, e.g. with ('weather', 'temperature'), ('weather', 'pressure'), ... and ('calendar', 'is_bank_holiday'), ('calendar', 'is_school_holiday') ..., and single-level dataframes on the other hand, for instance containing 'yesterday_target'. And I want to merge all that, and give the result to scikit-learn. But I like multi-level dataframes because they make it so much easier to select a subset of the features, e.g. for plotting, or just to exclude some features, so I'd like to get one in the end (or a single-level with tuples, which can easily be converted).

Sure, I can think of some ways to arrive to the same result without merging a single-level dataframe to a multi-level one. But it is simpler if it's directly possible.

Otherwise, we can raise an error, and wait to see if anyone complains. You tell me.

jreback · 2016-02-19T13:34:40Z

@nbonnotte

my point is that the default of this operation is not generally wanted (IOW a user almost certainly would be suprised that this DOES not merge on a particular level). So I would like to see this raise a helpful error message and force the user to be proactive (e.g. pass an option) to actually merge a combined multi-level with a single level.

The default is too flexible here. We could simply show a warning, but warnings are often ignored.

I don't have a concrete idea of how to do this ATM.

nbonnotte · 2016-02-19T15:40:55Z

@jreback Here we go. Is that ok?

Hum, I should add a note in a what's new with the API change

jreback · 2016-02-19T15:51:03Z

pandas/tools/merge.py

@@ -193,6 +193,10 @@ def __init__(self, left, right, how='inner', on=None,
                'can not merge DataFrame with instance of '
                'type {0}'.format(type(right)))

+        # prevent merging between different levels
+        if left.columns.nlevels != right.columns.nlevels:
+            raise ValueError('can not merge between different levels')


can you make this even more expressive, e.g. indicate which/how many on left & right.

jreback · 2016-02-19T15:51:20Z

comments on this

@jorisvandenbossche @TomAugspurger

jorisvandenbossche · 2016-02-20T00:43:43Z

I find this a difficult one (in the sense of "there is not clear 'best' solution" IMO).
I agree the behaviour is somewhat unexpected and probably in most cases not what the user would have wanted (favoring a clear error message), but I don't know if I find this worth possibly breaking some users' code (in the end, it was explicitly implemented, and the current result is not 'wrong'. It is possible that you want the result, and in such cases you will need to do more code to workaround the error)

jreback · 2016-02-23T16:47:22Z

I would be ok with a warning UserWarning? in this case (the PerformanceWarning is only triggered in some cases and is not really relevant).

This is really a bit unexpected result I think in almost all cases and not an intended result. (and if it is, then the user could explicity construct an Index of tuples)

nbonnotte · 2016-02-24T12:25:34Z

@jreback like that?

jreback · 2016-02-24T13:38:50Z

I like that.

@jorisvandenbossche @TomAugspurger ?

TomAugspurger · 2016-02-24T13:46:08Z

I'm still trying to wrap my head around the issue, and what the ideal outcome is, but yeah I think that looks right.

nbonnotte · 2016-02-24T14:21:37Z

I've just detected an inconsistency with .join, or .merge with a specific parameter configuration:

In [2]: df1 = DataFrame(columns=['a', 'b'], data=[[0, 1]])

In [3]: columns = MultiIndex.from_tuples([('a', ''), ('c', 'c1')])

In [4]: df2 = DataFrame(columns=columns, data=[[0, 2]])

In [6]: merge(df1, df2, on='a')
Out[6]:
   a  b  (c, c1)
0  0  1        2

In [7]: df1.join(df2, on='a')
Out[7]:
   a  b  (a, )  (c, c1)
0  0  1      0        2

In [16]: merge(df1, df2, left_on='a', left_index=False, right_index=True)
Out[16]:
   a  b  (a, )  (c, c1)
0  0  1      0        2

This is just wrong. But I guess it can be treated as a separate issue.

jreback · 2016-02-24T15:42:28Z

@nbonnotte that's the same exact issue. Note that I doubt this is tested very well. So might be good to add all of these examples as tests.

nbonnotte · 2016-02-25T09:15:08Z

Ah, but I was wrong, the result is totally correct! My bad, I'm adding some tests

nbonnotte · 2016-02-26T11:47:42Z

@jreback Tests added, all green.

jreback · 2016-02-26T12:02:46Z

pandas/tools/tests/test_merge.py

        df = DataFrame([(1, 2, 3), (4, 5, 6)], columns=['a', 'b', 'c'])
        new_df = df.groupby(['a']).agg({'b': [np.mean, np.sum]})
        other_df = DataFrame(
            [(1, 2, 3), (7, 10, 6)], columns=['a', 'b', 'd'])
        other_df.set_index('a', inplace=True)
-
-        result = merge(new_df, other_df, left_index=True, right_index=True)
+        # GH 9455, 12219


is the only time this warning is shown in the entire codebase?

What do you mean?

The message is shown when pandas.tools.merge:merge is used with different levels, that is in DataFrame.mergeand DataFrame.join, and I think that's pretty much it.

jreback · 2016-02-26T12:03:05Z

ok, let's add a whats note for this API changes.

jreback · 2016-02-26T12:03:33Z

add #12219 (comment) as tests as well.

closes #9455 closes #12219

nbonnotte · 2016-03-13T10:19:58Z

I don't understand what the tests just failed with Python 3.4:

======================================================================
FAIL: test_format (pandas.tests.indexes.test_base.TestIndex)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/pydata/pandas/pandas/tests/indexes/test_base.py", line 730, in test_format
    self.assertEqual(formatted, expected)
AssertionError: Lists differ: ['2016-03-13 09:54:19.205'] != ['2016-03-13 09:54:19.205000']

nbonnotte · 2016-03-13T16:11:54Z

@jreback All green

jreback · 2016-04-18T17:30:22Z

can you rebase/update

jreback · 2016-04-25T14:38:24Z

thanks!

jreback · 2016-05-05T16:34:24Z

pandas/tools/merge.py

@@ -193,6 +195,13 @@ def __init__(self, left, right, how='inner', on=None,
                'can not merge DataFrame with instance of '
                'type {0}'.format(type(right)))

+        # warn user when merging between different levels
+        if left.columns.nlevels != right.columns.nlevels:


ahh didn't even notice, this is ONLY checking the levels, it should be doing something like:

if left._get_axis(axis).nlevels != right._get_axis(axis).nlevels: ...

Oh, I see. I've been away for some time, but I just went back. I can do a PR for that.

jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Error Reporting Incorrect or improved errors from pandas Difficulty Intermediate labels Feb 3, 2016

jreback added this to the Next Major Release milestone Feb 3, 2016

jreback removed Difficulty Intermediate labels Feb 3, 2016

nbonnotte changed the title ~~TST for merging non-lexsorted multi-indexed dataframe, #9455~~ ERR automatic broadcast for merging different levels, #9455 Feb 18, 2016

jreback added the API Design label Feb 18, 2016

jreback reviewed Feb 19, 2016
View reviewed changes

jreback reviewed Feb 26, 2016
View reviewed changes

ERR automatic broadcast for merging different levels, #9455

6532ab2

closes #9455 closes #12219

jreback modified the milestones: 0.18.1, Next Major Release Mar 13, 2016

jreback closed this in bb9b9c5 Apr 25, 2016

jreback mentioned this pull request May 5, 2016

ERR: warning on merging on unequal levels for an Index #13094

Open

jreback reviewed May 5, 2016
View reviewed changes

dinya mentioned this pull request Sep 9, 2020

DEPR: Merging on different number of levels #34862

Closed

2 tasks

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ERR automatic broadcast for merging different levels, #9455 #12219

ERR automatic broadcast for merging different levels, #9455 #12219

nbonnotte commented Feb 3, 2016

jreback commented Feb 3, 2016

jreback commented Feb 3, 2016

nbonnotte commented Feb 18, 2016

nbonnotte commented Feb 18, 2016

jreback commented Feb 18, 2016

nbonnotte commented Feb 18, 2016

jreback commented Feb 18, 2016

nbonnotte commented Feb 18, 2016

jreback commented Feb 18, 2016

nbonnotte commented Feb 19, 2016

jreback commented Feb 19, 2016

nbonnotte commented Feb 19, 2016

jreback Feb 19, 2016

jreback commented Feb 19, 2016

jorisvandenbossche commented Feb 20, 2016

jreback commented Feb 23, 2016

nbonnotte commented Feb 24, 2016

jreback commented Feb 24, 2016

TomAugspurger commented Feb 24, 2016

nbonnotte commented Feb 24, 2016

jreback commented Feb 24, 2016

nbonnotte commented Feb 25, 2016

nbonnotte commented Feb 26, 2016

jreback Feb 26, 2016

nbonnotte Feb 26, 2016

jreback commented Feb 26, 2016

jreback commented Feb 26, 2016

nbonnotte commented Mar 13, 2016

nbonnotte commented Mar 13, 2016

jreback commented Apr 18, 2016

jreback commented Apr 25, 2016

jreback May 5, 2016

nbonnotte May 5, 2016

ERR automatic broadcast for merging different levels, #9455 #12219

ERR automatic broadcast for merging different levels, #9455 #12219

Conversation

nbonnotte commented Feb 3, 2016

jreback commented Feb 3, 2016

jreback commented Feb 3, 2016

nbonnotte commented Feb 18, 2016

nbonnotte commented Feb 18, 2016

jreback commented Feb 18, 2016

nbonnotte commented Feb 18, 2016

jreback commented Feb 18, 2016

nbonnotte commented Feb 18, 2016

jreback commented Feb 18, 2016

nbonnotte commented Feb 19, 2016

jreback commented Feb 19, 2016

nbonnotte commented Feb 19, 2016

jreback Feb 19, 2016

Choose a reason for hiding this comment

jreback commented Feb 19, 2016

jorisvandenbossche commented Feb 20, 2016

jreback commented Feb 23, 2016

nbonnotte commented Feb 24, 2016

jreback commented Feb 24, 2016

TomAugspurger commented Feb 24, 2016

nbonnotte commented Feb 24, 2016

jreback commented Feb 24, 2016

nbonnotte commented Feb 25, 2016

nbonnotte commented Feb 26, 2016

jreback Feb 26, 2016

Choose a reason for hiding this comment

nbonnotte Feb 26, 2016

Choose a reason for hiding this comment

jreback commented Feb 26, 2016

jreback commented Feb 26, 2016

nbonnotte commented Mar 13, 2016

nbonnotte commented Mar 13, 2016

jreback commented Apr 18, 2016

jreback commented Apr 25, 2016

jreback May 5, 2016

Choose a reason for hiding this comment

nbonnotte May 5, 2016

Choose a reason for hiding this comment