New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERR automatic broadcast for merging different levels, #9455 #12219

Closed
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
4 participants
@nbonnotte
Contributor

nbonnotte commented Feb 3, 2016

I'm adding tests to close #9455 after pull request #12158 solved the issue

But I confess I'm a bit surprised by the result:

n [2]: df1 = DataFrame(columns=['a', 'b'], data=[[0, 1]])

In [3]: df2a = DataFrame(columns=['c', 'd'], data=[[2, 3]])

In [4]: df2b = DataFrame(columns=['c', 'e'], data=[[4, 5]])

In [5]: df2 = concat([df2a, df2b], keys=['l', 'r'], axis=1)

In [6]: df2.index.name = 'a'

In [7]: df2 = df2.reset_index()

In [8]: df1
Out[8]: 
   a  b
0  0  1

In [9]: df2
Out[9]: 
   a  l     r   
      c  d  c  e
0  0  2  3  4  5

In [10]: merge(df1, df2, on='a')
pandas/tools/merge.py:467: PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance.
  self.right = self.right.drop(right_drop, axis=1)
Out[10]: 
   a  b  (l, c)  (l, d)  (r, c)  (r, e)
0  0  1       2       3       4       5

Is it all right ?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Feb 3, 2016

Contributor

its a correct result, but not very useful. merging between different levels cannot automatically broadcast.

Contributor

jreback commented Feb 3, 2016

its a correct result, but not very useful. merging between different levels cannot automatically broadcast.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Feb 3, 2016

Contributor

I would like to see this raise a ValueError instead of actually merging.

Contributor

jreback commented Feb 3, 2016

I would like to see this raise a ValueError instead of actually merging.

@jreback jreback added this to the Next Major Release milestone Feb 3, 2016

@nbonnotte nbonnotte changed the title from TST for merging non-lexsorted multi-indexed dataframe, #9455 to ERR automatic broadcast for merging different levels, #9455 Feb 18, 2016

@nbonnotte

This comment has been minimized.

Show comment
Hide comment
@nbonnotte

nbonnotte Feb 18, 2016

Contributor

I thought it was related to the multi-index not being lexsorted, but it is not:

In [36]: merge(df1, df2.sortlevel(axis=1), on='a')
Out[36]:
   a  b  (l, c)  (l, d)  (r, c)  (r, e)
0  0  1       2       3       4       5

I'm updating the name of the pull request.

Contributor

nbonnotte commented Feb 18, 2016

I thought it was related to the multi-index not being lexsorted, but it is not:

In [36]: merge(df1, df2.sortlevel(axis=1), on='a')
Out[36]:
   a  b  (l, c)  (l, d)  (r, c)  (r, e)
0  0  1       2       3       4       5

I'm updating the name of the pull request.

@nbonnotte

This comment has been minimized.

Show comment
Hide comment
@nbonnotte

nbonnotte Feb 18, 2016

Contributor

@jreback I'm a bit confused about what should or should not be allowed.

How is this different from #2024, which was closed by this commit?

Contributor

nbonnotte commented Feb 18, 2016

@jreback I'm a bit confused about what should or should not be allowed.

How is this different from #2024, which was closed by this commit?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Feb 18, 2016

Contributor

oh, I think we out to change that behavior and raise. I think this should be a hard error, or at the very least a useful Warning (I agree its not PerformanceWarning which is prob just triggered).

Are there any situations where this is NOT an error?

Contributor

jreback commented Feb 18, 2016

oh, I think we out to change that behavior and raise. I think this should be a hard error, or at the very least a useful Warning (I agree its not PerformanceWarning which is prob just triggered).

Are there any situations where this is NOT an error?

@jreback jreback added the API Design label Feb 18, 2016

@nbonnotte

This comment has been minimized.

Show comment
Hide comment
@nbonnotte

nbonnotte Feb 18, 2016

Contributor

@jreback What do you mean by "not an error"?

Contributor

nbonnotte commented Feb 18, 2016

@jreback What do you mean by "not an error"?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Feb 18, 2016

Contributor

I mean, can you come up with a usecase where this is actually useful? e.g. merging something with multi-levels with a single level frame? (so you have mixed tuples and column labels).

Contributor

jreback commented Feb 18, 2016

I mean, can you come up with a usecase where this is actually useful? e.g. merging something with multi-levels with a single level frame? (so you have mixed tuples and column labels).

@nbonnotte

This comment has been minimized.

Show comment
Hide comment
@nbonnotte

nbonnotte Feb 18, 2016

Contributor

Wanting to merge a single-level dataframe with a multi-level one seems natural to me (and, apparently, to other users, since this functionality was explicitly implemented).

If this is to be prevented with a ValueError, how to get the same result? One would have to do something like

multilevel_df.columns = multilevel_df.columns.values

which is not really difficult to do (or to guess?)

On the other hand, what are the drawbacks to keeping the feature? As far as I'm concerned, I was a bit surprised with the result, because I would naively have expected to get a multilevel dataframe, like

   a  b  l     r   
         c  d  c  e
0  0  1  2  3  4  5

But maybe that's because I'm still not familiar enough with the spirit of the API.

Anyway, raising an error here will break things. Following the principle "be liberal in what you accept, be conservative in what you do", I would suggest to keep the API as it is.

Contributor

nbonnotte commented Feb 18, 2016

Wanting to merge a single-level dataframe with a multi-level one seems natural to me (and, apparently, to other users, since this functionality was explicitly implemented).

If this is to be prevented with a ValueError, how to get the same result? One would have to do something like

multilevel_df.columns = multilevel_df.columns.values

which is not really difficult to do (or to guess?)

On the other hand, what are the drawbacks to keeping the feature? As far as I'm concerned, I was a bit surprised with the result, because I would naively have expected to get a multilevel dataframe, like

   a  b  l     r   
         c  d  c  e
0  0  1  2  3  4  5

But maybe that's because I'm still not familiar enough with the spirit of the API.

Anyway, raising an error here will break things. Following the principle "be liberal in what you accept, be conservative in what you do", I would suggest to keep the API as it is.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Feb 18, 2016

Contributor

@nbonnotte you are totally missing my point. I am not asking about merging a single level to a SINGLE level of a multi-level at all.

from the example above

In [10]: merge(df1, df2, on='a')
pandas/tools/merge.py:467: PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance.
  self.right = self.right.drop(right_drop, axis=1)
Out[10]: 
   a  b  (l, c)  (l, d)  (r, c)  (r, e)
0  0  1       2       3       4       5

I can't imagine this is useful in any way and would always be an error. However if someone speaks up and says, hey this might be useful? I want to see that case.

Contributor

jreback commented Feb 18, 2016

@nbonnotte you are totally missing my point. I am not asking about merging a single level to a SINGLE level of a multi-level at all.

from the example above

In [10]: merge(df1, df2, on='a')
pandas/tools/merge.py:467: PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance.
  self.right = self.right.drop(right_drop, axis=1)
Out[10]: 
   a  b  (l, c)  (l, d)  (r, c)  (r, e)
0  0  1       2       3       4       5

I can't imagine this is useful in any way and would always be an error. However if someone speaks up and says, hey this might be useful? I want to see that case.

@nbonnotte

This comment has been minimized.

Show comment
Hide comment
@nbonnotte

nbonnotte Feb 19, 2016

Contributor

@jreback I feel like none of us understand what the other is saying. I am not talking either about actually merging a single level to a single level of a multi-level.

What I am saying is: what are the drawbacks to letting the feature? From my (still naive) point of view, I would find it difficult to guess the result of the operation: here, the multi-level is "flattened" (not sure the term is correct), but it could have been possible to instead merge the single-level index to a single level of the multi-level, right?

So I'm not saying we should actually merge a single level to a multi-level, just that we could, and it would seem to me at least as natural as the current result... which therefore might appear as unpredictable. That's a drawback, for me.

Are there any other drawbacks? Why should it be an error?

As for a use case, what about #2024 ?

Otherwise, I have an example. I often use pandas to work on the features I'm going to feed to scikit-learn, and I have features from many different origins: for instance, for each day weather data (temperature, pressure, etc.), calendar data (is it a bank holiday? a school holiday?) and let's say the target value of the previous day. At some point, I thus have multiple multi-level dataframes on the one hand, e.g. with ('weather', 'temperature'), ('weather', 'pressure'), ... and ('calendar', 'is_bank_holiday'), ('calendar', 'is_school_holiday') ..., and single-level dataframes on the other hand, for instance containing 'yesterday_target'. And I want to merge all that, and give the result to scikit-learn. But I like multi-level dataframes because they make it so much easier to select a subset of the features, e.g. for plotting, or just to exclude some features, so I'd like to get one in the end (or a single-level with tuples, which can easily be converted).

Sure, I can think of some ways to arrive to the same result without merging a single-level dataframe to a multi-level one. But it is simpler if it's directly possible.

Otherwise, we can raise an error, and wait to see if anyone complains. You tell me.

Contributor

nbonnotte commented Feb 19, 2016

@jreback I feel like none of us understand what the other is saying. I am not talking either about actually merging a single level to a single level of a multi-level.

What I am saying is: what are the drawbacks to letting the feature? From my (still naive) point of view, I would find it difficult to guess the result of the operation: here, the multi-level is "flattened" (not sure the term is correct), but it could have been possible to instead merge the single-level index to a single level of the multi-level, right?

So I'm not saying we should actually merge a single level to a multi-level, just that we could, and it would seem to me at least as natural as the current result... which therefore might appear as unpredictable. That's a drawback, for me.

Are there any other drawbacks? Why should it be an error?

As for a use case, what about #2024 ?

Otherwise, I have an example. I often use pandas to work on the features I'm going to feed to scikit-learn, and I have features from many different origins: for instance, for each day weather data (temperature, pressure, etc.), calendar data (is it a bank holiday? a school holiday?) and let's say the target value of the previous day. At some point, I thus have multiple multi-level dataframes on the one hand, e.g. with ('weather', 'temperature'), ('weather', 'pressure'), ... and ('calendar', 'is_bank_holiday'), ('calendar', 'is_school_holiday') ..., and single-level dataframes on the other hand, for instance containing 'yesterday_target'. And I want to merge all that, and give the result to scikit-learn. But I like multi-level dataframes because they make it so much easier to select a subset of the features, e.g. for plotting, or just to exclude some features, so I'd like to get one in the end (or a single-level with tuples, which can easily be converted).

Sure, I can think of some ways to arrive to the same result without merging a single-level dataframe to a multi-level one. But it is simpler if it's directly possible.

Otherwise, we can raise an error, and wait to see if anyone complains. You tell me.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Feb 19, 2016

Contributor

@nbonnotte

my point is that the default of this operation is not generally wanted (IOW a user almost certainly would be suprised that this DOES not merge on a particular level). So I would like to see this raise a helpful error message and force the user to be proactive (e.g. pass an option) to actually merge a combined multi-level with a single level.

The default is too flexible here. We could simply show a warning, but warnings are often ignored.

I don't have a concrete idea of how to do this ATM.

Contributor

jreback commented Feb 19, 2016

@nbonnotte

my point is that the default of this operation is not generally wanted (IOW a user almost certainly would be suprised that this DOES not merge on a particular level). So I would like to see this raise a helpful error message and force the user to be proactive (e.g. pass an option) to actually merge a combined multi-level with a single level.

The default is too flexible here. We could simply show a warning, but warnings are often ignored.

I don't have a concrete idea of how to do this ATM.

@nbonnotte

This comment has been minimized.

Show comment
Hide comment
@nbonnotte

nbonnotte Feb 19, 2016

Contributor

@jreback Here we go. Is that ok?

Hum, I should add a note in a what's new with the API change

Contributor

nbonnotte commented Feb 19, 2016

@jreback Here we go. Is that ok?

Hum, I should add a note in a what's new with the API change

@jreback

View changes

Show outdated Hide outdated pandas/tools/merge.py Outdated
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback
Contributor

jreback commented Feb 19, 2016

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Feb 20, 2016

Member

I find this a difficult one (in the sense of "there is not clear 'best' solution" IMO).
I agree the behaviour is somewhat unexpected and probably in most cases not what the user would have wanted (favoring a clear error message), but I don't know if I find this worth possibly breaking some users' code (in the end, it was explicitly implemented, and the current result is not 'wrong'. It is possible that you want the result, and in such cases you will need to do more code to workaround the error)

Member

jorisvandenbossche commented Feb 20, 2016

I find this a difficult one (in the sense of "there is not clear 'best' solution" IMO).
I agree the behaviour is somewhat unexpected and probably in most cases not what the user would have wanted (favoring a clear error message), but I don't know if I find this worth possibly breaking some users' code (in the end, it was explicitly implemented, and the current result is not 'wrong'. It is possible that you want the result, and in such cases you will need to do more code to workaround the error)

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Feb 23, 2016

Contributor

I would be ok with a warning UserWarning? in this case (the PerformanceWarning is only triggered in some cases and is not really relevant).

This is really a bit unexpected result I think in almost all cases and not an intended result. (and if it is, then the user could explicity construct an Index of tuples)

Contributor

jreback commented Feb 23, 2016

I would be ok with a warning UserWarning? in this case (the PerformanceWarning is only triggered in some cases and is not really relevant).

This is really a bit unexpected result I think in almost all cases and not an intended result. (and if it is, then the user could explicity construct an Index of tuples)

@nbonnotte

This comment has been minimized.

Show comment
Hide comment
@nbonnotte

nbonnotte Feb 24, 2016

Contributor

@jreback like that?

Contributor

nbonnotte commented Feb 24, 2016

@jreback like that?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback
Contributor

jreback commented Feb 24, 2016

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Feb 24, 2016

Contributor

I'm still trying to wrap my head around the issue, and what the ideal outcome is, but yeah I think that looks right.

Contributor

TomAugspurger commented Feb 24, 2016

I'm still trying to wrap my head around the issue, and what the ideal outcome is, but yeah I think that looks right.

@nbonnotte

This comment has been minimized.

Show comment
Hide comment
@nbonnotte

nbonnotte Feb 24, 2016

Contributor

I've just detected an inconsistency with .join, or .merge with a specific parameter configuration:

In [2]: df1 = DataFrame(columns=['a', 'b'], data=[[0, 1]])

In [3]: columns = MultiIndex.from_tuples([('a', ''), ('c', 'c1')])

In [4]: df2 = DataFrame(columns=columns, data=[[0, 2]])

In [6]: merge(df1, df2, on='a')
Out[6]:
   a  b  (c, c1)
0  0  1        2

In [7]: df1.join(df2, on='a')
Out[7]:
   a  b  (a, )  (c, c1)
0  0  1      0        2

In [16]: merge(df1, df2, left_on='a', left_index=False, right_index=True)
Out[16]:
   a  b  (a, )  (c, c1)
0  0  1      0        2

This is just wrong. But I guess it can be treated as a separate issue.

Contributor

nbonnotte commented Feb 24, 2016

I've just detected an inconsistency with .join, or .merge with a specific parameter configuration:

In [2]: df1 = DataFrame(columns=['a', 'b'], data=[[0, 1]])

In [3]: columns = MultiIndex.from_tuples([('a', ''), ('c', 'c1')])

In [4]: df2 = DataFrame(columns=columns, data=[[0, 2]])

In [6]: merge(df1, df2, on='a')
Out[6]:
   a  b  (c, c1)
0  0  1        2

In [7]: df1.join(df2, on='a')
Out[7]:
   a  b  (a, )  (c, c1)
0  0  1      0        2

In [16]: merge(df1, df2, left_on='a', left_index=False, right_index=True)
Out[16]:
   a  b  (a, )  (c, c1)
0  0  1      0        2

This is just wrong. But I guess it can be treated as a separate issue.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Feb 24, 2016

Contributor

@nbonnotte that's the same exact issue. Note that I doubt this is tested very well. So might be good to add all of these examples as tests.

Contributor

jreback commented Feb 24, 2016

@nbonnotte that's the same exact issue. Note that I doubt this is tested very well. So might be good to add all of these examples as tests.

@nbonnotte

This comment has been minimized.

Show comment
Hide comment
@nbonnotte

nbonnotte Feb 25, 2016

Contributor

Ah, but I was wrong, the result is totally correct! My bad, I'm adding some tests

Contributor

nbonnotte commented Feb 25, 2016

Ah, but I was wrong, the result is totally correct! My bad, I'm adding some tests

@nbonnotte

This comment has been minimized.

Show comment
Hide comment
@nbonnotte

nbonnotte Feb 26, 2016

Contributor

@jreback Tests added, all green.

Contributor

nbonnotte commented Feb 26, 2016

@jreback Tests added, all green.

df = DataFrame([(1, 2, 3), (4, 5, 6)], columns=['a', 'b', 'c'])
new_df = df.groupby(['a']).agg({'b': [np.mean, np.sum]})
other_df = DataFrame(
[(1, 2, 3), (7, 10, 6)], columns=['a', 'b', 'd'])
other_df.set_index('a', inplace=True)
result = merge(new_df, other_df, left_index=True, right_index=True)
# GH 9455, 12219

This comment has been minimized.

@jreback

jreback Feb 26, 2016

Contributor

is the only time this warning is shown in the entire codebase?

@jreback

jreback Feb 26, 2016

Contributor

is the only time this warning is shown in the entire codebase?

This comment has been minimized.

@nbonnotte

nbonnotte Feb 26, 2016

Contributor

What do you mean?

The message is shown when pandas.tools.merge:merge is used with different levels, that is in DataFrame.mergeand DataFrame.join, and I think that's pretty much it.

@nbonnotte

nbonnotte Feb 26, 2016

Contributor

What do you mean?

The message is shown when pandas.tools.merge:merge is used with different levels, that is in DataFrame.mergeand DataFrame.join, and I think that's pretty much it.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Feb 26, 2016

Contributor

ok, let's add a whats note for this API changes.

Contributor

jreback commented Feb 26, 2016

ok, let's add a whats note for this API changes.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Feb 26, 2016

Contributor

add #12219 (comment) as tests as well.

Contributor

jreback commented Feb 26, 2016

add #12219 (comment) as tests as well.

@nbonnotte

This comment has been minimized.

Show comment
Hide comment
@nbonnotte

nbonnotte Mar 13, 2016

Contributor

I don't understand what the tests just failed with Python 3.4:

======================================================================
FAIL: test_format (pandas.tests.indexes.test_base.TestIndex)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/pydata/pandas/pandas/tests/indexes/test_base.py", line 730, in test_format
    self.assertEqual(formatted, expected)
AssertionError: Lists differ: ['2016-03-13 09:54:19.205'] != ['2016-03-13 09:54:19.205000']
Contributor

nbonnotte commented Mar 13, 2016

I don't understand what the tests just failed with Python 3.4:

======================================================================
FAIL: test_format (pandas.tests.indexes.test_base.TestIndex)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/pydata/pandas/pandas/tests/indexes/test_base.py", line 730, in test_format
    self.assertEqual(formatted, expected)
AssertionError: Lists differ: ['2016-03-13 09:54:19.205'] != ['2016-03-13 09:54:19.205000']
@nbonnotte

This comment has been minimized.

Show comment
Hide comment
@nbonnotte

nbonnotte Mar 13, 2016

Contributor

@jreback All green

Contributor

nbonnotte commented Mar 13, 2016

@jreback All green

@jreback jreback modified the milestones: 0.18.1, Next Major Release Mar 13, 2016

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Apr 18, 2016

Contributor

can you rebase/update

Contributor

jreback commented Apr 18, 2016

can you rebase/update

@jreback jreback closed this in bb9b9c5 Apr 25, 2016

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Apr 25, 2016

Contributor

thanks!

Contributor

jreback commented Apr 25, 2016

thanks!

@@ -193,6 +195,13 @@ def __init__(self, left, right, how='inner', on=None,
'can not merge DataFrame with instance of '
'type {0}'.format(type(right)))
# warn user when merging between different levels
if left.columns.nlevels != right.columns.nlevels:

This comment has been minimized.

@jreback

jreback May 5, 2016

Contributor

ahh didn't even notice, this is ONLY checking the levels, it should be doing something like:

if left._get_axis(axis).nlevels != right._get_axis(axis).nlevels:
   ...
@jreback

jreback May 5, 2016

Contributor

ahh didn't even notice, this is ONLY checking the levels, it should be doing something like:

if left._get_axis(axis).nlevels != right._get_axis(axis).nlevels:
   ...

This comment has been minimized.

@nbonnotte

nbonnotte May 5, 2016

Contributor

Oh, I see. I've been away for some time, but I just went back. I can do a PR for that.

@nbonnotte

nbonnotte May 5, 2016

Contributor

Oh, I see. I've been away for some time, but I just went back. I can do a PR for that.

nps added a commit to nps/pandas that referenced this pull request May 17, 2016

ERR automatic broadcast for merging different levels
closes pandas-dev#9455

Author: Nicolas Bonnotte <nicolas.bonnotte@gmail.com>

This patch had conflicts when merged, resolved by
Committer: Jeff Reback <jeff@reback.net>

Closes pandas-dev#12219 from nbonnotte/9455-test-for-merge and squashes the following commits:

6532ab2 [Nicolas Bonnotte] ERR automatic broadcast for merging different levels, pandas-dev#9455
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment