Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERR automatic broadcast for merging different levels, #9455 #12219

Closed
wants to merge 1 commit into from
Closed

ERR automatic broadcast for merging different levels, #9455 #12219

wants to merge 1 commit into from

Conversation

nbonnotte
Copy link
Contributor

I'm adding tests to close #9455 after pull request #12158 solved the issue

But I confess I'm a bit surprised by the result:

n [2]: df1 = DataFrame(columns=['a', 'b'], data=[[0, 1]])

In [3]: df2a = DataFrame(columns=['c', 'd'], data=[[2, 3]])

In [4]: df2b = DataFrame(columns=['c', 'e'], data=[[4, 5]])

In [5]: df2 = concat([df2a, df2b], keys=['l', 'r'], axis=1)

In [6]: df2.index.name = 'a'

In [7]: df2 = df2.reset_index()

In [8]: df1
Out[8]: 
   a  b
0  0  1

In [9]: df2
Out[9]: 
   a  l     r   
      c  d  c  e
0  0  2  3  4  5

In [10]: merge(df1, df2, on='a')
pandas/tools/merge.py:467: PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance.
  self.right = self.right.drop(right_drop, axis=1)
Out[10]: 
   a  b  (l, c)  (l, d)  (r, c)  (r, e)
0  0  1       2       3       4       5

Is it all right ?

@jreback
Copy link
Contributor

jreback commented Feb 3, 2016

its a correct result, but not very useful. merging between different levels cannot automatically broadcast.

@jreback
Copy link
Contributor

jreback commented Feb 3, 2016

I would like to see this raise a ValueError instead of actually merging.

@jreback jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Error Reporting Incorrect or improved errors from pandas Difficulty Intermediate labels Feb 3, 2016
@jreback jreback added this to the Next Major Release milestone Feb 3, 2016
@nbonnotte nbonnotte changed the title TST for merging non-lexsorted multi-indexed dataframe, #9455 ERR automatic broadcast for merging different levels, #9455 Feb 18, 2016
@nbonnotte
Copy link
Contributor Author

I thought it was related to the multi-index not being lexsorted, but it is not:

In [36]: merge(df1, df2.sortlevel(axis=1), on='a')
Out[36]:
   a  b  (l, c)  (l, d)  (r, c)  (r, e)
0  0  1       2       3       4       5

I'm updating the name of the pull request.

@nbonnotte
Copy link
Contributor Author

@jreback I'm a bit confused about what should or should not be allowed.

How is this different from #2024, which was closed by this commit?

@jreback
Copy link
Contributor

jreback commented Feb 18, 2016

oh, I think we out to change that behavior and raise. I think this should be a hard error, or at the very least a useful Warning (I agree its not PerformanceWarning which is prob just triggered).

Are there any situations where this is NOT an error?

@nbonnotte
Copy link
Contributor Author

@jreback What do you mean by "not an error"?

@jreback
Copy link
Contributor

jreback commented Feb 18, 2016

I mean, can you come up with a usecase where this is actually useful? e.g. merging something with multi-levels with a single level frame? (so you have mixed tuples and column labels).

@nbonnotte
Copy link
Contributor Author

Wanting to merge a single-level dataframe with a multi-level one seems natural to me (and, apparently, to other users, since this functionality was explicitly implemented).

If this is to be prevented with a ValueError, how to get the same result? One would have to do something like

multilevel_df.columns = multilevel_df.columns.values

which is not really difficult to do (or to guess?)

On the other hand, what are the drawbacks to keeping the feature? As far as I'm concerned, I was a bit surprised with the result, because I would naively have expected to get a multilevel dataframe, like

   a  b  l     r   
         c  d  c  e
0  0  1  2  3  4  5

But maybe that's because I'm still not familiar enough with the spirit of the API.

Anyway, raising an error here will break things. Following the principle "be liberal in what you accept, be conservative in what you do", I would suggest to keep the API as it is.

@jreback
Copy link
Contributor

jreback commented Feb 18, 2016

@nbonnotte you are totally missing my point. I am not asking about merging a single level to a SINGLE level of a multi-level at all.

from the example above

In [10]: merge(df1, df2, on='a')
pandas/tools/merge.py:467: PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance.
  self.right = self.right.drop(right_drop, axis=1)
Out[10]: 
   a  b  (l, c)  (l, d)  (r, c)  (r, e)
0  0  1       2       3       4       5

I can't imagine this is useful in any way and would always be an error. However if someone speaks up and says, hey this might be useful? I want to see that case.

@nbonnotte
Copy link
Contributor Author

@jreback I feel like none of us understand what the other is saying. I am not talking either about actually merging a single level to a single level of a multi-level.

What I am saying is: what are the drawbacks to letting the feature? From my (still naive) point of view, I would find it difficult to guess the result of the operation: here, the multi-level is "flattened" (not sure the term is correct), but it could have been possible to instead merge the single-level index to a single level of the multi-level, right?

So I'm not saying we should actually merge a single level to a multi-level, just that we could, and it would seem to me at least as natural as the current result... which therefore might appear as unpredictable. That's a drawback, for me.

Are there any other drawbacks? Why should it be an error?

As for a use case, what about #2024 ?

Otherwise, I have an example. I often use pandas to work on the features I'm going to feed to scikit-learn, and I have features from many different origins: for instance, for each day weather data (temperature, pressure, etc.), calendar data (is it a bank holiday? a school holiday?) and let's say the target value of the previous day. At some point, I thus have multiple multi-level dataframes on the one hand, e.g. with ('weather', 'temperature'), ('weather', 'pressure'), ... and ('calendar', 'is_bank_holiday'), ('calendar', 'is_school_holiday') ..., and single-level dataframes on the other hand, for instance containing 'yesterday_target'. And I want to merge all that, and give the result to scikit-learn. But I like multi-level dataframes because they make it so much easier to select a subset of the features, e.g. for plotting, or just to exclude some features, so I'd like to get one in the end (or a single-level with tuples, which can easily be converted).

Sure, I can think of some ways to arrive to the same result without merging a single-level dataframe to a multi-level one. But it is simpler if it's directly possible.

Otherwise, we can raise an error, and wait to see if anyone complains. You tell me.

@jreback
Copy link
Contributor

jreback commented Feb 19, 2016

@nbonnotte

my point is that the default of this operation is not generally wanted (IOW a user almost certainly would be suprised that this DOES not merge on a particular level). So I would like to see this raise a helpful error message and force the user to be proactive (e.g. pass an option) to actually merge a combined multi-level with a single level.

The default is too flexible here. We could simply show a warning, but warnings are often ignored.

I don't have a concrete idea of how to do this ATM.

@nbonnotte
Copy link
Contributor Author

@jreback Here we go. Is that ok?

Hum, I should add a note in a what's new with the API change

@@ -193,6 +193,10 @@ def __init__(self, left, right, how='inner', on=None,
'can not merge DataFrame with instance of '
'type {0}'.format(type(right)))

# prevent merging between different levels
if left.columns.nlevels != right.columns.nlevels:
raise ValueError('can not merge between different levels')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make this even more expressive, e.g. indicate which/how many on left & right.

@jreback
Copy link
Contributor

jreback commented Feb 19, 2016

comments on this

@jorisvandenbossche @TomAugspurger

@jorisvandenbossche
Copy link
Member

I find this a difficult one (in the sense of "there is not clear 'best' solution" IMO).
I agree the behaviour is somewhat unexpected and probably in most cases not what the user would have wanted (favoring a clear error message), but I don't know if I find this worth possibly breaking some users' code (in the end, it was explicitly implemented, and the current result is not 'wrong'. It is possible that you want the result, and in such cases you will need to do more code to workaround the error)

@jreback
Copy link
Contributor

jreback commented Feb 23, 2016

I would be ok with a warning UserWarning? in this case (the PerformanceWarning is only triggered in some cases and is not really relevant).

This is really a bit unexpected result I think in almost all cases and not an intended result. (and if it is, then the user could explicity construct an Index of tuples)

@nbonnotte
Copy link
Contributor Author

@jreback like that?

@jreback
Copy link
Contributor

jreback commented Feb 24, 2016

I like that.

@jorisvandenbossche @TomAugspurger ?

@TomAugspurger
Copy link
Contributor

I'm still trying to wrap my head around the issue, and what the ideal outcome is, but yeah I think that looks right.

@nbonnotte
Copy link
Contributor Author

I've just detected an inconsistency with .join, or .merge with a specific parameter configuration:

In [2]: df1 = DataFrame(columns=['a', 'b'], data=[[0, 1]])

In [3]: columns = MultiIndex.from_tuples([('a', ''), ('c', 'c1')])

In [4]: df2 = DataFrame(columns=columns, data=[[0, 2]])

In [6]: merge(df1, df2, on='a')
Out[6]:
   a  b  (c, c1)
0  0  1        2

In [7]: df1.join(df2, on='a')
Out[7]:
   a  b  (a, )  (c, c1)
0  0  1      0        2

In [16]: merge(df1, df2, left_on='a', left_index=False, right_index=True)
Out[16]:
   a  b  (a, )  (c, c1)
0  0  1      0        2

This is just wrong. But I guess it can be treated as a separate issue.

@jreback
Copy link
Contributor

jreback commented Feb 24, 2016

@nbonnotte that's the same exact issue. Note that I doubt this is tested very well. So might be good to add all of these examples as tests.

@nbonnotte
Copy link
Contributor Author

Ah, but I was wrong, the result is totally correct! My bad, I'm adding some tests

@nbonnotte
Copy link
Contributor Author

@jreback Tests added, all green.

df = DataFrame([(1, 2, 3), (4, 5, 6)], columns=['a', 'b', 'c'])
new_df = df.groupby(['a']).agg({'b': [np.mean, np.sum]})
other_df = DataFrame(
[(1, 2, 3), (7, 10, 6)], columns=['a', 'b', 'd'])
other_df.set_index('a', inplace=True)

result = merge(new_df, other_df, left_index=True, right_index=True)
# GH 9455, 12219
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the only time this warning is shown in the entire codebase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean?

The message is shown when pandas.tools.merge:merge is used with different levels, that is in DataFrame.mergeand DataFrame.join, and I think that's pretty much it.

@jreback
Copy link
Contributor

jreback commented Feb 26, 2016

ok, let's add a whats note for this API changes.

@jreback
Copy link
Contributor

jreback commented Feb 26, 2016

add #12219 (comment) as tests as well.

@nbonnotte
Copy link
Contributor Author

I don't understand what the tests just failed with Python 3.4:

======================================================================
FAIL: test_format (pandas.tests.indexes.test_base.TestIndex)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/pydata/pandas/pandas/tests/indexes/test_base.py", line 730, in test_format
    self.assertEqual(formatted, expected)
AssertionError: Lists differ: ['2016-03-13 09:54:19.205'] != ['2016-03-13 09:54:19.205000']

@nbonnotte
Copy link
Contributor Author

@jreback All green

@jreback jreback modified the milestones: 0.18.1, Next Major Release Mar 13, 2016
@jreback
Copy link
Contributor

jreback commented Apr 18, 2016

can you rebase/update

@jreback jreback closed this in bb9b9c5 Apr 25, 2016
@jreback
Copy link
Contributor

jreback commented Apr 25, 2016

thanks!

@@ -193,6 +195,13 @@ def __init__(self, left, right, how='inner', on=None,
'can not merge DataFrame with instance of '
'type {0}'.format(type(right)))

# warn user when merging between different levels
if left.columns.nlevels != right.columns.nlevels:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh didn't even notice, this is ONLY checking the levels, it should be doing something like:

if left._get_axis(axis).nlevels != right._get_axis(axis).nlevels:
   ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. I've been away for some time, but I just went back. I can do a PR for that.

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Error Reporting Incorrect or improved errors from pandas Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ERR: better error message on invalid on with multi-index columns
4 participants