Obscur AttributeError when dropping on a multi-index dataframe #12078

Closed
nbonnotte opened this Issue Jan 18, 2016 · 9 comments

Comments

Projects
None yet
2 participants
Contributor

nbonnotte commented Jan 18, 2016

In [2]:  df = pd.DataFrame(columns=['a','b','c','d'], data=[[1,'b1','c1',3], [1,'b2','c2',4]])

In [3]: df = df.pivot_table(index='a', columns=['b','c'], values='d').reset_index()

In [4]: df
Out[4]: 
b  a b1 b2
c    c1 c2
0  1  3  4

In [5]: df.drop('a', axis=1)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-5-b59fbf92d28f> in <module>()
----> 1 df.drop('a', axis=1)

/home/nicolas/Git/pandas/pandas/core/generic.pyc in drop(self, labels, axis, level, inplace, errors)
   1617                 new_axis = axis.drop(labels, level=level, errors=errors)
   1618             else:
-> 1619                 new_axis = axis.drop(labels, errors=errors)
   1620             dropped = self.reindex(**{axis_name: new_axis})
   1621             try:

/home/nicolas/Git/pandas/pandas/core/index.pyc in drop(self, labels, level, errors)
   5729                     inds.append(loc)
   5730                 else:
-> 5731                     inds.extend(lrange(loc.start, loc.stop))
   5732             except KeyError:
   5733                 if errors != 'ignore':

AttributeError: 'numpy.ndarray' object has no attribute 'start'

This is related to issue #11640. I have been working on a solution that I submitted in the pull request #11717, but the said solution was controversial, so I'm creating this issue to separate the problems.

I'll make a PR soon enough.

Contributor

nbonnotte commented Jan 18, 2016

I'm a bit confused.

As I have understood the API, here .drop should not work, because 'a' is not a column, and we should just have a more meaningful error message. If I wanted to remove the columns whose first level is 'a', I should do df.drop('a', axis=1, level=0). Right?

On the other hand, if we consider

In [4]: dg = pd.DataFrame([[1,3,4]],columns=pd.MultiIndex.from_tuples([('a',''),('b1','c1'),('b2','c2')],names=['b','c']))

In [5]: dg
Out[5]: 
b  a b1 b2
c    c1 c2
0  1  3  4

then dg and df are equivalent:

In [7]: from pandas.util.testing import assert_frame_equal

In [8]: assert_frame_equal(df, dg) or "No error raised"
Out[8]: 'No error raised'

but

In [14]: dg.drop('a', axis=1)
Out[14]: 
b b1 b2
c c1 c2
0  3  4

Here is what happens:

  • In MultiIndex.drop (see here), in the try... except ... the ValueError is raised because labels ['a'] not contained in axis, which is correct.
  • Then we go on, to loc = self.get_loc(label), with here label='a'
  • In MultiIndex.get_loc, since the key 'a' is not a tuple, the parameter level=0 is automagically added (see here)

Does that mean that, in the API as it should be, in .drop the parameter level=0 was intended to be superfluous? That is, df.drop('a', axis=1) should be equivalent to df.drop('a', axis=1, level=0) ?

What should I do in my pull request?

As as side note, the reason why .drop fails for the first example df and not for the second example dg comes later: for the former, .get_loc returns a boolean mask, and the latter returns a slice, but .drop forgets to handle boolean mask (see those lines)

Also, I feel the need to say that I'm sorry if it seems that I am insisting a bit on those issues about .drop. I just like to understand things, and I'm confused about what the code pretends to be doing, what it should in theory do, and what it actually does. I guess that's bound to happen on such a complex project, and I'd be glad to help in any direction I can.

Contributor

jreback commented Jan 19, 2016

In [1]: dg = pd.DataFrame([[1,3,4]],columns=pd.MultiIndex.from_tuples([('a',''),('b1','c1'),('b2','c2')],names=['b','c']))

In [6]: dg.columns.is_lexsorted()
Out[6]: True

In [7]: df = pd.DataFrame(columns=['a','b','c','d'], data=[[1,'b1','c1',3], [1,'b2','c2',4]])

In [8]: df = df.pivot_table(index='a', columns=['b','c'], values='d').reset_index()
In [9]: df.columns.is_lexsorted()
Out[9]: False

The difference is that when the columns are not lexsorted this doesn't work, and the error message is incorrectly propogated, and an incorrect path is taken showing an error message which doesn't make sense. So you need to see where the difference is and what is happening to the exceptions.

jreback added this to the Next Major Release milestone Jan 19, 2016

Contributor

nbonnotte commented Jan 19, 2016

Oki doki, I'll do that ^^

Contributor

nbonnotte commented Jan 27, 2016

I couldn't find any other exception that would be raised but incorrectly propagated. Except the one that shows up, of course.

And this exception is raised for the reason I gave:

  • when the multi-index is lexsorted, .get_loc() returns a slice
  • when it is not, it returns a boolean mask, but what comes next in MultiIndex.drop cant' handle that (see those lines)
In [2]: ref = pd.MultiIndex.from_tuples([('a',''),('b1','c1'),('b2','c2')],names=['b','c'])

In [3]: pbm = pd.DataFrame(columns=['a','b','c','d'], data=[[1,'b1','c1',3], [1,'b2','c2',4]]).pivot_table(index='a', columns=['b','c'], values='d').reset_index().columns

In [6]: ref.is_lexsorted()
Out[6]: True

In [7]: pbm.is_lexsorted()
Out[7]: False

In [8]: ref.drop('a')
Out[8]: 
MultiIndex(levels=[[u'a', u'b1', u'b2'], [u'', u'c1', u'c2']],
           labels=[[1, 2], [1, 2]],
           names=[u'b', u'c'])

In [9]: pbm.drop('a')
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-9-fcb8cd09713a> in <module>()
----> 1 pbm.drop('a')

/home/nicolas/Git/pandas/pandas/indexes/multi.py in drop(self, labels, level, errors)
   1091                     inds.append(loc)
   1092                 else:
-> 1093                     inds.extend(lrange(loc.start, loc.stop))
   1094             except KeyError as e:
   1095                 if errors != 'ignore':

AttributeError: 'numpy.ndarray' object has no attribute 'start'

In [10]: ref.get_loc('a')
Out[10]: slice(0, 1, None)

In [11]: pbm.get_loc('a')
Out[11]: array([ True, False, False], dtype=bool)

In [12]: ref.get_loc('a').start
Out[12]: 0

In [13]: pbm.get_loc('a').start
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-13-2a974e7413c7> in <module>()
----> 1 pbm.get_loc('a').start

AttributeError: 'numpy.ndarray' object has no attribute 'start'

But maybe I'm just not looking at the right place. Am I missing something?

Contributor

jreback commented Jan 27, 2016

yeh, prob just not correctly implemented.

Contributor

nbonnotte commented Jan 27, 2016

Can I correct the implementation, so that .drop works for a non lexsorted multi-index in the same way as for a lexsorted one? :D

In [2]: ref = pd.MultiIndex.from_tuples([('a',''),('b1','c1'),('b2','c2')],names=['b','c'])

In [3]: pbm = pd.DataFrame(columns=['a','b','c','d'], data=[[1,'b1','c1',3], [1,'b2','c2',4]]).pivot_table(index='a', columns=['b','c'], values='d').reset_index().columns

In [4]: ref.is_lexsort
ref.is_lexsorted            ref.is_lexsorted_for_tuple  

In [4]: ref.is_lexsorted()
Out[4]: True

In [5]: pbm.is_lex
pbm.is_lexsorted            pbm.is_lexsorted_for_tuple  

In [5]: pbm.is_lexsorted()
Out[5]: False

In [6]: ref.values
Out[6]: array([('a', ''), ('b1', 'c1'), ('b2', 'c2')], dtype=object)

In [7]: pbm.values
Out[7]: array([('a', ''), ('b1', 'c1'), ('b2', 'c2')], dtype=object)

In [8]: ref.drop('a')
Out[8]: 
MultiIndex(levels=[[u'a', u'b1', u'b2'], [u'', u'c1', u'c2']],
           labels=[[1, 2], [1, 2]],
           names=[u'b', u'c'])

Beware that this simple correction might change the API of both .drop or .groupby, as we discussed in the pull request #11717 😇

So perhaps a safer option would be to first have ref.drop('a') raise a KeyError or ValueError because 'a' is not a correct value, the proper way being ref.drop('a', level=0)? And then correct the implementation.

Let me know what I can do.

Contributor

jreback commented Jan 27, 2016

I think .drop on a DataFrame is find (your example is not that). you can simply lexsort the pivot table I think.

Contributor

nbonnotte commented Jan 27, 2016

The problem with the DataFrame arises because of the problem with the MultiIndex, as shown in my examples.

What can I do to remove the obscur error message?

Contributor

jreback commented Jan 27, 2016

ahh, yes, see if you can

tm.assert_index_equal(pbm.drop('a'), ref.drop('a'))

though you may want to output a PerformanceWarning for pbm.drop('a')

you'll have to look and see how its used elsewhere.

@jreback jreback modified the milestone: 0.18.0, Next Major Release Jan 27, 2016

@nbonnotte nbonnotte added a commit to nbonnotte/pandas that referenced this issue Jan 28, 2016

@nbonnotte nbonnotte BUG in MultiIndex.drop for not-lexsorted multi-indexes, #12078
Closes #12078
5e765d0

jreback closed this in f673af1 Jan 28, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment