Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Indexes still include values that have been deleted #2770

Closed
darindillon opened this issue Jan 29, 2013 · 35 comments
Closed

BUG: Indexes still include values that have been deleted #2770

darindillon opened this issue Jan 29, 2013 · 35 comments
Labels
API Design Bug Enhancement Indexing Related to indexing on series/frames, not to indexes themselves

Comments

@darindillon
Copy link

Using pandas 0.10. If we create a Dataframe with a multi-index, then delete all the rows with value X, we'd expect the index to no longer show value X. But it does.
Note the apparent inconsistency between "index" and "index.levels" -- one shows the values have been deleted but the other doesn't.

import pandas

x = pandas.DataFrame([['deleteMe',1, 9],['keepMe',2, 9],['keepMeToo',3, 9]], columns=['first','second', 'third'])
x = x.set_index(['first','second'], drop=False)

x = x[x['first'] != 'deleteMe'] #Chop off all the 'deleteMe' rows

print x.index #Good: Index no longer has any rows with 'deleteMe'. But....

print x.index.levels #Bad: index still shows the "deleteMe" values are there. But why? We deleted them.

x.groupby(level='first').sum() #Bad: it's creating a dummy row for the rows we deleted!

We don't want the deleted values to show up in that groupby. Can we eliminate them?

@ghost
Copy link

ghost commented Jan 29, 2013

related #2655 (maybe)

@wesm
Copy link
Member

wesm commented Feb 9, 2013

This is kind of a tricky problem, e.g. when should you "recompute" the levels? Have to table this until I have a chance to look a bit more deeply. Another solution is to exclude levels with no observations in such a groupby

@darindillon
Copy link
Author

Well, is there any easy workaround I can use? Like if I know I have this problem, can I manually call a .rebuild_index() or something? I've played around with all the obvious possibilities (short of creating a brand new dataframe) and can't find any workaround.
It's the last line (the .groupby(...).sum() ) that I care about -- that's the one I need to make the bad data go away.

EDIT: better clarification. I have one function that builds the dataset and drops the rows. At that point, I know I'm in the situation described in this issue, and I'd like to do my workaround there. But then the .groupby().sum() happens much much later in a different function. I could easily hack that second function as you say (exclude levels with no observations) but it makes more sense to keep my workaround code in the first function. Any ideas?

@michaelaye
Copy link
Contributor

How about the workaround that I proposed for #2655 ? In your case maybe

x.groupby(x.index.get_level_values(1)).sum() 

should do the correct thing, if I'm not wrong? I don't know why, but the result of this function delivers updated values.

@darindillon
Copy link
Author

Yes that works; but the code that does .groupby().sum() is in one function and the code that removes the value from the table is in another fxn. It would be much much clear to use a workaround that cleans up the problem with the dataframe in the fxn that creates it -- that way any other fxn could use the dataframe without having to do your trick.

@michaelaye
Copy link
Contributor

Ehm, can you confirm that this problem still exists with 0.10.1?
I just tried your example, and I don't see a dummy row with index "deleteMe" ?

In [9]: print x.index.levels
[Index([deleteMe, keepMe, keepMeToo], dtype=object), Int64Index([1, 2, 3], dtype=int64)]

In [10]: x.groupby(level='first').sum()
Out[10]: 
           second  third
first                   
keepMe          2      9
keepMeToo       3      9

@jreback
Copy link
Contributor

jreback commented Mar 14, 2013

is this closable? @tavistmorph does this exist in 0.11-dev?

@darindillon
Copy link
Author

It's still an issue. Still happens for me in 10.1 and 0.11 (as of the last time I pulled, at least). Just run the code snippet in my orig post and you can see it.

Michael -- the deleted row appears in your output above on step #9 ("deleteMe" should not be there since we deleted it) and then it appears in the output for step 10 ("first" should not appear since all the rows with the "first" value were deleted).

@wesm
Copy link
Member

wesm commented Mar 15, 2013

This isn't really a bug. Perhaps an option should be added to return an array of observed values in a particular level in the index (which is what you're after)?

@michaelaye
Copy link
Contributor

Can you precise what you mean by observed? Do you mean, that the object is a view into the original object (I don't know if it is), and that's why it still contains the 'deleteMe' index?

@wesm
Copy link
Member

wesm commented Mar 15, 2013

The levels are not computed from the actual observed values. For example, in R you can have a factor (categorical variable) in which some distinct values are not observed:

> d
[1] b c b c
Levels: a b c

@michaelaye
Copy link
Contributor

Version: '0.12.0-1184-gc73b957'
The MultiIndex still shows all previously existing index values and therefore still is confusing to the user who looks at it, after chopping off the 'deleteMe' rows:

In [10]: x.index
Out[10]:
MultiIndex(levels=[[u'deleteMe', u'keepMe', u'keepMeToo'], [1, 2, 3]],
           labels=[[1, 2], [1, 2]],
           names=[u'first', u'second'])

but at least the groupby does not create an empty row anymore for previously existing indices:

In [12]: x.groupby(level='first').sum()
Out[12]:
           second  third
first
keepMe          2      9
keepMeToo       3      9

[2 rows x 2 columns]

so the discussion now boils down to the confusion of looking at df.index. I would argue, as I am looking often at the index to see what I am working with, that I would still be very puzzled by the index showing old values and from that point on I would not trust the results anymore.

@jtratner
Copy link
Contributor

If you print that MultiIndex, it looks like what you want:

In [7]: mi
Out[7]:
MultiIndex(levels=[[u'deleteMe', u'keepMe', u'keepMeToo'], [1, 2, 3]],
           labels=[[1, 2], [1, 2]],
           names=[u'first', u'second'])

In [8]: print mi
first      second
keepMe     2
keepMeToo  3

Thus, a simple way to handle this is to examine your indices with print rather than the repr that IPython shows you.

The MultiIndex repr isn't really intuitive in any case, unless you understand that it's a categorical and that labels represent the integer positions of the levels at each location. You shouldn't need to care about that as a consumer of a MultiIndex. And if you understand the internal representation, you can then also understand why it doesn't matter whether there are extra levels.

The issue becomes clearer with a more complicated MI:

In [2]: ind = pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b', 'c'], ['d', 'e', 'f', 'g', 'h']])

In [3]: ind
Out[3]:
MultiIndex(levels=[[u'a', u'b', u'c'], [u'd', u'e', u'f', u'g', u'h']],
           labels=[[0, 0, 1, 1, 2], [0, 1, 2, 3, 4]])

In [4]: print ind

a  d
   e
b  f
   g
c  h

@jtratner
Copy link
Contributor

Based on your previous comment, it seems like the key issue here (groupby showing unused levels) is now resolved. Can we close this or edit this issue to be a feature request? (e.g., method to allow MI to consolidate its levels)

As an aside, my perspective is that it's more intuitive to have the entire level set remain, because it makes slices very clear (and you can share the memory for storing levels):

In [15]: ind
Out[15]:
MultiIndex(levels=[[u'a', u'b', u'c'], [u'd', u'e', u'f', u'g', u'h']],
           labels=[[0, 0, 1, 1, 2], [0, 1, 2, 3, 4]])

In [16]: ind[:2]
Out[16]:
MultiIndex(levels=[[u'a', u'b', u'c'], [u'd', u'e', u'f', u'g', u'h']],
           labels=[[0, 0], [0, 1]])

In [17]: ind[2:4]
Out[17]:
MultiIndex(levels=[[u'a', u'b', u'c'], [u'd', u'e', u'f', u'g', u'h']],
           labels=[[1, 1], [2, 3]])

In [18]: ind[4:5]
Out[18]:
MultiIndex(levels=[[u'a', u'b', u'c'], [u'd', u'e', u'f', u'g', u'h']],
           labels=[[2], [4]])

@michaelaye
Copy link
Contributor

I would argue that to use repr as a way to examine pandas objects is the default and advertised use case, as pandas docs are full of that, so I don't find it really satisfying and a tad inconsistent that I have to resort to printing an object for clarity while repr works for all (most?) other cases.
I don't really understand what you want to show in your last comment. What does it have to do with deleted indexes remaining in the index?

@michaelaye
Copy link
Contributor

I also would like to point out that the pandas core team does not seem to have come to a consistent conclusion how to handle this, as we have 3 issue related to this, and in one (#2655) the claim is made that it is no bug, while the 2 others (#3686 and this one) have been marked as a bug. Maybe you guys should have an internal discussion about it.

@ghost
Copy link

ghost commented Nov 30, 2013

@michaelaye, I think you (legitimately) missed the point wes was making. My guess is that you're under a misconception
of the role of levels. It is not the equivalent of a regular Index labels, that equivalent is mi.labels.
wes made this point to you, and so has jtratner #2655 (comment) to no avail.

Test yourself with this example:

In [10]: MultiIndex.from_tuples([[0,1],[0,2]])
Out[10]: 
MultiIndex(levels=[[0], [1, 2]],
           labels=[[0, 0], [0, 1]])

Do you understand why the first element in levels only has one item?
Have you noticed that the number of elements in levels is not directly related to the
number of rows in the frame? Then, it make sense that rows could be deleted without
levels logically having to change?

You may find it counter-intuitive (I did in the past), but then the problem to be addressed
is the misunderstanding , not the implementation. Hopefully, now you know.

The fact that a groupby emitted a group for entries that appear in levels but not in labels
(What wes meant by "unobserved") was a bug, and it has been fixed.

I would venture a guess that the reason this non-bug issue has lingered for so long is
lack of time or significance or, indeed, patience to spell things out like this and not a
paucity of managerial suggestions or demerits from you.

Also:

  1. issue labels are not holy scripture. (removed "bug" label) The extra groupby
    row was a bug though.

I agree with @jtratner, we can close this.
If someone wants that consolidate method he suggested, open an issue.
Personally, I don't see the need.

@ghost ghost closed this as completed Nov 30, 2013
@michaelaye
Copy link
Contributor

Thank you for your efforts. I indeed was puzzled by the meaning of 'unobserved' and 'observed' and finally understand Wes' comment. Still, there are API calls that take levels as an argument, e.g. groupby(). If other users don't find it confusing to have a list of levels not representing the current state, than it must be me.
I am sad to see that my effort to bring these issues forward is interpreted as demerit of your excellent work and apologize if I upset anyone.

@jtratner
Copy link
Contributor

If you're finding something wrong with groupby (ie you end up with spurious
levels in final output) can you post it?

@michaelaye
Copy link
Contributor

I don't have anything showing up wrong, and I didn't mean to imply that. My work-style is very much relying on looking at indices and columns with __repr__ because most of the time my dataframes are just too big to be helpful to be displayed. Your suggesting of printing it solves potential confusions with having glimpsed the content of levels, but it is now the only object I would need to print. Any chance the definition of __str__ and __repr__ could be swapped for the MultiIndex or would that mess up other things?

@jtratner
Copy link
Contributor

jtratner commented Dec 1, 2013

@michaelaye unlikely to happen - repr is set to be something that's potentially eval'able. Makes it much easier to reproduce indexes when their repr can be copy/pasted.

@michaelaye
Copy link
Contributor

Understood. Thanks for your patience.

@ghost
Copy link

ghost commented Dec 12, 2013

The pandas API doesn't fit in my head anymore. For reference df.index.get_level_values
might be relevent for whatever use case this was a problem for. DOes the right thing.

    ...: 
    ...: x = pandas.DataFrame([['deleteMe',1, 9],['keepMe',2, 9],['keepMeToo',3, 9]], columns=['first','second', 'third'])
    ...: x = x.set_index(['first','second'], drop=False)
    ...: 
    ...: print x.index.get_level_values(0)
    ...: x = x[x['first'] != 'deleteMe'] #Chop off all the 'deleteMe' rows
    ...: print x.index.get_level_values(0)
    ...: 
Index([u'deleteMe', u'keepMe', u'keepMeToo'], dtype='object')
Index([u'keepMe', u'keepMeToo'], dtype='object')

jreback added a commit to jreback/pandas that referenced this issue Mar 16, 2017
jreback added a commit to jreback/pandas that referenced this issue Mar 16, 2017
jreback added a commit to jreback/pandas that referenced this issue Mar 16, 2017
jreback added a commit to jreback/pandas that referenced this issue Mar 16, 2017
jreback added a commit to jreback/pandas that referenced this issue Mar 16, 2017
jreback added a commit to jreback/pandas that referenced this issue Mar 22, 2017
jreback added a commit to jreback/pandas that referenced this issue Mar 22, 2017
jreback added a commit to jreback/pandas that referenced this issue Mar 22, 2017
jreback added a commit to jreback/pandas that referenced this issue Mar 22, 2017
jreback added a commit to jreback/pandas that referenced this issue Mar 23, 2017
jreback added a commit to jreback/pandas that referenced this issue Mar 25, 2017
jreback added a commit to jreback/pandas that referenced this issue Apr 2, 2017
jreback added a commit to jreback/pandas that referenced this issue Apr 4, 2017
jreback added a commit to jreback/pandas that referenced this issue Apr 4, 2017
jreback added a commit to jreback/pandas that referenced this issue Apr 6, 2017
jreback added a commit to jreback/pandas that referenced this issue Apr 7, 2017
@toobaz
Copy link
Member

toobaz commented Dec 4, 2017

I think this can be closed: the default behavior is as intended, and the method MultiIndex.remove_unused_levels() has been added as a simple fix for whoever doesn't like the default behavior.

@jreback
Copy link
Contributor

jreback commented Dec 4, 2017

yep this is now the accepted soln.

@vldbnc
Copy link

vldbnc commented Feb 24, 2023

How this MultiIndex.remove_unused_levels() could be accepted solution?
You might filter dataframe based on level and create new df or series

df_new = df[df.index.isin(['VALUE'], level=0)]

Newly created df_new will correctly show Mutliindex with df_index.index having only 'VALUE' on level=0 but df_new.index.levels[0] are showing all index names on level=0 from original df.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Bug Enhancement Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

No branches or pull requests

9 participants