BUG: Indexes still include values that have been deleted #2770

darindillon · 2013-01-29T18:36:11Z

Using pandas 0.10. If we create a Dataframe with a multi-index, then delete all the rows with value X, we'd expect the index to no longer show value X. But it does.
Note the apparent inconsistency between "index" and "index.levels" -- one shows the values have been deleted but the other doesn't.

import pandas

x = pandas.DataFrame([['deleteMe',1, 9],['keepMe',2, 9],['keepMeToo',3, 9]], columns=['first','second', 'third'])
x = x.set_index(['first','second'], drop=False)

x = x[x['first'] != 'deleteMe'] #Chop off all the 'deleteMe' rows

print x.index #Good: Index no longer has any rows with 'deleteMe'. But....

print x.index.levels #Bad: index still shows the "deleteMe" values are there. But why? We deleted them.

x.groupby(level='first').sum() #Bad: it's creating a dummy row for the rows we deleted!

We don't want the deleted values to show up in that groupby. Can we eliminate them?

ghost · 2013-01-29T21:36:59Z

related #2655 (maybe)

wesm · 2013-02-09T18:08:00Z

This is kind of a tricky problem, e.g. when should you "recompute" the levels? Have to table this until I have a chance to look a bit more deeply. Another solution is to exclude levels with no observations in such a groupby

darindillon · 2013-02-09T18:24:54Z

Well, is there any easy workaround I can use? Like if I know I have this problem, can I manually call a .rebuild_index() or something? I've played around with all the obvious possibilities (short of creating a brand new dataframe) and can't find any workaround.
It's the last line (the .groupby(...).sum() ) that I care about -- that's the one I need to make the bad data go away.

EDIT: better clarification. I have one function that builds the dataset and drops the rows. At that point, I know I'm in the situation described in this issue, and I'd like to do my workaround there. But then the .groupby().sum() happens much much later in a different function. I could easily hack that second function as you say (exclude levels with no observations) but it makes more sense to keep my workaround code in the first function. Any ideas?

michaelaye · 2013-02-10T04:20:40Z

How about the workaround that I proposed for #2655 ? In your case maybe

x.groupby(x.index.get_level_values(1)).sum()

should do the correct thing, if I'm not wrong? I don't know why, but the result of this function delivers updated values.

darindillon · 2013-02-10T05:08:28Z

Yes that works; but the code that does .groupby().sum() is in one function and the code that removes the value from the table is in another fxn. It would be much much clear to use a workaround that cleans up the problem with the dataframe in the fxn that creates it -- that way any other fxn could use the dataframe without having to do your trick.

michaelaye · 2013-02-10T08:07:11Z

Ehm, can you confirm that this problem still exists with 0.10.1?
I just tried your example, and I don't see a dummy row with index "deleteMe" ?

In [9]: print x.index.levels
[Index([deleteMe, keepMe, keepMeToo], dtype=object), Int64Index([1, 2, 3], dtype=int64)]

In [10]: x.groupby(level='first').sum()
Out[10]: 
           second  third
first                   
keepMe          2      9
keepMeToo       3      9

jreback · 2013-03-14T11:06:49Z

is this closable? @tavistmorph does this exist in 0.11-dev?

darindillon · 2013-03-14T17:05:14Z

It's still an issue. Still happens for me in 10.1 and 0.11 (as of the last time I pulled, at least). Just run the code snippet in my orig post and you can see it.

Michael -- the deleted row appears in your output above on step #9 ("deleteMe" should not be there since we deleted it) and then it appears in the output for step 10 ("first" should not appear since all the rows with the "first" value were deleted).

wesm · 2013-03-15T01:26:27Z

This isn't really a bug. Perhaps an option should be added to return an array of observed values in a particular level in the index (which is what you're after)?

michaelaye · 2013-03-15T01:34:06Z

Can you precise what you mean by observed? Do you mean, that the object is a view into the original object (I don't know if it is), and that's why it still contains the 'deleteMe' index?

wesm · 2013-03-15T02:29:41Z

The levels are not computed from the actual observed values. For example, in R you can have a factor (categorical variable) in which some distinct values are not observed:

> d
[1] b c b c
Levels: a b c

michaelaye · 2013-11-30T03:10:34Z

Version: '0.12.0-1184-gc73b957'
The MultiIndex still shows all previously existing index values and therefore still is confusing to the user who looks at it, after chopping off the 'deleteMe' rows:

In [10]: x.index
Out[10]:
MultiIndex(levels=[[u'deleteMe', u'keepMe', u'keepMeToo'], [1, 2, 3]],
           labels=[[1, 2], [1, 2]],
           names=[u'first', u'second'])

but at least the groupby does not create an empty row anymore for previously existing indices:

In [12]: x.groupby(level='first').sum()
Out[12]:
           second  third
first
keepMe          2      9
keepMeToo       3      9

[2 rows x 2 columns]

so the discussion now boils down to the confusion of looking at df.index. I would argue, as I am looking often at the index to see what I am working with, that I would still be very puzzled by the index showing old values and from that point on I would not trust the results anymore.

jtratner · 2013-11-30T04:42:36Z

If you print that MultiIndex, it looks like what you want:

In [7]: mi
Out[7]:
MultiIndex(levels=[[u'deleteMe', u'keepMe', u'keepMeToo'], [1, 2, 3]],
           labels=[[1, 2], [1, 2]],
           names=[u'first', u'second'])

In [8]: print mi
first      second
keepMe     2
keepMeToo  3

Thus, a simple way to handle this is to examine your indices with print rather than the repr that IPython shows you.

The MultiIndex repr isn't really intuitive in any case, unless you understand that it's a categorical and that labels represent the integer positions of the levels at each location. You shouldn't need to care about that as a consumer of a MultiIndex. And if you understand the internal representation, you can then also understand why it doesn't matter whether there are extra levels.

The issue becomes clearer with a more complicated MI:

In [2]: ind = pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b', 'c'], ['d', 'e', 'f', 'g', 'h']])

In [3]: ind
Out[3]:
MultiIndex(levels=[[u'a', u'b', u'c'], [u'd', u'e', u'f', u'g', u'h']],
           labels=[[0, 0, 1, 1, 2], [0, 1, 2, 3, 4]])

In [4]: print ind

a  d
   e
b  f
   g
c  h

jtratner · 2013-11-30T04:47:38Z

Based on your previous comment, it seems like the key issue here (groupby showing unused levels) is now resolved. Can we close this or edit this issue to be a feature request? (e.g., method to allow MI to consolidate its levels)

As an aside, my perspective is that it's more intuitive to have the entire level set remain, because it makes slices very clear (and you can share the memory for storing levels):

In [15]: ind
Out[15]:
MultiIndex(levels=[[u'a', u'b', u'c'], [u'd', u'e', u'f', u'g', u'h']],
           labels=[[0, 0, 1, 1, 2], [0, 1, 2, 3, 4]])

In [16]: ind[:2]
Out[16]:
MultiIndex(levels=[[u'a', u'b', u'c'], [u'd', u'e', u'f', u'g', u'h']],
           labels=[[0, 0], [0, 1]])

In [17]: ind[2:4]
Out[17]:
MultiIndex(levels=[[u'a', u'b', u'c'], [u'd', u'e', u'f', u'g', u'h']],
           labels=[[1, 1], [2, 3]])

In [18]: ind[4:5]
Out[18]:
MultiIndex(levels=[[u'a', u'b', u'c'], [u'd', u'e', u'f', u'g', u'h']],
           labels=[[2], [4]])

michaelaye · 2013-11-30T06:48:24Z

I would argue that to use repr as a way to examine pandas objects is the default and advertised use case, as pandas docs are full of that, so I don't find it really satisfying and a tad inconsistent that I have to resort to printing an object for clarity while repr works for all (most?) other cases.
I don't really understand what you want to show in your last comment. What does it have to do with deleted indexes remaining in the index?

michaelaye · 2013-11-30T06:52:15Z

I also would like to point out that the pandas core team does not seem to have come to a consistent conclusion how to handle this, as we have 3 issue related to this, and in one (#2655) the claim is made that it is no bug, while the 2 others (#3686 and this one) have been marked as a bug. Maybe you guys should have an internal discussion about it.

ghost · 2013-11-30T12:07:54Z

@michaelaye, I think you (legitimately) missed the point wes was making. My guess is that you're under a misconception
of the role of levels. It is not the equivalent of a regular Index labels, that equivalent is mi.labels.
wes made this point to you, and so has jtratner #2655 (comment) to no avail.

Test yourself with this example:

In [10]: MultiIndex.from_tuples([[0,1],[0,2]])
Out[10]: 
MultiIndex(levels=[[0], [1, 2]],
           labels=[[0, 0], [0, 1]])

Do you understand why the first element in levels only has one item?
Have you noticed that the number of elements in levels is not directly related to the
number of rows in the frame? Then, it make sense that rows could be deleted without
levels logically having to change?

You may find it counter-intuitive (I did in the past), but then the problem to be addressed
is the misunderstanding , not the implementation. Hopefully, now you know.

The fact that a groupby emitted a group for entries that appear in levels but not in labels
(What wes meant by "unobserved") was a bug, and it has been fixed.

I would venture a guess that the reason this non-bug issue has lingered for so long is
lack of time or significance or, indeed, patience to spell things out like this and not a
paucity of managerial suggestions or demerits from you.

Also:

issue labels are not holy scripture. ~~(removed "bug" label)~~ The extra groupby
row was a bug though.

I agree with @jtratner, we can close this.
If someone wants that consolidate method he suggested, open an issue.
Personally, I don't see the need.

michaelaye · 2013-11-30T21:32:14Z

Thank you for your efforts. I indeed was puzzled by the meaning of 'unobserved' and 'observed' and finally understand Wes' comment. Still, there are API calls that take levels as an argument, e.g. groupby(). If other users don't find it confusing to have a list of levels not representing the current state, than it must be me.
I am sad to see that my effort to bring these issues forward is interpreted as demerit of your excellent work and apologize if I upset anyone.

jtratner · 2013-11-30T22:04:39Z

If you're finding something wrong with groupby (ie you end up with spurious
levels in final output) can you post it?

michaelaye · 2013-12-01T03:00:40Z

I don't have anything showing up wrong, and I didn't mean to imply that. My work-style is very much relying on looking at indices and columns with __repr__ because most of the time my dataframes are just too big to be helpful to be displayed. Your suggesting of printing it solves potential confusions with having glimpsed the content of levels, but it is now the only object I would need to print. Any chance the definition of __str__ and __repr__ could be swapped for the MultiIndex or would that mess up other things?

jtratner · 2013-12-01T03:43:22Z

@michaelaye unlikely to happen - repr is set to be something that's potentially eval'able. Makes it much easier to reproduce indexes when their repr can be copy/pasted.

michaelaye · 2013-12-01T03:46:34Z

Understood. Thanks for your patience.

ghost · 2013-12-12T15:47:28Z

The pandas API doesn't fit in my head anymore. For reference df.index.get_level_values
might be relevent for whatever use case this was a problem for. DOes the right thing.

    ...: 
    ...: x = pandas.DataFrame([['deleteMe',1, 9],['keepMe',2, 9],['keepMeToo',3, 9]], columns=['first','second', 'third'])
    ...: x = x.set_index(['first','second'], drop=False)
    ...: 
    ...: print x.index.get_level_values(0)
    ...: x = x[x['first'] != 'deleteMe'] #Chop off all the 'deleteMe' rows
    ...: print x.index.get_level_values(0)
    ...: 
Index([u'deleteMe', u'keepMe', u'keepMeToo'], dtype='object')
Index([u'keepMe', u'keepMeToo'], dtype='object')

xref pandas-dev#2770

toobaz · 2017-12-04T15:19:33Z

I think this can be closed: the default behavior is as intended, and the method MultiIndex.remove_unused_levels() has been added as a simple fix for whoever doesn't like the default behavior.

jreback · 2017-12-04T15:32:21Z

yep this is now the accepted soln.

vldbnc · 2023-02-24T11:04:57Z

How this MultiIndex.remove_unused_levels() could be accepted solution?
You might filter dataframe based on level and create new df or series

df_new = df[df.index.isin(['VALUE'], level=0)]

Newly created df_new will correctly show Mutliindex with df_index.index having only 'VALUE' on level=0 but df_new.index.levels[0] are showing all index names on level=0 from original df.

michaelaye mentioned this issue Feb 1, 2013

index.levels not being updated by groupby #2655

Closed

michaelaye mentioned this issue May 23, 2013

dataframe.drop(col,axis=1) does not drop column from column.levels in multiindex dataframe #3686

Closed

ghost closed this as completed Nov 30, 2013

jreback mentioned this issue Jun 30, 2014

BUG: Unexpected behavior from DataFrame.index.levels #7614

Closed

jreback mentioned this issue Nov 18, 2014

BUG: .stack(dropna=False) looks through views incorrectly for dataframe views with multi-index columns #8844

Closed

jreback mentioned this issue Nov 26, 2014

MultiIndexing Issue #8893

Closed

jreback mentioned this issue Dec 5, 2014

Dropping rows from inner level of multiindex not removing name of rows from index #9013

Closed

jreback added a commit to jreback/pandas that referenced this issue Mar 16, 2017

support for removing unused levels (internally)

ae6b9ec

xref pandas-dev#2770

jreback mentioned this issue Mar 16, 2017

BUG: DataFrame.sort_index broken if not both lexsorted and monotonic in levels #15694

Closed

jreback added a commit to jreback/pandas that referenced this issue Mar 16, 2017

support for removing unused levels (internally)

50ac461

xref pandas-dev#2770

jreback added a commit to jreback/pandas that referenced this issue Mar 16, 2017

support for removing unused levels (internally)

84bb2b9

xref pandas-dev#2770

jreback mentioned this issue Mar 16, 2017

ENH: support for removing unused levels of a MultiIndex (interally) #15700

Closed

jreback added a commit to jreback/pandas that referenced this issue Mar 16, 2017

support for removing unused levels (internally)

ff58f82

xref pandas-dev#2770

jreback added a commit to jreback/pandas that referenced this issue Mar 16, 2017

support for removing unused levels (internally)

35e6a0f

xref pandas-dev#2770

jreback added a commit to jreback/pandas that referenced this issue Mar 22, 2017

support for removing unused levels (internally)

dbf1c94

xref pandas-dev#2770

jreback added a commit to jreback/pandas that referenced this issue Mar 22, 2017

support for removing unused levels (internally)

aa6190f

xref pandas-dev#2770

jreback added a commit to jreback/pandas that referenced this issue Mar 22, 2017

support for removing unused levels (internally)

f3ec8ac

xref pandas-dev#2770

jreback added a commit to jreback/pandas that referenced this issue Mar 22, 2017

support for removing unused levels (internally)

4a2e3ac

xref pandas-dev#2770

jreback added a commit to jreback/pandas that referenced this issue Mar 23, 2017

support for removing unused levels (internally)

512541a

xref pandas-dev#2770

jreback added a commit to jreback/pandas that referenced this issue Mar 25, 2017

support for removing unused levels (internally)

dc28dff

xref pandas-dev#2770

jreback added a commit to jreback/pandas that referenced this issue Apr 2, 2017

support for removing unused levels (internally)

7533579

xref pandas-dev#2770

jreback added a commit to jreback/pandas that referenced this issue Apr 4, 2017

support for removing unused levels (internally)

6d5d456

xref pandas-dev#2770

jreback added a commit to jreback/pandas that referenced this issue Apr 4, 2017

support for removing unused levels (internally)

7677df2

xref pandas-dev#2770

jreback added a commit to jreback/pandas that referenced this issue Apr 6, 2017

support for removing unused levels (internally)

21e5e49

xref pandas-dev#2770

jreback added a commit to jreback/pandas that referenced this issue Apr 7, 2017

support for removing unused levels (internally)

b234bdb

xref pandas-dev#2770

jreback mentioned this issue Apr 27, 2017

MultiIndex levels() and get_level_values() give different results #16160

Closed

TomAugspurger mentioned this issue May 1, 2017

BUG: multi-index joining returns wrong multiindex #16182

Closed

dfolch mentioned this issue Jun 11, 2017

column multiindex and reindex inconsistency #16626

Closed

toobaz mentioned this issue Oct 16, 2017

API: support "unique=True" in MultiIndex.get_level_values() #17896

Closed

jreback closed this as completed Dec 4, 2017

toobaz mentioned this issue Jan 1, 2018

x in pd.MultiIndex.drop(x) #19027

Closed

lussong mentioned this issue Jun 17, 2019

Treemap visualization with multi group by filed raises KeyError apache/superset#7710

Closed

3 tasks

qsourav mentioned this issue Nov 29, 2023

Bug in pyarrow.from_pandas() when input has MultiIndex index columns having non-string names apache/arrow#38983

Open

SultanOrazbayev mentioned this issue Jul 12, 2024

[BUG] Trim unused levels when verifying dataframe formatting. sktime/sktime#6754

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Indexes still include values that have been deleted #2770

BUG: Indexes still include values that have been deleted #2770

darindillon commented Jan 29, 2013

ghost commented Jan 29, 2013

wesm commented Feb 9, 2013

darindillon commented Feb 9, 2013

michaelaye commented Feb 10, 2013

darindillon commented Feb 10, 2013

michaelaye commented Feb 10, 2013

jreback commented Mar 14, 2013

darindillon commented Mar 14, 2013

wesm commented Mar 15, 2013

michaelaye commented Mar 15, 2013

wesm commented Mar 15, 2013

michaelaye commented Nov 30, 2013

jtratner commented Nov 30, 2013

jtratner commented Nov 30, 2013

michaelaye commented Nov 30, 2013

michaelaye commented Nov 30, 2013

ghost commented Nov 30, 2013

michaelaye commented Nov 30, 2013

jtratner commented Nov 30, 2013

michaelaye commented Dec 1, 2013

jtratner commented Dec 1, 2013

michaelaye commented Dec 1, 2013

ghost commented Dec 12, 2013

toobaz commented Dec 4, 2017

jreback commented Dec 4, 2017

vldbnc commented Feb 24, 2023 •

edited

Loading

BUG: Indexes still include values that have been deleted #2770

BUG: Indexes still include values that have been deleted #2770

Comments

darindillon commented Jan 29, 2013

ghost commented Jan 29, 2013

wesm commented Feb 9, 2013

darindillon commented Feb 9, 2013

michaelaye commented Feb 10, 2013

darindillon commented Feb 10, 2013

michaelaye commented Feb 10, 2013

jreback commented Mar 14, 2013

darindillon commented Mar 14, 2013

wesm commented Mar 15, 2013

michaelaye commented Mar 15, 2013

wesm commented Mar 15, 2013

michaelaye commented Nov 30, 2013

jtratner commented Nov 30, 2013

jtratner commented Nov 30, 2013

michaelaye commented Nov 30, 2013

michaelaye commented Nov 30, 2013

ghost commented Nov 30, 2013

michaelaye commented Nov 30, 2013

jtratner commented Nov 30, 2013

michaelaye commented Dec 1, 2013

jtratner commented Dec 1, 2013

michaelaye commented Dec 1, 2013

ghost commented Dec 12, 2013

toobaz commented Dec 4, 2017

jreback commented Dec 4, 2017

vldbnc commented Feb 24, 2023 • edited Loading

vldbnc commented Feb 24, 2023 •

edited

Loading