Removing Unused Levels in MultiIndex with NaN values corrupts Index #18616

jlandercy · 2017-12-03T22:58:59Z

Minimal Verifiable Complete Exemple

Below a MVCE of the behavior:

import pandas as pd
# Trial Data:
data = {
    'key1': list(range(6))*2
   ,'key2': [100, 100, 100, 100, 200, 200, 200, 300, 300, None, None, None]
   ,'data': ['a']*12
}
# Load Data:
df0 = pd.DataFrame(data)
# Index (Int64 upcasted to Float64, because of None converted into NaN)
df1 = df0.set_index('key2')
# MultiIndex containing NaN on second level:
df2 = df0.set_index(['key1', 'key2'])
# NaN values are replaced by last existing value:
idx = df2.index.remove_unused_levels()
# Then, Index are not equal:
idx.equals(df2.index) # False

Problem description

Using method remove_unused_levels on MultiIndex containing NaN create a new MultiIndex that is not equal to the original as documentation says:

The resulting MultiIndex will have the same outward appearance, meaning the same .values and ordering. It will also be .equals() to the original.

This is why I suspect it is a bug.

Float Index

Single Index uses NaN as modality:

>>> df1.index
Float64Index([100.0, 100.0, 100.0, 100.0, 200.0, 200.0, 200.0, 300.0, 300.0,
              nan, nan, nan],
             dtype='float64', name='key2')

But, MultiIndex does not, it has negative modality index instead:

>>> df2.index
MultiIndex(levels=[[0, 1, 2, 3, 4, 5], [100.0, 200.0, 300.0]],
           labels=[[0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5], [0, 0, 0, 0, 1, 1, 1, 2, 2, -1, -1, -1]],
           names=['key1', 'key2'])

MultiIndex corruption

When refreshed, NaN values point to a copy of the last float modality (here 300.0) of the level, this lead to a kind of corrupted index because those auto-filled value do not have any meaning.

>>> df2.index.remove_unused_levels()
MultiIndex(levels=[[0, 1, 2, 3, 4, 5], [100.0, 200.0, 300.0, 300.0]],
           labels=[[0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5], [0, 0, 0, 0, 1, 1, 1, 3, 3, 3, 3, 3]],
           names=['key1', 'key2'])

As a consequence Index are not equal (which contradicts documentation):

>>> df2.index.remove_unused_levels().equals(df2.index)
False

Even worse, original value (300.0 is not referenced anymore), and then it is a unused value/modality in the newly generated index.

To confirm it, lets apply the method twice, we get:

>>> df2.index.remove_unused_levels().remove_unused_levels()
MultiIndex(levels=[[0, 1, 2, 3, 4, 5], [100.0, 200.0, 300.0]],
           labels=[[0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5], [0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 2]],
           names=['key1', 'key2'])

Expected Output

I believe expected output of set_index and remove_unused_levels should be:

>>> df2.index.remove_unused_levels()
MultiIndex(levels=[[0, 1, 2, 3, 4, 5], [100.0, 200.0, 300.0, nan]],
           labels=[[0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5], [0, 0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 3]],
           names=['key1', 'key2'])

The problem also occurs when rows are removed from the DataFrame, and then it makes sense to use the method remove_unused_levels to clean up index. Anyway, when building the MCVE I found it was working on the whole Index whatever the level order.

Pandas Versions

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-75-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.21.0
pytest: None
pip: 9.0.1
setuptools: 36.4.0
Cython: None
numpy: 1.13.3
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.1.9
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.8
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2017-12-03T23:02:24Z

this is a duplicate of #18417 and fixed in #18426 (in master) already. thanks for the report.

jreback · 2017-12-03T23:02:53Z

cc @toobaz anything in here we should add to tests?

jreback · 2017-12-03T23:04:13Z

In [3]: pd.__version__
Out[3]: '0.22.0.dev0+280.gf04637b'

In [2]: df2.index.remove_unused_levels()
Out[2]: 
MultiIndex(levels=[[0, 1, 2, 3, 4, 5], [100.0, 200.0, 300.0]],
           labels=[[0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5], [0, 0, 0, 0, 1, 1, 1, 2, 2, -1, -1, -1]],
           names=['key1', 'key2'])

@jlandercy note that we never have nan in the levels, its simply a -1 code.

jlandercy · 2017-12-03T23:09:12Z

Yes I guess dealing with NaN in Index may lead to a lot of troubles (this entity does not behave well, such as comparison).

So, if I understand when I ship v0.22.0 the problem will vanish.

Thank you for your work, Pandas is a great tool.

toobaz · 2017-12-04T05:43:04Z

Yes I guess dealing with NaN in Index may lead to a lot of troubles (this entity does not behave well, such as comparison).

In principle, NaNs in indexes should behave just like normal values with respect to comparison (differently form NaNs in values). However this is currently affected by #18455 (which should be fixed soon) for flat Indexes, and #18485 (which will probably need more time) for MultiIndexes.

cc @toobaz anything in here we should add to tests?

I don't think so... the case of the MVCE above (with no unused levels in the input) is already covered.

jreback closed this as completed Dec 3, 2017

jreback added Duplicate Report Duplicate issue or pull request Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex labels Dec 3, 2017

jreback added this to the Next Major Release milestone Dec 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removing Unused Levels in MultiIndex with NaN values corrupts Index #18616

Removing Unused Levels in MultiIndex with NaN values corrupts Index #18616

jlandercy commented Dec 3, 2017 •

edited

jreback commented Dec 3, 2017

jreback commented Dec 3, 2017 •

edited

jreback commented Dec 3, 2017

jlandercy commented Dec 3, 2017 •

edited

toobaz commented Dec 4, 2017

Removing Unused Levels in MultiIndex with NaN values corrupts Index #18616

Removing Unused Levels in MultiIndex with NaN values corrupts Index #18616

Comments

jlandercy commented Dec 3, 2017 • edited

Minimal Verifiable Complete Exemple

Problem description

Float Index

MultiIndex corruption

Expected Output

Pandas Versions

jreback commented Dec 3, 2017

jreback commented Dec 3, 2017 • edited

jreback commented Dec 3, 2017

jlandercy commented Dec 3, 2017 • edited

toobaz commented Dec 4, 2017

jlandercy commented Dec 3, 2017 •

edited

jreback commented Dec 3, 2017 •

edited

jlandercy commented Dec 3, 2017 •

edited