Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing Unused Levels in MultiIndex with NaN values corrupts Index #18616

Closed
jlandercy opened this issue Dec 3, 2017 · 5 comments
Closed

Removing Unused Levels in MultiIndex with NaN values corrupts Index #18616

jlandercy opened this issue Dec 3, 2017 · 5 comments
Labels
Duplicate Report Duplicate issue or pull request Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex

Comments

@jlandercy
Copy link

jlandercy commented Dec 3, 2017

Minimal Verifiable Complete Exemple

Below a MVCE of the behavior:

import pandas as pd
# Trial Data:
data = {
    'key1': list(range(6))*2
   ,'key2': [100, 100, 100, 100, 200, 200, 200, 300, 300, None, None, None]
   ,'data': ['a']*12
}
# Load Data:
df0 = pd.DataFrame(data)
# Index (Int64 upcasted to Float64, because of None converted into NaN)
df1 = df0.set_index('key2')
# MultiIndex containing NaN on second level:
df2 = df0.set_index(['key1', 'key2'])
# NaN values are replaced by last existing value:
idx = df2.index.remove_unused_levels()
# Then, Index are not equal:
idx.equals(df2.index) # False

Problem description

Using method remove_unused_levels on MultiIndex containing NaN create a new MultiIndex that is not equal to the original as documentation says:

The resulting MultiIndex will have the same outward appearance, meaning the same .values and ordering. It will also be .equals() to the original.

This is why I suspect it is a bug.

Float Index

Single Index uses NaN as modality:

>>> df1.index
Float64Index([100.0, 100.0, 100.0, 100.0, 200.0, 200.0, 200.0, 300.0, 300.0,
              nan, nan, nan],
             dtype='float64', name='key2')

But, MultiIndex does not, it has negative modality index instead:

>>> df2.index
MultiIndex(levels=[[0, 1, 2, 3, 4, 5], [100.0, 200.0, 300.0]],
           labels=[[0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5], [0, 0, 0, 0, 1, 1, 1, 2, 2, -1, -1, -1]],
           names=['key1', 'key2'])

MultiIndex corruption

When refreshed, NaN values point to a copy of the last float modality (here 300.0) of the level, this lead to a kind of corrupted index because those auto-filled value do not have any meaning.

>>> df2.index.remove_unused_levels()
MultiIndex(levels=[[0, 1, 2, 3, 4, 5], [100.0, 200.0, 300.0, 300.0]],
           labels=[[0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5], [0, 0, 0, 0, 1, 1, 1, 3, 3, 3, 3, 3]],
           names=['key1', 'key2'])

As a consequence Index are not equal (which contradicts documentation):

>>> df2.index.remove_unused_levels().equals(df2.index)
False

Even worse, original value (300.0 is not referenced anymore), and then it is a unused value/modality in the newly generated index.

To confirm it, lets apply the method twice, we get:

>>> df2.index.remove_unused_levels().remove_unused_levels()
MultiIndex(levels=[[0, 1, 2, 3, 4, 5], [100.0, 200.0, 300.0]],
           labels=[[0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5], [0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 2]],
           names=['key1', 'key2'])

Expected Output

I believe expected output of set_index and remove_unused_levels should be:

>>> df2.index.remove_unused_levels()
MultiIndex(levels=[[0, 1, 2, 3, 4, 5], [100.0, 200.0, 300.0, nan]],
           labels=[[0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5], [0, 0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 3]],
           names=['key1', 'key2'])

The problem also occurs when rows are removed from the DataFrame, and then it makes sense to use the method remove_unused_levels to clean up index. Anyway, when building the MCVE I found it was working on the whole Index whatever the level order.

Pandas Versions

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-75-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.21.0
pytest: None
pip: 9.0.1
setuptools: 36.4.0
Cython: None
numpy: 1.13.3
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.1.9
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.8
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
@jreback
Copy link
Contributor

jreback commented Dec 3, 2017

this is a duplicate of #18417 and fixed in #18426 (in master) already. thanks for the report.

@jreback jreback closed this as completed Dec 3, 2017
@jreback jreback added Duplicate Report Duplicate issue or pull request Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex labels Dec 3, 2017
@jreback jreback added this to the Next Major Release milestone Dec 3, 2017
@jreback
Copy link
Contributor

jreback commented Dec 3, 2017

cc @toobaz anything in here we should add to tests?

@jreback
Copy link
Contributor

jreback commented Dec 3, 2017

In [3]: pd.__version__
Out[3]: '0.22.0.dev0+280.gf04637b'

In [2]: df2.index.remove_unused_levels()
Out[2]: 
MultiIndex(levels=[[0, 1, 2, 3, 4, 5], [100.0, 200.0, 300.0]],
           labels=[[0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5], [0, 0, 0, 0, 1, 1, 1, 2, 2, -1, -1, -1]],
           names=['key1', 'key2'])

@jlandercy note that we never have nan in the levels, its simply a -1 code.

@jlandercy
Copy link
Author

jlandercy commented Dec 3, 2017

Yes I guess dealing with NaN in Index may lead to a lot of troubles (this entity does not behave well, such as comparison).

So, if I understand when I ship v0.22.0 the problem will vanish.

Thank you for your work, Pandas is a great tool.

@toobaz
Copy link
Member

toobaz commented Dec 4, 2017

Yes I guess dealing with NaN in Index may lead to a lot of troubles (this entity does not behave well, such as comparison).

In principle, NaNs in indexes should behave just like normal values with respect to comparison (differently form NaNs in values). However this is currently affected by #18455 (which should be fixed soon) for flat Indexes, and #18485 (which will probably need more time) for MultiIndexes.

cc @toobaz anything in here we should add to tests?

I don't think so... the case of the MVCE above (with no unused levels in the input) is already covered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex
Projects
None yet
Development

No branches or pull requests

3 participants