Skip to content

sort_index for MultiIndex DataFrame silently fill NaN for index column #25818

@dddping

Description

@dddping

Code Sample

mi = pd.MultiIndex.from_tuples([['A0', 'B0'],['A0', 'B1'],
                   ['A1', 'B0'],['A1', 'B1'],['A3', np.nan] ], names=['ia','ib'])
df = pd.DataFrame(np.arange(10).reshape(5,2), mi,columns=['bar', 'foo'])
df2=df.copy()
df2.index.set_levels(['B1','B0'],level=1,inplace=True)

df2
        bar  foo
ia ib
A0 B1     0    1
   B0     2    3
A1 B1     4    5
   B0     6    7
A3 NaN    8    9

# now sort_index to df2 will fill B0 for (A3,NaN) row, but sort_index of df wouldn't
df2.sort_index()
       bar  foo
ia ib
A0 B0    2    3
   B1    0    1
A1 B0    6    7
   B1    4    5
A3 B0    8    9

Problem description

It is suspect that the unsort level in MultiIndex DataFrame lead to NaN auto fill.

Note: I come with "unsort level in MultiIndex" DataFrame after some broadcast operation.
Here is the code

def mklbl(prefix, n):
    return ["%s%s" % (prefix, i) for i in range(n)]

miindex = pd.MultiIndex.from_product([mklbl('A', 3),
                                      mklbl('B', 2),
                                      mklbl('C', 2),
                                      mklbl('D', 2)],names=['ia','ib','ic','id'])

micolumns = ['foo','bah']

dfmi = pd.DataFrame(np.arange(len(miindex) * len(micolumns))
                      .reshape((len(miindex), len(micolumns))),
                    index=miindex,
                    columns=micolumns).sort_index().sort_index(axis=1)

dfmi=dfmi.drop('A2')
# now dfmi first level index contain list of value more than it actually hold
# and it may lead to the index level change from ['A0', 'A1', 'A2'] to ['A1','A0']
bs=dfmi.loc[('A0','B0')].copy().rename({'D1':'D2'})
# this will lead to NaN in some row due to broadcast.
(dfmi+bs).sort_index()
               bah   foo
ic id ia ib
C0 D0 A0 B0    2.0   0.0
         B1   10.0   8.0
      A1 B0   18.0  16.0
         B1   26.0  24.0
   D1 A0 B0    NaN   NaN
         B1    NaN   NaN
      A1 B0    NaN   NaN
         B1    NaN   NaN
   D2 A0 NaN   NaN   NaN
C1 D0 A0 B0   10.0   8.0
         B1   18.0  16.0
      A1 B0   26.0  24.0
         B1   34.0  32.0
   D1 A0 B0    NaN   NaN
         B1    NaN   NaN
      A1 B0    NaN   NaN
         B1    NaN   NaN
   D2 A0 NaN   NaN   NaN # again the (D2,NaN,NaN) get change

Expected Output

sort_index should not change the contain of DataFrame.

Output of pd.show_versions()

Details

INSTALLED VERSIONS

commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.24.1
pytest: 4.1.0
pip: 10.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.16.2
scipy: None
pyarrow: None
xarray: None
IPython: 7.3.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.3
openpyxl: None
xlrd: 1.2.0
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: 4.7.1
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions