Removing Unused Levels in MultiIndex with NaN values corrupts Index #18616
Labels
Duplicate Report
Duplicate issue or pull request
Missing-data
np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
MultiIndex
Minimal Verifiable Complete Exemple
Below a MVCE of the behavior:
Problem description
Using method
remove_unused_levels
on MultiIndex containingNaN
create a new MultiIndex that is not equal to the original as documentation says:This is why I suspect it is a bug.
Float Index
Single
Index
usesNaN
as modality:But,
MultiIndex
does not, it has negative modality index instead:MultiIndex corruption
When refreshed,
NaN
values point to a copy of the lastfloat
modality (here300.0
) of the level, this lead to a kind of corrupted index because those auto-filled value do not have any meaning.As a consequence Index are not equal (which contradicts documentation):
Even worse, original value (
300.0
is not referenced anymore), and then it is a unused value/modality in the newly generated index.To confirm it, lets apply the method twice, we get:
Expected Output
I believe expected output of
set_index
andremove_unused_levels
should be:The problem also occurs when rows are removed from the DataFrame, and then it makes sense to use the method
remove_unused_levels
to clean up index. Anyway, when building the MCVE I found it was working on the whole Index whatever the level order.Pandas Versions
The text was updated successfully, but these errors were encountered: