ENH: improve partial key indexing performance #38650
Labels
Enhancement
Indexing
Related to indexing on series/frames, not to indexes themselves
MultiIndex
Performance
Memory or execution speed performance
Indexing for entries in a series or dataframe with a multi-index has dramatically worse performance when using partial keys.
For example, for a series with a 3-level multi-index,
series[x]
is dramatically slower thanseries[(x,y,z)]
. Please see the session below for a minimal reproduction:From looking at the code, it seems that
BaseMultiIndexCodesEngine
(pandas/_libs/index.pyx) essentially creates a hash table (alibindex.UInt64Engine
) with entries for each full key. This makes full-key lookups fast. Meanwhile, for partial key lookups on a sorted index,MultiIndex._get_level_indexer
will do two binary searches for each provided level of the key.Instead of a flattened-index approach (again, via
libindex.UInt64Engine
), it would be excellent if Pandas implemented nested indeces (nested hash tables, if you will) for this. This would have a small impact on the performance of full key lookups (current lookup complexity is O(1), new complexity would be O(number_of_levels)). However, partial key lookups would greatly benefit, going from O(log(size_of_series)) to O(number_of_levels_in_key).The text was updated successfully, but these errors were encountered: