Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: improve partial key indexing performance #38650

Open
DiegoAlbertoTorres opened this issue Dec 23, 2020 · 3 comments
Open

ENH: improve partial key indexing performance #38650

DiegoAlbertoTorres opened this issue Dec 23, 2020 · 3 comments
Labels
Enhancement Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Performance Memory or execution speed performance

Comments

@DiegoAlbertoTorres
Copy link
Contributor

Indexing for entries in a series or dataframe with a multi-index has dramatically worse performance when using partial keys.

For example, for a series with a 3-level multi-index, series[x] is dramatically slower than series[(x,y,z)]. Please see the session below for a minimal reproduction:

from datetime import datetime
import pandas as pd
import numpy as np
import timeit


dates = pd.date_range('1997-01-01', '2020-12-31')
cats = list('abcdefghijklmnop')

multi = pd.MultiIndex.from_product([dates, cats, cats])

series = pd.Series(np.random.rand(len(multi)), index=multi)
date = datetime(year=2020, month=1, day=1)

# generate cache
series.loc[(date, 'a', 'a')]

repeats = 1000
len(series)
## -- End pasted text --
Out[1]: 2244096

In [2]: # Performance of indexing with full keys

In [3]: timeit.timeit(lambda: series.loc[(date, 'a', 'a')], number=repeats)
Out[3]: 0.08493588299825205

In [4]: # Performance of indexing with partial keys

In [5]: timeit.timeit(lambda: series.loc[date], number=repeats)
Out[5]: 4.771169905001443

From looking at the code, it seems that BaseMultiIndexCodesEngine (pandas/_libs/index.pyx) essentially creates a hash table (a libindex.UInt64Engine) with entries for each full key. This makes full-key lookups fast. Meanwhile, for partial key lookups on a sorted index, MultiIndex._get_level_indexer will do two binary searches for each provided level of the key.

Instead of a flattened-index approach (again, via libindex.UInt64Engine), it would be excellent if Pandas implemented nested indeces (nested hash tables, if you will) for this. This would have a small impact on the performance of full key lookups (current lookup complexity is O(1), new complexity would be O(number_of_levels)). However, partial key lookups would greatly benefit, going from O(log(size_of_series)) to O(number_of_levels_in_key).

@DiegoAlbertoTorres DiegoAlbertoTorres added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 23, 2020
@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Performance Memory or execution speed performance and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 23, 2020
@jreback jreback added this to the Contributions Welcome milestone Dec 23, 2020
@jreback
Copy link
Contributor

jreback commented Dec 23, 2020

xref #20803
possible implementation via KDTrees: #9365

@mroeschke
Copy link
Member

Here are the graphiz profiling of the two calls. It appears that most of the bottleneck is in the binary search calls for the partial key lookup so leveraging a new data structure would probably yield a performance improvement.

Full key
full_key

Part key
part_key

@jreback jreback modified the milestones: Contributions Welcome, 1.5 Feb 11, 2022
@mroeschke mroeschke removed this from the 1.5 milestone Aug 15, 2022
@jbrockmendel
Copy link
Member

The full-key timing is almost exactly unchanged (.083s), but the partial-indexing timing is down from 4.77s to .289s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

4 participants