New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New engine for MultiIndex? #18519

Closed
toobaz opened this Issue Nov 27, 2017 · 3 comments

Comments

Projects
None yet
3 participants
@toobaz
Member

toobaz commented Nov 27, 2017

Currently, MultiIndex.get_loc() and MultiIndex.get_indexer() both rely on an _engine which is either a MultiIndexObjectEngine or a MultiIndexHashEngine: but both of these are thin layers over the flat ObjectEngine. This means that the actual structure of labels and levels is completely discarded (except e.g. for partial indexing, see _get_level_indexer()).

In principle, a completely different scheme could be used:

  • first look for the key elements in levels, and find the corresponding code
  • then look for the code in the levels

In most cases, the second part should be the computationally expensive one. It would consist in running nlevels searches in arrays of dtype=int (the .labels) rather than (as it is now) one search in an object array in which each element is actually a tuple of nlevels elements. My guess is that thanks to vectorization the former should be much faster than the latter.

Moreover (and maybe more importantly), with the current engine fixing a bug such as #18485 is a nightmare. And the same applies to

In [2]: (4, True) in pd.MultiIndex.from_tuples([(4, 1)])
Out[2]: True

and probably others. This is because even though levels are not mixed, the elements are compared as objects.

One caveat is that the single levels would be very often non-unique, and I'm not sure what is the impact of this with the current implementation of hash tables.

@chris-b1

This comment has been minimized.

Contributor

chris-b1 commented Nov 27, 2017

See also previous discussion in #1752.

@toobaz

This comment has been minimized.

Member

toobaz commented Nov 27, 2017

Aha, and #16324, which was a response to #16319, which was a reaction to #15245, which was a response to #13904, which was a follow-up to the issue @chris-b1 mentioned.

And while the MultiIndexHashEngine is not what I described above (it's probably better), indeed only now I see that

In [2]: mi = pd.MultiIndex.from_product([[True, False], range(1, 10000)])

In [3]: (1, 3) in mi
Out[3]: False

In [4]: mi = pd.MultiIndex.from_product([[1, np.nan], range(1, 10000)])

In [5]: (np.nan, 3) in mi
Out[5]: True

In [6]: mi = pd.MultiIndex.from_product([[1, np.nan], range(1, 10)])

In [7]: (np.nan, 3) in mi
Out[7]: False

that is, large indexes are free from #18485 and friends.

@toobaz

This comment has been minimized.

Member

toobaz commented Nov 27, 2017

OK, after reviewing this evidence, I guess we probably don't need a new engine: but I think we should (be able to) improve the MultiIndexEngine to the point that we can stop relying on the ObjectEngine for small MI, by having it hash codes rather than values.

@toobaz toobaz referenced this issue Jan 4, 2018

Merged

REF: codes-based MultiIndex engine #19074

3 of 3 tasks complete

@jreback jreback added this to the 0.23.0 milestone Jan 17, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment