PERF: improved performance of small multiindexes #16324

jreback · 2017-05-11T02:50:50Z

When I did #15245 the goal was two fold

Reduce the hashtable creation time for a MultiIndex
Reduce the memory footprint

original impl

These are interrelated, as the original impl of a MI constructed a list of tuples for each value (the len of the tuple is the number of levels). Then we construct a PyObject based hashtable. This is pretty slow & has the side effect of blowing up memory as a list of tuples is pretty heavyweight (when you have a large one).

Indexing is simple, you simply hash a passed tuple and do a lookup in the hashtable. This is quite fast.

hashtable impl

The new implementation, instead data hashes the levels themselves and performs transforms to preserve ordering. This is much faster than the original implementation because we have common dtypes and can construct these hashes in a vectorized way. So this is fast and memory efficient

Looking up a value involves constructing the hash for a tuple. This has some overhead as for impl simplicity this is partly python code and has a few steps. This could be improved. So the actual construction of the hash of the value to be lookuped up is slower than the original impl.

Naive timing of an indexing on a new DataFrame is very misleading. %timeit gives the min time, which includes the construction cost on the first iteration, and subequent access times. However it returns the min, which is obviously not including the cached time. We actually care about the sum of access times.

We always have to construct the hash table in order to index, so its cost is more than relevant, in fact it would dominate things.

This PR allow the original implementation for smaller MultiIndexes. I chose 10000 elmenents (times are still slightly slower for the hash-based construction by bout 1.5x, but balancing memory bloat here). It is vastly better for larger than this. (if some enterprising soul wants to draw the graph of amortized construction cost would be great!)

So we at run-time choose the engine for indexing. We can then have good smallish perf, and nice scaling on larger hash tables.

setup

import string
mi_large = pd.MultiIndex.from_product(
    [np.arange(1000),
     np.arange(20), list(string.ascii_letters)],
    names=['one', 'two', 'three'])
mi_med = pd.MultiIndex.from_product(
    [np.arange(1000),
     np.arange(10), list('A')],
    names=['one', 'two', 'three'])
mi_small = pd.MultiIndex.from_product(
    [np.arange(100), list('A'), list('A')],
    names=['one', 'two', 'three'])

These timing show index construction cost (first execution), and marginal after than (2nd) for small, med, large.

v0.19.2

In [2]: %time mi_small.get_loc((99, 'A'))
CPU times: user 281 µs, sys: 90 µs, total: 371 µs
Wall time: 329 µs
Out[2]: slice(99, 100, None)

In [3]: %time mi_small.get_loc((99, 'A'))
CPU times: user 117 µs, sys: 20 µs, total: 137 µs
Wall time: 125 µs
Out[3]: slice(99, 100, None)

In [4]: %time mi_med.get_loc((999, 9, 'A'))
CPU times: user 3.37 ms, sys: 918 µs, total: 4.29 ms
Wall time: 3.67 ms
Out[4]: 9999

In [5]: %time mi_med.get_loc((999, 9, 'A'))
CPU times: user 28 µs, sys: 0 ns, total: 28 µs
Wall time: 31 µs
Out[5]: 9999

In [6]: %time mi_large.get_loc((999, 19, 'E'))
CPU times: user 875 ms, sys: 67.9 ms, total: 943 ms
Wall time: 943 ms
Out[6]: 1039978

In [7]: %time mi_large.get_loc((999, 19, 'E'))
CPU times: user 31 µs, sys: 0 ns, total: 31 µs
Wall time: 33.9 µs
Out[7]: 1039978

PR

In [2]: %time mi_small.get_loc((99, 'A'))
CPU times: user 254 µs, sys: 37 µs, total: 291 µs
Wall time: 260 µs
Out[2]: slice(99, 100, None)

In [3]: %time mi_small.get_loc((99, 'A'))
CPU times: user 129 µs, sys: 17 µs, total: 146 µs
Wall time: 135 µs
Out[3]: slice(99, 100, None)

In [4]: %time mi_med.get_loc((999, 9, 'A'))
CPU times: user 3.31 ms, sys: 1.12 ms, total: 4.43 ms
Wall time: 5.75 ms
Out[4]: 9999

In [5]: %time mi_med.get_loc((999, 9, 'A'))
CPU times: user 32 µs, sys: 1 µs, total: 33 µs
Wall time: 35.8 µs
Out[5]: 9999

In [6]: %time mi_large.get_loc((999, 19, 'E'))
CPU times: user 130 ms, sys: 40.3 ms, total: 170 ms
Wall time: 171 ms
Out[6]: 1039978

In [7]: %time mi_large.get_loc((999, 19, 'E'))
CPU times: user 3.44 ms, sys: 348 µs, total: 3.79 ms
Wall time: 3.53 ms
Out[7]: 1039978

jorisvandenbossche · 2017-05-11T09:08:14Z

@jreback Did you post the wrong timings? As you show it is actually slower for the big one, and the same for the smaller one ..

jorisvandenbossche · 2017-05-11T09:54:28Z

Or are the timings of the PR compared to 0.19.2 timings? (I assumed master or 0.20.1 timings) And in that case, with this PR we are still considerably slower compared to 0.19.2 (after first indexing operation and hashtable creation)

When I was testing it yesterday, I also assumed it was some trade-off on the size of the MI for the new hashtable based MultiIndexEngine, and the ObjectEngine. So I tried to find the good cut-off by varying the MI size, but I actually got systematically faster results with the Object engine, for MI's up to 1 million elements.
So I now repeated that exercise but using this PR, fixing the engine to one of both implementations instead of letting it depend on the size:

In [16]: pd.MultiIndex.object_engine = False

In [17]: ts = []
    ...: 
    ...: for n in [3, 10, 30, 100, 300, 1000]:
    ...:     idx = pd.MultiIndex.from_product([np.arange(n), np.arange(n)])
    ...:     key = idx.values[0]
    ...:     t = %timeit -o idx.get_loc(key)
    ...:     ts.append(t)
    ...: 
    ...: 
100 loops, best of 3: 1.86 ms per loop
1000 loops, best of 3: 1.98 ms per loop
100 loops, best of 3: 2.28 ms per loop
100 loops, best of 3: 1.97 ms per loop
The slowest run took 5.58 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 1.93 ms per loop
The slowest run took 89.39 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 2.23 ms per loop

In [18]: pd.MultiIndex.object_engine = True

In [19]: ts2 = []
    ...: 
    ...: for n in [3, 10, 30, 100, 300, 1000]:
    ...:     idx = pd.MultiIndex.from_product([np.arange(n), np.arange(n)])
    ...:     key = idx.values[0]
    ...:     t = %timeit -o idx.get_loc(key)
    ...:     ts2.append(t)
    ...: 
    ...: 
The slowest run took 52.39 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 5.07 µs per loop
The slowest run took 45.32 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 5.18 µs per loop
The slowest run took 65.73 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 5.21 µs per loop
The slowest run took 215.95 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 5.12 µs per loop
The slowest run took 2562.67 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 5.12 µs per loop
The slowest run took 21262.27 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 7.51 µs per loop

So the engine creation gets a lot slower (and that is important as well of course), but after that the actual indexing operation is still consistently faster when using the ObjectEngine, even for len(self) > 1000.
The same for the example you showed above. And the strange thing is that in the original PR #15245 you actually showed that the timing for that benchmark considerably improved.

jorisvandenbossche · 2017-05-11T10:05:13Z

Regarding that last remark about the asv benchmarks: the reason is that it does the indexing timings on a 'cold' index, so the result seems to be dominated by the engine creation rather than the actual get_loc operation. Maybe we should benchmark both.

codecov · 2017-05-11T10:58:42Z

Codecov Report

Merging #16324 into master will decrease coverage by 0.03%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #16324      +/-   ##
==========================================
- Coverage   90.37%   90.34%   -0.04%     
==========================================
  Files         161      161              
  Lines       50863    50866       +3     
==========================================
- Hits        45966    45953      -13     
- Misses       4897     4913      +16

Flag	Coverage Δ
#multiple	`88.12% <100%> (-0.02%)`	⬇️
#single	`40.18% <46.15%> (-0.26%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/util/hashing.py	`90.35% <100%> (-1.69%)`	⬇️
pandas/core/dtypes/dtypes.py	`94.92% <100%> (ø)`	⬆️
pandas/core/indexes/multi.py	`96.56% <100%> (-0.17%)`	⬇️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/common.py	`90.68% <0%> (-0.35%)`	⬇️
pandas/core/indexes/category.py	`98.18% <0%> (-0.31%)`	⬇️
pandas/core/frame.py	`97.59% <0%> (-0.1%)`	⬇️
pandas/core/series.py	`94.71% <0%> (-0.1%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0607e03...795ba01. Read the comment docs.

codecov · 2017-05-11T10:58:49Z

Codecov Report

Merging #16324 into master will decrease coverage by 0.05%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #16324      +/-   ##
==========================================
- Coverage   90.39%   90.34%   -0.06%     
==========================================
  Files         161      161              
  Lines       50863    50866       +3     
==========================================
- Hits        45978    45954      -24     
- Misses       4885     4912      +27

Flag	Coverage Δ
#multiple	`88.12% <100%> (-0.04%)`	⬇️
#single	`40.18% <46.15%> (-0.26%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/indexes/multi.py	`96.56% <100%> (-0.17%)`	⬇️
pandas/core/util/hashing.py	`90.35% <100%> (-1.69%)`	⬇️
pandas/core/dtypes/dtypes.py	`94.92% <100%> (ø)`	⬆️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/plotting/_converter.py	`63.23% <0%> (-1.82%)`	⬇️
pandas/core/indexes/category.py	`98.18% <0%> (-0.31%)`	⬇️
pandas/core/frame.py	`97.59% <0%> (-0.1%)`	⬇️
pandas/core/series.py	`94.71% <0%> (-0.1%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 379fa87...de514f1. Read the comment docs.

jreback · 2017-05-11T11:16:04Z

@jorisvandenbossche pls read my impl notes. The timings are correct. Your measurement is incorrect.

So the engine creation gets a lot slower (and that is important as well of course), but after that the

This is by far the most important part of the cost.

jorisvandenbossche · 2017-05-11T12:03:26Z

I had read your comments in the classes, but I don't read an explanation for my comments above. So if you would like to clarify.

To be clear, I am not saying the ObjectEngine (the old implementation) is necessarily better (and thanks for this PR!). It is just that the trade-off between performance of index creation and of actual indexing operation, is a difficult one, because it depends on how many times you do such an indexing operation. So it will always be difficult to determine a good cut-off size to use one of both, because it depends on the usecase (apart from that, we should just choose a sensible one).

If you do many indexing operations (which is not most of the time not a good usage of course, but not always avoidable), the ObjectEngine is even for large arrays still a lot faster as the MultiIndexEngine (because in such case the faster get_loc outweighs the slower initial creation).
If you do only a few indexing operation, the MultiIndexEngine, due to a much faster creation, can indeed be a big improvement.

jreback · 2017-05-11T12:44:59Z

@jorisvandenbossche your comments indicate you did not read my update at the top.

If you do many indexing operations (which is not most of the time not a good usage of course, but not always avoidable), the ObjectEngine is even for large arrays still a lot faster as the MultiIndexEngine (because in such case the faster get_loc outweighs the slower initial creation).

prove it

jreback · 2017-05-11T12:48:39Z

To be clear, I am not saying the ObjectEngine (the old implementation) is necessarily better (and thanks for this PR!). It is just that the trade-off between performance of index creation and of actual indexing operation, is a difficult one, because it depends on how many times you do such an indexing operation. So it will always be difficult to determine a good cut-off size to use one of both, because it depends on the usecase (apart from that, we should just choose a sensible one).

sure there is a tradeoff. but around > 10000 elements it is dramatically better in the hash based impl. (sure there is a breakeven point, and its is prob somewhat larger, maybe 50k or 100k elements), BUT this does reflect the ballooing memory of the original impl.

simply compare [6] above.

The hashtable creation cost vastly outweights everything else. timeit does NOT take this into account (well it does says that things are cache, which is exactly what is happening here).

jorisvandenbossche · 2017-05-11T14:27:17Z

@jorisvandenbossche your comments indicate you did not read my update at the top.

OK, I thought you meant the comments in the code. Github doesn't always refresh itself if you keep the page open, so I only saw it now.
But that said, I completely understand what you are saying there (but thanks for writing it, that makes the discussion clearer!), as I think also is clear from my comments. There is the trade-off between index hashtable creation and actual getting operation (the timings I showed were just only about the second part). One things I was not considering was the memory aspect, and that was of course also an important reason (maybe the original) for the change.

But what I want to make clear is that this trade-off size of 10k can be very detrimental when you do a lot of single indexing operations (when above that size of 10k, and so using the hash based one). And you easily get at many indexing operations, even if you do not do that manually in a loop.
As an example, the combine_first of the original issue report does around 5 get_locs for each column in the frame, leading to 72 calls in total for that specific example. But of course that example would be below the 10k threshold, and so would be fixed with this PR.

Maybe this is not that common to have the combination of a large MI size and many indexing operations, so maybe we shouldn't care, but if you are in this situation, it can make big difference.

Visual example (code to reproduce in http://nbviewer.jupyter.org/gist/jorisvandenbossche/d39063411b480e1fa825b0cb5c1d56fd), full line is first get_loc (so with overhead of hashtable creation), dotted lines are subsequent get_loc calls (added to the first line):

So with 10 get_loc calls the cutoff is around 10^5, but if you would make the same plot with 1000 get_loc calls, the ObjectEngine is even slightly faster with 10^7 sized MI (but with probably blown up memory as you indicated)

But, as I don't think we want to expose this as an option, we have to choose a cutoff size, so I think the current 10k you used is probably fine.
That at least covers the typical case of a normal amount of multi-indexed columns (eg the case in the reported issue)

jorisvandenbossche

Some other comments apart from our discussion :-)

jorisvandenbossche · 2017-05-11T14:31:00Z

asv_bench/benchmarks/indexing.py

    def time_multiindex_string_get_loc(self):
-        self.mistring.get_loc((999, 19, 'Z'))
+        self.mi_small.get_loc((99, 'A'))


the index has actually 3 levels (did that change? in the beginning I think it had only 2 levels).
But so with the result it is not a full key, so I suppose takes a different code path? (maybe also useful to benchmark, but not what we want in this PR?)

jorisvandenbossche · 2017-05-11T14:33:10Z

pandas/_libs/index.pyx

+        return super(MultiIndexObjectEngine, self).get_loc(val)
+
+
+cdef class MultiIndexEngine(MultiIndexObjectEngine):


not sure if it is needed to subclass from MultiIndexObjectEngine, as you actually override all the methods specified in that one. So for code clarity, I would just subclass IndexEngine as before)

jorisvandenbossche · 2017-05-11T14:35:17Z

pandas/core/indexes/multi.py

+
+        # choose our engine based on our size
+        # the hashing based MultiIndex for larger
+        # sizes, and the MultiIndexOjbect for smaller


maybe refer to the github issue number of the pr for the discussion about it?

jreback · 2017-05-11T15:51:24Z

yeah, the calls after ht constructon are basically immaterial you pay the cost. sure you can always overwhelm it, but that is just user error (and to be honest not sure how much could do about that). a .get_indexer call efficiently does this lookup.

jorisvandenbossche · 2017-05-11T16:05:24Z

The gist/notebook for the plot above: http://nbviewer.jupyter.org/gist/jorisvandenbossche/d39063411b480e1fa825b0cb5c1d56fd

closes pandas-dev#16319

jreback · 2017-05-11T22:53:16Z

thanks for the review @jorisvandenbossche

closes pandas-dev#16319

closes pandas-dev#16319 (cherry picked from commit 94ef7b6)

closes #16319 (cherry picked from commit 94ef7b6)

closes pandas-dev#16319

jreback added MultiIndex Performance Memory or execution speed performance labels May 11, 2017

jreback added this to the 0.20.2 milestone May 11, 2017

jreback force-pushed the perf_mi branch from 24c223e to 795ba01 Compare May 11, 2017 10:58

jreback force-pushed the perf_mi branch from 795ba01 to b08a96a Compare May 11, 2017 10:59

jorisvandenbossche reviewed May 11, 2017

View reviewed changes

PERF: improved performance of small multiindexes

de514f1

closes pandas-dev#16319

jreback force-pushed the perf_mi branch from b08a96a to de514f1 Compare May 11, 2017 22:53

jreback merged commit 94ef7b6 into pandas-dev:master May 11, 2017

jorisvandenbossche mentioned this pull request May 12, 2017

PERF: improve MultiIndex get_loc performance #16346

Merged

pcluo pushed a commit to pcluo/pandas that referenced this pull request May 22, 2017

PERF: improved performance of small multiindexes (pandas-dev#16324)

c657632

closes pandas-dev#16319

TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this pull request May 29, 2017

PERF: improved performance of small multiindexes (pandas-dev#16324)

8d02272

closes pandas-dev#16319 (cherry picked from commit 94ef7b6)

TomAugspurger added Backported and removed Needs Backport labels May 30, 2017

TomAugspurger pushed a commit that referenced this pull request May 30, 2017

PERF: improved performance of small multiindexes (#16324)

3c3eb30

closes #16319 (cherry picked from commit 94ef7b6)

jorisvandenbossche mentioned this pull request Jun 9, 2017

Observing significant (up to 30x) __get_item__ performance drop between 0.19.2 and 0.20.x #16644

Closed

stangirala pushed a commit to stangirala/pandas that referenced this pull request Jun 11, 2017

PERF: improved performance of small multiindexes (pandas-dev#16324)

067f257

closes pandas-dev#16319

toobaz mentioned this pull request Nov 27, 2017

New engine for MultiIndex? #18519

Closed

TomAugspurger mentioned this pull request Jan 4, 2018

REF: codes-based MultiIndex engine #19074

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: improved performance of small multiindexes #16324

PERF: improved performance of small multiindexes #16324

jreback commented May 11, 2017 •

edited by jorisvandenbossche

Loading

jorisvandenbossche commented May 11, 2017 •

edited

Loading

jorisvandenbossche commented May 11, 2017

jorisvandenbossche commented May 11, 2017

codecov bot commented May 11, 2017

codecov bot commented May 11, 2017 •

edited

Loading

jreback commented May 11, 2017 •

edited

Loading

jorisvandenbossche commented May 11, 2017

jreback commented May 11, 2017 •

edited

Loading

jreback commented May 11, 2017 •

edited

Loading

jorisvandenbossche commented May 11, 2017 •

edited

Loading

jorisvandenbossche left a comment

jorisvandenbossche May 11, 2017

jreback May 11, 2017

jorisvandenbossche May 11, 2017

jreback May 11, 2017

jorisvandenbossche May 11, 2017

jreback May 11, 2017

jreback commented May 11, 2017

jorisvandenbossche commented May 11, 2017

jreback commented May 11, 2017

		return super(MultiIndexObjectEngine, self).get_loc(val)


		cdef class MultiIndexEngine(MultiIndexObjectEngine):

PERF: improved performance of small multiindexes #16324

PERF: improved performance of small multiindexes #16324

Conversation

jreback commented May 11, 2017 • edited by jorisvandenbossche Loading

original impl

hashtable impl

jorisvandenbossche commented May 11, 2017 • edited Loading

jorisvandenbossche commented May 11, 2017

jorisvandenbossche commented May 11, 2017

codecov bot commented May 11, 2017

Codecov Report

codecov bot commented May 11, 2017 • edited Loading

Codecov Report

jreback commented May 11, 2017 • edited Loading

jorisvandenbossche commented May 11, 2017

jreback commented May 11, 2017 • edited Loading

jreback commented May 11, 2017 • edited Loading

jorisvandenbossche commented May 11, 2017 • edited Loading

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche May 11, 2017

Choose a reason for hiding this comment

jreback May 11, 2017

Choose a reason for hiding this comment

jorisvandenbossche May 11, 2017

Choose a reason for hiding this comment

jreback May 11, 2017

Choose a reason for hiding this comment

jorisvandenbossche May 11, 2017

Choose a reason for hiding this comment

jreback May 11, 2017

Choose a reason for hiding this comment

jreback commented May 11, 2017

jorisvandenbossche commented May 11, 2017

jreback commented May 11, 2017

jreback commented May 11, 2017 •

edited by jorisvandenbossche

Loading

jorisvandenbossche commented May 11, 2017 •

edited

Loading

codecov bot commented May 11, 2017 •

edited

Loading

jreback commented May 11, 2017 •

edited

Loading

jreback commented May 11, 2017 •

edited

Loading

jreback commented May 11, 2017 •

edited

Loading

jorisvandenbossche commented May 11, 2017 •

edited

Loading