Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: improved performance of small multiindexes #16324

Merged
merged 1 commit into from
May 11, 2017

Conversation

jreback
Copy link
Contributor

@jreback jreback commented May 11, 2017

closes #16319

When I did #15245 the goal was two fold

  • Reduce the hashtable creation time for a MultiIndex
  • Reduce the memory footprint

original impl

These are interrelated, as the original impl of a MI constructed a list of tuples for each value (the len of the tuple is the number of levels). Then we construct a PyObject based hashtable. This is pretty slow & has the side effect of blowing up memory as a list of tuples is pretty heavyweight (when you have a large one).

Indexing is simple, you simply hash a passed tuple and do a lookup in the hashtable. This is quite fast.

hashtable impl

The new implementation, instead data hashes the levels themselves and performs transforms to preserve ordering. This is much faster than the original implementation because we have common dtypes and can construct these hashes in a vectorized way. So this is fast and memory efficient

Looking up a value involves constructing the hash for a tuple. This has some overhead as for impl simplicity this is partly python code and has a few steps. This could be improved. So the actual construction of the hash of the value to be lookuped up is slower than the original impl.

Naive timing of an indexing on a new DataFrame is very misleading. %timeit gives the min time, which includes the construction cost on the first iteration, and subequent access times. However it returns the min, which is obviously not including the cached time. We actually care about the sum of access times.

We always have to construct the hash table in order to index, so its cost is more than relevant, in fact it would dominate things.

This PR allow the original implementation for smaller MultiIndexes. I chose 10000 elmenents (times are still slightly slower for the hash-based construction by bout 1.5x, but balancing memory bloat here). It is vastly better for larger than this. (if some enterprising soul wants to draw the graph of amortized construction cost would be great!)

So we at run-time choose the engine for indexing. We can then have good smallish perf, and nice scaling on larger hash tables.

setup

import string
mi_large = pd.MultiIndex.from_product(
    [np.arange(1000),
     np.arange(20), list(string.ascii_letters)],
    names=['one', 'two', 'three'])
mi_med = pd.MultiIndex.from_product(
    [np.arange(1000),
     np.arange(10), list('A')],
    names=['one', 'two', 'three'])
mi_small = pd.MultiIndex.from_product(
    [np.arange(100), list('A'), list('A')],
    names=['one', 'two', 'three'])

These timing show index construction cost (first execution), and marginal after than (2nd) for small, med, large.

v0.19.2

In [2]: %time mi_small.get_loc((99, 'A'))
CPU times: user 281 µs, sys: 90 µs, total: 371 µs
Wall time: 329 µs
Out[2]: slice(99, 100, None)

In [3]: %time mi_small.get_loc((99, 'A'))
CPU times: user 117 µs, sys: 20 µs, total: 137 µs
Wall time: 125 µs
Out[3]: slice(99, 100, None)

In [4]: %time mi_med.get_loc((999, 9, 'A'))
CPU times: user 3.37 ms, sys: 918 µs, total: 4.29 ms
Wall time: 3.67 ms
Out[4]: 9999

In [5]: %time mi_med.get_loc((999, 9, 'A'))
CPU times: user 28 µs, sys: 0 ns, total: 28 µs
Wall time: 31 µs
Out[5]: 9999

In [6]: %time mi_large.get_loc((999, 19, 'E'))
CPU times: user 875 ms, sys: 67.9 ms, total: 943 ms
Wall time: 943 ms
Out[6]: 1039978

In [7]: %time mi_large.get_loc((999, 19, 'E'))
CPU times: user 31 µs, sys: 0 ns, total: 31 µs
Wall time: 33.9 µs
Out[7]: 1039978

PR

In [2]: %time mi_small.get_loc((99, 'A'))
CPU times: user 254 µs, sys: 37 µs, total: 291 µs
Wall time: 260 µs
Out[2]: slice(99, 100, None)

In [3]: %time mi_small.get_loc((99, 'A'))
CPU times: user 129 µs, sys: 17 µs, total: 146 µs
Wall time: 135 µs
Out[3]: slice(99, 100, None)

In [4]: %time mi_med.get_loc((999, 9, 'A'))
CPU times: user 3.31 ms, sys: 1.12 ms, total: 4.43 ms
Wall time: 5.75 ms
Out[4]: 9999

In [5]: %time mi_med.get_loc((999, 9, 'A'))
CPU times: user 32 µs, sys: 1 µs, total: 33 µs
Wall time: 35.8 µs
Out[5]: 9999

In [6]: %time mi_large.get_loc((999, 19, 'E'))
CPU times: user 130 ms, sys: 40.3 ms, total: 170 ms
Wall time: 171 ms
Out[6]: 1039978

In [7]: %time mi_large.get_loc((999, 19, 'E'))
CPU times: user 3.44 ms, sys: 348 µs, total: 3.79 ms
Wall time: 3.53 ms
Out[7]: 1039978

@jreback jreback added MultiIndex Performance Memory or execution speed performance labels May 11, 2017
@jreback jreback added this to the 0.20.2 milestone May 11, 2017
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented May 11, 2017

@jreback Did you post the wrong timings? As you show it is actually slower for the big one, and the same for the smaller one ..

@jorisvandenbossche
Copy link
Member

Or are the timings of the PR compared to 0.19.2 timings? (I assumed master or 0.20.1 timings) And in that case, with this PR we are still considerably slower compared to 0.19.2 (after first indexing operation and hashtable creation)

When I was testing it yesterday, I also assumed it was some trade-off on the size of the MI for the new hashtable based MultiIndexEngine, and the ObjectEngine. So I tried to find the good cut-off by varying the MI size, but I actually got systematically faster results with the Object engine, for MI's up to 1 million elements.
So I now repeated that exercise but using this PR, fixing the engine to one of both implementations instead of letting it depend on the size:

In [16]: pd.MultiIndex.object_engine = False

In [17]: ts = []
    ...: 
    ...: for n in [3, 10, 30, 100, 300, 1000]:
    ...:     idx = pd.MultiIndex.from_product([np.arange(n), np.arange(n)])
    ...:     key = idx.values[0]
    ...:     t = %timeit -o idx.get_loc(key)
    ...:     ts.append(t)
    ...: 
    ...: 
100 loops, best of 3: 1.86 ms per loop
1000 loops, best of 3: 1.98 ms per loop
100 loops, best of 3: 2.28 ms per loop
100 loops, best of 3: 1.97 ms per loop
The slowest run took 5.58 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 1.93 ms per loop
The slowest run took 89.39 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 2.23 ms per loop

In [18]: pd.MultiIndex.object_engine = True

In [19]: ts2 = []
    ...: 
    ...: for n in [3, 10, 30, 100, 300, 1000]:
    ...:     idx = pd.MultiIndex.from_product([np.arange(n), np.arange(n)])
    ...:     key = idx.values[0]
    ...:     t = %timeit -o idx.get_loc(key)
    ...:     ts2.append(t)
    ...: 
    ...: 
The slowest run took 52.39 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 5.07 µs per loop
The slowest run took 45.32 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 5.18 µs per loop
The slowest run took 65.73 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 5.21 µs per loop
The slowest run took 215.95 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 5.12 µs per loop
The slowest run took 2562.67 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 5.12 µs per loop
The slowest run took 21262.27 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 7.51 µs per loop

So the engine creation gets a lot slower (and that is important as well of course), but after that the actual indexing operation is still consistently faster when using the ObjectEngine, even for len(self) > 1000.
The same for the example you showed above. And the strange thing is that in the original PR #15245 you actually showed that the timing for that benchmark considerably improved.

@jorisvandenbossche
Copy link
Member

Regarding that last remark about the asv benchmarks: the reason is that it does the indexing timings on a 'cold' index, so the result seems to be dominated by the engine creation rather than the actual get_loc operation. Maybe we should benchmark both.

@codecov
Copy link

codecov bot commented May 11, 2017

Codecov Report

Merging #16324 into master will decrease coverage by 0.03%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16324      +/-   ##
==========================================
- Coverage   90.37%   90.34%   -0.04%     
==========================================
  Files         161      161              
  Lines       50863    50866       +3     
==========================================
- Hits        45966    45953      -13     
- Misses       4897     4913      +16
Flag Coverage Δ
#multiple 88.12% <100%> (-0.02%) ⬇️
#single 40.18% <46.15%> (-0.26%) ⬇️
Impacted Files Coverage Δ
pandas/core/util/hashing.py 90.35% <100%> (-1.69%) ⬇️
pandas/core/dtypes/dtypes.py 94.92% <100%> (ø) ⬆️
pandas/core/indexes/multi.py 96.56% <100%> (-0.17%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/common.py 90.68% <0%> (-0.35%) ⬇️
pandas/core/indexes/category.py 98.18% <0%> (-0.31%) ⬇️
pandas/core/frame.py 97.59% <0%> (-0.1%) ⬇️
pandas/core/series.py 94.71% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0607e03...795ba01. Read the comment docs.

@codecov
Copy link

codecov bot commented May 11, 2017

Codecov Report

Merging #16324 into master will decrease coverage by 0.05%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16324      +/-   ##
==========================================
- Coverage   90.39%   90.34%   -0.06%     
==========================================
  Files         161      161              
  Lines       50863    50866       +3     
==========================================
- Hits        45978    45954      -24     
- Misses       4885     4912      +27
Flag Coverage Δ
#multiple 88.12% <100%> (-0.04%) ⬇️
#single 40.18% <46.15%> (-0.26%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/multi.py 96.56% <100%> (-0.17%) ⬇️
pandas/core/util/hashing.py 90.35% <100%> (-1.69%) ⬇️
pandas/core/dtypes/dtypes.py 94.92% <100%> (ø) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/plotting/_converter.py 63.23% <0%> (-1.82%) ⬇️
pandas/core/indexes/category.py 98.18% <0%> (-0.31%) ⬇️
pandas/core/frame.py 97.59% <0%> (-0.1%) ⬇️
pandas/core/series.py 94.71% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 379fa87...de514f1. Read the comment docs.

@jreback
Copy link
Contributor Author

jreback commented May 11, 2017

@jorisvandenbossche pls read my impl notes. The timings are correct. Your measurement is incorrect.

So the engine creation gets a lot slower (and that is important as well of course), but after that the

This is by far the most important part of the cost.

@jorisvandenbossche
Copy link
Member

I had read your comments in the classes, but I don't read an explanation for my comments above. So if you would like to clarify.

To be clear, I am not saying the ObjectEngine (the old implementation) is necessarily better (and thanks for this PR!). It is just that the trade-off between performance of index creation and of actual indexing operation, is a difficult one, because it depends on how many times you do such an indexing operation. So it will always be difficult to determine a good cut-off size to use one of both, because it depends on the usecase (apart from that, we should just choose a sensible one).

If you do many indexing operations (which is not most of the time not a good usage of course, but not always avoidable), the ObjectEngine is even for large arrays still a lot faster as the MultiIndexEngine (because in such case the faster get_loc outweighs the slower initial creation).
If you do only a few indexing operation, the MultiIndexEngine, due to a much faster creation, can indeed be a big improvement.

@jreback
Copy link
Contributor Author

jreback commented May 11, 2017

@jorisvandenbossche your comments indicate you did not read my update at the top.

If you do many indexing operations (which is not most of the time not a good usage of course, but not always avoidable), the ObjectEngine is even for large arrays still a lot faster as the MultiIndexEngine (because in such case the faster get_loc outweighs the slower initial creation).

prove it

@jreback
Copy link
Contributor Author

jreback commented May 11, 2017

To be clear, I am not saying the ObjectEngine (the old implementation) is necessarily better (and thanks for this PR!). It is just that the trade-off between performance of index creation and of actual indexing operation, is a difficult one, because it depends on how many times you do such an indexing operation. So it will always be difficult to determine a good cut-off size to use one of both, because it depends on the usecase (apart from that, we should just choose a sensible one).

sure there is a tradeoff. but around > 10000 elements it is dramatically better in the hash based impl. (sure there is a breakeven point, and its is prob somewhat larger, maybe 50k or 100k elements), BUT this does reflect the ballooing memory of the original impl.

simply compare [6] above.

The hashtable creation cost vastly outweights everything else. timeit does NOT take this into account (well it does says that things are cache, which is exactly what is happening here).

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented May 11, 2017

@jorisvandenbossche your comments indicate you did not read my update at the top.

OK, I thought you meant the comments in the code. Github doesn't always refresh itself if you keep the page open, so I only saw it now.
But that said, I completely understand what you are saying there (but thanks for writing it, that makes the discussion clearer!), as I think also is clear from my comments. There is the trade-off between index hashtable creation and actual getting operation (the timings I showed were just only about the second part). One things I was not considering was the memory aspect, and that was of course also an important reason (maybe the original) for the change.

But what I want to make clear is that this trade-off size of 10k can be very detrimental when you do a lot of single indexing operations (when above that size of 10k, and so using the hash based one). And you easily get at many indexing operations, even if you do not do that manually in a loop.
As an example, the combine_first of the original issue report does around 5 get_locs for each column in the frame, leading to 72 calls in total for that specific example. But of course that example would be below the 10k threshold, and so would be fixed with this PR.

Maybe this is not that common to have the combination of a large MI size and many indexing operations, so maybe we shouldn't care, but if you are in this situation, it can make big difference.

Visual example (code to reproduce in http://nbviewer.jupyter.org/gist/jorisvandenbossche/d39063411b480e1fa825b0cb5c1d56fd), full line is first get_loc (so with overhead of hashtable creation), dotted lines are subsequent get_loc calls (added to the first line):

figure_6

So with 10 get_loc calls the cutoff is around 10^5, but if you would make the same plot with 1000 get_loc calls, the ObjectEngine is even slightly faster with 10^7 sized MI (but with probably blown up memory as you indicated)

But, as I don't think we want to expose this as an option, we have to choose a cutoff size, so I think the current 10k you used is probably fine.
That at least covers the typical case of a normal amount of multi-indexed columns (eg the case in the reported issue)

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some other comments apart from our discussion :-)

def time_multiindex_string_get_loc(self):
self.mistring.get_loc((999, 19, 'Z'))
self.mi_small.get_loc((99, 'A'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the index has actually 3 levels (did that change? in the beginning I think it had only 2 levels).
But so with the result it is not a full key, so I suppose takes a different code path? (maybe also useful to benchmark, but not what we want in this PR?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

return super(MultiIndexObjectEngine, self).get_loc(val)


cdef class MultiIndexEngine(MultiIndexObjectEngine):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if it is needed to subclass from MultiIndexObjectEngine, as you actually override all the methods specified in that one. So for code clarity, I would just subclass IndexEngine as before)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


# choose our engine based on our size
# the hashing based MultiIndex for larger
# sizes, and the MultiIndexOjbect for smaller
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe refer to the github issue number of the pr for the discussion about it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jreback
Copy link
Contributor Author

jreback commented May 11, 2017

yeah, the calls after ht constructon are basically immaterial you pay the cost. sure you can always overwhelm it, but that is just user error (and to be honest not sure how much could do about that). a .get_indexer call efficiently does this lookup.

@jorisvandenbossche
Copy link
Member

@jreback
Copy link
Contributor Author

jreback commented May 11, 2017

thanks for the review @jorisvandenbossche

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MultiIndex Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: regression in MultiIndex get_loc performance
3 participants