PERF: improve MultiIndex get_loc performance #16346

jorisvandenbossche · 2017-05-12T21:15:35Z

While I was timing the MultiIndex get_loc in #16324, I saw some 'quick wins' with some further profiling.

Rationale:

big part of the time was spent in checking for a hash collision: _check_for_collisions. This function was a generic one for one or multiple labels. I added a slightly adapted version specialized for a single tuple
another important factor was the hash creation with hash_tuples. Again, added specialized version for a single tuple.

@jreback I have some questions (will add them as comments), as this is not my specialty, but good opportunity to get more familiar with the inner guts :-)

Some timings using

import string
mi_large = pd.MultiIndex.from_product(
    [np.arange(1000),
     np.arange(20), list(string.ascii_letters)],
    names=['one', 'two', 'three'])

master:

In [2]: %time mi_large.get_loc((999, 19, 'E'))   # <--- first slower cold one
CPU times: user 140 ms, sys: 8 ms, total: 148 ms
Wall time: 145 ms
Out[2]: 1039978

In [3]: %time mi_large.get_loc((999, 19, 'E'))
CPU times: user 0 ns, sys: 4 ms, total: 4 ms
Wall time: 4.33 ms
Out[3]: 1039978

In [4]: %time mi_large.get_loc((999, 19, 'E'))
CPU times: user 4 ms, sys: 4 ms, total: 8 ms
Wall time: 4.18 ms
Out[4]: 1039978

In [5]: %timeit mi_large.get_loc((999, 19, 'E'))
100 loops, best of 3: 2.56 ms per loop

In [6]: %%timeit
   ...: for _ in range(1000):
   ...:     mi_large.get_loc((999, 19, 'E'))
   ...: 
1 loop, best of 3: 2.62 s per loop

PR:

In [3]: %time mi_large.get_loc((999, 19, 'E'))   # <--- first slower cold one
CPU times: user 128 ms, sys: 24 ms, total: 152 ms
Wall time: 152 ms
Out[3]: 1039978

In [4]: %time mi_large.get_loc((999, 19, 'E'))
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 716 µs
Out[4]: 1039978

In [5]: %time mi_large.get_loc((999, 19, 'E'))
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 371 µs
Out[5]: 1039978

In [6]: %timeit mi_large.get_loc((999, 19, 'E'))
1000 loops, best of 3: 189 µs per loop

In [15]: %%timeit
    ...: for _ in range(1000):
    ...:     mi_large.get_loc((999, 19, 'E'))
    ...: 
1 loop, best of 3: 188 ms per loop

So something between a 5x and 15x improvement (big variability between individual timings)

jorisvandenbossche · 2017-05-12T21:21:25Z

pandas/_libs/hashtable_class_helper.pxi.in

+        # version of _check_for_collisions above for single label (tuple)
+
+        result = self.mi[loc]
+        if not array_equivalent(result, label):


I want to compare 2 tuples here, but cannot just do result == equal as this would not compare equal when there are NaNs.
For now I used array_equivalent which does what I want (but still has some overhead). But do we have other utility to do that?

In [4]: %timeit array_equivalent((1, 2, np.nan), (1, 2, np.nan)) 34.6 µs ± 120 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) In [7]: compare = lambda x, y: all(np.isnan(e) and np.isnan(f) or e == f for e, f in zip(x, y)) In [8]: compare((1,2, np.nan), (1, 2, np.nan)) Out[8]: True In [9]: %timeit compare((1,2, np.nan), (1, 2, np.nan)) 5.37 µs ± 32 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Yes, much simpler (and faster)!

As long as we're optimizing, math.isnan is quite a bit faster for scalars:

In [3]: %timeit compare((1,2, np.nan), (1, 2, np.nan)) 4.14 µs ± 66.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) In [4]: import math In [5]: compare2 = lambda x, y: all(math.isnan(e) and math.isnan(f) or e == f for e, f in zip(x, y)) In [6]: %timeit compare2((1,2, np.nan), (1, 2, np.nan)) 1.34 µs ± 11.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

actually I think you need to use isnull; both isnan and math.isnan are not complete (IOW they won't for example handle NaT)

but since this is cython, you use lib.is_null_datetimelike (covers the bases of None, nan, NaT)

Actually, I wanted to test whether this comparison was working on a real usecase with NaNs in the index, but this seems to not work anyway (irregardless of this collision check):

In [11]: mi = pd.MultiIndex.from_product([['A', 'B'], [1, np.nan]]) In [12]: mi Out[12]: MultiIndex(levels=[['A', 'B'], [1]], labels=[[0, 0, 1, 1], [0, -1, 0, -1]]) In [13]: mi.get_loc(('A', np.nan)) ... pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)() KeyError: ('A', nan)

The above is with 0.19.2, so this was already failing with the PyObject hashtable

jorisvandenbossche · 2017-05-12T21:24:39Z

pandas/core/util/hashing.py

@@ -264,7 +287,7 @@ def hash_array(vals, encoding='utf8', hash_key=None, categorize=True):

        try:
            vals = hashing.hash_object_array(vals, hash_key, encoding)
-        except TypeError:
+        except (TypeError, ValueError):


I needed to add this ValueError because I got the following error with the line above:

In [6]: pd.core.util.hashing.hash_array(np.array(['E']), categorize=False) --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-6-6197ea2d3075> in <module>() ----> 1 pd.core.util.hashing.hash_array(np.array(['E']), categorize=False) /home/joris/scipy/pandas/pandas/core/util/hashing.py in hash_array(vals, encoding, hash_key, categorize) 286 287 try: --> 288 vals = hashing.hash_object_array(vals, hash_key, encoding) 289 except TypeError: 290 # we have mixed types /home/joris/scipy/pandas/pandas/_libs/hashing.pyx in pandas._libs.hashing.hash_object_array (pandas/_libs/hashing.c:1608)() ValueError: Does not understand character buffer dtype format string ('w')

so this gives a ValueError and not a TypeError, while the version in the except does work. But I don't know it is OK to just catch that as well, or whether I should rather change the hash_object_array array

the hashing doesn't handle fixed-length arrays (it prob could, but we don't generally have them). if its an object array it will work.

In [21]: np.array(['E']) Out[21]: array(['E'], dtype='<U1') In [22]: np.array(['E'], dtype=object) Out[22]: array(['E'], dtype=object) In [20]: pd.core.util.hashing.hash_array(np.array(['E'], dtype=object), categorize=False) Out[20]: array([16179053688037232491], dtype=uint64)

yes, that is related to how I coerce to an array above. When using np.array([val]), I don't get object arrays. But inferring the dtype with infer_dtype_from_array gives some overhead (but probably not too much)

jorisvandenbossche · 2017-05-12T21:28:48Z

pandas/core/util/hashing.py

+    hash
+
+    """
+    hashes = (hash_array(np.array([v]), encoding=encoding, hash_key=hash_key,


Not sure whether np.array([v]) will always do the correct coercion of the type. In the hash_tuples version, this coercion to dtyped arrays happens in MultiIndex.from_tuples(vals)

More in general, large part of the remaining time is spent in the hash_array function. A more specialized hasher for single scalars would probably further improve it

for a single element np.array will work, but if you have multiple elements it will not, use this insteadIn [10]: from pandas.core.dtypes.cast import infer_dtype_from_array

from pandas.core.dtypes.cast import infer_dtype_from_array [12]: infer_dtype_from_array(['foo', np.nan]) Out[12]: (numpy.object_, ['foo', nan]) In [14]: dtype, arr = infer_dtype_from_array(['foo', np.nan]) In [15]: np.array(arr, dtype=dtype) Out[15]: array(['foo', nan], dtype=object) In [16]: np.array(['foo', np.nan]) Out[16]: array(['foo', 'nan'], dtype='<U3')

but if you have multiple elements it will not

it will always be a single element I think (it are the individual elements of the single tuple that are put in an array)

numpy converts strings to fixed length which is wrong

use routine to have it corrrect

codecov · 2017-05-12T21:45:49Z

Codecov Report

Merging #16346 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #16346      +/-   ##
==========================================
- Coverage   90.38%   90.36%   -0.02%     
==========================================
  Files         161      161              
  Lines       50916    50920       +4     
==========================================
- Hits        46021    46015       -6     
- Misses       4895     4905      +10

Flag	Coverage Δ
#multiple	`88.14% <100%> (-0.01%)`	⬇️
#single	`40.22% <14.28%> (-0.11%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/util/hashing.py	`90.67% <100%> (+0.32%)`	⬆️
pandas/core/indexes/multi.py	`96.56% <100%> (ø)`	⬆️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/common.py	`90.68% <0%> (-0.35%)`	⬇️
pandas/core/frame.py	`97.68% <0%> (-0.1%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4bdbcb6...e88b658. Read the comment docs.

codecov · 2017-05-12T21:45:58Z

Codecov Report

Merging #16346 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #16346      +/-   ##
==========================================
- Coverage   90.38%   90.37%   -0.01%     
==========================================
  Files         161      161              
  Lines       50949    50963      +14     
==========================================
+ Hits        46052    46060       +8     
- Misses       4897     4903       +6

Flag	Coverage Δ
#multiple	`88.15% <100%> (ø)`	⬆️
#single	`40.2% <26.31%> (-0.11%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/indexes/multi.py	`96.56% <100%> (ø)`	⬆️
pandas/core/util/hashing.py	`92.96% <100%> (+2.61%)`	⬆️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/frame.py	`97.69% <0%> (-0.1%)`	⬇️
pandas/core/common.py	`91.39% <0%> (+0.33%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d3bef8...8acc9e8. Read the comment docs.

jreback · 2017-05-12T22:56:13Z

pandas/_libs/hashtable_class_helper.pxi.in

@@ -921,6 +923,16 @@ cdef class MultiIndexHashTable(HashTable):
                    "hash collision\nlocs:\n{}\n"
                    "result:\n{}\nmi:\n{}".format(alocs, result, mi))

+    def _check_for_collision(self, Py_ssize_t loc, object label):


make this cdef, its hitting overhead because it has to go to python and back to cython

jreback · 2017-05-15T10:19:44Z

pandas/core/util/hashing.py

+
+
+def _hash_scalar(val, encoding='utf8', hash_key=None):
+    """


this is duplicating lots of code. I sure there is a way to do this a bit more generically.

Yes, I know, but this was mainly to test.
And from that it, this makes it a lot faster than using the hash_array (the commented part in hash_tuple). So not sure how to solve that. In principle I can put the common parts in helper functions (eg the redistributing part), but for most of it it is not possible as there are slight differences.

what does all of this buy you? (IOW can you post updated timings)?

maintaining a separate code path for scalars will cause future issues. these will need to be kept in sync (with the array hashing), if any code changes are made. you can easily share code here which would make this more palatable.

In [4]: %timeit pd.core.util.hashing.hash_tuple2((999, np.nan, 'E')) 380 µs ± 60.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) In [5]: %timeit pd.core.util.hashing.hash_tuple((999, np.nan, 'E')) 81.8 µs ± 3.53 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

hash_tuple2 uses the hash_array (the commented out version in the current branch), the hash_tuple uses the hash_scalar

ok, its prob reasonable, but willing to sacrifice some perf to get some shared code (IOW between old and new prob a good compromise to repeating lots of code)

OK, I pushed a new version that almost fully reuses the hash_array function for the actual hashing, only has some specific logic before that to convert the scalar to a proper array.
This reduces a lot of the code duplication, and has only minor perf impact.

Apart from that, it still has the separate code path to first convert the scalar to an array, which might be a bit brittle, and I agree certainly not ideal for code maintenance, but using something more general (eg infer_dtype) gives a big perf penalty.

jorisvandenbossche · 2017-05-15T11:32:48Z

pandas/core/util/hashing.py

+    else:
+        if not string_like:
+            from pandas import Index
+            vals = Index(vals).values


This is also a bit an ugly part. I use Index(vals) to get the correct type coercion (eg for a Timestamp object, to ensure it does the same as hash_array). But I don't do this in the beginning because this is much slower than needed for simple numerical values.
It would be nice to have a utility function to convert a list of values to an array with the dtype "how pandas wants it" (but maybe that exists and I just don't know about it)

It would be nice to have a utility function to convert a list of values to an array with the dtype

this is what Series/Index constructors do. About the most complicated code that exists.

jreback · 2017-05-15T22:12:46Z

pandas/core/util/hashing.py

+    else:
+        vals = np.array([val])
+
+        if vals.dtype == np.object_:


you should just call infer_dtype_from_scalar instead of reinveting the wheel

Ah, you can only just call that if you know it exists :-)

But thanks for the pointer, that is exactly I was hoping already existed.

I made a slight adaptation (extra keyword) to get the behaviour I need here, but I can also put that in he hash_scalar method instead of in infer_dtype_from_scalar (but, there it is a bit easier to do).

jreback · 2017-05-15T22:13:50Z

pandas/tests/util/test_hashing.py

@@ -79,6 +79,23 @@ def test_hash_tuples(self):
        result = hash_tuples(tups[0])
        assert result == expected[0]

+    def test_hash_tuple(self):
+        # test equivalence between hash_tuples and hash_tuple
+        for tup in [(1, 'one'), (1, np.nan)]:


test with pd.Nat as well

jreback · 2017-05-16T10:27:09Z

pandas/core/dtypes/cast.py

@@ -333,7 +333,7 @@ def maybe_promote(dtype, fill_value=np.nan):
    return dtype, fill_value


-def infer_dtype_from_scalar(val, pandas_dtype=False):
+def infer_dtype_from_scalar(val, pandas_dtype=False, use_datetimetz=True):


you don't need this extra parameter, instead you can pass pandas_dtype=True . what is the issue?

pandas_dtype=True does not return a np.datetime64 but our DatetimeTZDtype, and further then also Periods get converted, which I also don't need.
So the combination I need, i.e. tz-aware timestamps to its numpy dtype instead of as pandas extension type or as object, but keep Periods as objects, is not possible with the current options. This is a bit a strange combination, but that's the consequence of how those are returned from the Index values (which is as datetime64 without tz but Periods as objects), which is how they are hashed.

But indeed, if we add this, ignore_tz is probably a better name

jreback · 2017-05-16T10:29:56Z

pandas/core/util/hashing.py

+    if isnull(val):
+        # this is to be consistent with the _hash_categorical implementation
+        return np.array([np.iinfo(np.uint64).max], dtype='u8')
+


if you need to handle datetime w/tz directly (IOW, we basically ignore it), then I would.

if getattr(val, 'tzinfo', None) is not None: val = val.tz_localize(None)

I suppose an option to ignore tz is fine for infer_dtype_from_scalar, but if you add it I would rename, document and test.

yes, I can certainly do that check here as well.

That is maybe better to keep the custom logic here, as the keyword added to infer_dtype_from_scalar would not be used anywhere else.

he i think better locally

jorisvandenbossche · 2017-05-16T22:06:27Z

Ran the benchmarks with the latest version of this PR:

Against master:

$ asv continuous upstream/master HEAD -b time_multiindex
...
     before      after     ratio
  [0ea0f25b] [c29dab95]
-     2.73s   298.28ms      0.11  indexing.MultiIndexing.time_multiindex_large_get_loc_warm
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

Against 0.19.2 (which combines slowdown/improvement of the several PRs), also gives an improvement now for the that benchmark:

$ asv continuous v0.19.2 HEAD -b time_multiindex
...
    before     after       ratio
  [6f525eec] [c29dab95]
-  744.29ms   334.65ms      0.45  indexing.MultiIndexing.time_multiindex_large_get_loc_warm
-  716.03ms   175.14ms      0.24  indexing.MultiIndexing.time_multiindex_large_get_loc
-  736.39ms   166.29ms      0.23  indexing.MultiIndexing.time_multiindex_get_indexer
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

jreback · 2017-05-16T22:19:20Z

pandas/core/util/hashing.py

+        # for tz-aware datetimes, we need the underlying naive UTC value and
+        # not the tz aware object or pd extension type (as
+        # infer_dtype_from_scalar would do)
+        if not isinstance(val, tslib.Timestamp):


see #16372 (for later)

jreback · 2017-05-16T22:20:27Z

@jorisvandenbossche lgtm. merge when ready (for 0.20.2) is good.

jorisvandenbossche · 2017-05-17T07:37:12Z

@jreback thanks for the review!

jreback · 2017-05-17T12:14:25Z

nice fix @jorisvandenbossche !

* upstream/master: (48 commits) BUG: Categorical comparison with unordered (pandas-dev#16339) ENH: Adding 'protocol' parameter to 'to_pickle'. PERF: improve MultiIndex get_loc performance (pandas-dev#16346) TST: remove pandas-datareader xfail as 0.4.0 works (pandas-dev#16374) TST: followup to pandas-dev#16364, catch errstate warnings (pandas-dev#16373) DOC: new oauth token TST: Add test for clip-na (pandas-dev#16369) ENH: Draft metadata specification doc for Apache Parquet (pandas-dev#16315) MAINT: Add .iml to .gitignore (pandas-dev#16368) BUG/API: Categorical constructor scalar categories (pandas-dev#16340) ENH: Provide dict object for to_dict() pandas-dev#16122 (pandas-dev#16220) PERF: improved clip performance (pandas-dev#16364) DOC: try new token for docs DOC: try with new secure token DOC: add developer section to the docs DEPS: Drop Python 3.4 support (pandas-dev#16303) DOC: remove credential helper DOC: force fetch on build docs DOC: redo dev docs access token DOC: add dataframe construction in merge_asof example (pandas-dev#16348) ...

* PERF: improve hash collision check for single MI labels * PERF: specialized hash function for single tuples

* PERF: improve hash collision check for single MI labels * PERF: specialized hash function for single tuples (cherry picked from commit 34ebad8)

* PERF: improve hash collision check for single MI labels * PERF: specialized hash function for single tuples

jorisvandenbossche added the Performance Memory or execution speed performance label May 12, 2017

jorisvandenbossche commented May 12, 2017

View reviewed changes

jorisvandenbossche changed the title ~~Perf mi get loc hash~~ PERF: improve MultiIndex get_loc performance May 12, 2017

jreback reviewed May 12, 2017

View reviewed changes

jreback reviewed May 15, 2017

View reviewed changes

jorisvandenbossche commented May 15, 2017

View reviewed changes

jorisvandenbossche force-pushed the perf-mi-get_loc-hash branch 3 times, most recently from fcaa5bb to afa8775 Compare May 15, 2017 17:37

jreback reviewed May 15, 2017

View reviewed changes

jorisvandenbossche force-pushed the perf-mi-get_loc-hash branch from 15fca29 to 8ba9b6f Compare May 16, 2017 08:46

jreback reviewed May 16, 2017

View reviewed changes

jorisvandenbossche added 8 commits May 16, 2017 22:49

PERF: improve hash collision check for single MI labels

3d3074b

PERF: specialized hash function for single tuples

d1d1513

feedback

664d2b3

Simplify to reduce code duplication

7cd3cc1

use infer_dtype_from_scalar

3bd0404

move check for datetime tz to hashing function

287817a

update whatsnew

638f011

add benchmarks

8acc9e8

jorisvandenbossche force-pushed the perf-mi-get_loc-hash branch from c29dab9 to 8acc9e8 Compare May 16, 2017 20:51

jreback mentioned this pull request May 16, 2017

API/BUG: hashing of datetimes is based on UTC values #16372

Open

jreback reviewed May 16, 2017

View reviewed changes

jreback added this to the 0.20.2 milestone May 16, 2017

jreback closed this May 16, 2017

jreback reopened this May 16, 2017

jorisvandenbossche merged commit 34ebad8 into pandas-dev:master May 17, 2017

pcluo pushed a commit to pcluo/pandas that referenced this pull request May 22, 2017

PERF: improve MultiIndex get_loc performance (pandas-dev#16346)

1fa8201

* PERF: improve hash collision check for single MI labels * PERF: specialized hash function for single tuples

TomAugspurger added Backported and removed Needs Backport labels May 30, 2017

TomAugspurger pushed a commit that referenced this pull request May 30, 2017

PERF: improve MultiIndex get_loc performance (#16346)

3af2646

* PERF: improve hash collision check for single MI labels * PERF: specialized hash function for single tuples (cherry picked from commit 34ebad8)

jorisvandenbossche mentioned this pull request Jun 9, 2017

Observing significant (up to 30x) __get_item__ performance drop between 0.19.2 and 0.20.x #16644

Closed

stangirala pushed a commit to stangirala/pandas that referenced this pull request Jun 11, 2017

PERF: improve MultiIndex get_loc performance (pandas-dev#16346)

745d1b7

* PERF: improve hash collision check for single MI labels * PERF: specialized hash function for single tuples

jorisvandenbossche deleted the perf-mi-get_loc-hash branch November 27, 2017 19:51

PERF: improve MultiIndex get_loc performance #16346

PERF: improve MultiIndex get_loc performance #16346

Conversation

jorisvandenbossche commented May 12, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented May 12, 2017 • edited Loading

Codecov Report

codecov bot commented May 12, 2017 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented May 16, 2017 • edited Loading

Choose a reason for hiding this comment

jreback commented May 16, 2017

jorisvandenbossche commented May 17, 2017

jreback commented May 17, 2017

codecov bot commented May 12, 2017 •

edited

Loading

codecov bot commented May 12, 2017 •

edited

Loading

jorisvandenbossche commented May 16, 2017 •

edited

Loading