Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
PERF: improve MultiIndex get_loc performance #16346
Conversation
jorisvandenbossche
added the
Performance
label
May 12, 2017
| + # version of _check_for_collisions above for single label (tuple) | ||
| + | ||
| + result = self.mi[loc] | ||
| + if not array_equivalent(result, label): |
jorisvandenbossche
May 12, 2017
Owner
I want to compare 2 tuples here, but cannot just do result == equal as this would not compare equal when there are NaNs.
For now I used array_equivalent which does what I want (but still has some overhead). But do we have other utility to do that?
jreback
May 12, 2017
Contributor
In [4]: %timeit array_equivalent((1, 2, np.nan), (1, 2, np.nan))
34.6 µs ± 120 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [7]: compare = lambda x, y: all(np.isnan(e) and np.isnan(f) or e == f for e, f in zip(x, y))
In [8]: compare((1,2, np.nan), (1, 2, np.nan))
Out[8]: True
In [9]: %timeit compare((1,2, np.nan), (1, 2, np.nan))
5.37 µs ± 32 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
TomAugspurger
May 13, 2017
Contributor
As long as we're optimizing, math.isnan is quite a bit faster for scalars:
In [3]: %timeit compare((1,2, np.nan), (1, 2, np.nan))
4.14 µs ± 66.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [4]: import math
In [5]: compare2 = lambda x, y: all(math.isnan(e) and math.isnan(f) or e == f for e, f in zip(x, y))
In [6]: %timeit compare2((1,2, np.nan), (1, 2, np.nan))
1.34 µs ± 11.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
jreback
May 13, 2017
Contributor
actually I think you need to use isnull; both isnan and math.isnan are not complete (IOW they won't for example handle NaT)
jreback
May 13, 2017
Contributor
but since this is cython, you use lib.is_null_datetimelike (covers the bases of None, nan, NaT)
jorisvandenbossche
May 15, 2017
Owner
Actually, I wanted to test whether this comparison was working on a real usecase with NaNs in the index, but this seems to not work anyway (irregardless of this collision check):
In [11]: mi = pd.MultiIndex.from_product([['A', 'B'], [1, np.nan]])
In [12]: mi
Out[12]:
MultiIndex(levels=[['A', 'B'], [1]],
labels=[[0, 0, 1, 1], [0, -1, 0, -1]])
In [13]: mi.get_loc(('A', np.nan))
...
pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)()
KeyError: ('A', nan)
The above is with 0.19.2, so this was already failing with the PyObject hashtable
| + hash | ||
| + | ||
| + """ | ||
| + hashes = (hash_array(np.array([v]), encoding=encoding, hash_key=hash_key, |
jorisvandenbossche
May 12, 2017
Owner
Not sure whether np.array([v]) will always do the correct coercion of the type. In the hash_tuples version, this coercion to dtyped arrays happens in MultiIndex.from_tuples(vals)
More in general, large part of the remaining time is spent in the hash_array function. A more specialized hasher for single scalars would probably further improve it
jreback
May 12, 2017
Contributor
for a single element np.array will work, but if you have multiple elements it will not, use this insteadIn [10]: from pandas.core.dtypes.cast import infer_dtype_from_array
from pandas.core.dtypes.cast import infer_dtype_from_array
[12]: infer_dtype_from_array(['foo', np.nan])
Out[12]: (numpy.object_, ['foo', nan])
In [14]: dtype, arr = infer_dtype_from_array(['foo', np.nan])
In [15]: np.array(arr, dtype=dtype)
Out[15]: array(['foo', nan], dtype=object)
In [16]: np.array(['foo', np.nan])
Out[16]:
array(['foo', 'nan'],
dtype='<U3')
jorisvandenbossche
May 12, 2017
Owner
but if you have multiple elements it will not
it will always be a single element I think (it are the individual elements of the single tuple that are put in an array)
jreback
May 12, 2017
Contributor
numpy converts strings to fixed length which is wrong
use routine to have it corrrect
| @@ -264,7 +287,7 @@ def hash_array(vals, encoding='utf8', hash_key=None, categorize=True): | ||
| try: | ||
| vals = hashing.hash_object_array(vals, hash_key, encoding) | ||
| - except TypeError: | ||
| + except (TypeError, ValueError): |
jorisvandenbossche
May 12, 2017
Owner
I needed to add this ValueError because I got the following error with the line above:
In [6]: pd.core.util.hashing.hash_array(np.array(['E']), categorize=False)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-6-6197ea2d3075> in <module>()
----> 1 pd.core.util.hashing.hash_array(np.array(['E']), categorize=False)
/home/joris/scipy/pandas/pandas/core/util/hashing.py in hash_array(vals, encoding, hash_key, categorize)
286
287 try:
--> 288 vals = hashing.hash_object_array(vals, hash_key, encoding)
289 except TypeError:
290 # we have mixed types
/home/joris/scipy/pandas/pandas/_libs/hashing.pyx in pandas._libs.hashing.hash_object_array (pandas/_libs/hashing.c:1608)()
ValueError: Does not understand character buffer dtype format string ('w')
so this gives a ValueError and not a TypeError, while the version in the except does work. But I don't know it is OK to just catch that as well, or whether I should rather change the hash_object_array array
jreback
May 12, 2017
Contributor
the hashing doesn't handle fixed-length arrays (it prob could, but we don't generally have them). if its an object array it will work.
In [21]: np.array(['E'])
Out[21]:
array(['E'],
dtype='<U1')
In [22]: np.array(['E'], dtype=object)
Out[22]: array(['E'], dtype=object)
In [20]: pd.core.util.hashing.hash_array(np.array(['E'], dtype=object), categorize=False)
Out[20]: array([16179053688037232491], dtype=uint64)
jorisvandenbossche
May 12, 2017
Owner
yes, that is related to how I coerce to an array above. When using np.array([val]), I don't get object arrays. But inferring the dtype with infer_dtype_from_array gives some overhead (but probably not too much)
jorisvandenbossche
changed the title from
Perf mi get loc hash to PERF: improve MultiIndex get_loc performance
May 12, 2017
codecov
bot
commented
May 12, 2017
•
Codecov Report
@@ Coverage Diff @@
## master #16346 +/- ##
==========================================
- Coverage 90.38% 90.36% -0.02%
==========================================
Files 161 161
Lines 50916 50920 +4
==========================================
- Hits 46021 46015 -6
- Misses 4895 4905 +10
Continue to review full report at Codecov.
|
codecov
bot
commented
May 12, 2017
•
Codecov Report
@@ Coverage Diff @@
## master #16346 +/- ##
==========================================
- Coverage 90.38% 90.37% -0.01%
==========================================
Files 161 161
Lines 50949 50963 +14
==========================================
+ Hits 46052 46060 +8
- Misses 4897 4903 +6
Continue to review full report at Codecov.
|
| @@ -921,6 +923,16 @@ cdef class MultiIndexHashTable(HashTable): | ||
| "hash collision\nlocs:\n{}\n" | ||
| "result:\n{}\nmi:\n{}".format(alocs, result, mi)) | ||
| + def _check_for_collision(self, Py_ssize_t loc, object label): |
jreback
May 12, 2017
Contributor
make this cdef, its hitting overhead because it has to go to python and back to cython
| + | ||
| + | ||
| +def _hash_scalar(val, encoding='utf8', hash_key=None): | ||
| + """ |
jreback
May 15, 2017
Contributor
this is duplicating lots of code. I sure there is a way to do this a bit more generically.
jorisvandenbossche
May 15, 2017
Owner
Yes, I know, but this was mainly to test.
And from that it, this makes it a lot faster than using the hash_array (the commented part in hash_tuple). So not sure how to solve that. In principle I can put the common parts in helper functions (eg the redistributing part), but for most of it it is not possible as there are slight differences.
jreback
May 15, 2017
Contributor
what does all of this buy you? (IOW can you post updated timings)?
maintaining a separate code path for scalars will cause future issues. these will need to be kept in sync (with the array hashing), if any code changes are made. you can easily share code here which would make this more palatable.
jorisvandenbossche
May 15, 2017
Owner
In [4]: %timeit pd.core.util.hashing.hash_tuple2((999, np.nan, 'E'))
380 µs ± 60.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [5]: %timeit pd.core.util.hashing.hash_tuple((999, np.nan, 'E'))
81.8 µs ± 3.53 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
hash_tuple2 uses the hash_array (the commented out version in the current branch), the hash_tuple uses the hash_scalar
jreback
May 15, 2017
Contributor
ok, its prob reasonable, but willing to sacrifice some perf to get some shared code (IOW between old and new prob a good compromise to repeating lots of code)
jorisvandenbossche
May 15, 2017
Owner
OK, I pushed a new version that almost fully reuses the hash_array function for the actual hashing, only has some specific logic before that to convert the scalar to a proper array.
This reduces a lot of the code duplication, and has only minor perf impact.
Apart from that, it still has the separate code path to first convert the scalar to an array, which might be a bit brittle, and I agree certainly not ideal for code maintenance, but using something more general (eg infer_dtype) gives a big perf penalty.
| + else: | ||
| + if not string_like: | ||
| + from pandas import Index | ||
| + vals = Index(vals).values |
jorisvandenbossche
May 15, 2017
Owner
This is also a bit an ugly part. I use Index(vals) to get the correct type coercion (eg for a Timestamp object, to ensure it does the same as hash_array). But I don't do this in the beginning because this is much slower than needed for simple numerical values.
It would be nice to have a utility function to convert a list of values to an array with the dtype "how pandas wants it" (but maybe that exists and I just don't know about it)
jreback
May 15, 2017
Contributor
It would be nice to have a utility function to convert a list of values to an array with the dtype
this is what Series/Index constructors do. About the most complicated code that exists.
| + else: | ||
| + vals = np.array([val]) | ||
| + | ||
| + if vals.dtype == np.object_: |
jreback
May 15, 2017
Contributor
you should just call infer_dtype_from_scalar instead of reinveting the wheel
jorisvandenbossche
May 16, 2017
Owner
Ah, you can only just call that if you know it exists :-)
But thanks for the pointer, that is exactly I was hoping already existed.
I made a slight adaptation (extra keyword) to get the behaviour I need here, but I can also put that in he hash_scalar method instead of in infer_dtype_from_scalar (but, there it is a bit easier to do).
| @@ -79,6 +79,23 @@ def test_hash_tuples(self): | ||
| result = hash_tuples(tups[0]) | ||
| assert result == expected[0] | ||
| + def test_hash_tuple(self): | ||
| + # test equivalence between hash_tuples and hash_tuple | ||
| + for tup in [(1, 'one'), (1, np.nan)]: |
| @@ -333,7 +333,7 @@ def maybe_promote(dtype, fill_value=np.nan): | ||
| return dtype, fill_value | ||
| -def infer_dtype_from_scalar(val, pandas_dtype=False): | ||
| +def infer_dtype_from_scalar(val, pandas_dtype=False, use_datetimetz=True): |
jreback
May 16, 2017
Contributor
you don't need this extra parameter, instead you can pass pandas_dtype=True . what is the issue?
jorisvandenbossche
May 16, 2017
Owner
pandas_dtype=True does not return a np.datetime64 but our DatetimeTZDtype, and further then also Periods get converted, which I also don't need.
So the combination I need, i.e. tz-aware timestamps to its numpy dtype instead of as pandas extension type or as object, but keep Periods as objects, is not possible with the current options. This is a bit a strange combination, but that's the consequence of how those are returned from the Index values (which is as datetime64 without tz but Periods as objects), which is how they are hashed.
But indeed, if we add this, ignore_tz is probably a better name
| + if isnull(val): | ||
| + # this is to be consistent with the _hash_categorical implementation | ||
| + return np.array([np.iinfo(np.uint64).max], dtype='u8') | ||
| + |
jreback
May 16, 2017
Contributor
if you need to handle datetime w/tz directly (IOW, we basically ignore it), then I would.
if getattr(val, 'tzinfo', None) is not None:
val = val.tz_localize(None)
I suppose an option to ignore tz is fine for infer_dtype_from_scalar, but if you add it I would rename, document and test.
jorisvandenbossche
May 16, 2017
Owner
yes, I can certainly do that check here as well.
That is maybe better to keep the custom logic here, as the keyword added to infer_dtype_from_scalar would not be used anywhere else.
jorisvandenbossche
added some commits
May 11, 2017
|
Ran the benchmarks with the latest version of this PR: Against master:
Against 0.19.2 (which combines slowdown/improvement of the several PRs), also gives an improvement now for the that benchmark:
|
jreback
referenced
this pull request
May 16, 2017
Open
API/BUG: hashing of datetimes is based on UTC values #16372
| + # for tz-aware datetimes, we need the underlying naive UTC value and | ||
| + # not the tz aware object or pd extension type (as | ||
| + # infer_dtype_from_scalar would do) | ||
| + if not isinstance(val, tslib.Timestamp): |
jreback
added this to the
0.20.2
milestone
May 16, 2017
|
@jorisvandenbossche lgtm. merge when ready (for 0.20.2) is good. |
jreback
closed this
May 16, 2017
jreback
reopened this
May 16, 2017
jorisvandenbossche
merged commit 34ebad8
into pandas-dev:master
May 17, 2017
5 checks passed
|
@jreback thanks for the review! |
|
nice fix @jorisvandenbossche ! |
pawroman
added a commit
to pawroman/pandas
that referenced
this pull request
May 18, 2017
|
|
pawroman |
b2699e7
|
pcluo
added a commit
to pcluo/pandas
that referenced
this pull request
May 22, 2017
|
|
jorisvandenbossche + pcluo |
1fa8201
|
TomAugspurger
added a commit
to TomAugspurger/pandas
that referenced
this pull request
May 29, 2017
|
|
jorisvandenbossche + TomAugspurger |
c6ce9ea
|
TomAugspurger
added Backported and removed Needs Backport
labels
May 30, 2017
TomAugspurger
added a commit
that referenced
this pull request
May 30, 2017
|
|
jorisvandenbossche + TomAugspurger |
3af2646
|
jorisvandenbossche
referenced
this pull request
Jun 9, 2017
Closed
Observing significant (up to 30x) __get_item__ performance drop between 0.19.2 and 0.20.x #16644
stangirala
added a commit
to stangirala/pandas
that referenced
this pull request
Jun 11, 2017
|
|
jorisvandenbossche + stangirala |
745d1b7
|
jorisvandenbossche commentedMay 12, 2017
While I was timing the MultiIndex get_loc in #16324, I saw some 'quick wins' with some further profiling.
Rationale:
_check_for_collisions. This function was a generic one for one or multiple labels. I added a slightly adapted version specialized for a single tuplehash_tuples. Again, added specialized version for a single tuple.@jreback I have some questions (will add them as comments), as this is not my specialty, but good opportunity to get more familiar with the inner guts :-)
Some timings using
master:
PR:
So something between a 5x and 15x improvement (big variability between individual timings)