PERF: improve MultiIndex get_loc performance #16346

Merged
merged 8 commits into from May 17, 2017

Conversation

Projects
None yet
3 participants
@jorisvandenbossche
Member

jorisvandenbossche commented May 12, 2017

While I was timing the MultiIndex get_loc in #16324, I saw some 'quick wins' with some further profiling.

Rationale:

  • big part of the time was spent in checking for a hash collision: _check_for_collisions. This function was a generic one for one or multiple labels. I added a slightly adapted version specialized for a single tuple
  • another important factor was the hash creation with hash_tuples. Again, added specialized version for a single tuple.

@jreback I have some questions (will add them as comments), as this is not my specialty, but good opportunity to get more familiar with the inner guts :-)

Some timings using

import string
mi_large = pd.MultiIndex.from_product(
    [np.arange(1000),
     np.arange(20), list(string.ascii_letters)],
    names=['one', 'two', 'three'])

master:

In [2]: %time mi_large.get_loc((999, 19, 'E'))   # <--- first slower cold one
CPU times: user 140 ms, sys: 8 ms, total: 148 ms
Wall time: 145 ms
Out[2]: 1039978

In [3]: %time mi_large.get_loc((999, 19, 'E'))
CPU times: user 0 ns, sys: 4 ms, total: 4 ms
Wall time: 4.33 ms
Out[3]: 1039978

In [4]: %time mi_large.get_loc((999, 19, 'E'))
CPU times: user 4 ms, sys: 4 ms, total: 8 ms
Wall time: 4.18 ms
Out[4]: 1039978

In [5]: %timeit mi_large.get_loc((999, 19, 'E'))
100 loops, best of 3: 2.56 ms per loop

In [6]: %%timeit
   ...: for _ in range(1000):
   ...:     mi_large.get_loc((999, 19, 'E'))
   ...: 
1 loop, best of 3: 2.62 s per loop

PR:

In [3]: %time mi_large.get_loc((999, 19, 'E'))   # <--- first slower cold one
CPU times: user 128 ms, sys: 24 ms, total: 152 ms
Wall time: 152 ms
Out[3]: 1039978

In [4]: %time mi_large.get_loc((999, 19, 'E'))
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 716 µs
Out[4]: 1039978

In [5]: %time mi_large.get_loc((999, 19, 'E'))
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 371 µs
Out[5]: 1039978

In [6]: %timeit mi_large.get_loc((999, 19, 'E'))
1000 loops, best of 3: 189 µs per loop

In [15]: %%timeit
    ...: for _ in range(1000):
    ...:     mi_large.get_loc((999, 19, 'E'))
    ...: 
1 loop, best of 3: 188 ms per loop

So something between a 5x and 15x improvement (big variability between individual timings)

+ # version of _check_for_collisions above for single label (tuple)
+
+ result = self.mi[loc]
+ if not array_equivalent(result, label):

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche May 12, 2017

Member

I want to compare 2 tuples here, but cannot just do result == equal as this would not compare equal when there are NaNs.
For now I used array_equivalent which does what I want (but still has some overhead). But do we have other utility to do that?

@jorisvandenbossche

jorisvandenbossche May 12, 2017

Member

I want to compare 2 tuples here, but cannot just do result == equal as this would not compare equal when there are NaNs.
For now I used array_equivalent which does what I want (but still has some overhead). But do we have other utility to do that?

This comment has been minimized.

@jreback

jreback May 12, 2017

Contributor
In [4]: %timeit array_equivalent((1, 2, np.nan), (1, 2, np.nan))
34.6 µs ± 120 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [7]: compare = lambda x, y: all(np.isnan(e) and np.isnan(f) or e == f for e, f in zip(x, y))

In [8]: compare((1,2, np.nan), (1, 2, np.nan))
Out[8]: True

In [9]: %timeit compare((1,2, np.nan), (1, 2, np.nan))
5.37 µs ± 32 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
@jreback

jreback May 12, 2017

Contributor
In [4]: %timeit array_equivalent((1, 2, np.nan), (1, 2, np.nan))
34.6 µs ± 120 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [7]: compare = lambda x, y: all(np.isnan(e) and np.isnan(f) or e == f for e, f in zip(x, y))

In [8]: compare((1,2, np.nan), (1, 2, np.nan))
Out[8]: True

In [9]: %timeit compare((1,2, np.nan), (1, 2, np.nan))
5.37 µs ± 32 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche May 12, 2017

Member

Yes, much simpler (and faster)!

@jorisvandenbossche

jorisvandenbossche May 12, 2017

Member

Yes, much simpler (and faster)!

This comment has been minimized.

@TomAugspurger

TomAugspurger May 13, 2017

Contributor

As long as we're optimizing, math.isnan is quite a bit faster for scalars:

In [3]: %timeit compare((1,2, np.nan), (1, 2, np.nan))
4.14 µs ± 66.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [4]: import math

In [5]: compare2 = lambda x, y: all(math.isnan(e) and math.isnan(f) or e == f for e, f in zip(x, y))

In [6]: %timeit compare2((1,2, np.nan), (1, 2, np.nan))
1.34 µs ± 11.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
@TomAugspurger

TomAugspurger May 13, 2017

Contributor

As long as we're optimizing, math.isnan is quite a bit faster for scalars:

In [3]: %timeit compare((1,2, np.nan), (1, 2, np.nan))
4.14 µs ± 66.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [4]: import math

In [5]: compare2 = lambda x, y: all(math.isnan(e) and math.isnan(f) or e == f for e, f in zip(x, y))

In [6]: %timeit compare2((1,2, np.nan), (1, 2, np.nan))
1.34 µs ± 11.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

This comment has been minimized.

@jreback

jreback May 13, 2017

Contributor

actually I think you need to use isnull; both isnan and math.isnan are not complete (IOW they won't for example handle NaT)

@jreback

jreback May 13, 2017

Contributor

actually I think you need to use isnull; both isnan and math.isnan are not complete (IOW they won't for example handle NaT)

This comment has been minimized.

@jreback

jreback May 13, 2017

Contributor

but since this is cython, you use lib.is_null_datetimelike (covers the bases of None, nan, NaT)

@jreback

jreback May 13, 2017

Contributor

but since this is cython, you use lib.is_null_datetimelike (covers the bases of None, nan, NaT)

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche May 15, 2017

Member

Actually, I wanted to test whether this comparison was working on a real usecase with NaNs in the index, but this seems to not work anyway (irregardless of this collision check):

In [11]: mi = pd.MultiIndex.from_product([['A', 'B'], [1, np.nan]])

In [12]: mi
Out[12]: 
MultiIndex(levels=[['A', 'B'], [1]],
           labels=[[0, 0, 1, 1], [0, -1, 0, -1]])

In [13]: mi.get_loc(('A', np.nan))
...
pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)()

KeyError: ('A', nan)

The above is with 0.19.2, so this was already failing with the PyObject hashtable

@jorisvandenbossche

jorisvandenbossche May 15, 2017

Member

Actually, I wanted to test whether this comparison was working on a real usecase with NaNs in the index, but this seems to not work anyway (irregardless of this collision check):

In [11]: mi = pd.MultiIndex.from_product([['A', 'B'], [1, np.nan]])

In [12]: mi
Out[12]: 
MultiIndex(levels=[['A', 'B'], [1]],
           labels=[[0, 0, 1, 1], [0, -1, 0, -1]])

In [13]: mi.get_loc(('A', np.nan))
...
pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)()

KeyError: ('A', nan)

The above is with 0.19.2, so this was already failing with the PyObject hashtable

pandas/core/util/hashing.py
@@ -264,7 +287,7 @@ def hash_array(vals, encoding='utf8', hash_key=None, categorize=True):
try:
vals = hashing.hash_object_array(vals, hash_key, encoding)
- except TypeError:
+ except (TypeError, ValueError):

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche May 12, 2017

Member

I needed to add this ValueError because I got the following error with the line above:

In [6]: pd.core.util.hashing.hash_array(np.array(['E']), categorize=False)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-6197ea2d3075> in <module>()
----> 1 pd.core.util.hashing.hash_array(np.array(['E']), categorize=False)

/home/joris/scipy/pandas/pandas/core/util/hashing.py in hash_array(vals, encoding, hash_key, categorize)
    286 
    287         try:
--> 288             vals = hashing.hash_object_array(vals, hash_key, encoding)
    289         except TypeError:
    290             # we have mixed types

/home/joris/scipy/pandas/pandas/_libs/hashing.pyx in pandas._libs.hashing.hash_object_array (pandas/_libs/hashing.c:1608)()

ValueError: Does not understand character buffer dtype format string ('w')

so this gives a ValueError and not a TypeError, while the version in the except does work. But I don't know it is OK to just catch that as well, or whether I should rather change the hash_object_array array

@jorisvandenbossche

jorisvandenbossche May 12, 2017

Member

I needed to add this ValueError because I got the following error with the line above:

In [6]: pd.core.util.hashing.hash_array(np.array(['E']), categorize=False)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-6197ea2d3075> in <module>()
----> 1 pd.core.util.hashing.hash_array(np.array(['E']), categorize=False)

/home/joris/scipy/pandas/pandas/core/util/hashing.py in hash_array(vals, encoding, hash_key, categorize)
    286 
    287         try:
--> 288             vals = hashing.hash_object_array(vals, hash_key, encoding)
    289         except TypeError:
    290             # we have mixed types

/home/joris/scipy/pandas/pandas/_libs/hashing.pyx in pandas._libs.hashing.hash_object_array (pandas/_libs/hashing.c:1608)()

ValueError: Does not understand character buffer dtype format string ('w')

so this gives a ValueError and not a TypeError, while the version in the except does work. But I don't know it is OK to just catch that as well, or whether I should rather change the hash_object_array array

This comment has been minimized.

@jreback

jreback May 12, 2017

Contributor

the hashing doesn't handle fixed-length arrays (it prob could, but we don't generally have them). if its an object array it will work.

In [21]: np.array(['E'])
Out[21]: 
array(['E'], 
      dtype='<U1')

In [22]: np.array(['E'], dtype=object)
Out[22]: array(['E'], dtype=object)

In [20]: pd.core.util.hashing.hash_array(np.array(['E'], dtype=object), categorize=False)
Out[20]: array([16179053688037232491], dtype=uint64)
@jreback

jreback May 12, 2017

Contributor

the hashing doesn't handle fixed-length arrays (it prob could, but we don't generally have them). if its an object array it will work.

In [21]: np.array(['E'])
Out[21]: 
array(['E'], 
      dtype='<U1')

In [22]: np.array(['E'], dtype=object)
Out[22]: array(['E'], dtype=object)

In [20]: pd.core.util.hashing.hash_array(np.array(['E'], dtype=object), categorize=False)
Out[20]: array([16179053688037232491], dtype=uint64)

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche May 12, 2017

Member

yes, that is related to how I coerce to an array above. When using np.array([val]), I don't get object arrays. But inferring the dtype with infer_dtype_from_array gives some overhead (but probably not too much)

@jorisvandenbossche

jorisvandenbossche May 12, 2017

Member

yes, that is related to how I coerce to an array above. When using np.array([val]), I don't get object arrays. But inferring the dtype with infer_dtype_from_array gives some overhead (but probably not too much)

pandas/core/util/hashing.py
+ hash
+
+ """
+ hashes = (hash_array(np.array([v]), encoding=encoding, hash_key=hash_key,

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche May 12, 2017

Member

Not sure whether np.array([v]) will always do the correct coercion of the type. In the hash_tuples version, this coercion to dtyped arrays happens in MultiIndex.from_tuples(vals)

More in general, large part of the remaining time is spent in the hash_array function. A more specialized hasher for single scalars would probably further improve it

@jorisvandenbossche

jorisvandenbossche May 12, 2017

Member

Not sure whether np.array([v]) will always do the correct coercion of the type. In the hash_tuples version, this coercion to dtyped arrays happens in MultiIndex.from_tuples(vals)

More in general, large part of the remaining time is spent in the hash_array function. A more specialized hasher for single scalars would probably further improve it

This comment has been minimized.

@jreback

jreback May 12, 2017

Contributor

for a single element np.array will work, but if you have multiple elements it will not, use this insteadIn [10]: from pandas.core.dtypes.cast import infer_dtype_from_array

from pandas.core.dtypes.cast import infer_dtype_from_array

 [12]: infer_dtype_from_array(['foo', np.nan])
Out[12]: (numpy.object_, ['foo', nan])

In [14]: dtype, arr = infer_dtype_from_array(['foo', np.nan])

In [15]: np.array(arr, dtype=dtype)
Out[15]: array(['foo', nan], dtype=object)

In [16]: np.array(['foo', np.nan])
Out[16]: 
array(['foo', 'nan'], 
      dtype='<U3')

@jreback

jreback May 12, 2017

Contributor

for a single element np.array will work, but if you have multiple elements it will not, use this insteadIn [10]: from pandas.core.dtypes.cast import infer_dtype_from_array

from pandas.core.dtypes.cast import infer_dtype_from_array

 [12]: infer_dtype_from_array(['foo', np.nan])
Out[12]: (numpy.object_, ['foo', nan])

In [14]: dtype, arr = infer_dtype_from_array(['foo', np.nan])

In [15]: np.array(arr, dtype=dtype)
Out[15]: array(['foo', nan], dtype=object)

In [16]: np.array(['foo', np.nan])
Out[16]: 
array(['foo', 'nan'], 
      dtype='<U3')

This comment has been minimized.

@jreback

jreback May 12, 2017

Contributor

yep

@jreback

jreback May 12, 2017

Contributor

yep

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche May 12, 2017

Member

but if you have multiple elements it will not

it will always be a single element I think (it are the individual elements of the single tuple that are put in an array)

@jorisvandenbossche

jorisvandenbossche May 12, 2017

Member

but if you have multiple elements it will not

it will always be a single element I think (it are the individual elements of the single tuple that are put in an array)

This comment has been minimized.

@jreback

jreback May 12, 2017

Contributor

numpy converts strings to fixed length which is wrong

use routine to have it corrrect

@jreback

jreback May 12, 2017

Contributor

numpy converts strings to fixed length which is wrong

use routine to have it corrrect

@jorisvandenbossche jorisvandenbossche changed the title from Perf mi get loc hash to PERF: improve MultiIndex get_loc performance May 12, 2017

@codecov

This comment has been minimized.

Show comment
Hide comment
@codecov

codecov bot May 12, 2017

Codecov Report

Merging #16346 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16346      +/-   ##
==========================================
- Coverage   90.38%   90.36%   -0.02%     
==========================================
  Files         161      161              
  Lines       50916    50920       +4     
==========================================
- Hits        46021    46015       -6     
- Misses       4895     4905      +10
Flag Coverage Δ
#multiple 88.14% <100%> (-0.01%) ⬇️
#single 40.22% <14.28%> (-0.11%) ⬇️
Impacted Files Coverage Δ
pandas/core/util/hashing.py 90.67% <100%> (+0.32%) ⬆️
pandas/core/indexes/multi.py 96.56% <100%> (ø) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/common.py 90.68% <0%> (-0.35%) ⬇️
pandas/core/frame.py 97.68% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4bdbcb6...e88b658. Read the comment docs.

codecov bot commented May 12, 2017

Codecov Report

Merging #16346 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16346      +/-   ##
==========================================
- Coverage   90.38%   90.36%   -0.02%     
==========================================
  Files         161      161              
  Lines       50916    50920       +4     
==========================================
- Hits        46021    46015       -6     
- Misses       4895     4905      +10
Flag Coverage Δ
#multiple 88.14% <100%> (-0.01%) ⬇️
#single 40.22% <14.28%> (-0.11%) ⬇️
Impacted Files Coverage Δ
pandas/core/util/hashing.py 90.67% <100%> (+0.32%) ⬆️
pandas/core/indexes/multi.py 96.56% <100%> (ø) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/common.py 90.68% <0%> (-0.35%) ⬇️
pandas/core/frame.py 97.68% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4bdbcb6...e88b658. Read the comment docs.

@codecov

This comment has been minimized.

Show comment
Hide comment
@codecov

codecov bot May 12, 2017

Codecov Report

Merging #16346 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16346      +/-   ##
==========================================
- Coverage   90.38%   90.37%   -0.01%     
==========================================
  Files         161      161              
  Lines       50949    50963      +14     
==========================================
+ Hits        46052    46060       +8     
- Misses       4897     4903       +6
Flag Coverage Δ
#multiple 88.15% <100%> (ø) ⬆️
#single 40.2% <26.31%> (-0.11%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/multi.py 96.56% <100%> (ø) ⬆️
pandas/core/util/hashing.py 92.96% <100%> (+2.61%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.69% <0%> (-0.1%) ⬇️
pandas/core/common.py 91.39% <0%> (+0.33%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d3bef8...8acc9e8. Read the comment docs.

codecov bot commented May 12, 2017

Codecov Report

Merging #16346 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16346      +/-   ##
==========================================
- Coverage   90.38%   90.37%   -0.01%     
==========================================
  Files         161      161              
  Lines       50949    50963      +14     
==========================================
+ Hits        46052    46060       +8     
- Misses       4897     4903       +6
Flag Coverage Δ
#multiple 88.15% <100%> (ø) ⬆️
#single 40.2% <26.31%> (-0.11%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/multi.py 96.56% <100%> (ø) ⬆️
pandas/core/util/hashing.py 92.96% <100%> (+2.61%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.69% <0%> (-0.1%) ⬇️
pandas/core/common.py 91.39% <0%> (+0.33%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d3bef8...8acc9e8. Read the comment docs.

@@ -921,6 +923,16 @@ cdef class MultiIndexHashTable(HashTable):
"hash collision\nlocs:\n{}\n"
"result:\n{}\nmi:\n{}".format(alocs, result, mi))
+ def _check_for_collision(self, Py_ssize_t loc, object label):

This comment has been minimized.

@jreback

jreback May 12, 2017

Contributor

make this cdef, its hitting overhead because it has to go to python and back to cython

@jreback

jreback May 12, 2017

Contributor

make this cdef, its hitting overhead because it has to go to python and back to cython

+
+
+def _hash_scalar(val, encoding='utf8', hash_key=None):
+ """

This comment has been minimized.

@jreback

jreback May 15, 2017

Contributor

this is duplicating lots of code. I sure there is a way to do this a bit more generically.

@jreback

jreback May 15, 2017

Contributor

this is duplicating lots of code. I sure there is a way to do this a bit more generically.

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche May 15, 2017

Member

Yes, I know, but this was mainly to test.
And from that it, this makes it a lot faster than using the hash_array (the commented part in hash_tuple). So not sure how to solve that. In principle I can put the common parts in helper functions (eg the redistributing part), but for most of it it is not possible as there are slight differences.

@jorisvandenbossche

jorisvandenbossche May 15, 2017

Member

Yes, I know, but this was mainly to test.
And from that it, this makes it a lot faster than using the hash_array (the commented part in hash_tuple). So not sure how to solve that. In principle I can put the common parts in helper functions (eg the redistributing part), but for most of it it is not possible as there are slight differences.

This comment has been minimized.

@jreback

jreback May 15, 2017

Contributor

what does all of this buy you? (IOW can you post updated timings)?

maintaining a separate code path for scalars will cause future issues. these will need to be kept in sync (with the array hashing), if any code changes are made. you can easily share code here which would make this more palatable.

@jreback

jreback May 15, 2017

Contributor

what does all of this buy you? (IOW can you post updated timings)?

maintaining a separate code path for scalars will cause future issues. these will need to be kept in sync (with the array hashing), if any code changes are made. you can easily share code here which would make this more palatable.

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche May 15, 2017

Member
In [4]: %timeit pd.core.util.hashing.hash_tuple2((999, np.nan, 'E'))
380 µs ± 60.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [5]: %timeit pd.core.util.hashing.hash_tuple((999, np.nan, 'E'))
81.8 µs ± 3.53 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

hash_tuple2 uses the hash_array (the commented out version in the current branch), the hash_tuple uses the hash_scalar

@jorisvandenbossche

jorisvandenbossche May 15, 2017

Member
In [4]: %timeit pd.core.util.hashing.hash_tuple2((999, np.nan, 'E'))
380 µs ± 60.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [5]: %timeit pd.core.util.hashing.hash_tuple((999, np.nan, 'E'))
81.8 µs ± 3.53 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

hash_tuple2 uses the hash_array (the commented out version in the current branch), the hash_tuple uses the hash_scalar

This comment has been minimized.

@jreback

jreback May 15, 2017

Contributor

ok, its prob reasonable, but willing to sacrifice some perf to get some shared code (IOW between old and new prob a good compromise to repeating lots of code)

@jreback

jreback May 15, 2017

Contributor

ok, its prob reasonable, but willing to sacrifice some perf to get some shared code (IOW between old and new prob a good compromise to repeating lots of code)

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche May 15, 2017

Member

OK, I pushed a new version that almost fully reuses the hash_array function for the actual hashing, only has some specific logic before that to convert the scalar to a proper array.
This reduces a lot of the code duplication, and has only minor perf impact.

Apart from that, it still has the separate code path to first convert the scalar to an array, which might be a bit brittle, and I agree certainly not ideal for code maintenance, but using something more general (eg infer_dtype) gives a big perf penalty.

@jorisvandenbossche

jorisvandenbossche May 15, 2017

Member

OK, I pushed a new version that almost fully reuses the hash_array function for the actual hashing, only has some specific logic before that to convert the scalar to a proper array.
This reduces a lot of the code duplication, and has only minor perf impact.

Apart from that, it still has the separate code path to first convert the scalar to an array, which might be a bit brittle, and I agree certainly not ideal for code maintenance, but using something more general (eg infer_dtype) gives a big perf penalty.

pandas/core/util/hashing.py
+ else:
+ if not string_like:
+ from pandas import Index
+ vals = Index(vals).values

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche May 15, 2017

Member

This is also a bit an ugly part. I use Index(vals) to get the correct type coercion (eg for a Timestamp object, to ensure it does the same as hash_array). But I don't do this in the beginning because this is much slower than needed for simple numerical values.
It would be nice to have a utility function to convert a list of values to an array with the dtype "how pandas wants it" (but maybe that exists and I just don't know about it)

@jorisvandenbossche

jorisvandenbossche May 15, 2017

Member

This is also a bit an ugly part. I use Index(vals) to get the correct type coercion (eg for a Timestamp object, to ensure it does the same as hash_array). But I don't do this in the beginning because this is much slower than needed for simple numerical values.
It would be nice to have a utility function to convert a list of values to an array with the dtype "how pandas wants it" (but maybe that exists and I just don't know about it)

This comment has been minimized.

@jreback

jreback May 15, 2017

Contributor

It would be nice to have a utility function to convert a list of values to an array with the dtype

this is what Series/Index constructors do. About the most complicated code that exists.

@jreback

jreback May 15, 2017

Contributor

It would be nice to have a utility function to convert a list of values to an array with the dtype

this is what Series/Index constructors do. About the most complicated code that exists.

pandas/core/util/hashing.py
+ else:
+ vals = np.array([val])
+
+ if vals.dtype == np.object_:

This comment has been minimized.

@jreback

jreback May 15, 2017

Contributor

you should just call infer_dtype_from_scalar instead of reinveting the wheel

@jreback

jreback May 15, 2017

Contributor

you should just call infer_dtype_from_scalar instead of reinveting the wheel

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche May 16, 2017

Member

Ah, you can only just call that if you know it exists :-)

But thanks for the pointer, that is exactly I was hoping already existed.

I made a slight adaptation (extra keyword) to get the behaviour I need here, but I can also put that in he hash_scalar method instead of in infer_dtype_from_scalar (but, there it is a bit easier to do).

@jorisvandenbossche

jorisvandenbossche May 16, 2017

Member

Ah, you can only just call that if you know it exists :-)

But thanks for the pointer, that is exactly I was hoping already existed.

I made a slight adaptation (extra keyword) to get the behaviour I need here, but I can also put that in he hash_scalar method instead of in infer_dtype_from_scalar (but, there it is a bit easier to do).

pandas/tests/util/test_hashing.py
@@ -79,6 +79,23 @@ def test_hash_tuples(self):
result = hash_tuples(tups[0])
assert result == expected[0]
+ def test_hash_tuple(self):
+ # test equivalence between hash_tuples and hash_tuple
+ for tup in [(1, 'one'), (1, np.nan)]:

This comment has been minimized.

@jreback

jreback May 15, 2017

Contributor

test with pd.Nat as well

@jreback

jreback May 15, 2017

Contributor

test with pd.Nat as well

pandas/core/dtypes/cast.py
@@ -333,7 +333,7 @@ def maybe_promote(dtype, fill_value=np.nan):
return dtype, fill_value
-def infer_dtype_from_scalar(val, pandas_dtype=False):
+def infer_dtype_from_scalar(val, pandas_dtype=False, use_datetimetz=True):

This comment has been minimized.

@jreback

jreback May 16, 2017

Contributor

you don't need this extra parameter, instead you can pass pandas_dtype=True . what is the issue?

@jreback

jreback May 16, 2017

Contributor

you don't need this extra parameter, instead you can pass pandas_dtype=True . what is the issue?

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche May 16, 2017

Member

pandas_dtype=True does not return a np.datetime64 but our DatetimeTZDtype, and further then also Periods get converted, which I also don't need.
So the combination I need, i.e. tz-aware timestamps to its numpy dtype instead of as pandas extension type or as object, but keep Periods as objects, is not possible with the current options. This is a bit a strange combination, but that's the consequence of how those are returned from the Index values (which is as datetime64 without tz but Periods as objects), which is how they are hashed.

But indeed, if we add this, ignore_tz is probably a better name

@jorisvandenbossche

jorisvandenbossche May 16, 2017

Member

pandas_dtype=True does not return a np.datetime64 but our DatetimeTZDtype, and further then also Periods get converted, which I also don't need.
So the combination I need, i.e. tz-aware timestamps to its numpy dtype instead of as pandas extension type or as object, but keep Periods as objects, is not possible with the current options. This is a bit a strange combination, but that's the consequence of how those are returned from the Index values (which is as datetime64 without tz but Periods as objects), which is how they are hashed.

But indeed, if we add this, ignore_tz is probably a better name

+ if isnull(val):
+ # this is to be consistent with the _hash_categorical implementation
+ return np.array([np.iinfo(np.uint64).max], dtype='u8')
+

This comment has been minimized.

@jreback

jreback May 16, 2017

Contributor

if you need to handle datetime w/tz directly (IOW, we basically ignore it), then I would.

if getattr(val, 'tzinfo', None) is not None:
    val = val.tz_localize(None)

I suppose an option to ignore tz is fine for infer_dtype_from_scalar, but if you add it I would rename, document and test.

@jreback

jreback May 16, 2017

Contributor

if you need to handle datetime w/tz directly (IOW, we basically ignore it), then I would.

if getattr(val, 'tzinfo', None) is not None:
    val = val.tz_localize(None)

I suppose an option to ignore tz is fine for infer_dtype_from_scalar, but if you add it I would rename, document and test.

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche May 16, 2017

Member

yes, I can certainly do that check here as well.

That is maybe better to keep the custom logic here, as the keyword added to infer_dtype_from_scalar would not be used anywhere else.

@jorisvandenbossche

jorisvandenbossche May 16, 2017

Member

yes, I can certainly do that check here as well.

That is maybe better to keep the custom logic here, as the keyword added to infer_dtype_from_scalar would not be used anywhere else.

This comment has been minimized.

@jreback

jreback May 16, 2017

Contributor

he i think better locally

@jreback

jreback May 16, 2017

Contributor

he i think better locally

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche May 16, 2017

Member

Ran the benchmarks with the latest version of this PR:

Against master:

$ asv continuous upstream/master HEAD -b time_multiindex
...
     before      after     ratio
  [0ea0f25b] [c29dab95]
-     2.73s   298.28ms      0.11  indexing.MultiIndexing.time_multiindex_large_get_loc_warm
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

Against 0.19.2 (which combines slowdown/improvement of the several PRs), also gives an improvement now for the that benchmark:

$ asv continuous v0.19.2 HEAD -b time_multiindex
...
    before     after       ratio
  [6f525eec] [c29dab95]
-  744.29ms   334.65ms      0.45  indexing.MultiIndexing.time_multiindex_large_get_loc_warm
-  716.03ms   175.14ms      0.24  indexing.MultiIndexing.time_multiindex_large_get_loc
-  736.39ms   166.29ms      0.23  indexing.MultiIndexing.time_multiindex_get_indexer
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
Member

jorisvandenbossche commented May 16, 2017

Ran the benchmarks with the latest version of this PR:

Against master:

$ asv continuous upstream/master HEAD -b time_multiindex
...
     before      after     ratio
  [0ea0f25b] [c29dab95]
-     2.73s   298.28ms      0.11  indexing.MultiIndexing.time_multiindex_large_get_loc_warm
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

Against 0.19.2 (which combines slowdown/improvement of the several PRs), also gives an improvement now for the that benchmark:

$ asv continuous v0.19.2 HEAD -b time_multiindex
...
    before     after       ratio
  [6f525eec] [c29dab95]
-  744.29ms   334.65ms      0.45  indexing.MultiIndexing.time_multiindex_large_get_loc_warm
-  716.03ms   175.14ms      0.24  indexing.MultiIndexing.time_multiindex_large_get_loc
-  736.39ms   166.29ms      0.23  indexing.MultiIndexing.time_multiindex_get_indexer
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
+ # for tz-aware datetimes, we need the underlying naive UTC value and
+ # not the tz aware object or pd extension type (as
+ # infer_dtype_from_scalar would do)
+ if not isinstance(val, tslib.Timestamp):

This comment has been minimized.

@jreback

jreback May 16, 2017

Contributor

see #16372 (for later)

@jreback

jreback May 16, 2017

Contributor

see #16372 (for later)

@jreback jreback added this to the 0.20.2 milestone May 16, 2017

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback May 16, 2017

Contributor

@jorisvandenbossche lgtm. merge when ready (for 0.20.2) is good.

Contributor

jreback commented May 16, 2017

@jorisvandenbossche lgtm. merge when ready (for 0.20.2) is good.

@jreback jreback closed this May 16, 2017

@jreback jreback reopened this May 16, 2017

@jorisvandenbossche jorisvandenbossche merged commit 34ebad8 into pandas-dev:master May 17, 2017

5 checks passed

ci/circleci Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 50%)
Details
codecov/project 90.37% (target 82%)
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche May 17, 2017

Member

@jreback thanks for the review!

Member

jorisvandenbossche commented May 17, 2017

@jreback thanks for the review!

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback May 17, 2017

Contributor

nice fix @jorisvandenbossche !

Contributor

jreback commented May 17, 2017

nice fix @jorisvandenbossche !

pawroman added a commit to pawroman/pandas that referenced this pull request May 18, 2017

Merge remote-tracking branch 'upstream/master' into fix-panel-fillna
* upstream/master: (48 commits)
  BUG: Categorical comparison with unordered (#16339)
  ENH: Adding 'protocol' parameter to 'to_pickle'.
  PERF: improve MultiIndex get_loc performance (#16346)
  TST: remove pandas-datareader xfail as 0.4.0 works (#16374)
  TST: followup to #16364, catch errstate warnings (#16373)
  DOC: new oauth token
  TST: Add test for clip-na (#16369)
  ENH: Draft metadata specification doc for Apache Parquet (#16315)
  MAINT: Add .iml to .gitignore (#16368)
  BUG/API: Categorical constructor scalar categories (#16340)
  ENH: Provide dict object for to_dict() #16122 (#16220)
  PERF: improved clip performance (#16364)
  DOC: try new token for docs
  DOC: try with new secure token
  DOC: add developer section to the docs
  DEPS: Drop Python 3.4 support (#16303)
  DOC: remove credential helper
  DOC: force fetch on build docs
  DOC: redo dev docs access token
  DOC: add dataframe construction in merge_asof example (#16348)
  ...

pcluo added a commit to pcluo/pandas that referenced this pull request May 22, 2017

PERF: improve MultiIndex get_loc performance (#16346)
* PERF: improve hash collision check for single MI labels
* PERF: specialized hash function for single tuples

TomAugspurger added a commit to TomAugspurger/pandas that referenced this pull request May 29, 2017

PERF: improve MultiIndex get_loc performance (#16346)
* PERF: improve hash collision check for single MI labels
* PERF: specialized hash function for single tuples
(cherry picked from commit 34ebad8)

TomAugspurger added a commit that referenced this pull request May 30, 2017

PERF: improve MultiIndex get_loc performance (#16346)
* PERF: improve hash collision check for single MI labels
* PERF: specialized hash function for single tuples
(cherry picked from commit 34ebad8)

stangirala added a commit to stangirala/pandas that referenced this pull request Jun 11, 2017

PERF: improve MultiIndex get_loc performance (#16346)
* PERF: improve hash collision check for single MI labels
* PERF: specialized hash function for single tuples

@jorisvandenbossche jorisvandenbossche deleted the jorisvandenbossche:perf-mi-get_loc-hash branch Nov 27, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment