Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
PERF: casting loc to labels dtype before searchsorted #14551
Conversation
| @@ -1907,6 +1907,7 @@ def convert_indexer(start, stop, step, indexer=indexer, labels=labels): | ||
| return np.array(labels == loc, dtype=bool) | ||
| else: | ||
| # sorted, so can return slice object -> view | ||
| + loc = labels.dtype.type(loc) |
jreback
Nov 1, 2016
Contributor
maybe should do this like
loc, orig_loc = lables.dtype.type(loc), loc
if loc != orig_loc:
loc = orig_loc
jorisvandenbossche
Nov 1, 2016
Owner
Yeah, I am not sure how much checking would be needed here.
My understanding is that it probably is not needed in this case, as loc is coming from loc = level_index.get_loc(key), and labels and level_index are from the same MultiIndex. So I would assume that get_loc can only return an existing label, and so should fit in the dtype of labels?
(but probably also not that a perf issue to do the check)
jorisvandenbossche
Nov 1, 2016
Owner
Apparently not :-)
But it's only a partial indexing test that fails. So loc could also be a slice, and it is expected in partial indexing that this raises an error in searchsorted, but now it raises already a slightly different error in dtype.type(loc). So that can easily be solved with:
try:
loc = labels.dtype.type(loc)
except TypeError:
# this occurs when loc is a slice (partial string indexing)
# but the TypeError raised by searchsorted in this case
# is catched in Index._has_valid_type()
pass
jorisvandenbossche
referenced
this pull request
Nov 1, 2016
Closed
Pandas 0.12 is much faster than Pandas 0.18 #14549
codecov-io
commented
Nov 1, 2016
•
Current coverage is 85.26% (diff: 100%)@@ master #14551 diff @@
==========================================
Files 140 140
Lines 50672 50676 +4
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 43207 43211 +4
Misses 7465 7465
Partials 0 0
|
jreback
added Indexing Performance
labels
Nov 1, 2016
jreback
added this to the
0.19.1
milestone
Nov 1, 2016
|
lgtm. just release note. I think this can only help :> |
|
OK, added whatsnew notice. Will merge, but will open a new issue for follow-up on this (other cases, benchmarks that capture this (at the moment there are none)). |
jorisvandenbossche
merged commit 1d95179
into pandas-dev:master
Nov 2, 2016
jorisvandenbossche
referenced
this pull request
Nov 2, 2016
Open
PERF: better use of searchsorted for indexing performance #14565
jorisvandenbossche
added a commit
that referenced
this pull request
Nov 3, 2016
|
|
jorisvandenbossche |
a95ce63
|
yarikoptic
added a commit
to neurodebian/pandas
that referenced
this pull request
Nov 18, 2016
|
|
yarikoptic |
dd3759d
|
amolkahat
added a commit
to amolkahat/pandas
that referenced
this pull request
Nov 26, 2016
|
|
jorisvandenbossche + amolkahat |
87ee289
|
jorisvandenbossche commentedNov 1, 2016
•
edited
Intrigued by the profiling results of the below example (Multi-index
locindexing, based on the example in #14549), wheresearchsortedseemed to take the majority of the computation time.And it seems that
searchsortedcasts both inputs (in this caselabelsandloc) to a common dtype, and thelabelsof the MultiIndex were in this caseint16, whileloc(output fromIndex.get_loc) is a python int.By casting
locto the dtype oflabels, the specific example gets a ca 20 x speed improvementOn master:
with this PR:
Just putting it here as a showcase.
Actual PR probably needs some more work (other places where this can be done, can loc ever be out of bound for that dtype?, benchmarks, ..)