Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
PERF/REF: improve performance of Series.searchsorted, PandasArray.searchsorted, collect functionality #22034
Numpy's searchsorted doesn't like being given values to search for that aren't of the same type as the array being searched. By ensuring that the input value has the same dtype as the underlying array, the search is sped up significantly.
>>> n = 1_000_000 >>> s = pd.Series(( * n +  * n +  * n), dtype='int8') >>> %timeit s.searchsorted(1) # python int 15.2 ms # master 9.75 µs # this PR
The improventents are largest when the dtype isn't int64 or float64, but for those cases the improvement is also significant (10x).
@@ Coverage Diff @@ ## master #22034 +/- ## ========================================== + Coverage 91.73% 91.73% +<.01% ========================================== Files 173 173 Lines 52848 52869 +21 ========================================== + Hits 48482 48502 +20 - Misses 4366 4367 +1
EDIT: New implementation before I go.
A few issues:
Tests (Series only atm) are still good, see below. I don't expect ASVs for Int64 and Float64 to show any meaningful change, while UInt64Index should get a nice improvement.
I'll get back to this in about two week, probably. Comments obviously welcome until then.
In the latest commit, Im automatically casting ints and uints to the correct dtype, bur also do some overflow checks, so we're not downcasting out-of-bound values. I think this is a good solution and is very fast.
But floats downcasting is more confusing, i.e. if the input in float64 and the array is float32 or float16, should we downcast?
Some options for floats:
I'm favoring option 4 or 5, but edge toward 5.
The reason is that in many use cases performance is really not the user's prime concern, and they just want things to work with minimal effort. Other users (beginners) may not understand types that well. So emitting dtype warnings would annoy/confuse such users. So, IMO option 5 would be best.