PERF/REF: improve performance of Series.searchsorted, PandasArray.searchsorted, collect functionality #22034

topper-123 · 2018-07-24T01:15:02Z

closes PERF: better use of searchsorted for indexing performance #14565
ASV's added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Numpy's searchsorted doesn't like being given values to search for that aren't of the same type as the array being searched. By ensuring that the input value has the same dtype as the underlying array, the search is sped up significantly.

>>> n = 1_000_000
>>> s = pd.Series(([1] * n + [2] * n + [3] * n), dtype='int8')
>>> %timeit s.searchsorted(1)  # python int
15.2 ms  # master
9.75 µs  # this PR

ASV results

      before           after         ratio
     [b9754556]       [fd6c117b]
-        93.9±9μs         10.9±0μs     0.12  series_methods.SearchSorted.time_searchsorted('int64')
-        95.4±0μs         10.9±2μs     0.11  series_methods.SearchSorted.time_searchsorted('float64')
-        140±10μs       10.7±0.4μs     0.08  series_methods.SearchSorted.time_searchsorted('str')
-      1.34±0.1ms       10.8±0.3μs     0.01  series_methods.SearchSorted.time_searchsorted('uint16')
-     1.52±0.03ms         10.7±0μs     0.01  series_methods.SearchSorted.time_searchsorted('int16')
-      1.79±0.1ms         12.5±2μs     0.01  series_methods.SearchSorted.time_searchsorted('int8')
-      1.46±0.1ms       10.2±0.8μs     0.01  series_methods.SearchSorted.time_searchsorted('int32')
-        1.46±0ms         9.54±1μs     0.01  series_methods.SearchSorted.time_searchsorted('uint8')
-      1.46±0.1ms         9.38±2μs     0.01  series_methods.SearchSorted.time_searchsorted('float32')
-      1.71±0.2ms       10.4±0.9μs     0.01  series_methods.SearchSorted.time_searchsorted('uint32')
-      1.82±0.1ms         10.7±0μs     0.01  series_methods.SearchSorted.time_searchsorted('uint64')
-     2.14±0.04ms         10.7±0μs     0.00  series_methods.SearchSorted.time_searchsorted('float16')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

The improventents are largest when the dtype isn't int64 or float64, but for those cases the improvement is also significant (10x).

topper-123 · 2018-07-24T01:20:35Z

pandas/core/series.py

-                                         side=side, sorter=sorter)
+        if not is_extension_type(self._values):
+            value = np.asarray(value, dtype=self._values.dtype)
+            value = value[..., np.newaxis] if value.ndim == 0 else value


Without this shim searchsorted returns a scalar, so this line is only to ensure that - like in master - an 1-dim array is returned.

I like better the numpy convention of returning a scalar from searchsorted when possible, but that can be for another PR.

can you add a comment here on why you are doing this (and the expl above as well)

topper-123 · 2018-07-24T09:38:33Z

The appveyor failure is a resource error, so unrelated to this PR.

jreback

ping on green.

jreback · 2018-07-24T22:04:06Z

pandas/core/series.py

-                                         side=side, sorter=sorter)
+        if not is_extension_type(self._values):
+            value = np.asarray(value, dtype=self._values.dtype)
+            value = value[..., np.newaxis] if value.ndim == 0 else value


can you add a comment here on why you are doing this (and the expl above as well)

codecov · 2018-07-25T09:29:33Z

Codecov Report

Merging #22034 into master will increase coverage by <.01%.
The diff coverage is 96.29%.

@@            Coverage Diff             @@
##           master   #22034      +/-   ##
==========================================
+ Coverage   91.73%   91.73%   +<.01%     
==========================================
  Files         173      173              
  Lines       52848    52869      +21     
==========================================
+ Hits        48482    48502      +20     
- Misses       4366     4367       +1

Flag	Coverage Δ
#multiple	`90.3% <96.29%> (ø)`	⬆️
#single	`41.72% <66.66%> (+0.01%)`	⬆️

Impacted Files	Coverage Δ
pandas/core/arrays/base.py	`98.25% <ø> (ø)`	⬆️
pandas/core/series.py	`93.68% <100%> (-0.02%)`	⬇️
pandas/core/base.py	`97.76% <100%> (ø)`	⬆️
pandas/core/arrays/numpy_.py	`93.66% <100%> (+0.14%)`	⬆️
pandas/core/algorithms.py	`94.77% <94.73%> (-0.01%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5557e36...bcbe226. Read the comment docs.

topper-123 · 2018-07-25T14:08:43Z

green.

jreback · 2018-07-26T12:38:17Z

pandas/core/series.py

@@ -2077,8 +2077,14 @@ def __rmatmul__(self, other):
    def searchsorted(self, value, side='left', sorter=None):
        if sorter is not None:
            sorter = ensure_platform_int(sorter)
-        return self._values.searchsorted(Series(value)._values,
-                                         side=side, sorter=sorter)
+        if not is_extension_type(self._values):


hmm, this is orphaning pandas.core.base.searchsorted e.g. Index.searchsorted calls this. can you fix this up? (and add an asv for same). I think ok to make this if clause a helper function, maybe in algos.py and just call it in both places.

Great idea consolidating the two methods, I'll do that.

Now, there won't be a speed difference for Int64Index from this. But Uint64Index will benefit from a common implementation, as it today also experiences the casting problem:

>>> i = pd.Index([1] * n + [2] * n + [3] * 3, dtype='uint64') >>> %timeit i.searchsorted(2) 1.11 ms ± 4.98 µs per loop

So yeah, a common implentation is a great idea. I'll get to it.

pep8speaks · 2018-08-01T17:34:28Z

Hello @topper-123! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on February 23, 2019 at 20:27 Hours UTC

topper-123 · 2018-08-01T17:37:55Z

EDIT: New implementation before I go.

A few issues:

I've decided not to downcast float64 to float32/16. Downcasting may lose precision and therefore give wrong results
Tests should be made for when value is out of bounds for integer dtype
where do I put ASVs for Index.searchsorted?

Tests (Series only atm) are still good, see below. I don't expect ASVs for Int64 and Float64 to show any meaningful change, while UInt64Index should get a nice improvement.

      before           after         ratio
     [9c118668]       [67efbda4]
-        77.7±1μs         24.4±1μs     0.31  series_methods.SearchSorted.time_searchsorted('int64')
-        80.1±6μs       10.7±0.8μs     0.13  series_methods.SearchSorted.time_searchsorted('float64')
-         107±2μs         9.94±1μs     0.09  series_methods.SearchSorted.time_searchsorted('str')
-      1.56±0.3ms         26.8±2μs     0.02  series_methods.SearchSorted.time_searchsorted('int32')
-        1.42±0ms         23.2±2μs     0.02  series_methods.SearchSorted.time_searchsorted('uint16')
-      1.45±0.1ms         23.2±1μs     0.02  series_methods.SearchSorted.time_searchsorted('uint32')
-     1.42±0.03ms         22.0±0μs     0.02  series_methods.SearchSorted.time_searchsorted('int16')
-        1.56±0ms       22.1±0.4μs     0.01  series_methods.SearchSorted.time_searchsorted('uint8')
-     1.71±0.09ms         22.0±1μs     0.01  series_methods.SearchSorted.time_searchsorted('uint64')
-        1.95±0ms         18.9±2μs     0.01  series_methods.SearchSorted.time_searchsorted('int8')
-      1.34±0.1ms       9.94±0.7μs     0.01  series_methods.SearchSorted.time_searchsorted('float32')
-        2.08±0ms         9.38±0μs     0.00  series_methods.SearchSorted.time_searchsorted('float16')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

I'll get back to this in about two week, probably. Comments obviously welcome until then.

topper-123 · 2018-08-01T23:59:27Z

pandas/core/common.py

+    value_arr = np.array([value]) if is_scalar(value) else np.array(value)
+    if (value_arr < iinfo.min).any() or (value_arr > iinfo.max).any():
+        msg = "Value {} out of bound for dtype {}".format(value, dtype)
+        raise ValueError(msg)


Notice I raise here when value is out of bound. I think this is reasonable behaviour and avoids some overflow surprises.

topper-123 · 2018-08-02T00:01:11Z

pandas/core/series.py

+
+        if is_scalar(result):
+            # ensure that a 1-dim array is returned
+            result = np.array([result])


This bit is just to maintain backward compatability, to make Series.searchsorted always return a 1-dim array. I'd like to remove this in a future PR, i there's agreement on that.

topper-123 · 2018-08-11T11:58:44Z

In the latest commit, Im automatically casting ints and uints to the correct dtype, bur also do some overflow checks, so we're not downcasting out-of-bound values. I think this is a good solution and is very fast.

But floats downcasting is more confusing, i.e. if the input in float64 and the array is float32 or float16, should we downcast?

Some options for floats:

Throw an error if input is not dtype compatible with array.
Silently force casting of input to the dtype of the array (fastest, but risks of wrong results?)
Keep current behaviour, i.e. just delegate decision to numpy (i.e. is slow, because we force upcasting of array for float32 and float16)
Keep current behaviour, but emit a warning if input and array are not dtype compatible (slow, but allows the user himself to fix the speed problem. Can be annoying with warnings)
Keep current behaviour, but add an boolean option to pd.options to emit a warning if input and array are not dtype compatible + add suitable explanation about the issue to the searchsorted doc string (Like option 4, but avoids annoying with warnings)

I'm favoring option 4 or 5, but edge toward 5.

The reason is that in many use cases performance is really not the user's prime concern, and they just want things to work with minimal effort. Other users (beginners) may not understand types that well. So emitting dtype warnings would annoy/confuse such users. So, IMO option 5 would be best.

jreback · 2018-08-11T13:27:25Z

don’t do anything for floats
ints are worth it because we commonly have non int64 and you do search on them

for floats you very rarely search and all of the above are too complex

topper-123 · 2019-02-03T11:47:00Z

The azure bugsseem unrelated to this PR, as I don't touch I/O or the html repr:

··· frame_methods.Repr.time_html_repr_trunc_mi                  failed
2019-02-03T09:40:31.5401600Z [ 11.57%] ···· Traceback (most recent call last):
2019-02-03T09:40:31.5401919Z                  File "/home/vsts/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/asv/benchmark.py", line 1039, in main_run_server
2019-02-03T09:40:31.5402018Z                    main_run(run_args)
2019-02-03T09:40:31.5402334Z                  File "/home/vsts/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/asv/benchmark.py", line 913, in main_run
2019-02-03T09:40:31.5402415Z                    result = benchmark.do_run()
2019-02-03T09:40:31.5402718Z                  File "/home/vsts/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/asv/benchmark.py", line 412, in do_run
2019-02-03T09:40:31.5402779Z                    return self.run(*self._current_params)
2019-02-03T09:40:31.5403097Z                  File "/home/vsts/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/asv/benchmark.py", line 506, in run
2019-02-03T09:40:31.5403275Z                    min_run_count=self.min_run_count)
2019-02-03T09:40:31.5403615Z                  File "/home/vsts/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/asv/benchmark.py", line 569, in benchmark_timing
2019-02-03T09:40:31.5403707Z                    timing = timer.timeit(number)
2019-02-03T09:40:31.5403992Z                  File "/home/vsts/miniconda3/envs/pandas-dev/lib/python3.7/timeit.py", line 176, in timeit
2019-02-03T09:40:31.5404050Z                    timing = self.inner(it, self.timer)
2019-02-03T09:40:31.5404485Z                  File "<timeit-src>", line 6, in inner
2019-02-03T09:40:31.5404530Z                  File "/home/vsts/work/1/s/asv_bench/benchmarks/frame_methods.py", line 226, in time_html_repr_trunc_mi
2019-02-03T09:40:31.5404570Z                    self.df3._repr_html_()
2019-02-03T09:40:31.5404627Z                  File "/home/vsts/work/1/s/pandas/core/frame.py", line 651, in _repr_html_
2019-02-03T09:40:31.5404671Z                    import IPython
2019-02-03T09:40:31.5404898Z                  File "/home/vsts/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/IPython/__init__.py", line 55, in <module>
2019-02-03T09:40:31.5404967Z                    from .terminal.embed import embed
2019-02-03T09:40:31.5405198Z                  File "/home/vsts/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/IPython/terminal/embed.py", line 16, in <module>
2019-02-03T09:40:31.5405245Z                    from IPython.terminal.interactiveshell import TerminalInteractiveShell
2019-02-03T09:40:31.5405502Z                  File "/home/vsts/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/IPython/terminal/interactiveshell.py", line 81, in <module>
2019-02-03T09:40:31.5405705Z                    if not _stream or not hasattr(_stream, 'isatty') or not _stream.isatty():
2019-02-03T09:40:31.5405746Z                ValueError: I/O operation on closed file
2019-02-03T09:40:31.5405794Z

I don't think this is related to this PR.

jreback · 2019-02-23T18:37:43Z

asv_bench/benchmarks/series_methods.py

+              'float16', 'float32', 'float64',
+              'str']
+    param_names = ['dtype']
+


for a followup can add EA types here (Int8 and so on)

jreback · 2019-02-23T18:38:11Z

doc/source/whatsnew/v0.25.0.rst

@@ -64,7 +64,8 @@ Performance Improvements

 - Significant speedup in `SparseArray` initialization that benefits most operations, fixing performance regression introduced in v0.20.0 (:issue:`24985`)
 - `DataFrame.to_stata()` is now faster when outputting data with any string or non-native endian columns (:issue:`25045`)
-
+- Improved performance of :meth:`Series.searchsorted`. The speedup is especially large when the dtype is
+  int8/int16/int32 and the searched key is within the integer bounds for the dtype(:issue:`22034`)


need a space before the parens

jreback · 2019-02-23T18:40:21Z

pandas/core/algorithms.py

+    ----------
+    arr: numpy.array or ExtensionArray
+        array to search in. Cannot be Index, Series or PandasArray, as that
+        would cause a RecursionError.


not sure what this is referring. why is this not an array-like here?

Yes this text is wrong now, it can be indeed be array-like.

jreback · 2019-02-23T18:43:27Z

pandas/tests/arrays/test_array.py

+        expected = np.array([1, 2], dtype=np.intp)
+        tm.assert_numpy_array_equal(result, expected)
+
+    def test_search_sorted_datetime64_scalar(self):


can you test for timedelta & datetime w/tz as well

topper-123 · 2019-02-23T21:42:23Z

Comments addressed.

jreback · 2019-02-24T03:40:14Z

thanks @topper-123

* upstream/master: DOC: CategoricalIndex doc string (pandas-dev#24852) CI: add __init__.py to isort skip list (pandas-dev#25455) TST: numpy RuntimeWarning with Series.round() (pandas-dev#25432) DOC: fixed geo accessor example in extending.rst (pandas-dev#25420) BUG: fixed merging with empty frame containing an Int64 column (pandas-dev#25183) (pandas-dev#25289) TST: remove never-used singleton fixtures (pandas-dev#24885) PERF/REF: improve performance of Series.searchsorted, PandasArray.searchsorted, collect functionality (pandas-dev#22034) BUG: Indexing with UTC offset string no longer ignored (pandas-dev#25263) API/ERR: allow iterators in df.set_index & improve errors (pandas-dev#24984) DOC: Rewriting of ParserError doc + minor spacing (pandas-dev#25421) ENH: Add in sort keyword to DatetimeIndex.union (pandas-dev#25110) ERR: doc update for ParsingError (pandas-dev#25414) BUG: Fix type coercion in read_json orient='table' (pandas-dev#21345) (pandas-dev#25219) DEP: add pytest-mock to environment.yml (pandas-dev#25417) Correct a typo of version number for interpolate() (pandas-dev#25418) Mark test_pct_max_many_rows as high memory (pandas-dev#25400) DOC: Edited docstring of Interval (pandas-dev#25410)

…rchsorted, collect functionality (pandas-dev#22034)

* ERR/TST: Add pytest idiom to dtypes/test_cast.py (pandas-dev#24847) * fix MacPython pandas-wheels failue (pandas-dev#24851) * DEPS: Bump pyarrow min version to 0.9.0 (pandas-dev#24854) Closes pandas-devgh-24767 * DOC: Document AttributeError for accessor (pandas-dev#24855) Closes pandas-dev#20579 * Start whatsnew for 0.24.1 and 0.25.0 (pandas-dev#24848) * DEPR/API: Non-ns precision in Index constructors (pandas-dev#24806) * BUG: Format mismatch doesn't coerce to NaT (pandas-dev#24815) * BUG: Properly parse unicode usecols names in CSV (pandas-dev#24856) * CLN: fix typo in asv eval.Query suite (pandas-dev#24865) * BUG: DataFrame respects dtype with masked recarray (pandas-dev#24874) * REF/CLN: Move private method (pandas-dev#24875) * BUG : ValueError in case on NaN value in groupby columns (pandas-dev#24850) * BUG: fix floating precision formatting in presence of inf (pandas-dev#24863) * DOC: Creating top-level user guide section, and moving pages inside (pandas-dev#24677) * DOC: Creating top-level development section, and moving pages inside (pandas-dev#24691) * DOC: Creating top-level getting started section, and moving pages inside (pandas-dev#24678) * DOC: Implementing redirect system, and adding user_guide redirects (pandas-dev#24715) * DOC: Implementing redirect system, and adding user_guide redirects * Using relative urls for the redirect * Validating that no file is overwritten by a redirect * Adding redirects for getting started and development sections * DOC: fixups (pandas-dev#24888) * Fixed heading on whatnew * Remove empty scalars.rst * CLN: fix typo in ctors.SeriesDtypesConstructors setup (pandas-dev#24894) * DOC: No clean in sphinx_build (pandas-dev#24902) Closes pandas-dev#24727 * BUG (output formatting): use fixed with for truncation column instead of inferring from last column (pandas-dev#24905) * DOC: also redirect old whatsnew url (pandas-dev#24906) * Revert BUG-24212 fix usage of Index.take in pd.merge (pandas-dev#24904) * Revert BUG-24212 fix usage of Index.take in pd.merge xref pandas-dev#24733 xref pandas-dev#24897 * test 0.23.4 output * added note about buggy test * DOC: Add experimental note to DatetimeArray and TimedeltaArray (pandas-dev#24882) * DOC: Add experimental note to DatetimeArray and TimedeltaArray * Disable M8 in nanops (pandas-dev#24907) * Disable M8 in nanops Closes pandas-dev#24752 * CLN: fix typo in asv benchmark of non_unique_sorted, which was not sorted (pandas-dev#24917) * API/VIS: remove misc plotting methods from plot accessor (revert pandas-dev#23811) (pandas-dev#24912) * DOC: some 0.24.0 whatsnew clean-up (pandas-dev#24911) * DOC: Final reorganization of documentation pages (pandas-dev#24890) * DOC: Final reorganization of documentation pages * Move ecosystem to top level * DOC: Adding redirects to API moved pages (pandas-dev#24909) * DOC: Adding redirects to API moved pages * DOC: Making home page links more compact and clearer (pandas-dev#24928) * DOC: 0.24 release date (pandas-dev#24930) * DOC: Adding version to the whatsnew section in the home page (pandas-dev#24929) * API: Remove IntervalArray from top-level (pandas-dev#24926) * RLS: 0.24.0 * DEV: Start 0.25 cycle * DOC: State that we support scalars in to_numeric (pandas-dev#24944) We support it and test it already. xref pandas-devgh-24910. * DOC: Minor what's new fix (pandas-dev#24933) * TST: GH#23922 Add missing match params to pytest.raises (pandas-dev#24937) * Add tests for NaT when performing dt.to_period (pandas-dev#24921) * DOC: switch headline whatsnew to 0.25 (pandas-dev#24941) * BUG-24212 fix regression in pandas-dev#24897 (pandas-dev#24916) * CLN: reduce overhead in setup for categoricals benchmarks in asv (pandas-dev#24913) * Excel Reader Refactor - Base Class Introduction (pandas-dev#24829) * TST/REF: Add pytest idiom to test_numeric.py (pandas-dev#24946) * BLD: silence npy_no_deprecated warnings with numpy>=1.16.0 (pandas-dev#24864) * CLN: Refactor cython to use memory views (pandas-dev#24932) * DOC: Clean sort_values and sort_index docstrings (pandas-dev#24843) * STY: use pytest.raises context syntax (indexing) (pandas-dev#24960) * Fixed itertuples usage in to_dict (pandas-dev#24965) * Fixed itertuples usage in to_dict Closes pandas-dev#24940 Closes pandas-dev#24939 * STY: use pytest.raises context manager (resample) (pandas-dev#24977) * DOC: Document breaking change to read_csv (pandas-dev#24989) * DEPR: Fixed warning for implicit registration (pandas-dev#24964) * STY: use pytest.raises context manager (indexes/datetimes) (pandas-dev#24995) * DOC: move whatsnew note of pandas-dev#24916 (pandas-dev#24999) * BUG: Fix broken links (pandas-dev#25002) The previous location of contributing.rst file was /doc/source/contributing.rst but has been moved to /doc/source/development/contributing.rst * fix for BUG: grouping with tz-aware: Values falls after last bin (pandas-dev#24973) * REGR: Preserve order by default in Index.difference (pandas-dev#24967) Closes pandas-dev#24959 * CLN: do not use .repeat asv setting for storing benchmark data (pandas-dev#25015) * CLN: isort asv_bench/benchmark/algorithms.py (pandas-dev#24958) * fix+test to_timedelta('NaT', box=False) (pandas-dev#24961) * PERF: significant speedup in sparse init and ops by using numpy in check_integrity (pandas-dev#24985) * BUG: Fixed merging on tz-aware (pandas-dev#25033) * Test nested PandasArray (pandas-dev#24993) * DOC: fix error in documentation pandas-dev#24981 (pandas-dev#25038) * BUG: support dtypes in column_dtypes for to_records() (pandas-dev#24895) * Makes example from docstring work (pandas-dev#25035) * CLN: typo fixups (pandas-dev#25028) * BUG: to_datetime(strs, utc=True) used previous UTC offset (pandas-dev#25020) * BUG: Better handle larger numbers in to_numeric (pandas-dev#24956) * BUG: Better handle larger numbers in to_numeric * Warn about lossiness when passing really large numbers that exceed (u)int64 ranges. * Coerce negative numbers to float when requested instead of crashing and returning object. * Consistently parse numbers as integers / floats, even if we know that the resulting container has to be float. This is to ensure consistent error behavior when inputs numbers are too large. Closes pandas-devgh-24910. * MAINT: Address comments * BUG: avoid usage in_qtconsole for recent IPython versions (pandas-dev#25039) * Drop IPython<4.0 compat * Revert "Drop IPython<4.0 compat" This reverts commit 0cb0452. * update a * whatsnew * REGR: fix read_sql delegation for queries on MySQL/pymysql (pandas-dev#25024) * DOC: Start 0.24.2.rst (pandas-dev#25026) [ci skip] * REGR: rename_axis with None should remove axis name (pandas-dev#25069) * clarified the documentation for DF.drop_duplicates (pandas-dev#25056) * Clarification in docstring of Series.value_counts (pandas-dev#25062) * ENH: Support fold argument in Timestamp.replace (pandas-dev#25046) * CLN: to_pickle internals (pandas-dev#25044) * Implement+Test Tick.__rtruediv__ (pandas-dev#24832) * API: change Index set ops sort=True -> sort=None (pandas-dev#25063) * BUG: to_clipboard text truncated for Python 3 on Windows for UTF-16 text (pandas-dev#25040) * PERF: use new to_records() argument in to_stata() (pandas-dev#25045) * DOC: Cleanup 0.24.1 whatsnew (pandas-dev#25084) * Fix quotes position in pandas.core, typos and misspelled parameters. (pandas-dev#25093) * CLN: Remove sentinel_factory() in favor of object() (pandas-dev#25074) * TST: remove DST transition scenarios from tc pandas-dev#24689 (pandas-dev#24736) * BLD: remove spellcheck from Makefile (pandas-dev#25111) * DOC: small clean-up of 0.24.1 whatsnew (pandas-dev#25096) * DOC: small doc fix to Series.repeat (pandas-dev#25115) * TST: tests for categorical apply (pandas-dev#25095) * CLN: use dtype in constructor (pandas-dev#25098) * DOC: frame.py doctest fixing (pandas-dev#25097) * DOC: 0.24.1 release (pandas-dev#25125) [ci skip] * Revert set_index inspection/error handling for 0.24.1 (pandas-dev#25085) * DOC: Minor what's new fix (pandas-dev#24933) * Backport PR pandas-dev#24916: BUG-24212 fix regression in pandas-dev#24897 (pandas-dev#24951) * Revert "Backport PR pandas-dev#24916: BUG-24212 fix regression in pandas-dev#24897 (pandas-dev#24951)" This reverts commit 84056c5. * DOC/CLN: Timezone section in timeseries.rst (pandas-dev#24825) * DOC: Improve timezone documentation in timeseries.rst * edit some of the examples * Address review * DOC: Fix validation type error RT04 (pandas-dev#25107) (pandas-dev#25129) * Reading a HDF5 created in py2 (pandas-dev#25058) * BUG: Fixing regression in DataFrame.all and DataFrame.any with bool_only=True (pandas-dev#25102) * Removal of return variable names (pandas-dev#25123) * DOC: Improve docstring of Series.mul (pandas-dev#25136) * TST/REF: collect DataFrame reduction tests (pandas-dev#24914) * Fix validation error type `SS05` and check in CI (pandas-dev#25133) * Fixed tuple to List Conversion in Dataframe class (pandas-dev#25089) * STY: use pytest.raises context manager (indexes/multi) (pandas-dev#25175) * DOC: Updates to Timestamp document (pandas-dev#25163) * BLD: pin cython language level to '2' (pandas-dev#25145) Not explicitly pinning the language level has been producing future warnings from cython. The next release of cython is going to change the default level to '3str' under which the pandas cython extensions do not compile. The long term solution is to update the cython files to the next language level, but this is a stop-gap to keep pandas building. * CLN: Use ABCs in set_index (pandas-dev#25128) * DOC: update docstring for series.nunique (pandas-dev#25116) * DEPR: remove PanelGroupBy, disable DataFrame.to_panel (pandas-dev#25047) * BUG: DataFrame.merge(suffixes=) does not respect None (pandas-dev#24819) * fix MacPython pandas-wheels failure (pandas-dev#25186) * modernize compat imports (pandas-dev#25192) * TST: follow-up to Test nested pandas array pandas-dev#24993 (pandas-dev#25155) * revert changes to tests in pandas-devgh-24993 * Test nested PandasArray * isort test_numpy.py * change NP_VERSION_INFO * use LooseVersion * add _np_version_under1p16 * remove blank line from merge master * add doctstrings to fixtures * DOC/CLN: Fix errors in Series docstrings (pandas-dev#24945) * REF: Add more pytest idiom to test_holiday.py (pandas-dev#25204) * DOC: Fix validation type error SA05 (pandas-dev#25208) Create check for SA05 errors in CI * BUG: Fix Series.is_unique with single occurrence of NaN (pandas-dev#25182) * REF: Remove many Panel tests (pandas-dev#25191) * DOC: Fixes to docstrings and add PR10 (space before colon) to validation (pandas-dev#25109) * DOC: exclude autogenerated c/cpp/html files from 'trailing whitespace' checks (pandas-dev#24549) * STY: use pytest.raises context manager (indexes/period) (pandas-dev#25199) * fix ci failures (pandas-dev#25225) * DEPR: remove tm.makePanel and all usages (pandas-dev#25231) * DEPR: Remove Panel-specific parts of io.pytables (pandas-dev#25233) * DEPR: Add Deprecated warning for timedelta with passed units M and Y (pandas-dev#23264) * BUG-25061 fix printing indices with NaNs (pandas-dev#25202) * BUG: Fix regression in DataFrame.apply causing RecursionError (pandas-dev#25230) * BUG: Fix regression in DataFrame.apply causing RecursionError * Add feedback from PR * Add feedback after further code review * Add feedback after further code review 2 * BUG: Fix read_json orient='table' without index (pandas-dev#25170) (pandas-dev#25171) * BLD: prevent asv from calling sys.stdin.close() by using different launch method (pandas-dev#25237) * (Closes pandas-dev#25029) Removed extra bracket from cheatsheet code example. (pandas-dev#25032) * CLN: For loops, boolean conditions, misc. (pandas-dev#25206) * Refactor groupby group_add from tempita to fused types (pandas-dev#24954) * CLN: Remove ipython 2.x compat (pandas-dev#25150) * CLN: Remove ipython 2.x compat * trivial change to trigger asv * Update v0.25.0.rst * revert whatsnew * BUG: Duplicated returns boolean dataframe (pandas-dev#25234) * REF/TST: resample/test_base.py (pandas-dev#25262) * Revert "BLD: prevent asv from calling sys.stdin.close() by using different launch method (pandas-dev#25237)" (pandas-dev#25253) This reverts commit f67b7fd. * BUG: pandas Timestamp tz_localize and tz_convert do not preserve `freq` attribute (pandas-dev#25247) * DEPR: remove assert_panel_equal (pandas-dev#25238) * PR04 errors fix (pandas-dev#25157) * Split Excel IO Into Sub-Directory (pandas-dev#25153) * API: Ensure DatetimeTZDtype standardizes pytz timezones (pandas-dev#25254) * API: Ensure DatetimeTZDtype standardizes pytz timezones * Add whatsnew * BUG: Fix exceptions when Series.interpolate's `order` parameter is missing or invalid (pandas-dev#25246) * BUG: raise accurate exception from Series.interpolate (pandas-dev#24014) * Actually validate `order` before use in spline * Remove unnecessary check and dead code * Clean up comparison/tests based on feedback * Include invalid order value in exception * Check for NaN order in spline validation * Add whatsnew entry for bug fix * CLN: Make unit tests assert one error at a time * CLN: break test into distinct test case * PEP8 fix in test module * CLN: Test fixture for interpolate methods * BUG: DataFrame.join on tz-aware DatetimeIndex (pandas-dev#25260) * REF: use _constructor and ABCFoo to avoid runtime imports (pandas-dev#25272) * Refactor groupby group_prod, group_var, group_mean, group_ohlc (pandas-dev#25249) * Fix typo in Cheat sheet with regex (pandas-dev#25215) * Edit parameter type in pandas.core.frame.py DataFrame.count (pandas-dev#25198) * TST/CLN: remove test_slice_ints_with_floats_raises (pandas-dev#25277) * Removed Panel class from HDF ASVs (pandas-dev#25281) * DOC: Fix minor typo in docstring (pandas-dev#25285) * DOC/CLN: Fix errors in DataFrame docstrings (pandas-dev#24952) * Skipped broken Py2 / Windows test (pandas-dev#25323) * Rt05 documentation error fix issue 25108 (pandas-dev#25309) * Fix typos in docs (pandas-dev#25305) * Doc: corrects spelling in generic.py (pandas-dev#25333) * BUG: groupby.transform retains timezone information (pandas-dev#25264) * Fixes Formatting Exception (pandas-dev#25088) * Bug: OverflowError in resample.agg with tz data (pandas-dev#25297) * DOC/CLN: Fix various docstring errors (pandas-dev#25295) * COMPAT: alias .to_numpy() for timestamp and timedelta scalars (pandas-dev#25142) * ENH: Support times with timezones in at_time (pandas-dev#25280) * BUG: Fix passing of numeric_only argument for categorical reduce (pandas-dev#25304) * TST: use a fixed seed to have the same uniques across python versions (pandas-dev#25346) TST: add pytest-mock to handle mocker fixture * TST: xfail excel styler tests, xref GH25351 (pandas-dev#25352) * TST: xfail excel styler tests, xref GH25351 * CI: cleanup .c files for cpplint>1.4 * DOC: Correct doc mistake in combiner func (pandas-dev#25360) Closes pandas-devgh-25359. * DOC/BLD: fix --no-api option (pandas-dev#25209) * DOC: modify typos in Contributing section (pandas-dev#25365) * Remove spurious MultiIndex creation in `_set_axis_name` (pandas-dev#25371) * Resovles pandas-dev#25370 * Introduced by pandas-dev#22969 * pandas-dev#23049: test for Fatal Stack Overflow stemming From Misuse of astype('category') (pandas-dev#25366) * 9236: test for the DataFrame.groupby with MultiIndex having pd.NaT (pandas-dev#25310) * [BUG] exception handling of MultiIndex.__contains__ too narrow (pandas-dev#25268) * 14873: test for groupby.agg coercing booleans (pandas-dev#25327) * BUG/ENH: Timestamp.strptime (pandas-dev#25124) * BUG: constructor Timestamp.strptime() does not support %z. * Add doc string to NaT and Timestamp * updated the error message * Updated whatsnew entry. * Interval dtype fix (pandas-dev#25338) * [CLN] Excel Module Cleanups (pandas-dev#25275) Closes pandas-devgh-25153 Authored-By: tdamsma <tdamsma@gmail.com> * ENH: indexing and __getitem__ of dataframe and series accept zerodim integer np.array as int (pandas-dev#24924) * REGR: fix TimedeltaIndex sum and datetime subtraction with NaT (pandas-dev#25282, pandas-dev#25317) (pandas-dev#25329) * edited whatsnew typo (pandas-dev#25381) * fix typo of see also in DataFrame stat funcs (pandas-dev#25388) * API: more consistent error message for MultiIndex.from_arrays (pandas-dev#25189) * CLN: (re-)enable infer_dtype to catch complex (pandas-dev#25382) * DOC: Edited docstring of Interval (pandas-dev#25410) The docstring contained a repeated segment, which I removed. * Mark test_pct_max_many_rows as high memory (pandas-dev#25400) Fixes issue pandas-dev#25384 * Correct a typo of version number for interpolate() (pandas-dev#25418) * DEP: add pytest-mock to environment.yml (pandas-dev#25417) * BUG: Fix type coercion in read_json orient='table' (pandas-dev#21345) (pandas-dev#25219) * ERR: doc update for ParsingError (pandas-dev#25414) Closes pandas-devgh-22881 * ENH: Add in sort keyword to DatetimeIndex.union (pandas-dev#25110) * DOC: Rewriting of ParserError doc + minor spacing (pandas-dev#25421) Follow-up to pandas-devgh-25414. * API/ERR: allow iterators in df.set_index & improve errors (pandas-dev#24984) * BUG: Indexing with UTC offset string no longer ignored (pandas-dev#25263) * PERF/REF: improve performance of Series.searchsorted, PandasArray.searchsorted, collect functionality (pandas-dev#22034) * TST: remove never-used singleton fixtures (pandas-dev#24885) * BUG: fixed merging with empty frame containing an Int64 column (pandas-dev#25183) (pandas-dev#25289) * DOC: fixed geo accessor example in extending.rst (pandas-dev#25420) I realised "lon" and "lat" had just been switched with "longitude" and "latitude" in the following code block. So I used those names here as well. * TST: numpy RuntimeWarning with Series.round() (pandas-dev#25432) * CI: add __init__.py to isort skip list (pandas-dev#25455) * DOC: CategoricalIndex doc string (pandas-dev#24852) * DataFrame.drop Raises KeyError definition (pandas-dev#25474) * BUG: Keep column level name in resample nunique (pandas-dev#25469) Closes pandas-devgh-23222 xref pandas-devgh-23645 * ERR: Correct error message in to_datetime (pandas-dev#25467) * ERR: Correct error message in to_datetime Closes pandas-devgh-23830 xref pandas-devgh-23969 * Fix minor typo (pandas-dev#25458) Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com> * CI: Set pytest minversion to 4.0.2 (pandas-dev#25402) * CI: Set pytest minversion to 4.0.2 * STY: use pytest.raises context manager (indexes) (pandas-dev#25447) * STY: use pytest.raises context manager (tests/test_*) (pandas-dev#25452) * STY: use pytest.raises context manager (tests/test_*) * fix ci failures * skip py2 ci failure * Fix minor error in dynamic load function (pandas-dev#25256) * Cythonized GroupBy Quantile (pandas-dev#20405) * BUG: Fix regression on DataFrame.replace for regex (pandas-dev#25266) * BUG: Fix regression on DataFrame.replace for regex The commit ensures that the replacement for regex is not confined to the beginning of the string but spans all the characters within. The behaviour is then consistent with versions prior to 0.24.0. One test has been added to account for character replacement when the character is not at the beginning of the string. * Correct contribution guide docbuild instruction (pandas-dev#25479) * TST/REF: Add pytest idiom to test_frequencies.py (pandas-dev#25430) * BUG: Fix index type casting in read_json with orient='table' and float index (pandas-dev#25433) (pandas-dev#25434) * BUG: Groupby.agg with reduction function with tz aware data (pandas-dev#25308) * BUG: Groupby.agg cannot reduce with tz aware data * Handle output always as UTC * Add whatsnew * isort and add another fixed groupby.first/last issue * bring condition at a higher level * Add try for _try_cast * Add comments * Don't pass the utc_dtype explicitly * Remove unused import * Use string dtype instead * DOC: Fix docstring for read_sql_table (pandas-dev#25465) * ENH: Add Series.str.casefold (pandas-dev#25419) * Fix PR10 error and Clean up docstrings from functions related to RT05 errors (pandas-dev#25132) * Fix unreliable test (pandas-dev#25496) * DOC: Clarifying doc/make.py --single parameter (pandas-dev#25482) * fix MacPython / pandas-wheels ci failures (pandas-dev#25505) * DOC: Reword Series.interpolate docstring for clarity (pandas-dev#25491) * Changed insertion order to sys.path (pandas-dev#25486) * TST: xfail non-writeable pytables tests with numpy 1.16x (pandas-dev#25517) * STY: use pytest.raises context manager (arithmetic, arrays, computati… (pandas-dev#25504) * BUG: Fix RecursionError during IntervalTree construction (pandas-dev#25498) * STY: use pytest.raises context manager (plotting, reductions, scalar...) (pandas-dev#25483) * STY: use pytest.raises context manager (plotting, reductions, scalar...) * revert removed testing in test_timedelta.py * remove TODO from test_frame.py * skip py2 ci failure * BUG: Fix potential segfault after pd.Categorical(pd.Series(...), categories=...) (pandas-dev#25368) * Make DataFrame.to_html output full content (pandas-dev#24841) * BUG-16807-1 SparseFrame fills with default_fill_value if data is None (pandas-dev#24842) Closes pandas-devgh-16807. * DOC: Add conda uninstall pandas to contributing guide (pandas-dev#25490) * fix pandas-dev#25487 add modify documentation * fix segfault when running with cython coverage enabled, xref cython#2879 (pandas-dev#25529) * TST: inline empty_frame = DataFrame({}) fixture (pandas-dev#24886) * DOC: Polishing typos out of doc/source/user_guide/indexing.rst (pandas-dev#25528) * STY: use pytest.raises context manager (frame) (pandas-dev#25516) * DOC: Fix pandas-dev#24268 by updating description for keep in Series.nlargest (pandas-dev#25358) * DOC: Fix pandas-dev#24268 by updating description for keep * fix MacPython / pandas-wheels ci failures (pandas-dev#25537) * TST/CLN: Remove more Panel tests (pandas-dev#25550) * BUG: caught typeError in series.at (pandas-dev#25506) (pandas-dev#25533) * ENH: Add errors parameter to DataFrame.rename (pandas-dev#25535) * ENH: GH13473 Add errors parameter to DataFrame.rename * TST: Skip IntervalTree construction overflow test on 32bit (pandas-dev#25558) * DOC: Small fixes to 0.24.2 whatsnew (pandas-dev#25559) * minor typo error (pandas-dev#25574) * BUG: in error message raised when invalid axis parameter (pandas-dev#25553) * BLD: Fixed pip install with no numpy (pandas-dev#25568) * Document the behavior of `axis=None` with `style.background_gradient` (pandas-dev#25551) * fix minor typos in dsintro.rst (pandas-dev#25579) * BUG: Handle readonly arrays in period_array (pandas-dev#25556) * BUG: Handle readonly arrays in period_array Closes pandas-dev#25403 * DOC: Fix typo in tz_localize (pandas-dev#25598) * BUG: secondary y axis could not be set to log scale (pandas-dev#25545) (pandas-dev#25586) * TST: add test for groupby on list of empty list (pandas-dev#25589) * TYPING: Small fixes to make stubgen happy (pandas-dev#25576) * CLN: Parmeterize test cases (pandas-dev#25355)

topper-123 force-pushed the searchsorted_perf branch from 0aa112b to 96b79cc Compare July 24, 2018 01:16

topper-123 commented Jul 24, 2018

View reviewed changes

gfyoung added Performance Memory or execution speed performance Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Jul 24, 2018

gfyoung requested a review from jreback July 24, 2018 04:23

jreback requested changes Jul 24, 2018

View reviewed changes

jreback added this to the 0.24.0 milestone Jul 24, 2018

jreback requested changes Jul 26, 2018

View reviewed changes

topper-123 force-pushed the searchsorted_perf branch from 353fb02 to 0c01d60 Compare August 1, 2018 17:34

topper-123 force-pushed the searchsorted_perf branch 3 times, most recently from 67efbda to dfe264a Compare August 1, 2018 23:52

topper-123 commented Aug 1, 2018

View reviewed changes

topper-123 commented Aug 2, 2018

View reviewed changes

topper-123 force-pushed the searchsorted_perf branch from dfe264a to f4961a9 Compare August 11, 2018 12:39

topper-123 force-pushed the searchsorted_perf branch 8 times, most recently from fb17187 to c4bde9f Compare August 14, 2018 23:19

jreback added this to the 0.25.0 milestone Feb 2, 2019

topper-123 force-pushed the searchsorted_perf branch 2 times, most recently from 42b4e7c to 8248536 Compare February 3, 2019 09:08

topper-123 force-pushed the searchsorted_perf branch 4 times, most recently from 45aea02 to 07291ea Compare February 9, 2019 18:56

tp and others added 8 commits February 23, 2019 11:30

improve performance of Series.searchsorted

6ad3f12

added explanation

60742c3

Make common impl. with Index.searchsorted

672802d

Simplify implementation

c1a337c

rebase

686a0a1

collect into one function

ea8280e

move searchsorted to algorithms.py

a9905fd

Guard against IntegerArray + cleanups

9e6ed43

topper-123 force-pushed the searchsorted_perf branch from 07291ea to 9e6ed43 Compare February 23, 2019 11:31

jreback requested changes Feb 23, 2019

View reviewed changes

cleanups

bcbe226

topper-123 force-pushed the searchsorted_perf branch from 0b53050 to bcbe226 Compare February 23, 2019 20:27

jreback approved these changes Feb 24, 2019

View reviewed changes

jreback merged commit df039bf into pandas-dev:master Feb 24, 2019

topper-123 deleted the searchsorted_perf branch February 24, 2019 08:14

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

PERF/REF: improve performance of Series.searchsorted, PandasArray.sea…

3b47037

…rchsorted, collect functionality (pandas-dev#22034)

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

PERF/REF: improve performance of Series.searchsorted, PandasArray.sea…

47c430f

…rchsorted, collect functionality (pandas-dev#22034)

topper-123 mentioned this pull request May 17, 2019

Speed problem for searchsorted when different integer dtypes numpy/numpy#13579

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF/REF: improve performance of Series.searchsorted, PandasArray.searchsorted, collect functionality #22034

PERF/REF: improve performance of Series.searchsorted, PandasArray.searchsorted, collect functionality #22034

topper-123 commented Jul 24, 2018 •

edited

Loading

topper-123 Jul 24, 2018

jreback Jul 24, 2018

topper-123 commented Jul 24, 2018

jreback left a comment

jreback Jul 24, 2018

codecov bot commented Jul 25, 2018 •

edited

Loading

topper-123 commented Jul 25, 2018

jreback Jul 26, 2018

topper-123 Jul 26, 2018

pep8speaks commented Aug 1, 2018 •

edited

Loading

topper-123 commented Aug 1, 2018 •

edited

Loading

topper-123 Aug 1, 2018

topper-123 Aug 2, 2018 •

edited

Loading

topper-123 commented Aug 11, 2018

jreback commented Aug 11, 2018

topper-123 commented Feb 3, 2019

jreback Feb 23, 2019

jreback Feb 23, 2019

jreback Feb 23, 2019

topper-123 Feb 23, 2019

jreback Feb 23, 2019

topper-123 commented Feb 23, 2019

jreback commented Feb 24, 2019

PERF/REF: improve performance of Series.searchsorted, PandasArray.searchsorted, collect functionality #22034

PERF/REF: improve performance of Series.searchsorted, PandasArray.searchsorted, collect functionality #22034

Conversation

topper-123 commented Jul 24, 2018 • edited Loading

ASV results

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 commented Jul 24, 2018

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jul 25, 2018 • edited Loading

Codecov Report

topper-123 commented Jul 25, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pep8speaks commented Aug 1, 2018 • edited Loading

Comment last updated on February 23, 2019 at 20:27 Hours UTC

topper-123 commented Aug 1, 2018 • edited Loading

Choose a reason for hiding this comment

topper-123 Aug 2, 2018 • edited Loading

Choose a reason for hiding this comment

topper-123 commented Aug 11, 2018

jreback commented Aug 11, 2018

topper-123 commented Feb 3, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 commented Feb 23, 2019

jreback commented Feb 24, 2019

topper-123 commented Jul 24, 2018 •

edited

Loading

codecov bot commented Jul 25, 2018 •

edited

Loading

pep8speaks commented Aug 1, 2018 •

edited

Loading

topper-123 commented Aug 1, 2018 •

edited

Loading

topper-123 Aug 2, 2018 •

edited

Loading