Loc enhancements #22826

rtlee9 · 2018-09-25T01:15:56Z

closes loc very slow on sorted, non-unique index with list of labels ar argument #9466
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Improves performance of IndexEngine.get_indexer_non_unique by using binary search when:

the index is monotonically increassing, and
the length of the iterable loc key is sufficiently small

For now I've conservatively set the loc key size threshold to 5 items -- any keys larger than this will resort to the current full index scan. It would probably make sense to increase this threshold for larger indexes, but that might require further analysis. Any feedback appreciated.

     [ab9dbd64]       [b704c5bb]
     <master>         <loc-enhancements>
-         383±4ms          211±3ms     0.55  indexing.NonNumericSeriesIndexing.time_getitem_list_like('datetime', 'nonunique_monotonic_inc')
-        59.0±2ms         11.9±1ms     0.20  indexing.CategoricalIndexIndexing.time_get_indexer_list('monotonic_incr')
-      69.4±0.6ms          445±3μs     0.01  indexing.NumericSeriesIndexing.time_getitem_list_like(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
-      66.3±0.3ms          423±1μs     0.01  indexing.NumericSeriesIndexing.time_getitem_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-      66.1±0.6ms          320±2μs     0.00  indexing.NumericSeriesIndexing.time_ix_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-      69.2±0.4ms          330±3μs     0.00  indexing.NumericSeriesIndexing.time_ix_list_like(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
-      65.7±0.3ms          286±3μs     0.00  indexing.NumericSeriesIndexing.time_loc_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-      69.3±0.5ms          295±2μs     0.00  indexing.NumericSeriesIndexing.time_loc_list_like(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

Use binary search instead of re-indexing if the iterable key length is small enough

Adds benchmarks for non-unique, sorted indices in NumericSeriesIndexing and NonNumericSeriesIndexing classes

pep8speaks · 2018-09-25T01:15:58Z

Hello @rtlee9! Thanks for submitting the PR.

There are no PEP8 issues in the file asv_bench/benchmarks/indexing.py !

codecov · 2018-09-25T02:22:31Z

Codecov Report

Merging #22826 into master will increase coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #22826      +/-   ##
==========================================
+ Coverage   92.18%   92.19%   +<.01%     
==========================================
  Files         169      169              
  Lines       50820    50827       +7     
==========================================
+ Hits        46850    46860      +10     
+ Misses       3970     3967       -3

Flag	Coverage Δ
#multiple	`90.61% <ø> (ø)`	⬆️
#single	`42.37% <ø> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/formats/format.py	`98.35% <0%> (-0.01%)`	⬇️
pandas/core/strings.py	`98.63% <0%> (ø)`	⬆️
pandas/core/indexes/multi.py	`95.45% <0%> (ø)`	⬆️
pandas/core/frame.py	`97.2% <0%> (ø)`	⬆️
pandas/core/generic.py	`96.67% <0%> (ø)`	⬆️
pandas/io/formats/style.py	`96.43% <0%> (ø)`	⬆️
pandas/core/series.py	`93.74% <0%> (ø)`	⬆️
pandas/core/nanops.py	`95.14% <0%> (+0.01%)`	⬆️
pandas/core/algorithms.py	`94.72% <0%> (+0.02%)`	⬆️
pandas/io/parsers.py	`95.6% <0%> (+0.05%)`	⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3ab9dbd...4ad3006. Read the comment docs.

jreback · 2018-09-25T12:04:30Z

asv_bench/benchmarks/indexing.py

+        (Int64Index, Float64Index),
+        ('unique_monotonic_inc', 'nonunique_monotonic_inc'),
+    ]
+    param_names = ['index dtype', 'index structure']


did you actually try this? I am shocked if this works when it has spaces in the name, or maybe the names are just mapped by postion. in any event these need underscores

Yeah I tried this and you can see it working in this sample output from the asv run:

[ 97.63%] ··· indexing.NumericSeriesIndexing.time_loc_list_like ok [ 97.63%] ··· ========================================== ========================= ============ index dtype index structure ------------------------------------------ ------------------------- ------------ pandas.core.indexes.numeric.Int64Index unique_monotonic_inc 379±2μs pandas.core.indexes.numeric.Int64Index nonunique_monotonic_inc 69.3±0.5ms pandas.core.indexes.numeric.Float64Index unique_monotonic_inc 170±1ms pandas.core.indexes.numeric.Float64Index nonunique_monotonic_inc 65.7±0.3ms ========================================== ========================= ============

I've added underscores in commit 4ad3006.

jreback · 2018-09-25T12:05:29Z

asv_bench/benchmarks/indexing.py

+        ('string', 'datetime'),
+        ('unique_monotonic_inc', 'nonunique_monotonic_inc'),
+    ]
+    param_names = ['index dtype', 'index structure']


added underscores in commit 4ad3006

jreback · 2018-09-25T12:06:26Z

pandas/_libs/index.pyx

-                if val not in d:
-                    d[val] = []
-                d[val].append(i)
+        # map each starget to its position in the index


what if you drop the len(stargets) < 5 and just use it if its monotonic_increasing? does the small case actually make any difference here?

If you drop the len(stargets) < 5, then we'd be running a binary search against the index for each item in a potentially large set of targets -- runtime should be O(m log n) where m is the number of items in the set and n is the length of the index. Presumably, this would be slower when m is large enough compared to the current behavior which is to run through each item in the index and check if it is in the set of targets, which should be O(n) assuming constant time checks for whether an item is in the set of targets. Thoughts?

can you add a case where this is true in the asv's and compare?

yes, in each of the following asv's stargets is either length one or length two

[ab9dbd64] [b704c5bb] <master> <loc-enhancements> - 383±4ms 211±3ms 0.55 indexing.NonNumericSeriesIndexing.time_getitem_list_like('datetime', 'nonunique_monotonic_inc') - 59.0±2ms 11.9±1ms 0.20 indexing.CategoricalIndexIndexing.time_get_indexer_list('monotonic_incr') - 69.4±0.6ms 445±3μs 0.01 indexing.NumericSeriesIndexing.time_getitem_list_like(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc') - 66.3±0.3ms 423±1μs 0.01 indexing.NumericSeriesIndexing.time_getitem_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc') - 66.1±0.6ms 320±2μs 0.00 indexing.NumericSeriesIndexing.time_ix_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc') - 69.2±0.4ms 330±3μs 0.00 indexing.NumericSeriesIndexing.time_ix_list_like(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc') - 65.7±0.3ms 286±3μs 0.00 indexing.NumericSeriesIndexing.time_loc_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc') - 69.3±0.5ms 295±2μs 0.00 indexing.NumericSeriesIndexing.time_loc_list_like(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc') SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

jreback · 2018-09-25T12:11:58Z

I think this would close #15364 as well (add in the whatsnew and at the top), and pls add that example if its needed

might close this #19464

rtlee9 · 2018-09-26T20:43:51Z

Thanks for the feedback. This PR only improves performance if the index is sorted, so I'll have to take a look at those issues in more detail later tonight or tomorrow.

M00NSH0T · 2018-09-28T19:38:11Z

would this also increase the speed of multi-index loc calls (on sorted multi-index dfs)?

rtlee9 · 2018-09-30T21:28:13Z

@jreback here's an example:

import pandas as pd
import numpy as np

# create a dataframe of length 10^7 with a non-unique, monotonically increasing index
N = 10 ** 7
repeat_loc = 362
df = pd.DataFrame(np.random.random((N, 1)))
df.index = list(range(N))
df.index = df.index.insert(item=df.index[repeat_loc], loc=repeat_loc)[:-1]

# benchmark performance
%timeit df.loc[repeat_loc]  # 225 µs
%timeit df.loc[[repeat_loc]]  # 633 ms using pandas 0.23.4, 354 µs using this commit

rtlee9 · 2018-09-30T21:36:20Z

My understanding is that issue #15364 isn't specific to sorted indices (based on this example), in which case this PR would not close that issue. Same thing for #19464 -- for this issue I confirmed by pulling down the asv benchmarks from their commits and found no difference in performance. Please let me know if I'm misunderstanding anything.

jreback · 2018-10-01T12:09:20Z

thanks @rtlee9 if you would like to have a look at those referenced issues to see if they are closable now would be great.

rtlee9 · 2018-10-02T03:40:30Z

No problem. I took another look at those issues and I don't think they're closable -- here's why:

My understanding is that issue #15364 isn't specific to sorted indices (based on this example), in which case this PR would not close that issue. Same thing for #19464 -- for this issue I confirmed by pulling down the asv benchmarks from their commits and found no difference in performance. Please let me know if I'm misunderstanding anything.

Please let me know if you think otherwise though.

rtlee9 added 2 commits September 24, 2018 17:25

PERF: improve get_indexer_non_unique on sorted, non-unique indexes

16177af

Use binary search instead of re-indexing if the iterable key length is small enough

PERF: parameterize index structure for asv indexing benchmarks

8b704c5

Adds benchmarks for non-unique, sorted indices in NumericSeriesIndexing and NonNumericSeriesIndexing classes

jreback requested changes Sep 25, 2018

View reviewed changes

jreback added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels Sep 25, 2018

CLN: use underscore in asv param names

4ad3006

jreback added this to the 0.24.0 milestone Oct 1, 2018

jreback approved these changes Oct 1, 2018

View reviewed changes

jreback merged commit b92b043 into pandas-dev:master Oct 1, 2018

rtlee9 deleted the loc-enhancements branch October 2, 2018 03:31

This was referenced Aug 19, 2023

PERF: Unreliable performance of .loc with big non-unique index dataframe #54550

Closed

PERF: Increase threashold for using binary search in IndexEngine #54746

Merged

Uh oh!

Loc enhancements #22826

Loc enhancements #22826

Uh oh!

Conversation

rtlee9 commented Sep 25, 2018

Uh oh!

pep8speaks commented Sep 25, 2018

Uh oh!

codecov bot commented Sep 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Sep 25, 2018

Uh oh!

rtlee9 commented Sep 26, 2018

Uh oh!

M00NSH0T commented Sep 28, 2018

Uh oh!

rtlee9 commented Sep 30, 2018

Uh oh!

rtlee9 commented Sep 30, 2018

Uh oh!

jreback commented Oct 1, 2018

Uh oh!

rtlee9 commented Oct 2, 2018

Uh oh!

Uh oh!

codecov bot commented Sep 25, 2018 •

edited

Loading