New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loc enhancements #22826

Merged
merged 3 commits into from Oct 1, 2018

Conversation

Projects
None yet
4 participants
@rtlee9
Copy link
Contributor

rtlee9 commented Sep 25, 2018

  • closes #9466
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

Improves performance of IndexEngine.get_indexer_non_unique by using binary search when:

  • the index is monotonically increassing, and
  • the length of the iterable loc key is sufficiently small

For now I've conservatively set the loc key size threshold to 5 items -- any keys larger than this will resort to the current full index scan. It would probably make sense to increase this threshold for larger indexes, but that might require further analysis. Any feedback appreciated.

     [ab9dbd64]       [b704c5bb]
     <master>         <loc-enhancements>
-         383±4ms          211±3ms     0.55  indexing.NonNumericSeriesIndexing.time_getitem_list_like('datetime', 'nonunique_monotonic_inc')
-        59.0±2ms         11.9±1ms     0.20  indexing.CategoricalIndexIndexing.time_get_indexer_list('monotonic_incr')
-      69.4±0.6ms          445±3μs     0.01  indexing.NumericSeriesIndexing.time_getitem_list_like(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
-      66.3±0.3ms          423±1μs     0.01  indexing.NumericSeriesIndexing.time_getitem_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-      66.1±0.6ms          320±2μs     0.00  indexing.NumericSeriesIndexing.time_ix_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-      69.2±0.4ms          330±3μs     0.00  indexing.NumericSeriesIndexing.time_ix_list_like(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
-      65.7±0.3ms          286±3μs     0.00  indexing.NumericSeriesIndexing.time_loc_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-      69.3±0.5ms          295±2μs     0.00  indexing.NumericSeriesIndexing.time_loc_list_like(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

rtlee9 added some commits Sep 5, 2018

PERF: improve get_indexer_non_unique on sorted, non-unique indexes
Use binary search instead of re-indexing if the iterable key length
is small enough
PERF: parameterize index structure for asv indexing benchmarks
Adds benchmarks for non-unique, sorted indices in NumericSeriesIndexing
and NonNumericSeriesIndexing classes
@pep8speaks

This comment has been minimized.

Copy link

pep8speaks commented Sep 25, 2018

Hello @rtlee9! Thanks for submitting the PR.

@codecov

This comment has been minimized.

Copy link

codecov bot commented Sep 25, 2018

Codecov Report

Merging #22826 into master will increase coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22826      +/-   ##
==========================================
+ Coverage   92.18%   92.19%   +<.01%     
==========================================
  Files         169      169              
  Lines       50820    50827       +7     
==========================================
+ Hits        46850    46860      +10     
+ Misses       3970     3967       -3
Flag Coverage Δ
#multiple 90.61% <ø> (ø) ⬆️
#single 42.37% <ø> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/io/formats/format.py 98.35% <0%> (-0.01%) ⬇️
pandas/core/strings.py 98.63% <0%> (ø) ⬆️
pandas/core/indexes/multi.py 95.45% <0%> (ø) ⬆️
pandas/core/frame.py 97.2% <0%> (ø) ⬆️
pandas/core/generic.py 96.67% <0%> (ø) ⬆️
pandas/io/formats/style.py 96.43% <0%> (ø) ⬆️
pandas/core/series.py 93.74% <0%> (ø) ⬆️
pandas/core/nanops.py 95.14% <0%> (+0.01%) ⬆️
pandas/core/algorithms.py 94.72% <0%> (+0.02%) ⬆️
pandas/io/parsers.py 95.6% <0%> (+0.05%) ⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3ab9dbd...4ad3006. Read the comment docs.

(Int64Index, Float64Index),
('unique_monotonic_inc', 'nonunique_monotonic_inc'),
]
param_names = ['index dtype', 'index structure']

This comment has been minimized.

@jreback

jreback Sep 25, 2018

Contributor

did you actually try this? I am shocked if this works when it has spaces in the name, or maybe the names are just mapped by postion. in any event these need underscores

This comment has been minimized.

@rtlee9

rtlee9 Sep 26, 2018

Contributor

Yeah I tried this and you can see it working in this sample output from the asv run:

[ 97.63%] ··· indexing.NumericSeriesIndexing.time_loc_list_like                                                     ok
[ 97.63%] ··· ========================================== ========================= ============
                             index dtype                      index structure                  
              ------------------------------------------ ------------------------- ------------
                pandas.core.indexes.numeric.Int64Index      unique_monotonic_inc     379±2μs   
                pandas.core.indexes.numeric.Int64Index    nonunique_monotonic_inc   69.3±0.5ms 
               pandas.core.indexes.numeric.Float64Index     unique_monotonic_inc     170±1ms   
               pandas.core.indexes.numeric.Float64Index   nonunique_monotonic_inc   65.7±0.3ms 
              ========================================== ========================= ============

I've added underscores in commit 4ad3006.

('string', 'datetime'),
('unique_monotonic_inc', 'nonunique_monotonic_inc'),
]
param_names = ['index dtype', 'index structure']

This comment has been minimized.

@jreback

jreback Sep 25, 2018

Contributor

same

This comment has been minimized.

@rtlee9

rtlee9 Sep 26, 2018

Contributor

added underscores in commit 4ad3006

if val not in d:
d[val] = []
d[val].append(i)
# map each starget to its position in the index

This comment has been minimized.

@jreback

jreback Sep 25, 2018

Contributor

what if you drop the len(stargets) < 5 and just use it if its monotonic_increasing? does the small case actually make any difference here?

This comment has been minimized.

@rtlee9

rtlee9 Sep 26, 2018

Contributor

If you drop the len(stargets) < 5, then we'd be running a binary search against the index for each item in a potentially large set of targets -- runtime should be O(m log n) where m is the number of items in the set and n is the length of the index. Presumably, this would be slower when m is large enough compared to the current behavior which is to run through each item in the index and check if it is in the set of targets, which should be O(n) assuming constant time checks for whether an item is in the set of targets. Thoughts?

This comment has been minimized.

@jreback

jreback Sep 28, 2018

Contributor

can you add a case where this is true in the asv's and compare?

This comment has been minimized.

@rtlee9

rtlee9 Sep 29, 2018

Contributor

yes, in each of the following asv's stargets is either length one or length two

     [ab9dbd64]       [b704c5bb]
     <master>         <loc-enhancements>
-         383±4ms          211±3ms     0.55  indexing.NonNumericSeriesIndexing.time_getitem_list_like('datetime', 'nonunique_monotonic_inc')
-        59.0±2ms         11.9±1ms     0.20  indexing.CategoricalIndexIndexing.time_get_indexer_list('monotonic_incr')
-      69.4±0.6ms          445±3μs     0.01  indexing.NumericSeriesIndexing.time_getitem_list_like(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
-      66.3±0.3ms          423±1μs     0.01  indexing.NumericSeriesIndexing.time_getitem_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-      66.1±0.6ms          320±2μs     0.00  indexing.NumericSeriesIndexing.time_ix_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-      69.2±0.4ms          330±3μs     0.00  indexing.NumericSeriesIndexing.time_ix_list_like(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
-      65.7±0.3ms          286±3μs     0.00  indexing.NumericSeriesIndexing.time_loc_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
-      69.3±0.5ms          295±2μs     0.00  indexing.NumericSeriesIndexing.time_loc_list_like(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
@jreback

This comment has been minimized.

Copy link
Contributor

jreback commented Sep 25, 2018

I think this would close #15364 as well (add in the whatsnew and at the top), and pls add that example if its needed

might close this #19464

@rtlee9

This comment has been minimized.

Copy link
Contributor

rtlee9 commented Sep 26, 2018

Thanks for the feedback. This PR only improves performance if the index is sorted, so I'll have to take a look at those issues in more detail later tonight or tomorrow.

@M00NSH0T

This comment has been minimized.

Copy link

M00NSH0T commented Sep 28, 2018

would this also increase the speed of multi-index loc calls (on sorted multi-index dfs)?

@rtlee9

This comment has been minimized.

Copy link
Contributor

rtlee9 commented Sep 30, 2018

@jreback here's an example:

import pandas as pd
import numpy as np

# create a dataframe of length 10^7 with a non-unique, monotonically increasing index
N = 10 ** 7
repeat_loc = 362
df = pd.DataFrame(np.random.random((N, 1)))
df.index = list(range(N))
df.index = df.index.insert(item=df.index[repeat_loc], loc=repeat_loc)[:-1]

# benchmark performance
%timeit df.loc[repeat_loc]  # 225 µs
%timeit df.loc[[repeat_loc]]  # 633 ms using pandas 0.23.4, 354 µs using this commit
@rtlee9

This comment has been minimized.

Copy link
Contributor

rtlee9 commented Sep 30, 2018

My understanding is that issue #15364 isn't specific to sorted indices (based on this example), in which case this PR would not close that issue. Same thing for #19464 -- for this issue I confirmed by pulling down the asv benchmarks from their commits and found no difference in performance. Please let me know if I'm misunderstanding anything.

@jreback jreback added this to the 0.24.0 milestone Oct 1, 2018

@jreback

jreback approved these changes Oct 1, 2018

@jreback jreback merged commit b92b043 into pandas-dev:master Oct 1, 2018

5 checks passed

ci/circleci: py27_compat Your tests passed on CircleCI!
Details
ci/circleci: py35_ascii Your tests passed on CircleCI!
Details
ci/circleci: py36_locale Your tests passed on CircleCI!
Details
ci/circleci: py36_locale_slow Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jreback

This comment has been minimized.

Copy link
Contributor

jreback commented Oct 1, 2018

thanks @rtlee9 if you would like to have a look at those referenced issues to see if they are closable now would be great.

@rtlee9 rtlee9 deleted the rtlee9:loc-enhancements branch Oct 2, 2018

@rtlee9

This comment has been minimized.

Copy link
Contributor

rtlee9 commented Oct 2, 2018

No problem. I took another look at those issues and I don't think they're closable -- here's why:

My understanding is that issue #15364 isn't specific to sorted indices (based on this example), in which case this PR would not close that issue. Same thing for #19464 -- for this issue I confirmed by pulling down the asv benchmarks from their commits and found no difference in performance. Please let me know if I'm misunderstanding anything.

Please let me know if you think otherwise though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment