New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: casting loc to labels dtype before searchsorted #14551

Merged
merged 3 commits into from Nov 2, 2016

Conversation

Projects
None yet
3 participants
@jorisvandenbossche
Member

jorisvandenbossche commented Nov 1, 2016

Intrigued by the profiling results of the below example (Multi-index loc indexing, based on the example in #14549), where searchsorted seemed to take the majority of the computation time.
And it seems that searchsorted casts both inputs (in this case labels and loc) to a common dtype, and the labels of the MultiIndex were in this case int16, while loc (output from Index.get_loc) is a python int.

By casting loc to the dtype of labels, the specific example gets a ca 20 x speed improvement

df = pd.DataFrame({'a': np.random.randn(500*5000)}, index=pd.MultiIndex.from_product([date_range("2014-01-01", periods=500), range(5000)]))
dt = pd.Timestamp('2015-01-01')
%timeit df.loc[dt]

On master:

In [3]: %timeit df.loc[dt]
The slowest run took 5.70 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 9.39 ms per loop

with this PR:

In [3]: %timeit df.loc[dt]
The slowest run took 122.51 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 422 µs per loop

Just putting it here as a showcase.
Actual PR probably needs some more work (other places where this can be done, can loc ever be out of bound for that dtype?, benchmarks, ..)

Show outdated Hide outdated pandas/indexes/multi.py
@@ -1907,6 +1907,7 @@ def convert_indexer(start, stop, step, indexer=indexer, labels=labels):
return np.array(labels == loc, dtype=bool)
else:
# sorted, so can return slice object -> view
loc = labels.dtype.type(loc)

This comment has been minimized.

@jreback

jreback Nov 1, 2016

Contributor

maybe should do this like

loc, orig_loc = lables.dtype.type(loc), loc
if  loc != orig_loc:
    loc = orig_loc
@jreback

jreback Nov 1, 2016

Contributor

maybe should do this like

loc, orig_loc = lables.dtype.type(loc), loc
if  loc != orig_loc:
    loc = orig_loc

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Nov 1, 2016

Member

Yeah, I am not sure how much checking would be needed here.

My understanding is that it probably is not needed in this case, as loc is coming from loc = level_index.get_loc(key), and labels and level_index are from the same MultiIndex. So I would assume that get_loc can only return an existing label, and so should fit in the dtype of labels?

(but probably also not that a perf issue to do the check)

@jorisvandenbossche

jorisvandenbossche Nov 1, 2016

Member

Yeah, I am not sure how much checking would be needed here.

My understanding is that it probably is not needed in this case, as loc is coming from loc = level_index.get_loc(key), and labels and level_index are from the same MultiIndex. So I would assume that get_loc can only return an existing label, and so should fit in the dtype of labels?

(but probably also not that a perf issue to do the check)

This comment has been minimized.

@jreback

jreback Nov 1, 2016

Contributor

that sounds right; if the test suite passes, prob ok!

@jreback

jreback Nov 1, 2016

Contributor

that sounds right; if the test suite passes, prob ok!

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Nov 1, 2016

Member

Apparently not :-)
But it's only a partial indexing test that fails. So loc could also be a slice, and it is expected in partial indexing that this raises an error in searchsorted, but now it raises already a slightly different error in dtype.type(loc). So that can easily be solved with:

                try:
                    loc = labels.dtype.type(loc)
                except TypeError:
                    # this occurs when loc is a slice (partial string indexing)
                    # but the TypeError raised by searchsorted in this case
                    # is catched in Index._has_valid_type()
                    pass
@jorisvandenbossche

jorisvandenbossche Nov 1, 2016

Member

Apparently not :-)
But it's only a partial indexing test that fails. So loc could also be a slice, and it is expected in partial indexing that this raises an error in searchsorted, but now it raises already a slightly different error in dtype.type(loc). So that can easily be solved with:

                try:
                    loc = labels.dtype.type(loc)
                except TypeError:
                    # this occurs when loc is a slice (partial string indexing)
                    # but the TypeError raised by searchsorted in this case
                    # is catched in Index._has_valid_type()
                    pass

This comment has been minimized.

@jreback

jreback Nov 1, 2016

Contributor

sure, or just tests if its a scalar to begin with

@jreback

jreback Nov 1, 2016

Contributor

sure, or just tests if its a scalar to begin with

@codecov-io

This comment has been minimized.

Show comment
Hide comment
@codecov-io

codecov-io Nov 1, 2016

Current coverage is 85.26% (diff: 100%)

Merging #14551 into master will increase coverage by <.01%

@@             master     #14551   diff @@
==========================================
  Files           140        140          
  Lines         50672      50676     +4   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43207      43211     +4   
  Misses         7465       7465          
  Partials          0          0          

Powered by Codecov. Last update 60a335e...6447e4c

codecov-io commented Nov 1, 2016

Current coverage is 85.26% (diff: 100%)

Merging #14551 into master will increase coverage by <.01%

@@             master     #14551   diff @@
==========================================
  Files           140        140          
  Lines         50672      50676     +4   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43207      43211     +4   
  Misses         7465       7465          
  Partials          0          0          

Powered by Codecov. Last update 60a335e...6447e4c

@jreback jreback added this to the 0.19.1 milestone Nov 1, 2016

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 1, 2016

Contributor

lgtm. just release note. I think this can only help :>

Contributor

jreback commented Nov 1, 2016

lgtm. just release note. I think this can only help :>

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Nov 2, 2016

Member

OK, added whatsnew notice. Will merge, but will open a new issue for follow-up on this (other cases, benchmarks that capture this (at the moment there are none)).

Member

jorisvandenbossche commented Nov 2, 2016

OK, added whatsnew notice. Will merge, but will open a new issue for follow-up on this (other cases, benchmarks that capture this (at the moment there are none)).

@jorisvandenbossche jorisvandenbossche merged commit 1d95179 into pandas-dev:master Nov 2, 2016

0 of 2 checks passed

continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details

jorisvandenbossche added a commit that referenced this pull request Nov 3, 2016

yarikoptic added a commit to neurodebian/pandas that referenced this pull request Nov 18, 2016

Merge tag 'v0.19.1' into debian
Version 0.19.1

* tag 'v0.19.1': (43 commits)
  RLS: v0.19.1
  DOC: update whatsnew/release notes for 0.19.1 (#14573)
  [Backport #14545] BUG/API: Index.append with mixed object/Categorical indices (#14545)
  DOC: rst fixes
  [Backport #14567] DEPR: add deprecation warning for com.array_equivalent (#14567)
  [Backport #14551] PERF: casting loc to labels dtype before searchsorted (#14551)
  [Backport #14536] BUG: DataFrame.quantile with NaNs (GH14357) (#14536)
  [Backport #14520] BUG: don't close user-provided file handles in C parser (GH14418) (#14520)
  [Backport #14392] BUG: Dataframe constructor when given dict with None value (#14392)
  [Backport #14514] BUG: Don't parse inline quotes in skipped lines (#14514)
  [Bacport #14543] BUG: tseries ceil doc fix (#14543)
  [Backport #14541] DOC: Simplify the gbq integration testing procedure for contributors (#14541)
  [Backport #14527] BUG/ERR: raise correct error when sql driver is not installed (#14527)
  [Backport #14501] BUG: fix DatetimeIndex._maybe_cast_slice_bound for empty index (GH14354) (#14501)
  [Backport #14442] DOC: Expand on reference docs for read_json() (#14442)
  BLD: fix 3.4 build for cython to 0.24.1
  [Backport #14492] BUG: Accept unicode quotechars again in pd.read_csv
  [Backport #14496] BLD: Support Cython 0.25
  [Backport #14498] COMPAT/TST: fix test for range testing of negative integers to neg powers
  [Backport #14476] PERF: performance regression in Series.asof (#14476)
  ...

amolkahat added a commit to amolkahat/pandas that referenced this pull request Nov 26, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment