New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Improve performance of CategoricalIndex.is_unique #21107

Merged
merged 1 commit into from Jun 4, 2018

Conversation

Projects
None yet
4 participants
@topper-123
Contributor

topper-123 commented May 17, 2018

CategoricalIndex.is_unique creates an extraneous boolean array. By changing CategoricalIndex.is_unique to use CategoricalIndex._engine.is_unique instead, this array creation is avoided. We simultaneously get to set is_monotonic* for free, and therefore will save time, if that property is called later.

Demonstration

Setup:

>>> n = 1_000_000
>>> ci = pd.CategoricalIndex(list('a' * n + 'b' * n + 'c' * n))

Currently, ci.is_unique is about the same (disregarding@readonly_cache) as:

>>> from pandas._libs.hashtable import duplicated_int64
>>> not duplicated_int64(ci.codes.astype('int64')).any()
False
>>> %timeit duplicated_int64(ci.codes.astype('int64')).any()
46.7 ms ± 4.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Notice that the duplicated_int64() creates an boolean array, which is not needed and slows the operation down.

If we instead use ci._engine.is_unique to check for uniqueness, the check is roughly similar to:

>>> from pandas._libs.algos import is_monotonic_int64
>>> is_monotonic_int64(ci.codes.astype('int64'), False)
(True, False, False)  # (is_monotonic_inc, is_monotonic_dec, is_unique)
>>> %timeit is_monotonic_int64(ci.codes.astype('int64'), False)
23.3 ms ± 364 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

This is faster than the other version, as the intermediate boolean array is not created in this version. Also, is it (IMO) more idiomatic, as index._engine is in general supposed to be used for this kind of index content checks.

@codecov

This comment has been minimized.

codecov bot commented May 17, 2018

Codecov Report

Merging #21107 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #21107   +/-   ##
=======================================
  Coverage   91.83%   91.83%           
=======================================
  Files         153      153           
  Lines       49498    49498           
=======================================
  Hits        45458    45458           
  Misses       4040     4040
Flag Coverage Δ
#multiple 90.23% <100%> (ø) ⬆️
#single 41.88% <0%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/indexes/category.py 97.03% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e033c06...06aef54. Read the comment docs.

@@ -581,6 +581,16 @@ def test_is_monotonic(self, data, non_lexsorted_data):
assert c.is_monotonic_increasing
assert not c.is_monotonic_decreasing
@pytest.mark.parametrize('unique, not_unique',

This comment has been minimized.

@jschendel

jschendel May 17, 2018

Member

Might be slightly cleaner to do the parametrize like:

@pytest.mark.parametrize('values, expected', [([1, 2, 3], True), ...])

Then just have one assert in the test:

assert ci.is_unique is expected

This comment has been minimized.

@topper-123

topper-123 May 18, 2018

Contributor

Yeah, that is cleaner. I've changed it.

(list('aba'), False)])
def test_is_unique(self, values, expected):
ci = CategoricalIndex(values)
assert ci.is_unique == expected

This comment has been minimized.

@jschendel

jschendel May 18, 2018

Member

Can you use is instead of ==? A little bit of paranoia on my end, but it'd guard against the unlikely event that is_unique mistakenly starts returning equivalent non-booleans (e.g. 0/1).

This comment has been minimized.

@topper-123

topper-123 May 18, 2018

Contributor

Ok, I've changed that.

@jreback jreback added this to the 0.23.1 milestone May 19, 2018

@jreback

is there an asv for this? if not can you add (ideally for all index types)

@topper-123

This comment has been minimized.

Contributor

topper-123 commented May 19, 2018

is_unique is a cached property, so I don't think an ASV makes sense in this case. (I assume setup in ASV is called once for each time method, but not once for every call to the time method, haven't checked).

So this PR improves first-time calls to is_unique and also makes calling is_unique and is_monotonic* in succesion faster, as both are set simultaneously. So e.g. ci.is_monotonic_increasing and ci.is_unique will be much faster (on the first call, on subsequent calls both are cached).

@jreback

This comment has been minimized.

Contributor

jreback commented May 19, 2018

how does this diminish the value of an asv?

sure if the subsequent calls are cached the mean time is lower but this should still show an asv improvement yes?

@topper-123

This comment has been minimized.

Contributor

topper-123 commented May 20, 2018

My impression has been that these neasurement are done in a “best of 3” style (like %timeit). For such tests the uncached version always loses.

Haven’t looked into how asv does it, so I could be mistaken here.

@topper-123

This comment has been minimized.

Contributor

topper-123 commented May 20, 2018

It seems to me that ASV does "best of X" style of testing. from the docs:

The timing itself is based on the Python standard library’s timeit module, with some extensions for automatic heuristics shamelessly stolen from IPython’s %timeit magic function

I could add the timing tests, but I assume these improvement is not seen well in repeated calls to is_unique. You want them added still?

@jreback

This comment has been minimized.

Contributor

jreback commented May 21, 2018

we do have asv's for others. sure if can tag so these are run once is better (not sure if that is possible)

(pandas) bash-3.2$ grep -r is_unique asv_bench/benchmarks/
asv_bench/benchmarks//algorithms.py:        # cache is_unique
asv_bench/benchmarks//algorithms.py:        self.idx_int_dup.is_unique
@topper-123

This comment has been minimized.

Contributor

topper-123 commented Jun 4, 2018

we do have asv's for others. sure if can tag so these are run once is better (not sure if that is possible)

(pandas) bash-3.2$ grep -r is_unique asv_bench/benchmarks/
asv_bench/benchmarks//algorithms.py: # cache is_unique
asv_bench/benchmarks//algorithms.py: self.idx_int_dup.is_unique

Those two are in a setup method, so timing are not done on them:

class DuplicatedUniqueIndex(object):

    goal_time = 0.2

    def setup(self):
        N = 10**5
        self.idx_int_dup = pd.Int64Index(np.arange(N * 5))
        # cache is_unique
        self.idx_int_dup.is_unique

     def time_duplicated_unique_int(self):

So currently there is no tests for is_unique in asv_bench, and it probably doesn't make sense IMO, as is_unique is a cached property and timing it will just return the cached value, and not time the speed of the algorithm.

@jreback

This comment has been minimized.

Contributor

jreback commented Jun 4, 2018

hmm ok. can you open a new issue to create asv's in some form or another for this (and is_unique in general). w/o asv's someone will ineveitable change something which breaks this.

maybe should just temporary turn cached_properties off with a context manager when running them.

@jreback

jreback approved these changes Jun 4, 2018

@jreback jreback merged commit 9f95f7d into pandas-dev:master Jun 4, 2018

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@topper-123

This comment has been minimized.

Contributor

topper-123 commented Jun 4, 2018

maybe should just temporary turn cached_properties off with a context manager when running them.

Is there currently a context manager or option for this in Pandas?

@jreback

This comment has been minimized.

Contributor

jreback commented Jun 4, 2018

no i think would need to create one for purposes of measuring cached properties

TomAugspurger added a commit to TomAugspurger/pandas that referenced this pull request Jun 12, 2018

TomAugspurger added a commit that referenced this pull request Jun 12, 2018

david-liu-brattle-1 added a commit to david-liu-brattle-1/pandas that referenced this pull request Jun 18, 2018

@topper-123 topper-123 deleted the topper-123:categorical_index_is_unique branch Oct 27, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment