PERF: Improve performance of CategoricalIndex.is_unique #21107

topper-123 · 2018-05-17T19:14:55Z

CategoricalIndex.is_unique creates an extraneous boolean array. By changing CategoricalIndex.is_unique to use CategoricalIndex._engine.is_unique instead, this array creation is avoided. We simultaneously get to set is_monotonic* for free, and therefore will save time, if that property is called later.

Demonstration

Setup:

>>> n = 1_000_000
>>> ci = pd.CategoricalIndex(list('a' * n + 'b' * n + 'c' * n))

Currently, ci.is_unique is about the same (disregarding@readonly_cache) as:

>>> from pandas._libs.hashtable import duplicated_int64
>>> not duplicated_int64(ci.codes.astype('int64')).any()
False
>>> %timeit duplicated_int64(ci.codes.astype('int64')).any()
46.7 ms ± 4.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Notice that the duplicated_int64() creates an boolean array, which is not needed and slows the operation down.

If we instead use ci._engine.is_unique to check for uniqueness, the check is roughly similar to:

>>> from pandas._libs.algos import is_monotonic_int64
>>> is_monotonic_int64(ci.codes.astype('int64'), False)
(True, False, False)  # (is_monotonic_inc, is_monotonic_dec, is_unique)
>>> %timeit is_monotonic_int64(ci.codes.astype('int64'), False)
23.3 ms ± 364 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

This is faster than the other version, as the intermediate boolean array is not created in this version. Also, is it (IMO) more idiomatic, as index._engine is in general supposed to be used for this kind of index content checks.

codecov · 2018-05-17T20:18:19Z

Codecov Report

Merging #21107 into master will not change coverage.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master   #21107   +/-   ##
=======================================
  Coverage   91.83%   91.83%           
=======================================
  Files         153      153           
  Lines       49498    49498           
=======================================
  Hits        45458    45458           
  Misses       4040     4040

Flag	Coverage Δ
#multiple	`90.23% <100%> (ø)`	⬆️
#single	`41.88% <0%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/indexes/category.py	`97.03% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e033c06...06aef54. Read the comment docs.

jschendel · 2018-05-17T22:05:42Z

pandas/tests/indexes/test_category.py

@@ -581,6 +581,16 @@ def test_is_monotonic(self, data, non_lexsorted_data):
        assert c.is_monotonic_increasing
        assert not c.is_monotonic_decreasing

+    @pytest.mark.parametrize('unique, not_unique',


Might be slightly cleaner to do the parametrize like:

@pytest.mark.parametrize('values, expected', [([1, 2, 3], True), ...])

Then just have one assert in the test:

assert ci.is_unique is expected

Yeah, that is cleaner. I've changed it.

jschendel · 2018-05-18T18:04:16Z

pandas/tests/indexes/test_category.py

+        (list('aba'), False)])
+    def test_is_unique(self, values, expected):
+        ci = CategoricalIndex(values)
+        assert ci.is_unique == expected


Can you use is instead of ==? A little bit of paranoia on my end, but it'd guard against the unlikely event that is_unique mistakenly starts returning equivalent non-booleans (e.g. 0/1).

Ok, I've changed that.

jreback

is there an asv for this? if not can you add (ideally for all index types)

topper-123 · 2018-05-19T21:43:51Z

is_unique is a cached property, so I don't think an ASV makes sense in this case. (I assume setup in ASV is called once for each time method, but not once for every call to the time method, haven't checked).

So this PR improves first-time calls to is_unique and also makes calling is_unique and is_monotonic* in succesion faster, as both are set simultaneously. So e.g. ci.is_monotonic_increasing and ci.is_unique will be much faster (on the first call, on subsequent calls both are cached).

jreback · 2018-05-19T23:36:21Z

how does this diminish the value of an asv?

sure if the subsequent calls are cached the mean time is lower but this should still show an asv improvement yes?

topper-123 · 2018-05-20T00:08:40Z

My impression has been that these neasurement are done in a “best of 3” style (like %timeit). For such tests the uncached version always loses.

Haven’t looked into how asv does it, so I could be mistaken here.

topper-123 · 2018-05-20T22:51:42Z

It seems to me that ASV does "best of X" style of testing. from the docs:

The timing itself is based on the Python standard library’s timeit module, with some extensions for automatic heuristics shamelessly stolen from IPython’s %timeit magic function

I could add the timing tests, but I assume these improvement is not seen well in repeated calls to is_unique. You want them added still?

jreback · 2018-05-21T11:02:37Z

we do have asv's for others. sure if can tag so these are run once is better (not sure if that is possible)

(pandas) bash-3.2$ grep -r is_unique asv_bench/benchmarks/
asv_bench/benchmarks//algorithms.py:        # cache is_unique
asv_bench/benchmarks//algorithms.py:        self.idx_int_dup.is_unique

topper-123 · 2018-06-04T17:01:25Z

we do have asv's for others. sure if can tag so these are run once is better (not sure if that is possible)

(pandas) bash-3.2$ grep -r is_unique asv_bench/benchmarks/
asv_bench/benchmarks//algorithms.py: # cache is_unique
asv_bench/benchmarks//algorithms.py: self.idx_int_dup.is_unique

Those two are in a setup method, so timing are not done on them:

class DuplicatedUniqueIndex(object):

    goal_time = 0.2

    def setup(self):
        N = 10**5
        self.idx_int_dup = pd.Int64Index(np.arange(N * 5))
        # cache is_unique
        self.idx_int_dup.is_unique

     def time_duplicated_unique_int(self):

So currently there is no tests for is_unique in asv_bench, and it probably doesn't make sense IMO, as is_unique is a cached property and timing it will just return the cached value, and not time the speed of the algorithm.

jreback · 2018-06-04T21:42:58Z

hmm ok. can you open a new issue to create asv's in some form or another for this (and is_unique in general). w/o asv's someone will ineveitable change something which breaks this.

maybe should just temporary turn cached_properties off with a context manager when running them.

topper-123 · 2018-06-04T22:10:42Z

maybe should just temporary turn cached_properties off with a context manager when running them.

Is there currently a context manager or option for this in Pandas?

jreback · 2018-06-04T22:16:05Z

no i think would need to create one for purposes of measuring cached properties

(cherry picked from commit 9f95f7d)

topper-123 force-pushed the categorical_index_is_unique branch 2 times, most recently from 7cc1a45 to 6071248 Compare May 17, 2018 19:21

jschendel reviewed May 17, 2018

View reviewed changes

jschendel added Performance Memory or execution speed performance Categorical Categorical Data Type labels May 17, 2018

topper-123 force-pushed the categorical_index_is_unique branch from 6071248 to 0b7309e Compare May 18, 2018 05:43

jschendel reviewed May 18, 2018

View reviewed changes

Improve performance of CategoricalIndex.is_unique

06aef54

topper-123 force-pushed the categorical_index_is_unique branch from 0b7309e to 06aef54 Compare May 18, 2018 18:49

jreback added this to the 0.23.1 milestone May 19, 2018

jreback requested changes May 19, 2018

View reviewed changes

jreback approved these changes Jun 4, 2018

View reviewed changes

jreback merged commit 9f95f7d into pandas-dev:master Jun 4, 2018

TomAugspurger added the Needs Backport label Jun 12, 2018

TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this pull request Jun 12, 2018

Improve performance of CategoricalIndex.is_unique (pandas-dev#21107)

21775b6

(cherry picked from commit 9f95f7d)

TomAugspurger pushed a commit that referenced this pull request Jun 12, 2018

Improve performance of CategoricalIndex.is_unique (#21107)

28ba35a

(cherry picked from commit 9f95f7d)

TomAugspurger removed the Needs Backport label Jun 12, 2018

david-liu-brattle-1 pushed a commit to david-liu-brattle-1/pandas that referenced this pull request Jun 18, 2018

Improve performance of CategoricalIndex.is_unique (pandas-dev#21107)

0e176dd

topper-123 deleted the categorical_index_is_unique branch October 27, 2018 08:16

Uh oh!

PERF: Improve performance of CategoricalIndex.is_unique #21107

PERF: Improve performance of CategoricalIndex.is_unique #21107

Uh oh!

Conversation

topper-123 commented May 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Demonstration

Uh oh!

codecov bot commented May 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jschendel May 17, 2018

Choose a reason for hiding this comment

Uh oh!

topper-123 May 18, 2018

Choose a reason for hiding this comment

Uh oh!

jschendel May 18, 2018

Choose a reason for hiding this comment

Uh oh!

topper-123 May 18, 2018

Choose a reason for hiding this comment

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

topper-123 commented May 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback commented May 19, 2018

Uh oh!

topper-123 commented May 20, 2018

Uh oh!

topper-123 commented May 20, 2018

Uh oh!

jreback commented May 21, 2018

Uh oh!

topper-123 commented Jun 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback commented Jun 4, 2018

Uh oh!

topper-123 commented Jun 4, 2018

Uh oh!

jreback commented Jun 4, 2018

Uh oh!

Uh oh!

topper-123 commented May 17, 2018 •

edited

Loading

codecov bot commented May 17, 2018 •

edited

Loading

topper-123 commented May 19, 2018 •

edited

Loading

topper-123 commented Jun 4, 2018 •

edited

Loading