PERF: avoid unneeded recoding of categoricals and reuse CategoricalDtypes for greater slicing speed #21659

topper-123 · 2018-06-27T20:09:49Z

progress towards PERF: df.loc is 100x slower for CategoricalIndex than for normal Index #20395
ASVs added
passes git diff upstream/master -u -- "*.py" | flake8 --diff

Currently, a lot of indexing/slicing ops on CategoricalIndexes goes through CategoricalIndex._create_categorical, which can be a slow operation, because calling data._set_dtype there is slow. This PR improves performance by avoiding calling data._set_dtype as often, specifically when the new and the old dtypes are equal.

An complicating issue was that comparisons of CategoricalDtype was quite slow, so the improvements I saw in #20395 were offset by slowdowns in other places. To avoid this, CategoricalDtype.__equal__ needed to become smarter and pandas has to pass round existing dtypes rather than only categories and ordered. This is ok, as CategoricalDtype is immutable. This minimizes the need to call the expensive CategoricalDtype.__hash__ method in CategoricalDtype.__equal__ and makes comparisons much faster.

Some notable results

First setup:

>>> n = 100_000
>>> c = pd.Categorical(list('a' * n + 'b' * n + 'c' * n))
>>> ci = pd.CategoricalIndex(c)
>>> df = pd.DataFrame({'A': range(n * 3)}, index=ci)
>>> sl = slice(n, n * 2)

Results:

>>> %timeit c[sl]
13.9 µs  # master
4.43 µs  # this PR
>>> %timeit ci[sl]
740 µs  # master
12.7 µs  # this PR
>>> %timeit df.iloc[sl]
855 µs  # master
72.2 µs  # this PR
>>> %timeit df.loc['b']
3.23 ms  # master
1.62 ms  # this PR

Benchmarks

benchmarks/indexing.py:

      before           after         ratio
     [36422a88]       [e0c62df0]
+        53.4±0μs         61.1±0μs     1.14  indexing.NumericSeriesIndexing.time_iloc_slice(<class 'pandas.core.indexes.numeric.Int64Index'>)
-         477±4ms          414±4ms     0.87  indexing.CategoricalIndexIndexing.time_get_indexer_list('monotonic_incr')
-        476±40ns         381±20ns     0.80  indexing.MethodLookup.time_lookup_iloc
-      1.23±0.2ms          367±0μs     0.30  indexing.CategoricalIndexIndexing.time_getitem_bool_array('monotonic_decr')
-      1.29±0.2ms          344±5μs     0.27  indexing.CategoricalIndexIndexing.time_getitem_bool_array('monotonic_incr')
-         115±8μs         19.5±2μs     0.17  indexing.CategoricalIndexIndexing.time_getitem_list_like('monotonic_decr')
-         122±2μs         19.5±0μs     0.16  indexing.CategoricalIndexIndexing.time_getitem_list_like('non_monotonic')
-         125±8μs         19.8±2μs     0.16  indexing.CategoricalIndexIndexing.time_getitem_list_like('monotonic_incr')
-         121±2μs         13.4±0μs     0.11  indexing.CategoricalIndexIndexing.time_getitem_slice('non_monotonic')
-         129±2μs         12.5±0μs     0.10  indexing.CategoricalIndexIndexing.time_getitem_slice('monotonic_incr')
-        195±20μs       13.4±0.2μs     0.07  indexing.CategoricalIndexIndexing.time_getitem_slice('monotonic_decr')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

benchmarks/categoricals.py:

      before           after         ratio
     [a620e725]       [04667f67]
+        10.4±0ms       11.7±0.5ms     1.12  categoricals.Concat.time_union
+          3.42μs           3.76μs     1.10  categoricals.CategoricalSlicing.time_getitem_scalar('non_monotonic')
-          3.66μs           3.20μs     0.87  categoricals.CategoricalSlicing.time_getitem_scalar('monotonic_incr')
-      20.8±0.7ms         13.9±0ms     0.67  categoricals.ValueCounts.time_value_counts(False)
-        23.4±0ms         12.2±0ms     0.52  categoricals.ValueCounts.time_value_counts(True)
-          13.3μs           5.86μs     0.44  categoricals.CategoricalSlicing.time_getitem_slice('monotonic_decr')
-          13.3μs           4.88μs     0.37  categoricals.CategoricalSlicing.time_getitem_slice('non_monotonic')
-          17.1μs           6.12μs     0.36  categoricals.CategoricalSlicing.time_getitem_list_like('monotonic_incr')
-          17.1μs           6.11μs     0.36  categoricals.CategoricalSlicing.time_getitem_list_like('non_monotonic')
-          15.5μs           4.39μs     0.28  categoricals.CategoricalSlicing.time_getitem_slice('monotonic_incr')
-          22.0μs           6.11μs     0.28  categoricals.CategoricalSlicing.time_getitem_list_like('monotonic_decr')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

I haven't run the whole test suite, as that takes a long time (4-5 hours?) for me. Would appreciate input first and, if I need to run the whole suite, if there is a smarter way to do it.

topper-123 · 2018-06-27T20:33:51Z

pandas/core/indexes/category.py

@@ -169,7 +169,7 @@ def _create_categorical(cls, data, categories=None, ordered=None,
                data = data.set_categories(categories, ordered=ordered)
            elif ordered is not None and ordered != data.ordered:
                data = data.set_ordered(ordered)
-            if isinstance(dtype, CategoricalDtype):
+            if isinstance(dtype, CategoricalDtype) and dtype != data.dtype:


It should be noted that in master this dtype comparison is quite slow:

>>> d = pd.api.types.CategoricalDtype(categories=['a', 'b', 'c']) >>> %timeit d == d 149 µs # master 524 ns # this PR

This is the reason for the focus on improving dtype comparisons also in this PR, so this check doesn't cause slowdowns on other parts of pandas. Otherwise the performance benefits of this PR would be ambivalent, causing some slowdowns also).

topper-123 · 2018-06-27T21:14:30Z

The CircleCi error seems unrelated : AssertionError: Caused unexpected warning(s): ['ResourceWarning']. in pandas/tests/io/parser/dialect.py, which isn't part of this PR.

codecov · 2018-06-28T00:04:47Z

Codecov Report

Merging #21659 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #21659      +/-   ##
==========================================
+ Coverage    91.9%    91.9%   +<.01%     
==========================================
  Files         154      154              
  Lines       49555    49559       +4     
==========================================
+ Hits        45542    45546       +4     
  Misses       4013     4013

Flag	Coverage Δ
#multiple	`90.27% <100%> (ø)`	⬆️
#single	`42.02% <100%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/arrays/categorical.py	`95.63% <ø> (ø)`	⬆️
pandas/core/indexes/category.py	`97.01% <100%> (ø)`	⬆️
pandas/core/dtypes/dtypes.py	`95.95% <100%> (+0.04%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a620e72...ff2de3f. Read the comment docs.

jreback

can you add to the perf issue for 0.24.0 (If we only have them on 0.23.2, then pls add a new one)

jreback · 2018-06-28T00:38:04Z

pandas/core/dtypes/dtypes.py

        """
        if isinstance(other, compat.string_types):
            return other == self.name

+        if other is self:
+            return True


prob doesn't make any difference, but this could be elif (existing if)

topper-123 · 2018-06-28T11:02:03Z

Comments accounted for.

TomAugspurger

Looks reasonable at a quick glance.

jreback · 2018-06-29T00:36:09Z

thanks @topper-123

…ypes for greater slicing speed (pandas-dev#21659)

avoid unneeded recoding af categoricals

ed1928f

topper-123 commented Jun 27, 2018

View reviewed changes

gfyoung added Experimental Categorical Categorical Data Type Performance Memory or execution speed performance and removed Experimental labels Jun 27, 2018

jreback requested changes Jun 28, 2018

View reviewed changes

jreback added this to the 0.24.0 milestone Jun 28, 2018

adjusted for comments

ff2de3f

TomAugspurger reviewed Jun 28, 2018

View reviewed changes

jreback approved these changes Jun 29, 2018

View reviewed changes

jreback merged commit c45bb0b into pandas-dev:master Jun 29, 2018

topper-123 mentioned this pull request Jun 29, 2018

PERF: df.loc is 100x slower for CategoricalIndex than for normal Index #20395

Closed

topper-123 deleted the avoid-categorical-recoding branch July 24, 2018 23:50

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018

PERF: avoid unneeded recoding of categoricals and reuse CategoricalDt…

62c5e51

…ypes for greater slicing speed (pandas-dev#21659)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: avoid unneeded recoding of categoricals and reuse CategoricalDtypes for greater slicing speed #21659

PERF: avoid unneeded recoding of categoricals and reuse CategoricalDtypes for greater slicing speed #21659

topper-123 commented Jun 27, 2018 •

edited

Loading

topper-123 Jun 27, 2018 •

edited by gfyoung

Loading

topper-123 commented Jun 27, 2018

codecov bot commented Jun 28, 2018 •

edited

Loading

jreback left a comment

jreback Jun 28, 2018

topper-123 commented Jun 28, 2018

TomAugspurger left a comment

jreback commented Jun 29, 2018

PERF: avoid unneeded recoding of categoricals and reuse CategoricalDtypes for greater slicing speed #21659

PERF: avoid unneeded recoding of categoricals and reuse CategoricalDtypes for greater slicing speed #21659

Conversation

topper-123 commented Jun 27, 2018 • edited Loading

Some notable results

Benchmarks

topper-123 Jun 27, 2018 • edited by gfyoung Loading

Choose a reason for hiding this comment

topper-123 commented Jun 27, 2018

codecov bot commented Jun 28, 2018 • edited Loading

Codecov Report

jreback left a comment

Choose a reason for hiding this comment

jreback Jun 28, 2018

Choose a reason for hiding this comment

topper-123 commented Jun 28, 2018

TomAugspurger left a comment

Choose a reason for hiding this comment

jreback commented Jun 29, 2018

topper-123 commented Jun 27, 2018 •

edited

Loading

topper-123 Jun 27, 2018 •

edited by gfyoung

Loading

codecov bot commented Jun 28, 2018 •

edited

Loading