Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: avoid unneeded recoding of categoricals and reuse CategoricalDtypes for greater slicing speed #21659

merged 2 commits into from Jun 29, 2018


None yet
4 participants
Copy link

commented Jun 27, 2018

  • progress towards #20395
  • ASVs added
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff

Currently, a lot of indexing/slicing ops on CategoricalIndexes goes through CategoricalIndex._create_categorical, which can be a slow operation, because calling data._set_dtype there is slow. This PR improves performance by avoiding calling data._set_dtype as often, specifically when the new and the old dtypes are equal.

An complicating issue was that comparisons of CategoricalDtype was quite slow, so the improvements I saw in #20395 were offset by slowdowns in other places. To avoid this, CategoricalDtype.__equal__ needed to become smarter and pandas has to pass round existing dtypes rather than only categories and ordered. This is ok, as CategoricalDtype is immutable. This minimizes the need to call the expensive CategoricalDtype.__hash__ method in CategoricalDtype.__equal__ and makes comparisons much faster.

Some notable results

First setup:

>>> n = 100_000
>>> c = pd.Categorical(list('a' * n + 'b' * n + 'c' * n))
>>> ci = pd.CategoricalIndex(c)
>>> df = pd.DataFrame({'A': range(n * 3)}, index=ci)
>>> sl = slice(n, n * 2)


>>> %timeit c[sl]
13.9 µs  # master
4.43 µs  # this PR
>>> %timeit ci[sl]
740 µs  # master
12.7 µs  # this PR
>>> %timeit df.iloc[sl]
855 µs  # master
72.2 µs  # this PR
>>> %timeit df.loc['b']
3.23 ms  # master
1.62 ms  # this PR



      before           after         ratio
     [36422a88]       [e0c62df0]
+        53.4±0μs         61.1±0μs     1.14  indexing.NumericSeriesIndexing.time_iloc_slice(<class 'pandas.core.indexes.numeric.Int64Index'>)
-         477±4ms          414±4ms     0.87  indexing.CategoricalIndexIndexing.time_get_indexer_list('monotonic_incr')
-        476±40ns         381±20ns     0.80  indexing.MethodLookup.time_lookup_iloc
-      1.23±0.2ms          367±0μs     0.30  indexing.CategoricalIndexIndexing.time_getitem_bool_array('monotonic_decr')
-      1.29±0.2ms          344±5μs     0.27  indexing.CategoricalIndexIndexing.time_getitem_bool_array('monotonic_incr')
-         115±8μs         19.5±2μs     0.17  indexing.CategoricalIndexIndexing.time_getitem_list_like('monotonic_decr')
-         122±2μs         19.5±0μs     0.16  indexing.CategoricalIndexIndexing.time_getitem_list_like('non_monotonic')
-         125±8μs         19.8±2μs     0.16  indexing.CategoricalIndexIndexing.time_getitem_list_like('monotonic_incr')
-         121±2μs         13.4±0μs     0.11  indexing.CategoricalIndexIndexing.time_getitem_slice('non_monotonic')
-         129±2μs         12.5±0μs     0.10  indexing.CategoricalIndexIndexing.time_getitem_slice('monotonic_incr')
-        195±20μs       13.4±0.2μs     0.07  indexing.CategoricalIndexIndexing.time_getitem_slice('monotonic_decr')



      before           after         ratio
     [a620e725]       [04667f67]
+        10.4±0ms       11.7±0.5ms     1.12  categoricals.Concat.time_union
+          3.42μs           3.76μs     1.10  categoricals.CategoricalSlicing.time_getitem_scalar('non_monotonic')
-          3.66μs           3.20μs     0.87  categoricals.CategoricalSlicing.time_getitem_scalar('monotonic_incr')
-      20.8±0.7ms         13.9±0ms     0.67  categoricals.ValueCounts.time_value_counts(False)
-        23.4±0ms         12.2±0ms     0.52  categoricals.ValueCounts.time_value_counts(True)
-          13.3μs           5.86μs     0.44  categoricals.CategoricalSlicing.time_getitem_slice('monotonic_decr')
-          13.3μs           4.88μs     0.37  categoricals.CategoricalSlicing.time_getitem_slice('non_monotonic')
-          17.1μs           6.12μs     0.36  categoricals.CategoricalSlicing.time_getitem_list_like('monotonic_incr')
-          17.1μs           6.11μs     0.36  categoricals.CategoricalSlicing.time_getitem_list_like('non_monotonic')
-          15.5μs           4.39μs     0.28  categoricals.CategoricalSlicing.time_getitem_slice('monotonic_incr')
-          22.0μs           6.11μs     0.28  categoricals.CategoricalSlicing.time_getitem_list_like('monotonic_decr')


I haven't run the whole test suite, as that takes a long time (4-5 hours?) for me. Would appreciate input first and, if I need to run the whole suite, if there is a smarter way to do it.

@@ -169,7 +169,7 @@ def _create_categorical(cls, data, categories=None, ordered=None,
data = data.set_categories(categories, ordered=ordered)
elif ordered is not None and ordered != data.ordered:
data = data.set_ordered(ordered)
if isinstance(dtype, CategoricalDtype):
if isinstance(dtype, CategoricalDtype) and dtype != data.dtype:

This comment has been minimized.

Copy link

topper-123 Jun 27, 2018

Author Contributor

It should be noted that in master this dtype comparison is quite slow:

>>> d = pd.api.types.CategoricalDtype(categories=['a', 'b', 'c'])
>>> %timeit d == d
149 µs  # master
524 ns  # this PR

This is the reason for the focus on improving dtype comparisons also in this PR, so this check doesn't cause slowdowns on other parts of pandas. Otherwise the performance benefits of this PR would be ambivalent, causing some slowdowns also).


This comment has been minimized.

Copy link
Contributor Author

commented Jun 27, 2018

The CircleCi error seems unrelated : AssertionError: Caused unexpected warning(s): ['ResourceWarning']. in pandas/tests/io/parser/, which isn't part of this PR.


This comment has been minimized.

Copy link

commented Jun 28, 2018

Codecov Report

Merging #21659 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21659      +/-   ##
+ Coverage    91.9%    91.9%   +<.01%     
  Files         154      154              
  Lines       49555    49559       +4     
+ Hits        45542    45546       +4     
  Misses       4013     4013
Flag Coverage Δ
#multiple 90.27% <100%> (ø) ⬆️
#single 42.02% <100%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/arrays/ 95.63% <ø> (ø) ⬆️
pandas/core/indexes/ 97.01% <100%> (ø) ⬆️
pandas/core/dtypes/ 95.95% <100%> (+0.04%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a620e72...ff2de3f. Read the comment docs.

Copy link

left a comment

can you add to the perf issue for 0.24.0 (If we only have them on 0.23.2, then pls add a new one)

if isinstance(other, compat.string_types):
return other ==

if other is self:
return True

This comment has been minimized.

Copy link

jreback Jun 28, 2018


prob doesn't make any difference, but this could be elif (existing if)

@jreback jreback added this to the 0.24.0 milestone Jun 28, 2018


This comment has been minimized.

Copy link
Contributor Author

commented Jun 28, 2018

Comments accounted for.

Copy link

left a comment

Looks reasonable at a quick glance.

@jreback jreback merged commit c45bb0b into pandas-dev:master Jun 29, 2018

3 checks passed

ci/circleci Your tests passed on CircleCI!
continuous-integration/appveyor/pr AppVeyor build succeeded
continuous-integration/travis-ci/pr The Travis CI build passed

This comment has been minimized.

Copy link

commented Jun 29, 2018

thanks @topper-123

@topper-123 topper-123 deleted the topper-123:avoid-categorical-recoding branch Jul 24, 2018

alimcmaster1 added a commit to alimcmaster1/pandas that referenced this pull request Aug 12, 2018

Sup3rGeo added a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.