Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: avoid unneeded recoding of categoricals and reuse CategoricalDtypes for greater slicing speed #21659

Merged
merged 2 commits into from
Jun 29, 2018

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented Jun 27, 2018

Currently, a lot of indexing/slicing ops on CategoricalIndexes goes through CategoricalIndex._create_categorical, which can be a slow operation, because calling data._set_dtype there is slow. This PR improves performance by avoiding calling data._set_dtype as often, specifically when the new and the old dtypes are equal.

An complicating issue was that comparisons of CategoricalDtype was quite slow, so the improvements I saw in #20395 were offset by slowdowns in other places. To avoid this, CategoricalDtype.__equal__ needed to become smarter and pandas has to pass round existing dtypes rather than only categories and ordered. This is ok, as CategoricalDtype is immutable. This minimizes the need to call the expensive CategoricalDtype.__hash__ method in CategoricalDtype.__equal__ and makes comparisons much faster.

Some notable results

First setup:

>>> n = 100_000
>>> c = pd.Categorical(list('a' * n + 'b' * n + 'c' * n))
>>> ci = pd.CategoricalIndex(c)
>>> df = pd.DataFrame({'A': range(n * 3)}, index=ci)
>>> sl = slice(n, n * 2)

Results:

>>> %timeit c[sl]
13.9 µs  # master
4.43 µs  # this PR
>>> %timeit ci[sl]
740 µs  # master
12.7 µs  # this PR
>>> %timeit df.iloc[sl]
855 µs  # master
72.2 µs  # this PR
>>> %timeit df.loc['b']
3.23 ms  # master
1.62 ms  # this PR

Benchmarks

benchmarks/indexing.py:

      before           after         ratio
     [36422a88]       [e0c62df0]
+        53.4±0μs         61.1±0μs     1.14  indexing.NumericSeriesIndexing.time_iloc_slice(<class 'pandas.core.indexes.numeric.Int64Index'>)
-         477±4ms          414±4ms     0.87  indexing.CategoricalIndexIndexing.time_get_indexer_list('monotonic_incr')
-        476±40ns         381±20ns     0.80  indexing.MethodLookup.time_lookup_iloc
-      1.23±0.2ms          367±0μs     0.30  indexing.CategoricalIndexIndexing.time_getitem_bool_array('monotonic_decr')
-      1.29±0.2ms          344±5μs     0.27  indexing.CategoricalIndexIndexing.time_getitem_bool_array('monotonic_incr')
-         115±8μs         19.5±2μs     0.17  indexing.CategoricalIndexIndexing.time_getitem_list_like('monotonic_decr')
-         122±2μs         19.5±0μs     0.16  indexing.CategoricalIndexIndexing.time_getitem_list_like('non_monotonic')
-         125±8μs         19.8±2μs     0.16  indexing.CategoricalIndexIndexing.time_getitem_list_like('monotonic_incr')
-         121±2μs         13.4±0μs     0.11  indexing.CategoricalIndexIndexing.time_getitem_slice('non_monotonic')
-         129±2μs         12.5±0μs     0.10  indexing.CategoricalIndexIndexing.time_getitem_slice('monotonic_incr')
-        195±20μs       13.4±0.2μs     0.07  indexing.CategoricalIndexIndexing.time_getitem_slice('monotonic_decr')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

benchmarks/categoricals.py:

      before           after         ratio
     [a620e725]       [04667f67]
+        10.4±0ms       11.7±0.5ms     1.12  categoricals.Concat.time_union
+          3.42μs           3.76μs     1.10  categoricals.CategoricalSlicing.time_getitem_scalar('non_monotonic')
-          3.66μs           3.20μs     0.87  categoricals.CategoricalSlicing.time_getitem_scalar('monotonic_incr')
-      20.8±0.7ms         13.9±0ms     0.67  categoricals.ValueCounts.time_value_counts(False)
-        23.4±0ms         12.2±0ms     0.52  categoricals.ValueCounts.time_value_counts(True)
-          13.3μs           5.86μs     0.44  categoricals.CategoricalSlicing.time_getitem_slice('monotonic_decr')
-          13.3μs           4.88μs     0.37  categoricals.CategoricalSlicing.time_getitem_slice('non_monotonic')
-          17.1μs           6.12μs     0.36  categoricals.CategoricalSlicing.time_getitem_list_like('monotonic_incr')
-          17.1μs           6.11μs     0.36  categoricals.CategoricalSlicing.time_getitem_list_like('non_monotonic')
-          15.5μs           4.39μs     0.28  categoricals.CategoricalSlicing.time_getitem_slice('monotonic_incr')
-          22.0μs           6.11μs     0.28  categoricals.CategoricalSlicing.time_getitem_list_like('monotonic_decr')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

I haven't run the whole test suite, as that takes a long time (4-5 hours?) for me. Would appreciate input first and, if I need to run the whole suite, if there is a smarter way to do it.

@@ -169,7 +169,7 @@ def _create_categorical(cls, data, categories=None, ordered=None,
data = data.set_categories(categories, ordered=ordered)
elif ordered is not None and ordered != data.ordered:
data = data.set_ordered(ordered)
if isinstance(dtype, CategoricalDtype):
if isinstance(dtype, CategoricalDtype) and dtype != data.dtype:
Copy link
Contributor Author

@topper-123 topper-123 Jun 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be noted that in master this dtype comparison is quite slow:

>>> d = pd.api.types.CategoricalDtype(categories=['a', 'b', 'c'])
>>> %timeit d == d
149 µs  # master
524 ns  # this PR

This is the reason for the focus on improving dtype comparisons also in this PR, so this check doesn't cause slowdowns on other parts of pandas. Otherwise the performance benefits of this PR would be ambivalent, causing some slowdowns also).

@topper-123
Copy link
Contributor Author

The CircleCi error seems unrelated : AssertionError: Caused unexpected warning(s): ['ResourceWarning']. in pandas/tests/io/parser/dialect.py, which isn't part of this PR.

@gfyoung gfyoung added Experimental Categorical Categorical Data Type Performance Memory or execution speed performance and removed Experimental labels Jun 27, 2018
@codecov
Copy link

codecov bot commented Jun 28, 2018

Codecov Report

Merging #21659 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21659      +/-   ##
==========================================
+ Coverage    91.9%    91.9%   +<.01%     
==========================================
  Files         154      154              
  Lines       49555    49559       +4     
==========================================
+ Hits        45542    45546       +4     
  Misses       4013     4013
Flag Coverage Δ
#multiple 90.27% <100%> (ø) ⬆️
#single 42.02% <100%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/arrays/categorical.py 95.63% <ø> (ø) ⬆️
pandas/core/indexes/category.py 97.01% <100%> (ø) ⬆️
pandas/core/dtypes/dtypes.py 95.95% <100%> (+0.04%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a620e72...ff2de3f. Read the comment docs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add to the perf issue for 0.24.0 (If we only have them on 0.23.2, then pls add a new one)

"""
if isinstance(other, compat.string_types):
return other == self.name

if other is self:
return True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prob doesn't make any difference, but this could be elif (existing if)

@jreback jreback added this to the 0.24.0 milestone Jun 28, 2018
@topper-123
Copy link
Contributor Author

Comments accounted for.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable at a quick glance.

@jreback jreback merged commit c45bb0b into pandas-dev:master Jun 29, 2018
@jreback
Copy link
Contributor

jreback commented Jun 29, 2018

thanks @topper-123

@topper-123 topper-123 deleted the avoid-categorical-recoding branch July 24, 2018 23:50
Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants