cache DateOffset attrs now that they are immutable #21582

jbrockmendel · 2018-06-21T23:55:20Z

TL;DR ~6x speedup in set_index for PeriodIndex-like column.

Alright! Now that DateOffset objects are immutable (#21341), we can can start caching stuff. This was pretty much the original motivation that brought me here, so I'm pretty psyched to finally make this happen.

The motivating super-slow operation is df.set_index. Profiling before/after with:

idx = pd.period_range('May 1973', freq='M', periods=10**5)
df = pd.DataFrame({"A": 1, "B": idx})
out = df.set_index("B", append=True)

Total Runtime Before: 32.708 seconds
Total Runtime After: 5.340 seconds

pstats output (truncated) before:

        1    0.000    0.000   31.903   31.903 pandas/core/frame.py:3807(set_index)
        1    0.000    0.000   31.897   31.897 pandas/core/indexes/base.py:4823(_ensure_index_from_sequences)
        1    0.000    0.000   31.896   31.896 pandas/core/indexes/multi.py:1246(from_arrays)
        1    0.000    0.000   31.896   31.896 pandas/core/arrays/categorical.py:2590(_factorize_from_iterables)
        2    0.001    0.000   31.896   15.948 pandas/core/arrays/categorical.py:2553(_factorize_from_iterable)
        2    0.000    0.000   31.895   15.948 pandas/core/arrays/categorical.py:318(__init__)
        2    0.000    0.000   31.512   15.756 pandas/util/_decorators.py:136(wrapper)
        2    0.002    0.001   31.512   15.756 pandas/core/algorithms.py:576(factorize)
  1600011    1.168    0.000   28.211    0.000 pandas/tseries/offsets.py:338(__ne__)
        4    1.820    0.455   28.042    7.010 {method 'argsort' of 'numpy.ndarray' objects}
  1600011    4.016    0.000   27.042    0.000 pandas/tseries/offsets.py:324(__eq__)
  3200022   16.987    0.000   21.856    0.000 pandas/tseries/offsets.py:291(_params)
        2    0.000    0.000    3.460    1.730 pandas/core/algorithms.py:449(_factorize_array)
        1    0.617    0.617    3.445    3.445 {method 'get_labels' of 'pandas._libs.hashtable.PyObjectHashTable' objects}
  3200023    3.200    0.000    3.200    0.000 {sorted}
3400729/3400727    1.235    0.000    1.235    0.000 {isinstance}
   400004    0.984    0.000    1.060    0.000 pandas/tseries/offsets.py:400(freqstr)
  3200022    0.840    0.000    0.840    0.000 {method 'copy' of 'dict' objects}
  3200023    0.829    0.000    0.829    0.000 {method 'items' of 'dict' objects}

pstats output (truncated) after:

        1    0.000    0.000    4.571    4.571 pandas/core/frame.py:3807(set_index)
        1    0.000    0.000    4.561    4.561 pandas/core/indexes/base.py:4823(_ensure_index_from_sequences)
        1    0.000    0.000    4.561    4.561 pandas/core/indexes/multi.py:1246(from_arrays)
        1    0.000    0.000    4.561    4.561 pandas/core/arrays/categorical.py:2590(_factorize_from_iterables)
        2    0.001    0.000    4.561    2.280 pandas/core/arrays/categorical.py:2553(_factorize_from_iterable)
        2    0.000    0.000    4.560    2.280 pandas/core/arrays/categorical.py:318(__init__)
        2    0.000    0.000    4.506    2.253 pandas/util/_decorators.py:136(wrapper)
        2    0.003    0.001    4.506    2.253 pandas/core/algorithms.py:576(factorize)
        4    1.170    0.292    4.090    1.022 {method 'argsort' of 'numpy.ndarray' objects}
  1600011    0.870    0.000    3.138    0.000 pandas/tseries/offsets.py:337(__ne__)
  1600011    1.475    0.000    2.267    0.000 pandas/tseries/offsets.py:325(__eq__)
3400729/3400727    0.845    0.000    0.846    0.000 {isinstance}

The _params calls that make up half of the runtime in the before version doesn't even make the cut for the pstats output in the after version.

There is some more tweaking around the edges we can do for perf, but this is the big one. (Also another big one when columns can have PeriodDtype).

passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

codecov · 2018-06-22T03:34:40Z

Codecov Report

Merging #21582 into master will decrease coverage by <.01%.
The diff coverage is 91.66%.

@@            Coverage Diff             @@
##           master   #21582      +/-   ##
==========================================
- Coverage    91.9%    91.9%   -0.01%     
==========================================
  Files         153      153              
  Lines       49549    49547       -2     
==========================================
- Hits        45539    45537       -2     
  Misses       4010     4010

Flag	Coverage Δ
#multiple	`90.3% <91.66%> (-0.01%)`	⬇️
#single	`41.78% <41.66%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/tseries/offsets.py	`97.16% <91.66%> (-0.01%)`	⬇️
pandas/util/testing.py	`85.27% <0%> (ø)`	⬆️
pandas/core/generic.py	`96.22% <0%> (ø)`	⬆️
pandas/core/tools/numeric.py	`100% <0%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aa3e683...4d8411d. Read the comment docs.

jbrockmendel · 2018-06-22T03:46:54Z

Looks like the follow-up (short-circuit a bunch of isinstance calls, lose __ne__ overhead) gets another 2.8x speedup

jreback

can you add a whatsnew note
do we have sufficient asvs to specifically test this? if not can you add

…ache

jbrockmendel · 2018-06-22T18:10:59Z

can you add a whatsnew note
do we have sufficient asvs to specifically test this? if not can you add

Done.

jreback · 2018-06-22T22:57:45Z

thanks!

jbrockmendel added 2 commits June 21, 2018 14:45

cache DateOffset attrs now that they are immutable

f697065

fix pickle errors

3aba608

gfyoung requested a review from jreback June 22, 2018 08:33

jreback requested changes Jun 22, 2018

View reviewed changes

jreback added this to the 0.24.0 milestone Jun 22, 2018

jbrockmendel added 2 commits June 22, 2018 11:06

Merge branch 'master' of https://github.com/pandas-dev/pandas into pc…

e30c211

…ache

asv, whatsnew note

4d8411d

jreback approved these changes Jun 22, 2018

View reviewed changes

jreback merged commit 922c8ba into pandas-dev:master Jun 22, 2018

jbrockmendel deleted the pcache branch June 23, 2018 00:11

jbrockmendel mentioned this pull request Jun 23, 2018

More speedups for Period comparisons #21606

Merged

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018

cache DateOffset attrs now that they are immutable (pandas-dev#21582)

607e414

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cache DateOffset attrs now that they are immutable #21582

cache DateOffset attrs now that they are immutable #21582

jbrockmendel commented Jun 21, 2018

codecov bot commented Jun 22, 2018 •

edited

Loading

jbrockmendel commented Jun 22, 2018

jreback left a comment

jbrockmendel commented Jun 22, 2018

jreback commented Jun 22, 2018

cache DateOffset attrs now that they are immutable #21582

cache DateOffset attrs now that they are immutable #21582

Conversation

jbrockmendel commented Jun 21, 2018

codecov bot commented Jun 22, 2018 • edited Loading

Codecov Report

jbrockmendel commented Jun 22, 2018

jreback left a comment

Choose a reason for hiding this comment

jbrockmendel commented Jun 22, 2018

jreback commented Jun 22, 2018

codecov bot commented Jun 22, 2018 •

edited

Loading