PERF: Speeds up creation of Period, PeriodArray, with Offset freq #23589

TomAugspurger · 2018-11-09T03:41:20Z

master:

In [2]: freq = pd.tseries.offsets.Day()
   ...:
   ...: %timeit pd.Period("2001", freq=freq)

294 µs ± 5.53 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [3]: %timeit pd.Period._maybe_convert_freq(freq)
   ...:
64.7 µs ± 382 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

branch:

In [2]: freq = pd.tseries.offsets.Day()
   ...:
   ...: %timeit pd.Period("2001", freq=freq)

158 µs ± 2.87 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [3]: %timeit pd.Period._maybe_convert_freq(freq)
193 ns ± 4.3 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

While looking at the profile plot in snakeviz, it seems like a lot of time in
Period._maybe_convert_freq was spent importing modules. _maybe_convert_freq
calls offsets.to_offset, which imports a Python function inside the method.

Does Cython not handle this well?

master: ```python In [2]: freq = pd.tseries.offsets.Day() ...: ...: %timeit pd.Period("2001", freq=freq) 294 µs ± 5.53 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) In [3]: %timeit pd.Period._maybe_convert_freq(freq) ...: 64.7 µs ± 382 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) ``` branch: ```python In [2]: freq = pd.tseries.offsets.Day() ...: ...: %timeit pd.Period("2001", freq=freq) 158 µs ± 2.87 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) In [3]: %timeit pd.Period._maybe_convert_freq(freq) 193 ns ± 4.3 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each) ``` While looking at the profile plot in snakeviz, it seems like a lot of time in Period._maybe_convert_freq was spent importing modules. `_maybe_convert_freq` calls `offsets.to_offset`, which imports a Python function inside the method. Does Cython not handle this well?

TomAugspurger · 2018-11-09T03:41:48Z

cc @jbrockmendel

codecov · 2018-11-09T04:22:04Z

Codecov Report

Merging #23589 into master will decrease coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #23589      +/-   ##
==========================================
- Coverage   92.25%   92.25%   -0.01%     
==========================================
  Files         161      161              
  Lines       51237    51260      +23     
==========================================
+ Hits        47269    47290      +21     
- Misses       3968     3970       +2

Flag	Coverage Δ
#multiple	`90.63% <ø> (ø)`	⬆️
#single	`42.33% <ø> (+0.03%)`	⬆️

Impacted Files	Coverage Δ
pandas/core/arrays/datetimes.py	`98.42% <0%> (-0.43%)`	⬇️
pandas/core/arrays/period.py	`98.08% <0%> (-0.01%)`	⬇️
pandas/core/dtypes/generic.py	`100% <0%> (ø)`	⬆️
pandas/core/arrays/datetimelike.py	`95.92% <0%> (+0.02%)`	⬆️
pandas/core/arrays/timedeltas.py	`93.78% <0%> (+0.03%)`	⬆️
pandas/util/testing.py	`86.78% <0%> (+0.14%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ce62a5c...795a7d1. Read the comment docs.

pandas/_libs/tslibs/period.pyx

jorisvandenbossche · 2018-11-09T11:15:07Z

This already improves the slowdown (using the example code of the plotting benchmark), like from 5.5 s to 2 s, but it is still a lot slower than on 0.23 (100 ms)

Issues related to previous time this happened: #11831, #12909, #12903)

TomAugspurger · 2018-11-09T12:39:11Z

master

· Discovering benchmarks
· Running 3 total benchmarks (1 commits * 1 environments * 3 benchmarks)
[  0.00%] ·· Benchmarking existing-py_Users_taugspurger_Envs_pandas-dev_bin_python3
[ 16.67%] ··· plotting.TimeseriesPlotting.time_plot_irregular                                                                                        265±0ms
[ 33.33%] ··· plotting.TimeseriesPlotting.time_plot_regular                                                                                          4.94±0s
[ 50.00%] ··· plotting.TimeseriesPlotting.time_plot_regular_compat                                                                                   204±0ms

branch:

· Discovering benchmarks
· Running 3 total benchmarks (1 commits * 1 environments * 3 benchmarks)
[  0.00%] ·· Benchmarking existing-py_Users_taugspurger_Envs_pandas-dev_bin_python3
[ 16.67%] ··· plotting.TimeseriesPlotting.time_plot_irregular                                                                                        204±0ms
[ 33.33%] ··· plotting.TimeseriesPlotting.time_plot_regular                                                                                          1.90±0s
[ 50.00%] ··· plotting.TimeseriesPlotting.time_plot_regular_compat                                                                                   197±0ms

jreback · 2018-11-09T13:41:26Z

thanks!

jorisvandenbossche · 2018-11-09T13:50:14Z

To be clear, although this PR improved performance a bit, the perf regression is not yet fixed

jbrockmendel · 2018-11-09T13:58:26Z

but it is still a lot slower than on 0.23

IIRC the version of to_offset in tslibs.offsets was implemented largely for aesthetic purposes, to have the runtime import in just one place. If there is a perf hit, let's absolutely revert it.

The long-term solution may just be to move to_offset up to cython. The PITA is that requires moving most of tseries.offsets up to cython.

jorisvandenbossche · 2018-11-09T14:05:39Z

Not saying that the remaining performance hit is necessarily due to to_offset (didn't look enough in detail for that), just that there are still performance problems in general with Periods.

For the plotting, it seems to be coming from Period.asfreq, which uses to_offset amongst other things

…fixed * upstream/master: (47 commits) CLN: remove values attribute from datetimelike EAs (pandas-dev#23603) DOC/CI: Add linting to rst files, and fix issues (pandas-dev#23381) PERF: Speeds up creation of Period, PeriodArray, with Offset freq (pandas-dev#23589) PERF: define is_all_dates to shortcut inadvertent copy when slicing an IntervalIndex (pandas-dev#23591) TST: Tests and Helpers for Datetime/Period Arrays (pandas-dev#23502) Update description of Index._values/values/ndarray_values (pandas-dev#23507) Fixes to make validate_docstrings.py not generate warnings or unwanted output (pandas-dev#23552) DOC: Added note about groupby excluding Decimal columns by default (pandas-dev#18953) ENH: Support writing timestamps with timezones with to_sql (pandas-dev#22654) CI: Auto-cancel redundant builds (pandas-dev#23523) Preserve EA dtype in DataFrame.stack (pandas-dev#23285) TST: Fix dtype mismatch on 32bit in IntervalTree get_indexer test (pandas-dev#23468) BUG: raise if invalid freq is passed (pandas-dev#23546) remove uses of (ts)?lib.(NaT|iNaT|Timestamp) (pandas-dev#23562) BUG: Fix error message for invalid HTML flavor (pandas-dev#23550) ENH: Support EAs in Series.unstack (pandas-dev#23284) DOC: Updating DataFrame.join docstring (pandas-dev#23471) TST: coverage for skipped tests in io/formats/test_to_html.py (pandas-dev#22888) BUG: Return KeyError for invalid string key (pandas-dev#23540) BUG: DatetimeIndex slicing with boolean Index raises TypeError (pandas-dev#22852) ...

…ndas-dev#23589)

TomAugspurger added Datetime Datetime data dtype Performance Memory or execution speed performance Period Period data type labels Nov 9, 2018

TomAugspurger added this to the 0.24.0 milestone Nov 9, 2018

jbrockmendel reviewed Nov 9, 2018

View reviewed changes

pandas/_libs/tslibs/period.pyx Outdated Show resolved Hide resolved

TomAugspurger added 2 commits November 9, 2018 06:27

move to to_offset

8daba2c

fixup

795a7d1

jreback merged commit f691711 into pandas-dev:master Nov 9, 2018

JustinZhengBC pushed a commit to JustinZhengBC/pandas that referenced this pull request Nov 14, 2018

PERF: Speeds up creation of Period, PeriodArray, with Offset freq (pa…

01ffb03

…ndas-dev#23589)

tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request Nov 19, 2018

PERF: Speeds up creation of Period, PeriodArray, with Offset freq (pa…

42305f6

…ndas-dev#23589)

qwhelan mentioned this pull request Dec 4, 2018

WIP: speed up Period creation #23500

Closed

4 tasks

jorisvandenbossche mentioned this pull request Dec 16, 2018

PERF: regression in time series plotting #24304

Closed

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

PERF: Speeds up creation of Period, PeriodArray, with Offset freq (pa…

2491a30

…ndas-dev#23589)

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

PERF: Speeds up creation of Period, PeriodArray, with Offset freq (pa…

9c42f82

…ndas-dev#23589)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Speeds up creation of Period, PeriodArray, with Offset freq #23589

PERF: Speeds up creation of Period, PeriodArray, with Offset freq #23589

TomAugspurger commented Nov 9, 2018

TomAugspurger commented Nov 9, 2018

codecov bot commented Nov 9, 2018 •

edited

Loading

jorisvandenbossche commented Nov 9, 2018

TomAugspurger commented Nov 9, 2018

jreback commented Nov 9, 2018

jorisvandenbossche commented Nov 9, 2018

jbrockmendel commented Nov 9, 2018

jorisvandenbossche commented Nov 9, 2018

PERF: Speeds up creation of Period, PeriodArray, with Offset freq #23589

PERF: Speeds up creation of Period, PeriodArray, with Offset freq #23589

Conversation

TomAugspurger commented Nov 9, 2018

TomAugspurger commented Nov 9, 2018

codecov bot commented Nov 9, 2018 • edited Loading

Codecov Report

jorisvandenbossche commented Nov 9, 2018

TomAugspurger commented Nov 9, 2018

jreback commented Nov 9, 2018

jorisvandenbossche commented Nov 9, 2018

jbrockmendel commented Nov 9, 2018

jorisvandenbossche commented Nov 9, 2018

codecov bot commented Nov 9, 2018 •

edited

Loading