Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: some regressions #11084

Closed
1 of 13 tasks
jreback opened this issue Sep 13, 2015 · 3 comments
Closed
1 of 13 tasks

PERF: some regressions #11084

jreback opened this issue Sep 13, 2015 · 3 comments
Labels
Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version

Comments

@jreback
Copy link
Contributor

jreback commented Sep 13, 2015

http://pydata.github.io/pandas/# is a view since 0.14 (its not every tag, but a sampling).
The regressions pages is now working here

timeseries / period related:

@jreback jreback added Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version labels Sep 13, 2015
@jreback jreback modified the milestones: 0.17.0, 0.17.1 Sep 13, 2015
@jorisvandenbossche jorisvandenbossche modified the milestones: 0.17.0, 0.17.1 Sep 20, 2015
@jreback jreback modified the milestones: 0.17.1, 0.17.0 Sep 20, 2015
@jorisvandenbossche
Copy link
Member

@sinhrks I was looking at the time series plotting slowdown (time_plot_timeseries_period, there is a ca 5 times slowdown in timeseries plotting since 0.16.2)

It is related to some of the Period changes, namely that freq is no longer a string but a DateOffset object.
If you profile df.plot(), most of the time is cause by to_offset. At a certain point (in converter.py:convert), a object dtyped array of Period objects is converted back to a PeriodIndex:

In [1]: values = pd.period_range('1/1/1975', periods=2000).astype(object).values

In [2]: values
Out[2]:
array([Period('1975-01-01', 'D'), Period('1975-01-02', 'D'),
       Period('1975-01-03', 'D'), ..., Period('1980-06-20', 'D'),
       Period('1980-06-21', 'D'), Period('1980-06-22', 'D')], dtype=object)

In [3]: %timeit pd.PeriodIndex(values, freq='D')
100 loops, best of 3: 1.86 ms per loop

Above is with 0.16.2, on master this gives me 109 ms instead of 1.86 ms. Reason for the slowdown is that PeriodIndex._from_arraylike will try to extract the freq from each object, and checks if the freq is equal to the given freq. Previously this was a string equality check, now a DateOffset/string equality check.

Now, looking for a possible fix, this commit: jorisvandenbossche@55ecbf0 (making it compare strings again) does solve the perf issue for a big part. But I was wondering, do you know a better approach?
Maybe we could prevent this step (array of Period objects -> PeriodIndex) altogether in the plotting code? (although this is initially called from

@sinhrks
Copy link
Member

sinhrks commented Sep 21, 2015

Thanks for catching this. In addition to your ideas, caching str -> freq mapping in to_offset may work. This is already done in get_offset, and I think the same cache can be used.

Let me look into this.

@jreback
Copy link
Contributor Author

jreback commented Jan 1, 2020

closing as stale

@jreback jreback closed this as completed Jan 1, 2020
@jreback jreback modified the milestones: Contributions Welcome, No action Jan 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

No branches or pull requests

3 participants