Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plotting performance deterioration on DataFrames with date-index #4705

Closed
davaco opened this issue Aug 29, 2013 · 4 comments

Comments

@davaco
Copy link

commented Aug 29, 2013

The below code-sample gives a big performance drop on plotting in pandas 0.12, compared to pandas 0.11:

In Pandas 0.12 : "Ran in 0:00:29.542475 secs"
In Pandas 0.11 : "Ran in 0:00:06.653506 secs"

It only happens on a date-indexed DataFrame:

from pandas import *
from numpy.random import randn

N = 10000
M = 25

df = DataFrame(randn(N,M), index=date_range('1/1/1975', periods=N))

t0 = datetime.now()
df.plot()
print("Ran in %s secs" % (datetime.now() - t0))

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Aug 30, 2013

As I also wrote on the mailing list, I can confirm this. For me it gives 35 s (0.12) vs 9 s (0.11) on Windows 7 (also both Matplotlib 1.2.1).

You can see it here (together with the result of %prun):
http://nbviewer.ipython.org/5868420/pandas-slow-plotting-011.ipynb
http://nbviewer.ipython.org/5868420/pandas-slow-plotting-012.ipynb

I looked a little bit into it, and I am not an expert at all but I thought to share some insights (for the case it would be useful):

  • The factual reason it is so slow in 0.12 is due to the fact that it searches the config options 250025 times (for each point), because it searches for 'display.encoding' (this is the reason you see a lot of re in the %prun output). This is introduced by this commit: ae50103, changing the unicode repr of PeriodIndex. Just removing this commit does reduce it from ca 35 to ca 15 s.
  • This is off course not the real reason, because the index of all points should not converted to a string at all! Why it is converted to a string I do not know. I tried to debug it, and I end up on line https://github.com/pydata/pandas/blob/master/pandas/tseries/converter.py#L131 with date being a PeriodIndex with 10000 elements. Why I end up there I don't get, because the mapping of the PeriodIndex (https://github.com/pydata/pandas/blob/master/pandas/tseries/converter.py#L115) should feed individual elements inside the get_datevalue instead of the whole Index.
  • Plotting these timeserieses is very slow, also in the less slow 0.11. Almost all time is taken inside converter.py, and more specific in the mapping of all points in the index to get_datevalue (https://github.com/pydata/pandas/blob/master/pandas/tseries/converter.py#L121). Some observations:
    1. The function get_datevalue is called for every point in the dataframe, so 10000x25 in the example. So it is called for every column in the dataframe, while the index is the same for all columns. This seems redundant, and should ideally only be done once?.
    2. The function get_datevalue is performed on the individual elements of the PeriodIndex to get the ordinal value. This seems not really efficient to me, and I would think this can be vectorised. With values being a PeriodIndex I think values.asfreq(axis.freq).values is equivalent to (but much faster than) values.map(lambda x: get_datevalue(x, axis.freq))?
@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Aug 30, 2013

After another look, I might have figured it out. Short version: with this commit (jorisvandenbossche@d63e77c) the time goes down from 30s to only 210ms!


Long version:

If you think this is a sensible change, I put it in a PR (in every case, travis passes, but I don't know if this is much tested).

@jreback

This comment has been minimized.

Copy link
Contributor

commented Aug 30, 2013

@jorisvandenbossche seems reasonable
pls put in a PR - u can do some tests (to at least ensure that their is a valid plot)

jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this issue Sep 7, 2013
PERF: faster plotting of PeriodIndex (pandas-dev#4705)
TST: add test for PeriodSeries to SeriesPlots test
jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this issue Sep 7, 2013
PERF: faster plotting of PeriodIndex (pandas-dev#4705)
TST: add test for plotting PeriodSeries to SeriesPlots test
@jreback

This comment has been minimized.

Copy link
Contributor

commented Sep 7, 2013

closed by #4722

@jreback jreback closed this Sep 7, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.