Speed up DatetimeConverter for plotting #6636

agijsberts · 2014-03-14T18:05:35Z

I've recently started using pandas (impressed so far!) and found that plotting large data (from around 100k) samples is quite slow. I traced the bottleneck to the _dt_to_float_ordinal helper function called by DatetimeConverter.(https://github.com/pydata/pandas/blob/master/pandas/tseries/converter.py#L144).

More specifically, this function uses matplotlib's date2num, which converts arrays and iterables using a slow list comprehension. Since pandas seem to natively store datetimes as epoch+nanoseconds in an int64 array, it would be much faster to use matplotlib's vectorized epoch2num instead. In a testcase with 1 million points, using epoch2num is about 100 times faster than date2num:

from pandas import date_range, DataFrame
from numpy import int64, arange
from matplotlib import pyplot, dates
import time

n = 1e6

df = DataFrame(arange(n), index = date_range('20130101', periods=n, freq='S'))

start = time.time()
pyplot.plot(df.index, df)
print('date2num took {0:g}s'.format(time.time() - start))
pyplot.show()

# monkey patch
import pandas.tseries.converter
def _my_dt_to_float_ordinal(dt):
    try:
        base = dates.epoch2num(dt.astype(int64) / 1.0E9)
    except AttributeError:
        base = dates.date2num(dt)
    return base
pandas.tseries.converter._dt_to_float_ordinal = _my_dt_to_float_ordinal

start = time.time()
pyplot.plot(df.index, df)
print('epoch2num took {0:g}s'.format(time.time() - start))
pyplot.show()

Unfortunately, I am not familiar enough with pandas to know whether date2num is used intentionally or to implement a proper patch myself that works in all cases.

jreback · 2014-03-14T18:38:47Z

I think they are the same effect, so this should be good. @TomAugspurger

@agijsberts pls do a pull-request and can get this in.

I think may need to manually validate that the graphs are correct as we don't do comparison graphs per se (more of a validation that they plot and and the returned objects are 'ok').

pls add a vbench for this as well.

good catch

jreback · 2014-03-14T18:39:19Z

https://github.com/pydata/pandas/wiki section on how-to do the PR

TomAugspurger · 2014-03-15T14:01:50Z

Applying your changed didn't seem to break any tests so this should be good. @jreback know if there will be any problems on 32-bit systems?

@agijsberts a pull request would be great for this. Let me know if you have any trouble.

jreback · 2014-03-15T14:32:28Z

@agijsberts yep let's give a try on this

agijsberts · 2014-03-15T14:44:16Z

Sorry, things are moving a bit slow since it's my first time preparing a PR (setting up git, virtualenv etc.). I'm manually checking the correctness of the plots at the moment (at least w.r.t. the current implementation). Expect a PR later today.

Note by the way that pandas' plotting functions (e.g., DataFrame.plot) do not benefit from this patch, as they do other trickery with time axes. The patch is therefore unfortunately only helpful in use-cases where matplotlib's functions are used directly with the time index.

jreback · 2014-03-17T19:50:55Z

@agijsberts can you elaborate on this point, e.g about DataFrame.plot not using this new speedup?

agijsberts · 2014-03-17T21:09:58Z

@jreback The function _dt_to_float_ordinal is (almost?) exclusively used by the DatetimeConverter, which is invoked if you pass a DatetimeIndex directly as one of the axes to matplotlib.

As far as I can tell, DataFrame.plot instead insists on converting any DatetimeIndex to a PeriodIndex (e.g., https://github.com/pydata/pandas/blob/master/pandas/tools/plotting.py#L1520, https://github.com/pydata/pandas/blob/master/pandas/tseries/plotting.py#L56). These plots will therefore use the completely different PeriodConverter. This converter for instance uses unix time rather than MPL's new date format (days since 0001) and uses its own tick locators and date formatters.

My 2 cents: the optimal and unifying solution would be for MPL to support numpy's datetime64, which would ideally also allow plotting on the scale of nanoseconds. There doesn't seem to be any ongoing work in that direction though (matplotlib/matplotlib#1097).

jreback · 2014-03-17T21:17:38Z

ok

since looks like MPL is somewhat behind the curve here

a datetime to period index conversion is fast so is their a problem speed or otherwise?

aside from this inelegance (which prob has a reason behind it - haven't delved into the plotting code) - can u see why this is done? and if it warrants change (eg unify to one or the other)

agijsberts · 2014-03-17T21:57:36Z

My guess it that both converters came from different backgrounds: the PeriodConverter and plotting code seems to originate from scikits.timeseries, while DatetimeConverter instead from MPL's plot_date function and dates.DateFormatter.

Conversion with PeriodConverter is actually quite fast, but DataFrame.plot does quite a bit more than just converting and plotting. For instance, conversion is done dynamically when zooming/panning. Though a nice feature, it does seem to make zooming and panning somewhat laggy for large data (say >1M points).

I do not think it really warrants change, as both approaches (DataFrame.plot vs pyplot.plot) seem to have clear benefits. The former is easier and more feature rich, while the latter is simply faster (and obviously gives a more low-level control to the user). In this context, the DatetimeConverter is actually just a small convenience for those users that prefer to use matplotlib directly.

jreback · 2014-03-17T22:04:46Z

ok that sounds good

would u put a small PR together to put something like that into the plotting docs (not sure exactly where) so a user would know when it 'pays' to go to a low-level approach

agijsberts · 2014-03-18T13:39:22Z

Just prepared PR #6660 . Is this more or less what you had in mind?

jreback added Visualization labels Mar 14, 2014

jreback added this to the 0.14.0 milestone Mar 14, 2014

agijsberts mentioned this issue Mar 15, 2014

PERF: Speed up DatetimeConverter by using Matplotlib's epoch2num when possible... #6650

Merged

TomAugspurger closed this as completed in #6650 Mar 17, 2014

TomAugspurger mentioned this issue May 25, 2014

test_dateindex_conversion fails on Python 3.4 / NumPy 1.8.1 / MPL 1.4 master / Ubuntu 12.04 #7233

Closed

agijsberts mentioned this issue Oct 27, 2014

DOC: update docs on direct plotting with matplotlib (GH8614) #8655

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up DatetimeConverter for plotting #6636

Speed up DatetimeConverter for plotting #6636

agijsberts commented Mar 14, 2014

jreback commented Mar 14, 2014

jreback commented Mar 14, 2014

TomAugspurger commented Mar 15, 2014

jreback commented Mar 15, 2014

agijsberts commented Mar 15, 2014

jreback commented Mar 17, 2014

agijsberts commented Mar 17, 2014

jreback commented Mar 17, 2014

agijsberts commented Mar 17, 2014

jreback commented Mar 17, 2014

agijsberts commented Mar 18, 2014

Speed up DatetimeConverter for plotting #6636

Speed up DatetimeConverter for plotting #6636

Comments

agijsberts commented Mar 14, 2014

jreback commented Mar 14, 2014

jreback commented Mar 14, 2014

TomAugspurger commented Mar 15, 2014

jreback commented Mar 15, 2014

agijsberts commented Mar 15, 2014

jreback commented Mar 17, 2014

agijsberts commented Mar 17, 2014

jreback commented Mar 17, 2014

agijsberts commented Mar 17, 2014

jreback commented Mar 17, 2014

agijsberts commented Mar 18, 2014