Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up DatetimeConverter for plotting #6636

Closed
agijsberts opened this issue Mar 14, 2014 · 11 comments · Fixed by #6650
Closed

Speed up DatetimeConverter for plotting #6636

agijsberts opened this issue Mar 14, 2014 · 11 comments · Fixed by #6650
Labels
Performance Memory or execution speed performance Visualization plotting
Milestone

Comments

@agijsberts
Copy link
Contributor

I've recently started using pandas (impressed so far!) and found that plotting large data (from around 100k) samples is quite slow. I traced the bottleneck to the _dt_to_float_ordinal helper function called by DatetimeConverter.(https://github.com/pydata/pandas/blob/master/pandas/tseries/converter.py#L144).

More specifically, this function uses matplotlib's date2num, which converts arrays and iterables using a slow list comprehension. Since pandas seem to natively store datetimes as epoch+nanoseconds in an int64 array, it would be much faster to use matplotlib's vectorized epoch2num instead. In a testcase with 1 million points, using epoch2num is about 100 times faster than date2num:

from pandas import date_range, DataFrame
from numpy import int64, arange
from matplotlib import pyplot, dates
import time

n = 1e6

df = DataFrame(arange(n), index = date_range('20130101', periods=n, freq='S'))

start = time.time()
pyplot.plot(df.index, df)
print('date2num took {0:g}s'.format(time.time() - start))
pyplot.show()

# monkey patch
import pandas.tseries.converter
def _my_dt_to_float_ordinal(dt):
    try:
        base = dates.epoch2num(dt.astype(int64) / 1.0E9)
    except AttributeError:
        base = dates.date2num(dt)
    return base
pandas.tseries.converter._dt_to_float_ordinal = _my_dt_to_float_ordinal

start = time.time()
pyplot.plot(df.index, df)
print('epoch2num took {0:g}s'.format(time.time() - start))
pyplot.show()

Unfortunately, I am not familiar enough with pandas to know whether date2num is used intentionally or to implement a proper patch myself that works in all cases.

@jreback
Copy link
Contributor

jreback commented Mar 14, 2014

I think they are the same effect, so this should be good. @TomAugspurger

@agijsberts pls do a pull-request and can get this in.

I think may need to manually validate that the graphs are correct as we don't do comparison graphs per se (more of a validation that they plot and and the returned objects are 'ok').

pls add a vbench for this as well.

good catch

@jreback jreback added this to the 0.14.0 milestone Mar 14, 2014
@jreback
Copy link
Contributor

jreback commented Mar 14, 2014

https://github.com/pydata/pandas/wiki section on how-to do the PR

@TomAugspurger
Copy link
Contributor

Applying your changed didn't seem to break any tests so this should be good. @jreback know if there will be any problems on 32-bit systems?

@agijsberts a pull request would be great for this. Let me know if you have any trouble.

@jreback
Copy link
Contributor

jreback commented Mar 15, 2014

@agijsberts yep let's give a try on this

@agijsberts
Copy link
Contributor Author

Sorry, things are moving a bit slow since it's my first time preparing a PR (setting up git, virtualenv etc.). I'm manually checking the correctness of the plots at the moment (at least w.r.t. the current implementation). Expect a PR later today.

Note by the way that pandas' plotting functions (e.g., DataFrame.plot) do not benefit from this patch, as they do other trickery with time axes. The patch is therefore unfortunately only helpful in use-cases where matplotlib's functions are used directly with the time index.

@jreback
Copy link
Contributor

jreback commented Mar 17, 2014

@agijsberts can you elaborate on this point, e.g about DataFrame.plot not using this new speedup?

@agijsberts
Copy link
Contributor Author

@jreback The function _dt_to_float_ordinal is (almost?) exclusively used by the DatetimeConverter, which is invoked if you pass a DatetimeIndex directly as one of the axes to matplotlib.

As far as I can tell, DataFrame.plot instead insists on converting any DatetimeIndex to a PeriodIndex (e.g., https://github.com/pydata/pandas/blob/master/pandas/tools/plotting.py#L1520, https://github.com/pydata/pandas/blob/master/pandas/tseries/plotting.py#L56). These plots will therefore use the completely different PeriodConverter. This converter for instance uses unix time rather than MPL's new date format (days since 0001) and uses its own tick locators and date formatters.

My 2 cents: the optimal and unifying solution would be for MPL to support numpy's datetime64, which would ideally also allow plotting on the scale of nanoseconds. There doesn't seem to be any ongoing work in that direction though (matplotlib/matplotlib#1097).

@jreback
Copy link
Contributor

jreback commented Mar 17, 2014

ok

since looks like MPL is somewhat behind the curve here

a datetime to period index conversion is fast so is their a problem speed or otherwise?

aside from this inelegance (which prob has a reason behind it - haven't delved into the plotting code) - can u see why this is done? and if it warrants change (eg unify to one or the other)

@agijsberts
Copy link
Contributor Author

My guess it that both converters came from different backgrounds: the PeriodConverter and plotting code seems to originate from scikits.timeseries, while DatetimeConverter instead from MPL's plot_date function and dates.DateFormatter.

Conversion with PeriodConverter is actually quite fast, but DataFrame.plot does quite a bit more than just converting and plotting. For instance, conversion is done dynamically when zooming/panning. Though a nice feature, it does seem to make zooming and panning somewhat laggy for large data (say >1M points).

I do not think it really warrants change, as both approaches (DataFrame.plot vs pyplot.plot) seem to have clear benefits. The former is easier and more feature rich, while the latter is simply faster (and obviously gives a more low-level control to the user). In this context, the DatetimeConverter is actually just a small convenience for those users that prefer to use matplotlib directly.

@jreback
Copy link
Contributor

jreback commented Mar 17, 2014

ok that sounds good

would u put a small PR together to put something like that into the plotting docs (not sure exactly where) so a user would know when it 'pays' to go to a low-level approach

@agijsberts
Copy link
Contributor Author

Just prepared PR #6660 . Is this more or less what you had in mind?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Visualization plotting
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants