Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
to_datetime poor performance parsing string datetimes #1571
I want to convert to datetime64 a Series that contains datetimes as strings. The format is '%Y-%m-%d %H:%M:%S' ('2012-07-06 10:05:58', for instance).
Casting the strings array into a datetime64 array in numpy (or using Series.astype("datetime64")) is fast, but it transforms the datetimes according to the local timezone, which is not the behavior I want in this case. Pandas to_datetime function does the parsing right, but it is much slower.
However, it is also possible to do the parsing right and fast with numpy by appending the "+0000" timezone suffix to every string before parsing/casting to datetime64. So I wonder, is there any reason why to_datetime() runs much slower than this approach?
Thanks and regards.
Some sample code to illustrate the issue:
import os import time import numpy as np from datetime import datetime, timedelta from pandas import DatetimeIndex, Series, to_datetime from pandas.tseries.offsets import Minute DATETIME_FORMAT = "%Y-%m-%d %H:%M:%S" # Decorator to execute code in a faked GMT timezone def GMT_Timezone(func): def wrapper(*args, **kwargs): # Set timezone to GMT previous_tz = time.tzname os.environ['TZ'] = 'GMT' time.tzset() # Execute function result = func(*args, **kwargs) # Set timezone to previous state os.environ['TZ'] = previous_tz time.tzset() return result return wrapper def generate_test_data(): start_date = datetime(2012, 1, 1) end_date = datetime(2012, 1, 31) minutes = DatetimeIndex(start=start_date, end=end_date, freq=Minute()) str_minutes = Series(minutes.map(lambda d: d.strftime(DATETIME_FORMAT))) return str_minutes def simple_casting(data): return data.astype("datetime64") def shifting_afterwards(data): casted = data.astype("datetime64") shifted = DatetimeIndex(casted) - timedelta(seconds=time.altzone) return Series(shifted) def concat_gmt_tz(data): data = np.add(data, "+0000") return data.astype("datetime64") def using_to_datetime(data): return to_datetime(data) @GMT_Timezone def faking_tz(data): return data.astype("datetime64") if __name__ == '__main__': test_data = generate_test_data() # Some aproaches to parse the datetime string using_to_dt = using_to_datetime(test_data) simple = simple_casting(test_data) shifted = shifting_afterwards(test_data) faked_tz = faking_tz(test_data) concat_tz = concat_gmt_tz(test_data) # Simple casting does not work. During the process, numpy converts the date # into GMT according to the local timezone. assert not (simple == using_to_dt).all(), "Fails if run from GMT timezone." # Shifting after casting does not work either, the problem being that # we should shift using a different timezone depending on the date we are # parsing (summer vs winter) -> more complex + error-prone + slower assert not (shifted == using_to_dt).all() # Faking the timezone to be GMT avoids datetime transformations, so it seems # to work, although since we are messing with environmental variables # it gets risky (multithreading, other side-effects?) assert (faked_tz == using_to_dt).all() assert (concat_tz == using_to_dt).all() # Now to performance from timeit import Timer ITERATIONS = 5 t = Timer("using_to_datetime(test_data)", "from __main__ import generate_test_data, using_to_datetime;" + \ "test_data = generate_test_data()") t_using_to_dt = t.timeit(ITERATIONS) / ITERATIONS t = Timer("faking_tz(test_data)", "from __main__ import generate_test_data, faking_tz;" + \ "test_data = generate_test_data()") t_faking_tz = t.timeit(ITERATIONS) / ITERATIONS t = Timer("concat_gmt_tz(test_data)", "from __main__ import generate_test_data, concat_gmt_tz;" + \ "test_data = generate_test_data()") t_concat_tz = t.timeit(ITERATIONS) / ITERATIONS print "to_datetime():", t_using_to_dt print "faking tz:", t_faking_tz # to_datetime() ~90 times slower than .astype("datetime64") print "Ratio:", t_using_to_dt / t_faking_tz print print "to_datetime():", t_using_to_dt print "concat tz:", t_concat_tz # to_datetime() ~270 times slower than concat tz + .astype("datetime64") print "Ratio:", t_using_to_dt / t_concat_tz
I ran your file on my box and I get a performance difference of about >2x.
In : %timeit using_to_datetime(test_data)
In : %timeit faking_tz(test_data)
In : %timeit concat_gmt_tz(test_data)
I think it's because pandas is using dateutil internally while numpy uses its own parser that's faster:
In : from numpy.core._mx_datetime_parser import datetime_from_string as p2
In : from dateutil.parser import parse as p1
In : %timeit test_data.apply(p1)
In : %timeit test_data.apply(p2)
I'll see whether we can convert pandas to use the faster date parsing code
Sorry, I should have said it before, but I'm using the '1.8.0.dev-6a06466' version for numpy and '0.8.0' for pandas. It seems like I get much different performances due to recent improvements in numpy.
The results I get are these:
I also run some simple tests using both numpy '1.8.0.dev-6a06466' and '1.6.2' to compare performance:
In : import numpy as np
The results are:
~25 times faster in the newer version
So parsing data with the current develpment version for numpy seems to be significantly faster than using to_datetime. Maybe it could be possible for to_datetime to make use of the new numpy improvements in the future, or maybe try to apply the same optimizations. It would be really nice to be able to use to_datetime with a performance similar to that offered by numpy.
Thanks again and regards.
I was able to optimize the ISO8601 case and bring down the parsing time on 20000 strings from 1.87 seconds to 22.1 ms (85x improvement). Do you think this is adequate? I don't think things can get all that much faster than this.
Wow, definitely it is a huge improvement. It's even faster than doing .astype("datetime64"). These are the results I get now in terms of performance:
However, I've noticed the result values are different now (actually, the asserts in my sample code fail). It seems like it is making a transformation of the datetimes according to the local timezone so, for instance, "2012-01-01 00:00:00" becomes "2011-12-31 23:00:00" in my timezone (CET). This is coherent with the results yielded by .astype("datetime64") and np.array([...], dtype="datetime64").
For my application I would like to have a timezone-agnostic parsing utility just like the older to_datetime, but maybe it makes more sense that to_datetime behaves like it does now, I don't really know. This is a discussion I'm not fit to get in ;).
Please, let me know what you think about this.
Awesome work! After this fix it even runs about 3x faster on top of your first 85x improvement. That really makes a difference in large datasets.
My sample code outputs these estimates now: