PERF: allow even more flexible ISO 8601 datetime parsing #11899

Closed
femtotrader opened this Issue Dec 25, 2015 · 5 comments

Comments

Projects
None yet
2 participants

Hello,

I noticed that there is a huge code speed difference with to_datetime execution when format is not given and when it's given.

I wonder if there is not some room for improvements here!

In [1]: %time df=pd.read_csv("AUDUSD-2014-01.csv", names=['Symbol', 'Date', 'Bid', 'Ask'])
CPU times: user 3.31 s, sys: 481 ms, total: 3.79 s
Wall time: 4.13 s

In [2]: df
Out[274]:
          Symbol                   Date      Bid      Ask
0        AUD/USD  20140101 21:55:34.404  0.88796  0.88922
1        AUD/USD  20140101 21:55:34.444  0.88805  0.88914
2        AUD/USD  20140101 21:55:34.475  0.88809  0.88910
3        AUD/USD  20140101 21:55:48.962  0.88811  0.88908
4        AUD/USD  20140101 21:56:38.293  0.88808  0.88887
...          ...                    ...      ...      ...
1947101  AUD/USD  20140131 21:59:48.048  0.87525  0.87589
1947102  AUD/USD  20140131 21:59:54.599  0.87527  0.87589
1947103  AUD/USD  20140131 21:59:56.927  0.87531  0.87588
1947104  AUD/USD  20140131 21:59:59.365  0.87531  0.87574
1947105  AUD/USD  20140131 22:00:00.038  0.87531  0.87574

[1947106 rows x 4 columns]

In [3]: %time pd.to_datetime(df['Date'])
CPU times: user 11min 44s, sys: 19.4 s, total: 12min 4s
Wall time: 12min 46s
Out[3]:
0         2014-01-01 21:55:34.404
1         2014-01-01 21:55:34.444
2         2014-01-01 21:55:34.475
3         2014-01-01 21:55:48.962
4         2014-01-01 21:56:38.293
                    ...
1947101   2014-01-31 21:59:48.048
1947102   2014-01-31 21:59:54.599
1947103   2014-01-31 21:59:56.927
1947104   2014-01-31 21:59:59.365
1947105   2014-01-31 22:00:00.038
Name: Date, dtype: datetime64[ns]

In [4]: fmt='%Y%m%d %H:%M:%S.%f'

In [5]: %time pd.to_datetime(df['Date'], format=fmt)
CPU times: user 37.3 s, sys: 1.31 s, total: 38.6 s
Wall time: 40 s

In [6]: timedelta(minutes=12, seconds=46) / timedelta(seconds=40)
Out[6]: 19.15

There is x19.15 factor!!!

Sample data can be found here
https://drive.google.com/file/d/0B8iUtWjZOTqla3ZZTC1FS0pkZXc/view?usp=sharing

See also pydata/pandas-datareader#153

Contributor

jreback commented Dec 26, 2015

this falls back to dateutil which is in python space for parsing. So it behooves you to specify the format in this case. No real easy way around this, unless the ISO 8601 parser was modified to accept this format (which is not that difficult as its the same format with no separators).

If you are another brave soul would like to submit a pull-request it would be great.

jreback added this to the Someday milestone Dec 26, 2015

Contributor

jreback commented Dec 26, 2015

Here's an easy way to repro things like this:
You can also use infer_datetime_format=True which will effectively guess the format for you (if its close to ISO 8601)

In [16]: s = Series(pd.date_range('20130101',freq='ms',periods=10000)).dt.strftime('%Y%m%d %H:%M:%S.%f')

In [17]: s.head()
Out[17]: 
0    20130101 00:00:00.000000
1    20130101 00:00:00.001000
2    20130101 00:00:00.002000
3    20130101 00:00:00.003000
4    20130101 00:00:00.004000
dtype: object

In [18]: %timeit pd.to_datetime(s)
1 loops, best of 3: 1.09 s per loop

In [19]: %timeit pd.to_datetime(s,format='%Y%m%d %H:%M:%S.%f')
10 loops, best of 3: 61.9 ms per loop

In [20]: %timeit pd.to_datetime(s,infer_datetime_format=True)
10 loops, best of 3: 62.8 ms per loop

Thanks @jreback for infer_datetime_format this is good for my use case.

Not sure if I need to send PR to pandas or to dateutil

https://github.com/pydata/pandas/blob/master/pandas/tseries/tools.py

Contributor

jreback commented Dec 26, 2015

I am referring to pandas of course, code is here

jreback changed the title from to_datetime is very slow when no format is given to PERF: allow even more flexible ISO 8601 datetime parsing Dec 26, 2015

Contributor

jreback commented Dec 26, 2015

xref #9714

@jreback jreback modified the milestone: 0.18.0, Someday Jan 19, 2016

jreback closed this in 5de6b84 Jan 26, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment