pd.to_datetime much slower with supplied format than when format is inferred #10178

dsimmie · 2015-05-20T12:46:20Z

It is much slower when converting a date string to supply a date format for a column than for it to be inferred. I would've though there should be less work to do when the format is known (and supplied)

To test

df = DataFrame({'date_text':["2015-05-18" for i in range(10**6)]})
%timeit pd.to_datetime(df['date_text'],infer_datetime_format=True, box=False).values.view('i8')/10**9
10 loops, best of 3: 115 ms per loop
#Top line from %prun of same command:
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
1       0.095    0.095    0.095  0.095   {pandas.tslib.array_to_datetime} 

%timeit pd.to_datetime(df['date_text'],format="%Y-%m-%d", box=False).values.view('i8')/10**9
1 loops, best of 3: 2.27 s per loop
#Top line from %prun of same command:
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
1       2.282    2.282    2.282  2.282   {pandas.tslib.array_strptime}

This plot is taken from this S/O post which shows the difference over a larger range of sizes (and compared to other methods).

INSTALLED VERSIONS

commit: None
python: 2.7.9.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-52-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8

pandas: 0.15.2
nose: 1.3.4
Cython: 0.22
numpy: 1.9.2
scipy: 0.14.0
statsmodels: 0.5.0
IPython: 2.2.0
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.2
pytz: 2014.10
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.0
openpyxl: 2.2.0-b1
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.5.7
lxml: 3.4.0
bs4: 4.3.2
html5lib: None
httplib2: 0.9
apiclient: 1.4.0
rpy2: None
sqlalchemy: 1.0.0
pymysql: None
psycopg2: None

jreback · 2015-05-20T13:00:14Z

@dsimmie as I explained to you. This is as expected. It is doing regex-matching. You are welcome to have a look.

jreback · 2015-05-20T13:03:06Z

See the code here: https://github.com/pydata/pandas/blob/master/pandas/tseries/tools.py#L277

There are fast-paths for ISO8601 strings and %Y%m%d. Others hit the regex engine.

sinhrks · 2015-05-20T13:05:41Z

Maybe we should clarify format kw is for flexibility rather than performance? I assume not passing format (and dateutil will be used in most cases) is faster than regex.

jreback · 2015-05-20T13:06:33Z

no, dateutil is almost NEVER used. and is MUCH slower; its all in python. When I say reg-ex, I mean a specially constructured strptime like object in cython.

jreback · 2015-05-20T13:08:01Z

you can certainly profile this if you'd like: https://github.com/pydata/pandas/blob/master/pandas/tslib.pyx#L2257

dsimmie · 2015-05-20T13:32:23Z

Hi Jeff. You said: "If you have repeated non-ISO dates it will help a lot. Since you have an ISO date it doesn't make much difference (as the parser is in c anyhow)". It does make a difference however and supplying the single unchanging format slows this operation significantly. The date format I put in, %Y-%m-%d, is a valid ISO8601 date if perhaps not a datetime. Seeing how '%Y%m%d' is given a fast-path perhaps that string could also get one.

https://github.com/pydata/pandas/blob/master/pandas/tseries/tools.py#L297

if format == '%Y%m%d' or format == '%Y-%m-%d':
    try:    
        result = _attempt_YYYYMMDD(arg, coerce=coerce)
    except:
        raise ValueError("cannot convert the input to '%Y%m%d' date format")

A change like that would necessitate a change _attempt _YYYYMMDD that would involve using a split on the hyphen and I don't know if that is agreeable. I don't really mind that inferring is quicker than being told but it is certainly not obvious behaviour when you have put in an ISO8601 date string to start.

jreback · 2015-05-20T13:56:07Z

@dsimmie I misspoke a bit - no caching with repeated dates (it could be done and I have seen it done, not sure of the utility; this is a cache not on the format but on the actual date values themselves), sort of a separate issue.

the change you propose will be much slower slower (and %Y%m%d parses quickly because it can be turned into an integer).

rather, you could map certain ISO8601 like formats to the generic format (which is fast pathed), e.g.

something like:

iso_formats = set(['%Y-%m-%d %h:%m:%d.%f','%Y-%m-%d'])
if format in iso_formats:
    format = '%Y-%m-%d %h:%m:%d.%f'

This is what infer_datetime_format=True does after it guesses.

dsimmie · 2015-05-20T14:02:32Z

OK thanks for the clarification and your time. It would be nice if that format string '%Y-%m-%d' was fast-pathed... agree my solution was naive, I haven't seen any of this code before and haven't read it in any detail yet.

jorisvandenbossche · 2015-05-20T14:06:43Z

Update: this is now a bit repetition, but was already typing:

I think the point here is that there is a fastpath for ISO8601 formatted strings. With infer_datetime_format, if the format guessing returns %Y-%m-%d this fast path is used, but when giving this format manually, this fast path is not used.
For non-ISO strings, you won't see this difference between inferred and provided format I think.

So we could do this checking for fastpath after infer_datetime_format is handled (so for both this and manually provided format)

@dsimmie reopening this, as this is a valid improvement I think.

jreback · 2015-05-20T14:17:39Z

agreed, this is a valid issue. (and the fix is pretty straightforward as I describe above)

jreback · 2015-07-20T23:25:42Z

closed by #10615

jreback added Datetime Datetime data dtype Performance Memory or execution speed performance labels May 20, 2015

jreback added this to the Someday milestone May 20, 2015

dsimmie closed this as completed May 20, 2015

jorisvandenbossche reopened this May 20, 2015

jreback modified the milestones: 0.17.0, Someday May 20, 2015

chris-b1 mentioned this issue Jul 18, 2015

PERF: Improve perf of to_datetime with ISO format #10615

Merged

jreback closed this as completed Jul 20, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pd.to_datetime much slower with supplied format than when format is inferred #10178

pd.to_datetime much slower with supplied format than when format is inferred #10178

dsimmie commented May 20, 2015

jreback commented May 20, 2015

jreback commented May 20, 2015

sinhrks commented May 20, 2015

jreback commented May 20, 2015

jreback commented May 20, 2015

dsimmie commented May 20, 2015

jreback commented May 20, 2015

dsimmie commented May 20, 2015

jorisvandenbossche commented May 20, 2015

jreback commented May 20, 2015

jreback commented Jul 20, 2015

pd.to_datetime much slower with supplied format than when format is inferred #10178

pd.to_datetime much slower with supplied format than when format is inferred #10178

Comments

dsimmie commented May 20, 2015

INSTALLED VERSIONS

jreback commented May 20, 2015

jreback commented May 20, 2015

sinhrks commented May 20, 2015

jreback commented May 20, 2015

jreback commented May 20, 2015

dsimmie commented May 20, 2015

jreback commented May 20, 2015

dsimmie commented May 20, 2015

jorisvandenbossche commented May 20, 2015

jreback commented May 20, 2015

jreback commented Jul 20, 2015