Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pd.to_datetime much slower with supplied format than when format is inferred #10178

Closed
dsimmie opened this issue May 20, 2015 · 11 comments
Closed
Labels
Datetime Datetime data dtype Performance Memory or execution speed performance
Milestone

Comments

@dsimmie
Copy link

dsimmie commented May 20, 2015

It is much slower when converting a date string to supply a date format for a column than for it to be inferred. I would've though there should be less work to do when the format is known (and supplied)

To test

df = DataFrame({'date_text':["2015-05-18" for i in range(10**6)]})
%timeit pd.to_datetime(df['date_text'],infer_datetime_format=True, box=False).values.view('i8')/10**9
10 loops, best of 3: 115 ms per loop
#Top line from %prun of same command:
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
1       0.095    0.095    0.095  0.095   {pandas.tslib.array_to_datetime} 

%timeit pd.to_datetime(df['date_text'],format="%Y-%m-%d", box=False).values.view('i8')/10**9
1 loops, best of 3: 2.27 s per loop
#Top line from %prun of same command:
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
1       2.282    2.282    2.282  2.282   {pandas.tslib.array_strptime}

This plot is taken from this S/O post which shows the difference over a larger range of sizes (and compared to other methods).

perf_comparison

INSTALLED VERSIONS

commit: None
python: 2.7.9.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-52-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8

pandas: 0.15.2
nose: 1.3.4
Cython: 0.22
numpy: 1.9.2
scipy: 0.14.0
statsmodels: 0.5.0
IPython: 2.2.0
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.2
pytz: 2014.10
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.0
openpyxl: 2.2.0-b1
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.5.7
lxml: 3.4.0
bs4: 4.3.2
html5lib: None
httplib2: 0.9
apiclient: 1.4.0
rpy2: None
sqlalchemy: 1.0.0
pymysql: None
psycopg2: None

@jreback
Copy link
Contributor

jreback commented May 20, 2015

@dsimmie as I explained to you. This is as expected. It is doing regex-matching. You are welcome to have a look.

@jreback jreback added Datetime Datetime data dtype Performance Memory or execution speed performance labels May 20, 2015
@jreback jreback added this to the Someday milestone May 20, 2015
@jreback
Copy link
Contributor

jreback commented May 20, 2015

See the code here: https://github.com/pydata/pandas/blob/master/pandas/tseries/tools.py#L277

There are fast-paths for ISO8601 strings and %Y%m%d. Others hit the regex engine.

@sinhrks
Copy link
Member

sinhrks commented May 20, 2015

Maybe we should clarify format kw is for flexibility rather than performance? I assume not passing format (and dateutil will be used in most cases) is faster than regex.

@jreback
Copy link
Contributor

jreback commented May 20, 2015

no, dateutil is almost NEVER used. and is MUCH slower; its all in python. When I say reg-ex, I mean a specially constructured strptime like object in cython.

@jreback
Copy link
Contributor

jreback commented May 20, 2015

you can certainly profile this if you'd like: https://github.com/pydata/pandas/blob/master/pandas/tslib.pyx#L2257

@dsimmie
Copy link
Author

dsimmie commented May 20, 2015

Hi Jeff. You said: "If you have repeated non-ISO dates it will help a lot. Since you have an ISO date it doesn't make much difference (as the parser is in c anyhow)". It does make a difference however and supplying the single unchanging format slows this operation significantly. The date format I put in, %Y-%m-%d, is a valid ISO8601 date if perhaps not a datetime. Seeing how '%Y%m%d' is given a fast-path perhaps that string could also get one.

https://github.com/pydata/pandas/blob/master/pandas/tseries/tools.py#L297

if format == '%Y%m%d' or format == '%Y-%m-%d':
    try:    
        result = _attempt_YYYYMMDD(arg, coerce=coerce)
    except:
        raise ValueError("cannot convert the input to '%Y%m%d' date format")

A change like that would necessitate a change _attempt _YYYYMMDD that would involve using a split on the hyphen and I don't know if that is agreeable. I don't really mind that inferring is quicker than being told but it is certainly not obvious behaviour when you have put in an ISO8601 date string to start.

@jreback
Copy link
Contributor

jreback commented May 20, 2015

@dsimmie I misspoke a bit - no caching with repeated dates (it could be done and I have seen it done, not sure of the utility; this is a cache not on the format but on the actual date values themselves), sort of a separate issue.

the change you propose will be much slower slower (and %Y%m%d parses quickly because it can be turned into an integer).

rather, you could map certain ISO8601 like formats to the generic format (which is fast pathed), e.g.

something like:

iso_formats = set(['%Y-%m-%d %h:%m:%d.%f','%Y-%m-%d'])
if format in iso_formats:
    format = '%Y-%m-%d %h:%m:%d.%f'

This is what infer_datetime_format=True does after it guesses.

@dsimmie
Copy link
Author

dsimmie commented May 20, 2015

OK thanks for the clarification and your time. It would be nice if that format string '%Y-%m-%d' was fast-pathed... agree my solution was naive, I haven't seen any of this code before and haven't read it in any detail yet.

@dsimmie dsimmie closed this as completed May 20, 2015
@jorisvandenbossche
Copy link
Member

Update: this is now a bit repetition, but was already typing:

I think the point here is that there is a fastpath for ISO8601 formatted strings. With infer_datetime_format, if the format guessing returns %Y-%m-%d this fast path is used, but when giving this format manually, this fast path is not used.
For non-ISO strings, you won't see this difference between inferred and provided format I think.

So we could do this checking for fastpath after infer_datetime_format is handled (so for both this and manually provided format)

@dsimmie reopening this, as this is a valid improvement I think.

@jreback
Copy link
Contributor

jreback commented May 20, 2015

agreed, this is a valid issue. (and the fix is pretty straightforward as I describe above)

@jreback
Copy link
Contributor

jreback commented Jul 20, 2015

closed by #10615

@jreback jreback closed this as completed Jul 20, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

4 participants