PERF: add datetime caching kw in to_datetime #11665

jreback · 2015-11-20T17:32:33Z

I'll propose a

cache_datetime=False keyword as an addition to read_csv and pd.to_datetime

this would use a lookup cache (a dict will probably work), to map datetime strings to Timestamp objects. For repeated dates this will lead to some dramatic speedups.

Care must be taken if a format kw is provided (in to_datetime as the cache will have to be exposed). This would be an optional (and default False) as I think if you have unique dates this could modestly slow down things (but can be revisted if needed).

This might need also want to accept a list of column names (like parse_dates) to enable per-column caching (e.g. you might want to apply to a column, but not the index of example).

possibly we could overload parse_dates='cache' to mean this as well

trivial example

In [1]: pd.DataFrame({'A' : ['20130101 00:00:00']*10000}).to_csv('test.csv',index=True)

In [14]: def parser(x):
   ....:         uniques = pd.Series(pd.unique(x))
   ....:         d = pd.to_datetime(uniques)
   ....:         d.index = uniques
   ....:         return Series(x).map(d).values
   ....: 
In [3]: df1 = pd.read_csv('test.csv',index_col=0, parse_dates=['A'])

In [4]: df2 = pd.read_csv('test.csv',index_col=0, parse_dates=['A'], date_parser=parser)

In [17]: %timeit pd.read_csv('test.csv',index_col=0, parse_dates=['A'])
1 loops, best of 3: 969 ms per loop

In [18]: %timeit pd.read_csv('test.csv',index_col=0, parse_dates=['A'], date_parser=parser)
100 loops, best of 3: 5.31 ms per loop

In [7]: df1.equals(df2)
Out[7]: True

The text was updated successfully, but these errors were encountered:

stevenmanton · 2015-11-25T18:10:48Z

I recently stumbled across the exact same optimization. Converting 5 million strings to datetime's takes 2 seconds instead of 10 minutes:

pd.to_datetime(df['date'])  # takes 10 minutes
df['date'].map({k: pd.to_datetime(k) for k in df['date'].unique()})  # takes 2 seconds

It seems like the caching/optimization strategy should live with to_datetime (or _to_datetime) and not necessarily with read_csv, which calls to_datetime internally anyway. Perhaps to_datetime could internally use the above unique-then-map strategy when it detects an array as input. It might also be possible, though probably more complicated, to cache values in parse_datetime_string in tslib.pyx.

It even seems as though you might want this optimization on by default. The only real overhead is the call to unique, which will make things slower if there aren't many duplicate strings. However, when there are duplicate strings, which I imagine is a very common case, the optimized code is hundreds of times faster.

jreback · 2015-11-25T18:18:59Z

@stevenmanton yes this could be implemented internally in to_datetime, but I think you need a keyword there to avoid recursively calling yourself (and to catch all of the cases), as well as to avoid doing this on a single timestamp conversion.

not saying this needs to be a kw to read_csv as already have enough!

stevenmanton · 2015-11-25T18:19:33Z

Note that the above method is also about 10x faster than using infer_datetime_format=True:

pd.to_datetime(df['date'], infer_datetime_format=True)  # takes 25 seconds

stevenmanton · 2015-11-25T18:21:56Z

How about a simple recursive wrapper?

def to_datetime(s, cache=True, **kwargs):
    if cache:
        return s.map({k: to_datetime(k, cache=False, **kwargs) for k in s.unique()})
    else:
        return pd.to_datetime(s, cache=False, **kwargs)

jreback · 2015-11-25T18:24:11Z

you could do this inside to_datetime, which calls _to_datetime

jreback · 2015-11-25T18:24:45Z

note infer_datetime_format could ALSO be used with this (as this fixed the parsing issue of when you have non-ISO separators).

stevenmanton · 2015-11-25T18:30:51Z

It looks like some of the parser tools call _to_datetime so I wonder if the optimization would have to be inside that function, not just in to_datetime?

Not sure what you mean by using the infer_datetime_format in conjunction with the optimization. You mean this?

def to_datetime(s, **kwargs):
    return s.map({k: pd.to_datetime(k, **kwargs) for k in s.unique()})

jreback · 2015-11-25T18:34:14Z

@stevenmanton it could be in either place

what I mean is you prob want cache=True as the default, and people can still pass infer_datetime_format if they want (to be honest this should prob default to True as well, but that's a separate issue), which helps when you have non-ISO formats but limited uniques, its orthogonal to cache

jreback · 2016-01-24T21:58:51Z

@stevenmanton want to do a PR for this?

stevenmanton · 2016-01-30T18:07:55Z

@jreback I just got back from vacation, but I'll take a look and try to PR something.

jreback · 2016-01-30T18:41:44Z

great this would be a nice perf enhancement!

stevenmanton · 2016-02-02T01:33:33Z

I took a stab at this but after a couple hours of looking through the code I'm not sure the best way to proceed. The small changes I made are failing for a number of reasons. It seems like a lot of complexity is due to the number of inputs and outputs (tuple, list, string, Series, Index, etc.) accepted by the to_datetime function.

Any thoughts on a better approach?

jreback · 2016-02-12T19:22:44Z

@stevenmanton the way to do this is to add a kw arg to to_datetime.

then here
use that routine I have above (but call _to_datetime(....); only do this if the kw is True & its a list-like and has a reasonable length, maybe > 100.

GeorgieBanks · 2016-03-19T02:49:06Z

@charles-cooper suggested this feature 8 months earlier in #9594, but was rejected by @jreback. Just to give credit where it's due.

I'm 👍 for this.

jreback · 2016-03-19T03:17:23Z

@GeorgieBanks

the suggested feature was not parsing the uniques in a vectorized way and broadcasting, rather memoizarion of repeated inefficient calls

to be honest I don't care about the credit - I was quite clear on that other issue why it was closed

DGrady · 2017-05-22T19:26:05Z

Is there still interest in getting this optimization working? When it was originally reported, it looks like it was giving at 5x or better speed up. I'm testing again and seeing pretty consistently a 2x, at best, speed up even for very large series. I am willing to tackle it but wanted to check on the status first, since it appears that it's going to make the to_datetime interface a bit more complicated and may give only a modest performance boost at this point.

% ipython
Python 3.6.1 |Anaconda custom (x86_64)| (default, May 11 2017, 13:04:09)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.0.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd

In [2]: def parser(x):
   ...:     uniques = pd.Series(pd.unique(x))
   ...:     d = pd.to_datetime(uniques)
   ...:     d.index = uniques
   ...:     return pd.Series(x).map(d).values
   ...:

In [3]: pd.DataFrame({'A' : ['20130101 00:00:00'] * 10 ** 4}).to_csv('test.csv', index=True)

In [4]: %timeit pd.read_csv('test.csv', index_col=0, parse_dates=['A'])
10.1 ms ± 96.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: %timeit pd.read_csv('test.csv', index_col=0, parse_dates=['A'], date_parser=parser)
6.69 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [6]: pd.DataFrame({'A' : ['20130101 00:00:00'] * 10 ** 5}).to_csv('test.csv', index=True)

In [7]: %timeit pd.read_csv('test.csv', index_col=0, parse_dates=['A'])
90.2 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [8]: %timeit pd.read_csv('test.csv', index_col=0, parse_dates=['A'], date_parser=parser)
51 ms ± 546 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [9]: pd.DataFrame({'A' : ['20130101 00:00:00'] * 10 ** 6}).to_csv('test.csv', index=True)

In [10]: %timeit pd.read_csv('test.csv', index_col=0, parse_dates=['A'])
876 ms ± 6.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [11]: %timeit pd.read_csv('test.csv', index_col=0, parse_dates=['A'], date_parser=parser)
508 ms ± 6.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [12]: pd.DataFrame({'A' : ['20130101 00:00:00'] * 10 ** 7}).to_csv('test.csv', index=True)

In [13]: %timeit pd.read_csv('test.csv', index_col=0, parse_dates=['A'])
/Users/dgrady/anaconda3/lib/python3.6/site-packages/numpy/lib/arraysetops.py:395: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  mask |= (ar1 == a)
8.5 s ± 554 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [14]: %timeit pd.read_csv('test.csv', index_col=0, parse_dates=['A'], date_parser=parser)
/Users/dgrady/anaconda3/lib/python3.6/site-packages/numpy/lib/arraysetops.py:395: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  mask |= (ar1 == a)
4.46 s ± 56.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

jreback · 2017-05-22T21:52:23Z

oh i think we should always do this
we could add an option to turn it off i suppose

basically if you have say at least 1000 elements or less you can not do it otherwise it's always worth it (yes there is a degenerate case with a really long u issue series but detecting that is often not worth it)

u can impelement then we an test a couple or of cases and see

fyi we do something similar in pandas.core.util.hashing
with the categorize kw (it's simpler there to just hash the uniques there)
so a similar approach would be good

DGrady · 2017-05-23T03:15:31Z

Okay; will dive in to it tomorrow.

…#17077)

jreback added Datetime Datetime data dtype Performance Memory or execution speed performance IO CSV read_csv, to_csv labels Nov 20, 2015

jreback added this to the 0.18.0 milestone Nov 20, 2015

jreback added API Design Difficulty Intermediate labels Nov 20, 2015

jreback changed the title ~~PERF: add datetime caching kw to read_csv/to_datetime~~ PERF: add datetime caching kw in to_datetime Nov 25, 2015

jreback modified the milestones: Next Major Release, 0.18.0 Jan 24, 2016

jreback added the Prio-medium label Feb 12, 2016

jreback modified the milestones: 0.18.1, Next Major Release Feb 12, 2016

jreback mentioned this issue Mar 18, 2016

PERF: parse non-ambiguous date formats in c #12667

Closed

GeorgieBanks mentioned this issue Mar 19, 2016

datetime optimization #9594

Closed

jreback modified the milestones: 0.18.1, 0.18.2 Apr 26, 2016

jorisvandenbossche modified the milestones: Next Major Release, 0.19.0 Aug 13, 2016

jreback added Prio-high and removed Prio-medium labels Mar 29, 2017

jreback modified the milestones: Next Major Release, Next Minor Release Mar 29, 2017

chris-b1 mentioned this issue Jul 13, 2017

perf for read_csv with parse_dates #16914

Closed

mroeschke mentioned this issue Jul 26, 2017

PERF: Add cache keyword to to_datetime (#11665) #17077

Merged

4 tasks

jreback modified the milestones: Interesting Issues, 0.22.0 Nov 11, 2017

jreback closed this as completed in #17077 Nov 11, 2017

jreback pushed a commit that referenced this issue Nov 11, 2017

PERF: Add cache keyword to to_datetime (#11665) (#17077)

b36dab5

No-Stream pushed a commit to No-Stream/pandas that referenced this issue Nov 28, 2017

PERF: Add cache keyword to to_datetime (pandas-dev#11665) (pandas-dev…

2267b97

…#17077)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: add datetime caching kw in to_datetime #11665

PERF: add datetime caching kw in to_datetime #11665

jreback commented Nov 20, 2015

stevenmanton commented Nov 25, 2015

jreback commented Nov 25, 2015

stevenmanton commented Nov 25, 2015

stevenmanton commented Nov 25, 2015

jreback commented Nov 25, 2015

jreback commented Nov 25, 2015

stevenmanton commented Nov 25, 2015

jreback commented Nov 25, 2015

jreback commented Jan 24, 2016

stevenmanton commented Jan 30, 2016

jreback commented Jan 30, 2016

stevenmanton commented Feb 2, 2016

jreback commented Feb 12, 2016

GeorgieBanks commented Mar 19, 2016

jreback commented Mar 19, 2016

DGrady commented May 22, 2017

jreback commented May 22, 2017 •

edited

Loading

DGrady commented May 23, 2017

PERF: add datetime caching kw in to_datetime #11665

PERF: add datetime caching kw in to_datetime #11665

Comments

jreback commented Nov 20, 2015

stevenmanton commented Nov 25, 2015

jreback commented Nov 25, 2015

stevenmanton commented Nov 25, 2015

stevenmanton commented Nov 25, 2015

jreback commented Nov 25, 2015

jreback commented Nov 25, 2015

stevenmanton commented Nov 25, 2015

jreback commented Nov 25, 2015

jreback commented Jan 24, 2016

stevenmanton commented Jan 30, 2016

jreback commented Jan 30, 2016

stevenmanton commented Feb 2, 2016

jreback commented Feb 12, 2016

GeorgieBanks commented Mar 19, 2016

jreback commented Mar 19, 2016

DGrady commented May 22, 2017

jreback commented May 22, 2017 • edited Loading

DGrady commented May 23, 2017

jreback commented May 22, 2017 •

edited

Loading