Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: add datetime caching kw in to_datetime #11665

Closed
jreback opened this Issue Nov 20, 2015 · 18 comments

Comments

Projects
None yet
5 participants
@jreback
Copy link
Contributor

commented Nov 20, 2015

I'll propose a

cache_datetime=False keyword as an addition to read_csv and pd.to_datetime

this would use a lookup cache (a dict will probably work), to map datetime strings to Timestamp objects. For repeated dates this will lead to some dramatic speedups.

Care must be taken if a format kw is provided (in to_datetime as the cache will have to be exposed). This would be an optional (and default False) as I think if you have unique dates this could modestly slow down things (but can be revisted if needed).

This might need also want to accept a list of column names (like parse_dates) to enable per-column caching (e.g. you might want to apply to a column, but not the index of example).

possibly we could overload parse_dates='cache' to mean this as well

trivial example

In [1]: pd.DataFrame({'A' : ['20130101 00:00:00']*10000}).to_csv('test.csv',index=True)

In [14]: def parser(x):
   ....:         uniques = pd.Series(pd.unique(x))
   ....:         d = pd.to_datetime(uniques)
   ....:         d.index = uniques
   ....:         return Series(x).map(d).values
   ....: 
In [3]: df1 = pd.read_csv('test.csv',index_col=0, parse_dates=['A'])

In [4]: df2 = pd.read_csv('test.csv',index_col=0, parse_dates=['A'], date_parser=parser)

In [17]: %timeit pd.read_csv('test.csv',index_col=0, parse_dates=['A'])
1 loops, best of 3: 969 ms per loop

In [18]: %timeit pd.read_csv('test.csv',index_col=0, parse_dates=['A'], date_parser=parser)
100 loops, best of 3: 5.31 ms per loop

In [7]: df1.equals(df2)
Out[7]: True
@stevenmanton

This comment has been minimized.

Copy link

commented Nov 25, 2015

I recently stumbled across the exact same optimization. Converting 5 million strings to datetime's takes 2 seconds instead of 10 minutes:

pd.to_datetime(df['date'])  # takes 10 minutes
df['date'].map({k: pd.to_datetime(k) for k in df['date'].unique()})  # takes 2 seconds

It seems like the caching/optimization strategy should live with to_datetime (or _to_datetime) and not necessarily with read_csv, which calls to_datetime internally anyway. Perhaps to_datetime could internally use the above unique-then-map strategy when it detects an array as input. It might also be possible, though probably more complicated, to cache values in parse_datetime_string in tslib.pyx.

It even seems as though you might want this optimization on by default. The only real overhead is the call to unique, which will make things slower if there aren't many duplicate strings. However, when there are duplicate strings, which I imagine is a very common case, the optimized code is hundreds of times faster.

@jreback

This comment has been minimized.

Copy link
Contributor Author

commented Nov 25, 2015

@stevenmanton yes this could be implemented internally in to_datetime, but I think you need a keyword there to avoid recursively calling yourself (and to catch all of the cases), as well as to avoid doing this on a single timestamp conversion.

not saying this needs to be a kw to read_csv as already have enough!

@jreback jreback changed the title PERF: add datetime caching kw to read_csv/to_datetime PERF: add datetime caching kw in to_datetime Nov 25, 2015

@stevenmanton

This comment has been minimized.

Copy link

commented Nov 25, 2015

Note that the above method is also about 10x faster than using infer_datetime_format=True:

pd.to_datetime(df['date'], infer_datetime_format=True)  # takes 25 seconds
@stevenmanton

This comment has been minimized.

Copy link

commented Nov 25, 2015

How about a simple recursive wrapper?

def to_datetime(s, cache=True, **kwargs):
    if cache:
        return s.map({k: to_datetime(k, cache=False, **kwargs) for k in s.unique()})
    else:
        return pd.to_datetime(s, cache=False, **kwargs)
@jreback

This comment has been minimized.

Copy link
Contributor Author

commented Nov 25, 2015

you could do this inside to_datetime, which calls _to_datetime

@jreback

This comment has been minimized.

Copy link
Contributor Author

commented Nov 25, 2015

note infer_datetime_format could ALSO be used with this (as this fixed the parsing issue of when you have non-ISO separators).

@stevenmanton

This comment has been minimized.

Copy link

commented Nov 25, 2015

It looks like some of the parser tools call _to_datetime so I wonder if the optimization would have to be inside that function, not just in to_datetime?

Not sure what you mean by using the infer_datetime_format in conjunction with the optimization. You mean this?

def to_datetime(s, **kwargs):
    return s.map({k: pd.to_datetime(k, **kwargs) for k in s.unique()})

image

@jreback

This comment has been minimized.

Copy link
Contributor Author

commented Nov 25, 2015

@stevenmanton it could be in either place

what I mean is you prob want cache=True as the default, and people can still pass infer_datetime_format if they want (to be honest this should prob default to True as well, but that's a separate issue), which helps when you have non-ISO formats but limited uniques, its orthogonal to cache

@jreback jreback modified the milestones: Next Major Release, 0.18.0 Jan 24, 2016

@jreback

This comment has been minimized.

Copy link
Contributor Author

commented Jan 24, 2016

@stevenmanton want to do a PR for this?

@stevenmanton

This comment has been minimized.

Copy link

commented Jan 30, 2016

@jreback I just got back from vacation, but I'll take a look and try to PR something.

@jreback

This comment has been minimized.

Copy link
Contributor Author

commented Jan 30, 2016

great this would be a nice perf enhancement!

@stevenmanton

This comment has been minimized.

Copy link

commented Feb 2, 2016

I took a stab at this but after a couple hours of looking through the code I'm not sure the best way to proceed. The small changes I made are failing for a number of reasons. It seems like a lot of complexity is due to the number of inputs and outputs (tuple, list, string, Series, Index, etc.) accepted by the to_datetime function.

Any thoughts on a better approach?

@jreback

This comment has been minimized.

Copy link
Contributor Author

commented Feb 12, 2016

@stevenmanton the way to do this is to add a kw arg to to_datetime.

then here
use that routine I have above (but call _to_datetime(....); only do this if the kw is True & its a list-like and has a reasonable length, maybe > 100.

@jreback jreback added the Prio-medium label Feb 12, 2016

@jreback jreback modified the milestones: 0.18.1, Next Major Release Feb 12, 2016

@GeorgieBanks

This comment has been minimized.

Copy link

commented Mar 19, 2016

@charles-cooper suggested this feature 8 months earlier in #9594, but was rejected by @jreback. Just to give credit where it's due.

I'm 👍 for this.

@jreback

This comment has been minimized.

Copy link
Contributor Author

commented Mar 19, 2016

@GeorgieBanks

the suggested feature was not parsing the uniques in a vectorized way and broadcasting, rather memoizarion of repeated inefficient calls

to be honest I don't care about the credit - I was quite clear on that other issue why it was closed

@jreback jreback modified the milestones: 0.18.1, 0.18.2 Apr 26, 2016

@jorisvandenbossche jorisvandenbossche modified the milestones: Next Major Release, 0.19.0 Aug 13, 2016

@jreback jreback added Prio-high and removed Prio-medium labels Mar 29, 2017

@jreback jreback modified the milestones: Next Major Release, Next Minor Release Mar 29, 2017

@DGrady

This comment has been minimized.

Copy link
Contributor

commented May 22, 2017

Is there still interest in getting this optimization working? When it was originally reported, it looks like it was giving at 5x or better speed up. I'm testing again and seeing pretty consistently a 2x, at best, speed up even for very large series. I am willing to tackle it but wanted to check on the status first, since it appears that it's going to make the to_datetime interface a bit more complicated and may give only a modest performance boost at this point.

% ipython
Python 3.6.1 |Anaconda custom (x86_64)| (default, May 11 2017, 13:04:09)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.0.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd

In [2]: def parser(x):
   ...:     uniques = pd.Series(pd.unique(x))
   ...:     d = pd.to_datetime(uniques)
   ...:     d.index = uniques
   ...:     return pd.Series(x).map(d).values
   ...:

In [3]: pd.DataFrame({'A' : ['20130101 00:00:00'] * 10 ** 4}).to_csv('test.csv', index=True)

In [4]: %timeit pd.read_csv('test.csv', index_col=0, parse_dates=['A'])
10.1 ms ± 96.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: %timeit pd.read_csv('test.csv', index_col=0, parse_dates=['A'], date_parser=parser)
6.69 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [6]: pd.DataFrame({'A' : ['20130101 00:00:00'] * 10 ** 5}).to_csv('test.csv', index=True)

In [7]: %timeit pd.read_csv('test.csv', index_col=0, parse_dates=['A'])
90.2 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [8]: %timeit pd.read_csv('test.csv', index_col=0, parse_dates=['A'], date_parser=parser)
51 ms ± 546 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [9]: pd.DataFrame({'A' : ['20130101 00:00:00'] * 10 ** 6}).to_csv('test.csv', index=True)

In [10]: %timeit pd.read_csv('test.csv', index_col=0, parse_dates=['A'])
876 ms ± 6.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [11]: %timeit pd.read_csv('test.csv', index_col=0, parse_dates=['A'], date_parser=parser)
508 ms ± 6.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [12]: pd.DataFrame({'A' : ['20130101 00:00:00'] * 10 ** 7}).to_csv('test.csv', index=True)

In [13]: %timeit pd.read_csv('test.csv', index_col=0, parse_dates=['A'])
/Users/dgrady/anaconda3/lib/python3.6/site-packages/numpy/lib/arraysetops.py:395: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  mask |= (ar1 == a)
8.5 s ± 554 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [14]: %timeit pd.read_csv('test.csv', index_col=0, parse_dates=['A'], date_parser=parser)
/Users/dgrady/anaconda3/lib/python3.6/site-packages/numpy/lib/arraysetops.py:395: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  mask |= (ar1 == a)
4.46 s ± 56.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
@jreback

This comment has been minimized.

Copy link
Contributor Author

commented May 22, 2017

oh i think we should always do this
we could add an option to turn it off i suppose

basically if you have say at least 1000 elements or less you can not do it otherwise it's always worth it (yes there is a degenerate case with a really long u issue series but detecting that is often not worth it)

u can impelement then we an test a couple or of cases and see

fyi we do something similar in pandas.core.util.hashing
with the categorize kw (it's simpler there to just hash the uniques there)
so a similar approach would be good

@DGrady

This comment has been minimized.

Copy link
Contributor

commented May 23, 2017

Okay; will dive in to it tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.