datetime optimization #9594

charles-cooper · 2015-03-05T02:05:50Z

Hi,
I noticed that datetime parsing for large sql/csv tables is really slow. Would it be acceptable to use a technique where repeated calculations are cached? For example, instead of:

def parse_date(date_str) :
    return datetime.datetime.strptime(date_str,FMT)
def parse_date_col(str_col) :
    return [parse_date(date_str) for date_str in str_col]

use

def parse_date(date_str) :
    return datetime.datetime.strptime(date_str,FMT)
def parse_date_col(str_col) :
    cache = dict()
    for date_str in str_col :
        if date_str not in cache :
            cache[date_str] = parse_date(date_str)
    return [cache[date_str] for date_str in str_col]

The reason this works is that string hashing / comparison / dictionary insertion is much much faster than strptime.

For tables where dates are repeated many times this can result in orders of magnitude speedup.

Thanks
Charles

azjps · 2015-03-05T04:49:09Z

For reading csvs with pandas.read_csv there's a date_parser argument. If you know a priori that your datetime strings contain lots of duplicates are of a certain standard format, you can pass your own custom date_parser function which checks cached values of strptime. I doubt its worth doing dictionary lookups in the general case (especially when reading time series, where you might have many distinct datetimes being parsed).

jorisvandenbossche · 2015-03-05T09:02:15Z

@charles-cooper Can you give a reproducible example? And is this approach still faster when using pd.to_datetime for string parsing instead of datetime.datetime.strptime?

azjps · 2015-03-05T16:12:19Z

Quick example now that I have a python terminal in front of me:

class memoize:
  def __init__(self, function):
    self.function = function
    self.memoized = {}

  def __call__(self, *args):
    try:
      return self.memoized[args]
    except KeyError:
      self.memoized[args] = self.function(*args)
      return self.memoized[args]

def to_datetime(x):
  return pd.to_datetime(x, '%Y/%m/%d')

def time_read_csv(date_parser, num_rows=50000):
  s = StringIO.StringIO()
  pd.DataFrame([{'d': '2014/01/01', 'a': 0}] * num_rows).to_csv(s, index=False)
  s.seek(0)
  return pd.read_csv(s, parse_dates=['d'], date_parser=date_parser)

In [83]: %timeit time_read_csv(to_datetime)
1 loops, best of 3: 4.59 s per loop

In [84]: %timeit time_read_csv(memoize(to_datetime))
10 loops, best of 3: 185 ms per loop

Looks like pandas read_sql doesn't accept arbitrary date parsers, but you could read the date column as strings and apply a memoized date parser after the initial read.

jreback · 2015-03-05T17:24:36Z

there was a whole discussion of this in #9377 so the simple heuristic is this:
I'll reproduce.

if you have multiple columns that need parsing, use parse_dates=[[....]].
try to infer the format read_csv(..., infer_datetime_format=True)
if you have a format, the use date_parser=lambda x: pd.to_datetime(x, format=.....)
if you have a really non-standard format, finally use date_parser=.....
so a naked date_parser is ALWAYS the last resort (as unless it can handle a vectorized input, its in python space).

With the addendum if you have a format that infer_datetime_format cannot readily infer then you should simply read the column is as object then use pd.to_datetime(...format=....) after. This will be the most performant.

Trying to memoize is fine, but not necessary as that is what infer_datetime_format already does internally.

There was some talk of adding this to the docs (in io.rst/read_csv section). Anyone interested?

charles-cooper · 2015-03-05T18:54:17Z

Thanks for the feedback all,
@jreback I see now that using parse_dates in read_csv results in memoization. What about in read_sql?

Charles

GeorgieBanks · 2016-03-19T02:46:05Z

turns out infer_datetime_format memoizes date formats and not date strings as @jreback claimed, so #11665 is just a dupe of @charles-cooper 's "not necessary" idea.

jreback closed this as completed Mar 5, 2015

jreback added Performance Memory or execution speed performance IO CSV read_csv, to_csv labels Mar 5, 2015

GeorgieBanks mentioned this issue Mar 19, 2016

PERF: add datetime caching kw in to_datetime #11665

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datetime optimization #9594

datetime optimization #9594

charles-cooper commented Mar 5, 2015

azjps commented Mar 5, 2015

jorisvandenbossche commented Mar 5, 2015

azjps commented Mar 5, 2015

jreback commented Mar 5, 2015

charles-cooper commented Mar 5, 2015

GeorgieBanks commented Mar 19, 2016

datetime optimization #9594

datetime optimization #9594

Comments

charles-cooper commented Mar 5, 2015

azjps commented Mar 5, 2015

jorisvandenbossche commented Mar 5, 2015

azjps commented Mar 5, 2015

jreback commented Mar 5, 2015

charles-cooper commented Mar 5, 2015

GeorgieBanks commented Mar 19, 2016