Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datetime optimization #9594

Closed
charles-cooper opened this issue Mar 5, 2015 · 6 comments
Closed

datetime optimization #9594

charles-cooper opened this issue Mar 5, 2015 · 6 comments
Labels
IO CSV read_csv, to_csv Performance Memory or execution speed performance

Comments

@charles-cooper
Copy link

Hi,
I noticed that datetime parsing for large sql/csv tables is really slow. Would it be acceptable to use a technique where repeated calculations are cached? For example, instead of:

def parse_date(date_str) :
    return datetime.datetime.strptime(date_str,FMT)
def parse_date_col(str_col) :
    return [parse_date(date_str) for date_str in str_col]

use

def parse_date(date_str) :
    return datetime.datetime.strptime(date_str,FMT)
def parse_date_col(str_col) :
    cache = dict()
    for date_str in str_col :
        if date_str not in cache :
            cache[date_str] = parse_date(date_str)
    return [cache[date_str] for date_str in str_col]

The reason this works is that string hashing / comparison / dictionary insertion is much much faster than strptime.

For tables where dates are repeated many times this can result in orders of magnitude speedup.

Thanks
Charles

@azjps
Copy link

azjps commented Mar 5, 2015

For reading csvs with pandas.read_csv there's a date_parser argument. If you know a priori that your datetime strings contain lots of duplicates are of a certain standard format, you can pass your own custom date_parser function which checks cached values of strptime. I doubt its worth doing dictionary lookups in the general case (especially when reading time series, where you might have many distinct datetimes being parsed).

@jorisvandenbossche
Copy link
Member

@charles-cooper Can you give a reproducible example? And is this approach still faster when using pd.to_datetime for string parsing instead of datetime.datetime.strptime?

@azjps
Copy link

azjps commented Mar 5, 2015

Quick example now that I have a python terminal in front of me:

class memoize:
  def __init__(self, function):
    self.function = function
    self.memoized = {}

  def __call__(self, *args):
    try:
      return self.memoized[args]
    except KeyError:
      self.memoized[args] = self.function(*args)
      return self.memoized[args]

def to_datetime(x):
  return pd.to_datetime(x, '%Y/%m/%d')

def time_read_csv(date_parser, num_rows=50000):
  s = StringIO.StringIO()
  pd.DataFrame([{'d': '2014/01/01', 'a': 0}] * num_rows).to_csv(s, index=False)
  s.seek(0)
  return pd.read_csv(s, parse_dates=['d'], date_parser=date_parser)
In [83]: %timeit time_read_csv(to_datetime)
1 loops, best of 3: 4.59 s per loop

In [84]: %timeit time_read_csv(memoize(to_datetime))
10 loops, best of 3: 185 ms per loop

Looks like pandas read_sql doesn't accept arbitrary date parsers, but you could read the date column as strings and apply a memoized date parser after the initial read.

@jreback
Copy link
Contributor

jreback commented Mar 5, 2015

there was a whole discussion of this in #9377 so the simple heuristic is this:
I'll reproduce.

if you have multiple columns that need parsing, use parse_dates=[[....]].
try to infer the format read_csv(..., infer_datetime_format=True)
if you have a format, the use date_parser=lambda x: pd.to_datetime(x, format=.....)
if you have a really non-standard format, finally use date_parser=.....
so a naked date_parser is ALWAYS the last resort (as unless it can handle a vectorized input, its in python space).

With the addendum if you have a format that infer_datetime_format cannot readily infer then you should simply read the column is as object then use pd.to_datetime(...format=....) after. This will be the most performant.

Trying to memoize is fine, but not necessary as that is what infer_datetime_format already does internally.

There was some talk of adding this to the docs (in io.rst/read_csv section). Anyone interested?

@jreback jreback closed this as completed Mar 5, 2015
@jreback jreback added Performance Memory or execution speed performance IO CSV read_csv, to_csv labels Mar 5, 2015
@charles-cooper
Copy link
Author

Thanks for the feedback all,
@jreback I see now that using parse_dates in read_csv results in memoization. What about in read_sql?

Charles

@GeorgieBanks
Copy link

turns out infer_datetime_format memoizes date formats and not date strings as @jreback claimed, so #11665 is just a dupe of @charles-cooper 's "not necessary" idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

5 participants