Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: speed up pd.to_datetime and co. by extracting dt format from data and using strptime to parse #5490

Closed
lexual opened this issue Nov 11, 2013 · 14 comments · Fixed by #6021

Comments

@lexual
Copy link
Contributor

commented Nov 11, 2013

I had a series containing strings like these:

"November 1, 2013"

Series length was about 500,000

A)
running pd.to_datetime(s) takes just over a minute.
B)
running pd.to_datetime(s, format="%B %d, %Y") takes about 7 seconds!

My suggestion is a way to make case A (where user doesn't specify the format type) take about as long as case B (user does specify).

Basically it looks like the code is always using date_util parser for case A.

My suggestion is based upon the idea that it's highly likely that the date strings are all in a consistent format (it's highly unlikely in this case that they would be in 500K separate formats!).

In a nutshell:

  • figure out the date format of the first entry.
  • try to use that against the entire series, using the speedy code in tslib.array_strptime
  • if that works, we've saved heaps of time, if not fall back to the current slower behaviour of using dateutil parse each time.

Here's some pseudo-code::

datestr1 = s[0]
# I'm assuming dateutil has something like this, that can tell you what the format is for a given date string.
date_format = figure_out_datetime_format(datestr1)

try:
    # use the super speed code that pandas uses when you tell it what the format is.
    dt_series = tslib.array_strptime(s, format=datestr1, *, ...)
except:
    # date strings aren't consistent after all. Let's do it the old slow way.
    dt_series = tslib.array_to_datetime(s, format=None)

return dt_series

@jreback

This comment has been minimized.

Copy link
Contributor

commented Nov 11, 2013

you could allow format='infer' to support this, keeping in mind that some dayfirst/yearfirst will then matter (and potentially could ambiguously interpret theformat)

@danbirken

This comment has been minimized.

Copy link
Contributor

commented Jan 10, 2014

This is a really smart idea, and I think this is a huge speed-up for a common use case. +1

@cancan101

This comment has been minimized.

Copy link
Contributor

commented Jan 10, 2014

You do want to deal with the first entry being NaT / None.

@lexual

This comment has been minimized.

Copy link
Contributor Author

commented Jan 13, 2014

Perhaps the way to do would it would be a "format" parameter (better name suggestions welcome) that is either set to "heterogenous" or "homogenous" (i.e. different date formats, all date formats the same).

If heterogenous, then current behaviour occurs, if homogenous use speedy algorithm from above where once you infer the date of the first date, you use that for all datetime entries.

@ghost

This comment has been minimized.

Copy link

commented Jan 18, 2014

This a promising idea, worth persuing. So I did, in the context of pd.read_csv
which helpfuly supports a date_parser argument that let's you pass in a
tailored parser in the form of a call to strptime with a hardwired format string.

Disregarding the stealth (#5993) fast-path for is8601 strings, measuring show
a potential a 4x speedup for large data.

dateutil's parser is not amenable to producing a format string, as it consumes
and backtracks through the string, but it does provide a handy lexer function
(_timelex.split(tzstr)) for dt strings which could, by reverse-mapping the fields of the
datetime object returned onto the tokens, be used to reconstruct a format string which
can then be used to speed up the rest of the process, at least for a good-sized
chunk of cases.

PRs welcome.

N=10000
r=pd.date_range("2000-01-01","2013-01-01",freq="H")[:N]
df=pd.DataFrame(range(N),index=r)
df.to_csv("/tmp/1.csv")
def f1(s):
    return dt.datetime.strptime(s,"%Y-%m-%d %H:%M:%S")
def f2(s):
    return dateutil.parser.parse(s)
print "Measuring iso8601"
%timeit f1("2000-01-01 00:01:02")
%timeit f2("2000-01-01 00:01:02")
%timeit x1=pd.read_csv("/tmp/1.csv",parse_dates=True,index_col=0)
%timeit x2=pd.read_csv("/tmp/1.csv",parse_dates=True,index_col=0,date_parser=f1)
%timeit x3=pd.read_csv("/tmp/1.csv",parse_dates=True,index_col=0,date_parser=f2)
#assert x1==x2==x3
# now try with non-iso-8601

with open("/tmp/1.csv","wb") as f:
    f.write("12/1/2013 00:01:02,1\n"*N)

def f1a(s):
    return dt.datetime.strptime(s,"%d/%m/%Y %H:%M:%S")

print "\nMeasuring an innovative dt format string"
%timeit f1a("12/1/2013 00:01:02")
%timeit f2("12/1/2013 00:01:02")
%timeit x1=pd.read_csv("/tmp/1.csv",parse_dates=True,index_col=0)
%timeit x2=pd.read_csv("/tmp/1.csv",parse_dates=True,index_col=0,date_parser=f1a)
%timeit x3=pd.read_csv("/tmp/1.csv",parse_dates=True,index_col=0,date_parser=f2)
#assert x1==x2==x3

10000 loops, best of 3: 24.2 µs per loop # strptime is faster then
10000 loops, best of 3: 95.2 µs per loop # dateutil.parse by 4x
100 loops, best of 3: 14.1 ms per loop # but read_csv has a fastpath for iso8601
1 loops, best of 3: 259 ms per loop
1 loops, best of 3: 956 ms per loop

Measuring an innovative dt format string
10000 loops, best of 3: 24 µs per loop # in the general case
10000 loops, best of 3: 96.8 µs per loop # 4x
1 loops, best of 3: 1.1 s per loop # and read_csv falls back to dateutil.parse() (*)
1 loops, best of 3: 256 ms per loop # in which case strptime, yields a nice 3-4x speedup.
1 loops, best of 3: 1.17 s per loop #(*)
@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 18, 2014

FYI their is also a stealth fast-path for format of '%Y%m%d', e.g. '20130101', as can be converted directly to an int64 and procssed that way (their is a commit for it but don't remember)

In [4]: s = Series(date_range('20000101',periods=10000)).apply(lambda x: x.strftime('%Y%m%d'))

In [5]: %timeit pd.to_datetime(s)
1 loops, best of 3: 518 ms per loop

In [6]: %timeit pd.to_datetime(s,format='%Y%m%d')
100 loops, best of 3: 10.5 ms per loop
@ghost

This comment has been minimized.

Copy link

commented Jan 18, 2014

Code reference? can't see it in tslib.pyx,inference.pyx nor datetime.pxd, nor grep the string anywhere.
Show me where and I'll update the docs again. Users can't optimize to fastpaths without knowing
what they are.

@danbirken

This comment has been minimized.

Copy link
Contributor

commented Jan 18, 2014

My guess is that if you can get the datetime string format, you could write a heavily optimized c/cython code path that would provide a healthy speed-up over the simple solution (similar to the iso8601 path). In fact this might already exist in tslib.array_strptime.

@ghost

This comment has been minimized.

Copy link

commented Jan 18, 2014

@danbirken... you know what I'm about to say. :)

@ghost

This comment has been minimized.

Copy link

commented Jan 18, 2014

I'll get the ball rolling. Let's start small.

def infer_df_format(s):
    import dateutil
    from  datetime import datetime as dt

    def maybe_int(v):
        try:
           return int(v)
        except:
           return v 

    dmap={'day':'%d',
          'month':'%m',
          'year':'%Y',
          'hour':'%H',
          'minute':'%M',
          'second':'%S'}

    tokens=map(maybe_int,dateutil.parser._timelex(s))       
    token_dict=dict([(v,i) for i,v in enumerate(tokens)])

    dt=dateutil.parser.parse(s)
    for k,fmt in dmap.items():
        val = getattr(dt,k)
        pos = token_dict.get(val)
        if pos is not None:
            tokens[pos]=fmt

    try:
        fmt = "".join(tokens)
        if  dt.strftime(fmt) == s:
            return fmt    
    except:
        pass

s="2013-12-01T11:58:59"
infer_df_format(s)

Out[15]: '%Y-%m-%dT%H:%M:%S'

your move.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 18, 2014

#4826

This applies to %Y%m%d for strings or int types

@danbirken

This comment has been minimized.

Copy link
Contributor

commented Jan 18, 2014

I feel a little worried about depending on a non-public interface for dateutil. At the same time getting the change into dateutil itself (where it theoretically belongs), seems pretty unsatisfactory as well.

Out of left field suggestion: What about a function that just tries 10 or 20 common datetime string formats, and then if one of them works it uses a fast code path to parse them all. If they all fail it just falls back to the slow method.

@ghost

This comment has been minimized.

Copy link

commented Jan 18, 2014

re non public interface, very true but we can always fallback if something raises, or
duplicate the lexer code inline. it's not that much code.

For now, assume that you have a function that reliably returns a format string or None.
The specifics we can sort out later and expand over time.

@danbirken

This comment has been minimized.

Copy link
Contributor

commented Jan 19, 2014

Well I'm pretty familiar with this section of the code, so I'll take a stab at this on Monday provided nobody else does it between now and then :)

@ghost ghost referenced this issue Jan 20, 2014
danbirken added a commit to danbirken/pandas that referenced this issue Jan 24, 2014
PERF: Speed up pd.to_datetime() by optionally inferring dt format pan…
…das-dev#5490

Given an array of strings that represent datetimes, infer_format=True
will attempt to guess the format of the datetimes, and if it can infer
the format, it will use a faster function to convert/import the
datetimes.  In cases where this speed-up can be used, the function
should be about 10x faster.
danbirken added a commit to danbirken/pandas that referenced this issue Jan 24, 2014
PERF: Add infer_datetime_format to read_csv() pandas-dev#5490
This allows read_csv() to attempt to infer the datetime format for any
columns where parse_dates is enabled.  In cases where the datetime
format can be inferred, this should speed up processing datetimes
by ~10x.

Additionally add documentation and benchmarks for read_csv().
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.