Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infer datetime format #6021

Merged
merged 2 commits into from Jan 24, 2014
Merged

Conversation

danbirken
Copy link
Contributor

closes #5490

Basically this attempts to figure out the the datetime format if given a list of datetime strings. If the format can be guessed successfully, then to_datetime() attempts to use a faster way of processing the datetime strings.

Here is a gist that shows some timings of this function: https://gist.github.com/danbirken/8533199

If the format can be guessed, this generally speeds up the processing by 10x. In the case where the format string is guessed to be an iso8601 string or subset of that, then the format is not used, because the iso8601 fast-path is much faster than this (about 30-40x faster).

I just did a pretty basic way of guessing the format using only dateutil.parse, a simple regex and some simple logic. This doesn't use any private methods of dateutil. It certainly can be improved, but I wanted to get something started first before I went crazy having to handle more advanced cases.

@jreback
Copy link
Contributor

jreback commented Jan 21, 2014

what happens if their are mixed formats in a list of strings?

does it gues based in 1st? (eg how do u sample what to guess)

I suppose that some formats if they are mixed will cause the reader to barf - so maybe have some tests

and prob just need to document this

@ghost
Copy link

ghost commented Jan 21, 2014

Looking good: unintrusive, covers the major use case and has a good bang/buck ratio.

Notes:

  • avoid different names for the same keyword arg in different places.
  • I agree with basics first, the case where the first line contains a nan for the date is important
    enough to handle though.
  • The iso8601 fallback is nice touch, are you sure there are no other major variations?
  • as jreback asked, sane behavior when the consistent format assumption is violated is important
    and deserves a test and some description.
  • Please add a test for a series that contains nan. Even if you're reusing existign code paths.

As far as generality, I think this is hits a balance nearly perfectly. marking it for 0.14.

a future iteration could tackle the case of multiple date formats in a column:

  • exploiting locality (runs of the same format)
  • or an optimistic cache of seen-so-far format-strings prioritized by hit-counts.

Those are just for fun though, this already delivers what we really care about.

👍

@ghost ghost mentioned this pull request Jan 21, 2014
@ghost
Copy link

ghost commented Jan 21, 2014

two more:

  • there's another fast path @jreback added in PERF: much faster to_datetime performance with a format of '%Y%m%d' #4826, '%Y%m%d'.

  • The guessing code refines my quick snippet which took a shortcut by not handling
    something like 2011-01-1 00:00:00, where the repeated values defy naive guessing.
    Unfortunately, round times are fairly common and dateutil fills in such a default for missing fields:

    In [3]: parse('2011-02-01')
    Out[3]: datetime.datetime(2011, 2, 1, 0, 0)
    

    both will flubber the hit rate in real usage. ideas? perhaps putting sentinels values
    in the defaults used by parse() can disambiguate.

@ghost
Copy link

ghost commented Jan 21, 2014

One more:
Using a different lexer from dateutil has a high prob. of incompat behavior in some cases.
if you prefer not call a _foo method in dateutil inside a try/except, dateutil is BSD
and you can just copy the lexing function and throw a license in LICENSES/.

@danbirken
Copy link
Contributor Author

Alright, updated version:

  • Switched to _timelex, which allows for dates like "30/Dec/2011" to be parsed. If you want I can go extremely cautious and put the _timelex import in a special section so pandas will survive if that breaks in a future version of dateutil.
  • Added in logic so it will use the dayfirst parameter if you pass that to to_datetime to help infer the right format
  • Changed datetime_format back to format. I wanted to switch away from using a python built-in function name, but at this point it really is too late for that so I see it just adds confusion.
  • Yes, iso8601 (and substrings thereof) are the only fast-path variations -- this happens in array_to_datetime
  • This doesn't interfere with the %Y%m%d fast-path, but it additionally doesn't help that path, because _guess_datetime_format() can't guess that one as of right now.
  • Changed logic to base the format on the first non-nan string in the array (using the logic arr[com.notnull(arr).nonzero()[0][0]] which seems to be right to me, but I really pulled that out of my ass so have no clue --- it seems to work though and it is fast)
  • Added a bunch of tests for: dealing with nans, dealing with strings with various formats (which actually was broken before), more strings.

Oddities:

  • _timelex breaks in dateutil2.2 unless you do the StringIO(str(datetime_string)) thing
  • Any of the tests for parsing month names can fail when the locale is set (because "%b" might be "Dec" in one locale, or "Dic" in another one). So I made those tests skip if the locale is set. I think if you have a non-US locale and your datetime strings are of the format "12/Dic/2011" then you wont get the speed-up because dateutil won't properly parse that.

Other things:

  • The fact that dateutil does fill in some defaults but I don't think this is that big of a problem. In any situation if the guessed format is proved to be wrong in practice, then all results are ignored and it just falls back to the default (which will be 10% slower to so, but the results will be the same). Also with dayfirst you do have a little control over helping it infer the right thing.
  • This doesn't support timezones right now because _timelex doesn't parse them out properly. I think it is a relatively easy fix (good for a future change), but it will require some addition filtering of the _timelex result which I didn't want to add into this change (further complicating an already complicated change).
  • As of right now, this inferring feature is set to False for both to_datetime and read_csv, which is certainly the safest thing. Maybe it would want to be set to True at some point, I really don't know how you evaluation the risk/reward for such a thing. In theory it should almost never be slower, often be significantly faster, and if it provides different values than the default then your entire input case was technically ambiguous to begin with.

@jreback
Copy link
Contributor

jreback commented Jan 22, 2014

instead of adding infer_datetime_format to read_csv

can u just intercept parse_dates='infer'

and if it's a dict maybe could for individual fields (if parse_dates is a list might be trickier)

I am normally against overloading too much
but I. this case seems natural

@ghost
Copy link

ghost commented Jan 22, 2014

oh. oh nice. yeah. you're against overloading. sure.

hehe :)

@jreback
Copy link
Contributor

jreback commented Jan 22, 2014

closet overloader :)

@ghost
Copy link

ghost commented Jan 22, 2014

I was thinking "filthy liar" but close enough :).

@danbirken

  • fine on punting for locale/tz
  • yes, if we're relying on the dateutil lexer the whole infer machinery should silenty fallback
    to the old path. I'm assuming the tests will let us know if dateutil makes a breaking change.
  • Since the %Y%m%d fastpath makes such a difference, I think testing specifically for it
    if the parse() roundtrip doesn't match would be good.
  • I disagree on the the dateutil defaults not mattering, since we're inferring the format string
    from a single data point. That will appear as widely divergent perf for basically identical datasets if one
    of them happens to have a bad datum as the first. It's not a blocker, but 0.14 will be in dev for
    a few months, it would be nice to remove that limitation.

On risk/defaults, I suggest we ship this as off by default in 0.14, give it a prominent place
in the ANN, and if it proves stable we can make it the default in a future release.

I'd like to avoid making non-battle-tested code the default for everyone's production
code, even if it's a big perf win. @jreback?

@jreback
Copy link
Contributor

jreback commented Jan 22, 2014

👍 on 0.14.0 (though could put in 0.13.1 if its turned off by default)

@ghost
Copy link

ghost commented Jan 22, 2014

You're totally right. @danbirken if you can sort out the remaining issues let's just ship it next
week in 0.13.1. the only blocker from my notes is the handling of _timelex import failures, the rest is just nice to have.

@jreback
Copy link
Contributor

jreback commented Jan 22, 2014

@danbirken

why don't you just have an import flag in tseries/tools.py something like '_DATEUTIL_LEXER`

and set it if the function exists

we do this when we want to know at run-time if a module exists, but no reason why you can't do this
for a function (see pandas.computation.expressions.py for the way numexpr is handled)

then your guesser can decide what to do based on the availability of the dateutil function?

1 similar comment
@jreback
Copy link
Contributor

jreback commented Jan 22, 2014

@danbirken

why don't you just have an import flag in tseries/tools.py something like '_DATEUTIL_LEXER`

and set it if the function exists

we do this when we want to know at run-time if a module exists, but no reason why you can't do this
for a function (see pandas.computation.expressions.py for the way numexpr is handled)

then your guesser can decide what to do based on the availability of the dateutil function?

@danbirken
Copy link
Contributor Author

  • If _timelex or _timelex.split doesn't exist, pandas will work fine except for this feature, which just will do nothing.
  • I added %Y%m%d support to _guess_datetime_format. Unlike the iso-8601 fast-path, this particular fast-path is actually hard to opt into (I don't think you can do it from read_csv), but now infer_datetime_format will do it for you if enabled. This is about a ~20-30x speedup for those cases.
  • As for overloading parse_dates, I think adding a new field is better for 2 reasons.
  1. parse_dates already supports a wide variety of input formats, which makes squeezing in something else more complicated.
  2. Since this is theoretically going from not-enabled to enabled by default, having it be a separate field is really nice because we can flip the one boolean value and that is that. If it were in parse_dates, then to flip it we would probably have to add in something like parse_dates='not_infer' (and keep supporting parse_dates='infer') for the people who explicitly want to opt-out for whatever reason, which would be really confusing.
  • As for the sentinel values, I thought a lot about this. We can use dateutil.parser.DEFAULTPARSER._parse which gives the raw values without any defaults (I tried a bunch of ways to trick dateutil.parser.parse into doing it for me, and couldn't find a way). However, it is using another private method and requires essentially repeating existing/tested/good code in dateutil.parser.parse to get it into a proper datetime. So I thought about doing this, but as it turns out, the fact that datetime.datetime puts 0 as default values is fine. 0 is a nice sentinel value, because 0 is invalid for both month and day, the only two fields it could possibly be confused with (as of right now, this function only supports 4-digit years). So the only potential failure mode is that a datetime like: "2011/01/01 00:00" will default to "%Y/%m/%d %H:%M". This is ambiguous, as the time-string could be referring to %H:%M or %M:%S (but it seems incredibly likely it is %H:%M, which is the default guess). It is impossible for something like "2011/01/01" to be mis-guessed as "%Y/%M/%S", because 0 is an invalid value for either month or date so they will be immediately ignored.

However, the situation isn't perfect. It will still mess up cases a human wouldn't:

In [4]: tools._guess_datetime_format('01:01 2011/01/01')
Out[4]: '%m:%d %Y/%H/%M'  # wrong!

In [6]: tools._guess_datetime_format('00:00 2011/01/01')
Out[6]: '%H:%M %Y/%m/%d'  # right!

But sentinel values don't actually improve this case, this is just a problem with the current guessing method. However, this is a pretty rare edge case, as pretty much every standard datetime format puts the Y-m-d information first, which is what the guesser expects.

So in conclusion, I think the sentinel values of 0 are actually perfectly good and I can't think of any case where they cause the guesser to do the wrong thing.


New questions:

Assuming everybody is content with adding the infer_datetime_format keyword to read_csv, should I also add this to Series.from_csv and DataFrame.from_csv?

@jreback
Copy link
Contributor

jreback commented Jan 23, 2014

you can add to the from_csv methods, though they apparently have parse_dates=True as a default IIRC (and eventually / their is an open issue) to deprectate these, but until then I guess should be in sync.

separate issue:

if you use the separate option (ok by me then :)) in theory it should be able to take a list of columns, because you might want to only infer on certain columns (annoying but prob needed)

@jreback
Copy link
Contributor

jreback commented Jan 23, 2014

@danbirken also will need an example in v0.13.1.txt (and you can use the same example) in io.rst (which you can prob use from a test you have anyhow)

@ghost
Copy link

ghost commented Jan 23, 2014

  • If dateutil breaks we can't detect the regression currently, please add a vbench.
  • about overloading parse_dates, you're right that adding it there would make some usage impossible.
    It could be another possible value for date_parser="infer". but read_csv is such a monster already that, meh, whatever. Keep the kwarg, however...
  • The new kwarg is still named differently in to_csv vs to_datetime.
  • Please always add new keyword at the tail position of the signature (right before the catch-all **kwds if
    there is one). yes it's grouped logically now, no, no-one really uses positional args when you have 30 kwargs.
    This is the general pattern we follow and we stick to it.

I'm convinced by your points on the sentinals idea.

…das-dev#5490

Given an array of strings that represent datetimes, infer_format=True
will attempt to guess the format of the datetimes, and if it can infer
the format, it will use a faster function to convert/import the
datetimes.  In cases where this speed-up can be used, the function
should be about 10x faster.
@danbirken
Copy link
Contributor Author

Added vbench, fixed kwarg ordering, made keyword consistent everywhere (infer_datetime_format), added documentation to both io.rst and 13.1.txt, added kwarg to from_csvs

Here is a link just to the changes from this round (the "official" pull request has all of these smashed together, but that makes it harder to review. Hopefully this helps): danbirken/pandas@731120f...6ed08d1

I admit I am saying this partially out of laziness, but making infer_datetime_format support all of the variations that parse_dates really is a headache. It gives people the possibility of doing inconsistent/weird things (parse_dates=[[1,4],], infer_datetime_format=[1,]) that would have to be checked somehow. Additionally, I think this is such a rare case that if for some reason inferring the datetime format is bad for some, but not all, of your date columns, disabling it for everything is an appropriate solution. Since this is purely about performance, those rare people will not get this optimization, but it wont change their data or cause them harm in any other way. And finally, if there is a large clamoring for it in the future, it can always be added on without any BC problems.

@jreback
Copy link
Contributor

jreback commented Jan 24, 2014

@danbirken agree with your point about infer_datetime_format just being a boolean

looks good

would add the YYYYMMDD format to the common parsed format section

can u post a run of the new vbenches?

otherwise looks good!

This allows read_csv() to attempt to infer the datetime format for any
columns where parse_dates is enabled.  In cases where the datetime
format can be inferred, this should speed up processing datetimes
by ~10x.

Additionally add documentation and benchmarks for read_csv().
@danbirken
Copy link
Contributor Author

Updated docs.

Here are the vbench outputs from my machine:

With infer_datetime_format=True:

---------------------------------------------------------
Test name                                    |    #0    |
---------------------------------------------------------
read_csv_infer_datetime_format_custom        |  13.5159 |
read_csv_infer_datetime_format_iso8601       |   2.0467 |
read_csv_infer_datetime_format_ymd           |   2.4966 |
---------------------------------------------------------
Test name                                    |    #0    |
---------------------------------------------------------

When I manually set them to False (I don't know another way
to do this, otherwise I get nan's because the kw doesn't exist)

---------------------------------------------------------
Test name                                    |    #0    |
---------------------------------------------------------
read_csv_infer_datetime_format_custom        | 130.3443 |
read_csv_infer_datetime_format_iso8601       |   1.4033 |
read_csv_infer_datetime_format_ymd           |  62.3440 |
---------------------------------------------------------
Test name                                    |    #0    |
---------------------------------------------------------

The iso8601 one is a little slower, which makes sense because the extra time of guessing the format has to be taken into account, and you don't actually get any speedup. I don't know what these time units are, but in an absolute real-world sense I don't think the difference is a big deal. The bigger your input case, the less this difference will be felt additionally.

.. ipython:: python

# Try to infer the format for the index column
df = pd.read_csv('foo.csv', index_col=0, parse_dates=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

foo.csv has been removed before (under 'Specifying date columns'), so you will have to move that remove below this.

@ghost
Copy link

ghost commented Jan 24, 2014

Let's call this after the doc tweaks, it's plenty good enough to merge.

@jreback
Copy link
Contributor

jreback commented Jan 24, 2014

@y-p merge away

@jreback
Copy link
Contributor

jreback commented Jan 24, 2014

this going to close #5490
@danbirken (you had put related....any reason not to close that issue?)

jreback added a commit that referenced this pull request Jan 24, 2014
@jreback jreback merged commit 4bfb38b into pandas-dev:master Jan 24, 2014
@jreback
Copy link
Contributor

jreback commented Jan 24, 2014

@danbirken thanks for this ! awesome!!!!

pls check docs and behavior....if need a small followup can easily put in 0..13.1!

@danbirken
Copy link
Contributor Author

I'm packing for a flight right now, but I'll get to those doc fixes when I'm on the plane in a few hours. I misunderstood the formatting and that was a case of copy/pasting gone wrong!

And yes, this should close that issue.

@jreback
Copy link
Contributor

jreback commented Jan 24, 2014

@danbirken gr8! doc updates can take anytime....

thanks again!

@ghost
Copy link

ghost commented Jan 24, 2014

I'll push a release note thanking @lexual and @danbirken

@jreback
Copy link
Contributor

jreback commented Jan 24, 2014

sure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: speed up pd.to_datetime and co. by extracting dt format from data and using strptime to parse
4 participants