PERF: Parse certain dates in Cython instead of falling back to dateutil.parse #25922

vnlitvinov · 2019-03-29T15:38:45Z

This tremendously speeds up parsing of certain date patterns: MM/YYYY, MM/DD/YYYY and DD/MM/YYYY (where separator is one of , /, -, and \), as previously they were parsed using dateutil.parse which is implemented in Python.

closes N/A
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

pep8speaks · 2019-03-29T15:38:48Z

Hello @vnlitvin! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-04-19 06:17:53 UTC

vnlitvinov · 2019-03-29T15:52:31Z

Performance benefits, shown by asv continuous -f 1.01 upstream/master parse-slashed-date -e -b io.csv.ReadCSVParseSpecialDate -a sample_time=1 -a warmup_time=1:

before	after	ratio	test name
[`1d4c89f`]	[`72a6ed8`]
master	parse-slashed-date
721±10ms	44.3±2ms	0.06	io.csv.ReadCSVParseSpecialDate.time_read_special_date('mY')
493±4ms	17.9±0.3ms	0.04	io.csv.ReadCSVParseSpecialDate.time_read_special_date('mdY')

codecov · 2019-03-29T16:23:01Z

Codecov Report

Merging #25922 into master will decrease coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #25922      +/-   ##
==========================================
- Coverage    91.8%   91.79%   -0.01%     
==========================================
  Files         174      174              
  Lines       52536    52536              
==========================================
- Hits        48231    48227       -4     
- Misses       4305     4309       +4

Flag	Coverage Δ
#multiple	`90.35% <ø> (ø)`	⬆️
#single	`41.87% <ø> (-0.08%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/gbq.py	`75% <0%> (-12.5%)`	⬇️
pandas/core/frame.py	`96.79% <0%> (-0.12%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1d4c89f...72a6ed8. Read the comment docs.

codecov · 2019-03-29T16:23:17Z

Codecov Report

Merging #25922 into master will decrease coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #25922      +/-   ##
==========================================
- Coverage   91.99%   91.98%   -0.01%     
==========================================
  Files         175      175              
  Lines       52383    52383              
==========================================
- Hits        48188    48185       -3     
- Misses       4195     4198       +3

Flag	Coverage Δ
#multiple	`90.54% <ø> (ø)`	⬆️
#single	`40.74% <ø> (-0.15%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/gbq.py	`75% <0%> (-12.5%)`	⬇️
pandas/core/frame.py	`96.9% <0%> (-0.12%)`	⬇️
pandas/util/testing.py	`90.71% <0%> (+0.1%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f53bb06...2cd971a. Read the comment docs.

mroeschke · 2019-03-29T18:08:01Z

xref #12667

I like the spirit of of this PR, but i am a little weary of the creep in scope. pandas does have some date parsing utility especially for iso dates, but I feel that more niche formats should be dispatched to dateutil i.e. maybe dateutil should inherit this performance improvement

Thoughts @jbrockmendel?

jbrockmendel · 2019-03-29T18:32:30Z

I agree with @mroeschke about creep scope. Do we have strong reason to believe that performance on this particular set of formats is a major pain point for users?

pandas/_libs/src/headers/portable.h

pandas/_libs/tslibs/parsing.pyx

vnlitvinov · 2019-03-30T13:13:37Z

I agree with @mroeschke about creep scope. Do we have strong reason to believe that performance on this particular set of formats is a major pain point for users?

As far as I can tell, dateutil is pure-Python by design, so it would be really hard to integrate this there.
I don't see why this date format should be treated differently than ISO8601 one (our parsing is definitely way easier to maintain than that).

As for being a pain point - this format is quite widely used in certain real-world data sets, like Mortgage or NYC Taxi Rides, so yes, it is a pain point in some cases (Mortgage dataset that takes dozens of minutes to be loaded because of this format being parsed slowly is the reality I measured).

jbrockmendel · 2019-03-30T15:13:48Z

As far as I can tell, dateutil is pure-Python by design, so it would be really hard to integrate this there.

dateutil is pure-python, but it also has lots of room for performance improvement. Also IIRC @pganssle has been considering adding C/cython/rust backend(s).

I don't see why this date format should be treated differently than ISO8601 one (our parsing is definitely way easier to maintain than that)

That logic can apply to many formats, most of which we don't want to be responsible for internally.

Mortgage dataset that takes dozens of minutes to be loaded because of this format being parsed slowly

I think if you pass the format kwarg to to_datetime it should use array_strptime, which I expect would be performant.

jreback

are there cases where this this fails to parse but we still call dateuti.parse? e.g. which are they (add to the doc-string).

certaily ok with parsing common cases, just want to note it.

pandas/_libs/tslibs/parsing.pyx

vnlitvinov · 2019-04-01T11:22:46Z

I think if you pass the format kwarg to to_datetime it should use array_strptime, which I expect would be performant.

There's no direct way of passing format argument to to_datetime() from read_csv. I have prototyped passing the argument to run a benchmark, though, and measured the performance.
It is performing roughly the same (in the margin of error), but it requires the caller of read_csv to know date format beforehand (including knowing the separator, so if different separators are for some reason used in different rows this would fail to parse correctly).

date format	`format` + `array_strptime`	proposed PR
mY	43.9±0.4ms	43.5±0.3ms
mdY	16.4±0.08ms	16.5±0.2ms

So I'm stating that using format keyword here could save from implementing our speed ups, yes, but at expense of end user knowing the format of input data, while our PR makes parsing almost equally fast without requiring knowledge from the user (and keeping public API unchanged! 😄 ).

pganssle · 2019-04-01T20:43:14Z

Does this not need additional tests?

I can't comment on the scoping stuff, but one of the hardest parts of changing any auto-magical interface is that it's very hard to tell if the things you are doing are going to break something someone is counting on. Even something like this with a fast-path, you should probably be doing a decent amount of testing of the desired behavior and similar, related behavior - hard to even say what that would do.

Since you are trying to replace dateutil.parser.parse, this may be an ideal situation to use hypothesis, because you can test the property "this function returns the same thing as dateutil" for all kinds of formats, since you're looking for something that is basically a faster version of dateutil.

vnlitvinov · 2019-04-01T21:02:42Z

Does this not need additional tests?

Great question, it most likely does need tests, we'll add some.

anmyachev · 2019-04-02T12:55:35Z

I have added tests and fixed bug. Thanks @pganssle

pganssle · 2019-04-03T14:41:30Z

Glad that it helped to add tests.

Maybe it's a job for a different PR, but I do still think it may be worth adding hypothesis-based tests testing that dateutil and pandas return the same thing for a wide swathe of dates and formats. It's super easy to change something about datetime parsing logic and change some use case you didn't think about.

vnlitvinov · 2019-04-03T15:24:38Z

testing that dateutil and pandas return the same thing for a wide swathe of dates and formats

Note that this change changes the behaviour for dates like MM/YYYY slightly, as dateutil parses MM/YYYY with day set equal to current day while our patch parses it set to 1. I think our approach is more consistent with how other abbreviated dates are parsed, see

pandas/pandas/_libs/tslibs/parsing.pyx

Lines 195 to 298 in 4814a28

    
           cdef inline object _parse_dateabbr_string(object date_string, object default, 
        
                                                     object freq): 
        
               cdef: 
        
                   object ret 
        
                   int year, quarter = -1, month, mnum, date_len 
        
               # special handling for possibilities eg, 2Q2005, 2Q05, 2005Q1, 05Q1 
        
               assert isinstance(date_string, (str, unicode)) 
        
               # len(date_string) == 0 
        
               # should be NaT??? 
        
               if date_string in nat_strings: 
        
                   return NaT, NaT, '' 
        
               date_string = date_string.upper() 
        
               date_len = len(date_string) 
        
               if date_len == 4: 
        
                   # parse year only like 2000 
        
                   try: 
        
                       ret = default.replace(year=int(date_string)) 
        
                       return ret, ret, 'year' 
        
                   except ValueError: 
        
                       pass 
        
               try: 
        
                   if 4 <= date_len <= 7: 
        
                       i = date_string.index('Q', 1, 6) 
        
                       if i == 1: 
        
                           quarter = int(date_string[0]) 
        
                           if date_len == 4 or (date_len == 5 
        
                                                and date_string[i + 1] == '-'): 
        
                               # r'(\d)Q-?(\d\d)') 
        
                               year = 2000 + int(date_string[-2:]) 
        
                           elif date_len == 6 or (date_len == 7 
        
                                                  and date_string[i + 1] == '-'): 
        
                               # r'(\d)Q-?(\d\d\d\d)') 
        
                               year = int(date_string[-4:]) 
        
                           else: 
        
                               raise ValueError 
        
                       elif i == 2 or i == 3: 
        
                           # r'(\d\d)-?Q(\d)' 
        
                           if date_len == 4 or (date_len == 5 
        
                                                and date_string[i - 1] == '-'): 
        
                               quarter = int(date_string[-1]) 
        
                               year = 2000 + int(date_string[:2]) 
        
                           else: 
        
                               raise ValueError 
        
                       elif i == 4 or i == 5: 
        
                           if date_len == 6 or (date_len == 7 
        
                                                and date_string[i - 1] == '-'): 
        
                               # r'(\d\d\d\d)-?Q(\d)' 
        
                               quarter = int(date_string[-1]) 
        
                               year = int(date_string[:4]) 
        
                           else: 
        
                               raise ValueError 
        
                       if not (1 <= quarter <= 4): 
        
                           msg = ('Incorrect quarterly string is given, quarter must be ' 
        
                                  'between 1 and 4: {dstr}') 
        
                           raise DateParseError(msg.format(dstr=date_string)) 
        
                       if freq is not None: 
        
                           # hack attack, #1228 
        
                           try: 
        
                               mnum = MONTH_NUMBERS[_get_rule_month(freq)] + 1 
        
                           except (KeyError, ValueError): 
        
                               msg = ('Unable to retrieve month information from given ' 
        
                                      'freq: {freq}'.format(freq=freq)) 
        
                               raise DateParseError(msg) 
        
                           month = (mnum + (quarter - 1) * 3) % 12 + 1 
        
                           if month > mnum: 
        
                               year -= 1 
        
                       else: 
        
                           month = (quarter - 1) * 3 + 1 
        
                       ret = default.replace(year=year, month=month) 
        
                       return ret, ret, 'quarter' 
        
               except DateParseError: 
        
                   raise 
        
               except ValueError: 
        
                   pass 
        
               if date_len == 6 and (freq == 'M' or 
        
                                     getattr(freq, 'rule_code', None) == 'M'): 
        
                   year = int(date_string[:4]) 
        
                   month = int(date_string[4:6]) 
        
                   try: 
        
                       ret = default.replace(year=year, month=month) 
        
                       return ret, ret, 'month' 
        
                   except ValueError: 
        
                       pass 
        
               for pat in ['%Y-%m', '%m-%Y', '%b %Y', '%b-%Y']: 
        
                   try: 
        
                       ret = datetime.strptime(date_string, pat) 
        
                       return ret, ret, 'month' 
        
                   except ValueError: 
        
                       pass 
        
               raise ValueError('Unable to parse {0}'.format(date_string))

It never replaces day in _DEFAULT_DATETIME so day stays 1 there.

pganssle · 2019-04-03T15:32:18Z

@vnlitvin While I think defaulting to "today" was not a good choice in dateutil, the reason I have not changed that is that it would be a backwards-incompatible change.

If you are going to make a backwards-incompatible change to the behavior of the library, that should be very prominently documented. This change is only documented as a performance improvement.

vnlitvinov · 2019-04-03T15:46:22Z

This change is only documented as a performance improvement.

That is because I didn't suspect that dateutil would use today's day for some unrelated date :)
We will update whatsnew with this change.

…eted as a float number, on the other hand 09.2019 can be parsed as date

It turned out that there is no change in behaviour

This pattern is already handled by _parse_delimited_date

…ypothesis_delimited_date

…Y%m%d', '%y%m%d' for test_hypothesis_delimited_date; created _helper_hypothesis_delimited_date func

…limited_date now

… -> test_datetime

…ar < 1000 not processig in _parse_delimited_date now

anmyachev · 2019-04-19T07:10:43Z

@jreback Done & green

jreback · 2019-04-20T16:52:55Z

thanks @vnlitvin and @anmyachev nice patch! these involved can take some time to review (and write of course).

…il.parse (pandas-dev#25922)