Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Parse certain dates in Cython instead of falling back to dateutil.parse #25922

Merged
merged 32 commits into from Apr 20, 2019

Conversation

vnlitvinov
Copy link
Contributor

@vnlitvinov vnlitvinov commented Mar 29, 2019

This tremendously speeds up parsing of certain date patterns: MM/YYYY, MM/DD/YYYY and DD/MM/YYYY (where separator is one of , /, -, and \), as previously they were parsed using dateutil.parse which is implemented in Python.

  • closes N/A
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

@pep8speaks
Copy link

pep8speaks commented Mar 29, 2019

Hello @vnlitvin! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-04-19 06:17:53 UTC

@vnlitvinov
Copy link
Contributor Author

Performance benefits, shown by asv continuous -f 1.01 upstream/master parse-slashed-date -e -b io.csv.ReadCSVParseSpecialDate -a sample_time=1 -a warmup_time=1:

before after ratio test name
[1d4c89f] [72a6ed8]
master parse-slashed-date
721±10ms 44.3±2ms 0.06 io.csv.ReadCSVParseSpecialDate.time_read_special_date('mY')
493±4ms 17.9±0.3ms 0.04 io.csv.ReadCSVParseSpecialDate.time_read_special_date('mdY')

@codecov
Copy link

codecov bot commented Mar 29, 2019

Codecov Report

Merging #25922 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25922      +/-   ##
==========================================
- Coverage    91.8%   91.79%   -0.01%     
==========================================
  Files         174      174              
  Lines       52536    52536              
==========================================
- Hits        48231    48227       -4     
- Misses       4305     4309       +4
Flag Coverage Δ
#multiple 90.35% <ø> (ø) ⬆️
#single 41.87% <ø> (-0.08%) ⬇️
Impacted Files Coverage Δ
pandas/io/gbq.py 75% <0%> (-12.5%) ⬇️
pandas/core/frame.py 96.79% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1d4c89f...72a6ed8. Read the comment docs.

@codecov
Copy link

codecov bot commented Mar 29, 2019

Codecov Report

Merging #25922 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25922      +/-   ##
==========================================
- Coverage   91.99%   91.98%   -0.01%     
==========================================
  Files         175      175              
  Lines       52383    52383              
==========================================
- Hits        48188    48185       -3     
- Misses       4195     4198       +3
Flag Coverage Δ
#multiple 90.54% <ø> (ø) ⬆️
#single 40.74% <ø> (-0.15%) ⬇️
Impacted Files Coverage Δ
pandas/io/gbq.py 75% <0%> (-12.5%) ⬇️
pandas/core/frame.py 96.9% <0%> (-0.12%) ⬇️
pandas/util/testing.py 90.71% <0%> (+0.1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f53bb06...2cd971a. Read the comment docs.

@mroeschke
Copy link
Member

xref #12667

I like the spirit of of this PR, but i am a little weary of the creep in scope. pandas does have some date parsing utility especially for iso dates, but I feel that more niche formats should be dispatched to dateutil i.e. maybe dateutil should inherit this performance improvement

Thoughts @jbrockmendel?

@jbrockmendel
Copy link
Member

I agree with @mroeschke about creep scope. Do we have strong reason to believe that performance on this particular set of formats is a major pain point for users?

@gfyoung gfyoung added Timeseries Performance Memory or execution speed performance labels Mar 29, 2019
@vnlitvinov
Copy link
Contributor Author

I agree with @mroeschke about creep scope. Do we have strong reason to believe that performance on this particular set of formats is a major pain point for users?

As far as I can tell, dateutil is pure-Python by design, so it would be really hard to integrate this there.
I don't see why this date format should be treated differently than ISO8601 one (our parsing is definitely way easier to maintain than that).

As for being a pain point - this format is quite widely used in certain real-world data sets, like Mortgage or NYC Taxi Rides, so yes, it is a pain point in some cases (Mortgage dataset that takes dozens of minutes to be loaded because of this format being parsed slowly is the reality I measured).

@jbrockmendel
Copy link
Member

As far as I can tell, dateutil is pure-Python by design, so it would be really hard to integrate this there.

dateutil is pure-python, but it also has lots of room for performance improvement. Also IIRC @pganssle has been considering adding C/cython/rust backend(s).

I don't see why this date format should be treated differently than ISO8601 one (our parsing is definitely way easier to maintain than that)

That logic can apply to many formats, most of which we don't want to be responsible for internally.

Mortgage dataset that takes dozens of minutes to be loaded because of this format being parsed slowly

I think if you pass the format kwarg to to_datetime it should use array_strptime, which I expect would be performant.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there cases where this this fails to parse but we still call dateuti.parse? e.g. which are they (add to the doc-string).

certaily ok with parsing common cases, just want to note it.

pandas/_libs/tslibs/parsing.pyx Outdated Show resolved Hide resolved
pandas/_libs/tslibs/parsing.pyx Outdated Show resolved Hide resolved
pandas/_libs/tslibs/parsing.pyx Outdated Show resolved Hide resolved
@vnlitvinov
Copy link
Contributor Author

I think if you pass the format kwarg to to_datetime it should use array_strptime, which I expect would be performant.

There's no direct way of passing format argument to to_datetime() from read_csv. I have prototyped passing the argument to run a benchmark, though, and measured the performance.
It is performing roughly the same (in the margin of error), but it requires the caller of read_csv to know date format beforehand (including knowing the separator, so if different separators are for some reason used in different rows this would fail to parse correctly).

date format format + array_strptime proposed PR
mY 43.9±0.4ms 43.5±0.3ms
mdY 16.4±0.08ms 16.5±0.2ms

So I'm stating that using format keyword here could save from implementing our speed ups, yes, but at expense of end user knowing the format of input data, while our PR makes parsing almost equally fast without requiring knowledge from the user (and keeping public API unchanged! 😄 ).

@vnlitvinov vnlitvinov changed the title PERF: Parse certain dates in Cython instead of falling back to dateuril.parse PERF: Parse certain dates in Cython instead of falling back to dateutil.parse Apr 1, 2019
@pganssle
Copy link
Contributor

pganssle commented Apr 1, 2019

Does this not need additional tests?

I can't comment on the scoping stuff, but one of the hardest parts of changing any auto-magical interface is that it's very hard to tell if the things you are doing are going to break something someone is counting on. Even something like this with a fast-path, you should probably be doing a decent amount of testing of the desired behavior and similar, related behavior - hard to even say what that would do.

Since you are trying to replace dateutil.parser.parse, this may be an ideal situation to use hypothesis, because you can test the property "this function returns the same thing as dateutil" for all kinds of formats, since you're looking for something that is basically a faster version of dateutil.

@vnlitvinov
Copy link
Contributor Author

Does this not need additional tests?

Great question, it most likely does need tests, we'll add some.

@anmyachev
Copy link
Contributor

I have added tests and fixed bug. Thanks @pganssle

@pganssle
Copy link
Contributor

pganssle commented Apr 3, 2019

Glad that it helped to add tests.

Maybe it's a job for a different PR, but I do still think it may be worth adding hypothesis-based tests testing that dateutil and pandas return the same thing for a wide swathe of dates and formats. It's super easy to change something about datetime parsing logic and change some use case you didn't think about.

@vnlitvinov
Copy link
Contributor Author

testing that dateutil and pandas return the same thing for a wide swathe of dates and formats

Note that this change changes the behaviour for dates like MM/YYYY slightly, as dateutil parses MM/YYYY with day set equal to current day while our patch parses it set to 1. I think our approach is more consistent with how other abbreviated dates are parsed, see

cdef inline object _parse_dateabbr_string(object date_string, object default,
object freq):
cdef:
object ret
int year, quarter = -1, month, mnum, date_len
# special handling for possibilities eg, 2Q2005, 2Q05, 2005Q1, 05Q1
assert isinstance(date_string, (str, unicode))
# len(date_string) == 0
# should be NaT???
if date_string in nat_strings:
return NaT, NaT, ''
date_string = date_string.upper()
date_len = len(date_string)
if date_len == 4:
# parse year only like 2000
try:
ret = default.replace(year=int(date_string))
return ret, ret, 'year'
except ValueError:
pass
try:
if 4 <= date_len <= 7:
i = date_string.index('Q', 1, 6)
if i == 1:
quarter = int(date_string[0])
if date_len == 4 or (date_len == 5
and date_string[i + 1] == '-'):
# r'(\d)Q-?(\d\d)')
year = 2000 + int(date_string[-2:])
elif date_len == 6 or (date_len == 7
and date_string[i + 1] == '-'):
# r'(\d)Q-?(\d\d\d\d)')
year = int(date_string[-4:])
else:
raise ValueError
elif i == 2 or i == 3:
# r'(\d\d)-?Q(\d)'
if date_len == 4 or (date_len == 5
and date_string[i - 1] == '-'):
quarter = int(date_string[-1])
year = 2000 + int(date_string[:2])
else:
raise ValueError
elif i == 4 or i == 5:
if date_len == 6 or (date_len == 7
and date_string[i - 1] == '-'):
# r'(\d\d\d\d)-?Q(\d)'
quarter = int(date_string[-1])
year = int(date_string[:4])
else:
raise ValueError
if not (1 <= quarter <= 4):
msg = ('Incorrect quarterly string is given, quarter must be '
'between 1 and 4: {dstr}')
raise DateParseError(msg.format(dstr=date_string))
if freq is not None:
# hack attack, #1228
try:
mnum = MONTH_NUMBERS[_get_rule_month(freq)] + 1
except (KeyError, ValueError):
msg = ('Unable to retrieve month information from given '
'freq: {freq}'.format(freq=freq))
raise DateParseError(msg)
month = (mnum + (quarter - 1) * 3) % 12 + 1
if month > mnum:
year -= 1
else:
month = (quarter - 1) * 3 + 1
ret = default.replace(year=year, month=month)
return ret, ret, 'quarter'
except DateParseError:
raise
except ValueError:
pass
if date_len == 6 and (freq == 'M' or
getattr(freq, 'rule_code', None) == 'M'):
year = int(date_string[:4])
month = int(date_string[4:6])
try:
ret = default.replace(year=year, month=month)
return ret, ret, 'month'
except ValueError:
pass
for pat in ['%Y-%m', '%m-%Y', '%b %Y', '%b-%Y']:
try:
ret = datetime.strptime(date_string, pat)
return ret, ret, 'month'
except ValueError:
pass
raise ValueError('Unable to parse {0}'.format(date_string))

It never replaces day in _DEFAULT_DATETIME so day stays 1 there.

@pganssle
Copy link
Contributor

pganssle commented Apr 3, 2019

@vnlitvin While I think defaulting to "today" was not a good choice in dateutil, the reason I have not changed that is that it would be a backwards-incompatible change.

If you are going to make a backwards-incompatible change to the behavior of the library, that should be very prominently documented. This change is only documented as a performance improvement.

@vnlitvinov
Copy link
Contributor Author

This change is only documented as a performance improvement.

That is because I didn't suspect that dateutil would use today's day for some unrelated date :)
We will update whatsnew with this change.

vnlitvinov and others added 22 commits April 19, 2019 09:16
…eted as a float number, on the other hand 09.2019 can be parsed as date
It turned out that there is no change in behaviour
This pattern is already handled by _parse_delimited_date
…Y%m%d', '%y%m%d' for test_hypothesis_delimited_date; created _helper_hypothesis_delimited_date func
…ar < 1000 not processig in _parse_delimited_date now
@anmyachev
Copy link
Contributor

@jreback Done & green

@jreback jreback merged commit 3e90d43 into pandas-dev:master Apr 20, 2019
@jreback
Copy link
Contributor

jreback commented Apr 20, 2019

thanks @vnlitvin and @anmyachev nice patch! these involved can take some time to review (and write of course).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Timeseries
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants