Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Parse certain dates in Cython instead of falling back to dateutil.parse #25922

Merged
merged 32 commits into from Apr 20, 2019

Conversation

@vnlitvin
Copy link
Contributor

commented Mar 29, 2019

This tremendously speeds up parsing of certain date patterns: MM/YYYY, MM/DD/YYYY and DD/MM/YYYY (where separator is one of , /, -, and \), as previously they were parsed using dateutil.parse which is implemented in Python.

  • closes N/A
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry
@pep8speaks

This comment has been minimized.

Copy link

commented Mar 29, 2019

Hello @vnlitvin! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-04-19 06:17:53 UTC
@vnlitvin

This comment has been minimized.

Copy link
Contributor Author

commented Mar 29, 2019

Performance benefits, shown by asv continuous -f 1.01 upstream/master parse-slashed-date -e -b io.csv.ReadCSVParseSpecialDate -a sample_time=1 -a warmup_time=1:

before after ratio test name
[1d4c89f] [72a6ed8]
master parse-slashed-date
721±10ms 44.3±2ms 0.06 io.csv.ReadCSVParseSpecialDate.time_read_special_date('mY')
493±4ms 17.9±0.3ms 0.04 io.csv.ReadCSVParseSpecialDate.time_read_special_date('mdY')
@codecov

This comment has been minimized.

Copy link

commented Mar 29, 2019

Codecov Report

Merging #25922 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25922      +/-   ##
==========================================
- Coverage    91.8%   91.79%   -0.01%     
==========================================
  Files         174      174              
  Lines       52536    52536              
==========================================
- Hits        48231    48227       -4     
- Misses       4305     4309       +4
Flag Coverage Δ
#multiple 90.35% <ø> (ø) ⬆️
#single 41.87% <ø> (-0.08%) ⬇️
Impacted Files Coverage Δ
pandas/io/gbq.py 75% <0%> (-12.5%) ⬇️
pandas/core/frame.py 96.79% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1d4c89f...72a6ed8. Read the comment docs.

@codecov

This comment has been minimized.

Copy link

commented Mar 29, 2019

Codecov Report

Merging #25922 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25922      +/-   ##
==========================================
- Coverage   91.99%   91.98%   -0.01%     
==========================================
  Files         175      175              
  Lines       52383    52383              
==========================================
- Hits        48188    48185       -3     
- Misses       4195     4198       +3
Flag Coverage Δ
#multiple 90.54% <ø> (ø) ⬆️
#single 40.74% <ø> (-0.15%) ⬇️
Impacted Files Coverage Δ
pandas/io/gbq.py 75% <0%> (-12.5%) ⬇️
pandas/core/frame.py 96.9% <0%> (-0.12%) ⬇️
pandas/util/testing.py 90.71% <0%> (+0.1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f53bb06...2cd971a. Read the comment docs.

@mroeschke

This comment has been minimized.

Copy link
Member

commented Mar 29, 2019

xref #12667

I like the spirit of of this PR, but i am a little weary of the creep in scope. pandas does have some date parsing utility especially for iso dates, but I feel that more niche formats should be dispatched to dateutil i.e. maybe dateutil should inherit this performance improvement

Thoughts @jbrockmendel?

@jbrockmendel

This comment has been minimized.

Copy link
Member

commented Mar 29, 2019

I agree with @mroeschke about creep scope. Do we have strong reason to believe that performance on this particular set of formats is a major pain point for users?

@vnlitvin

This comment has been minimized.

Copy link
Contributor Author

commented Mar 30, 2019

I agree with @mroeschke about creep scope. Do we have strong reason to believe that performance on this particular set of formats is a major pain point for users?

As far as I can tell, dateutil is pure-Python by design, so it would be really hard to integrate this there.
I don't see why this date format should be treated differently than ISO8601 one (our parsing is definitely way easier to maintain than that).

As for being a pain point - this format is quite widely used in certain real-world data sets, like Mortgage or NYC Taxi Rides, so yes, it is a pain point in some cases (Mortgage dataset that takes dozens of minutes to be loaded because of this format being parsed slowly is the reality I measured).

@jbrockmendel

This comment has been minimized.

Copy link
Member

commented Mar 30, 2019

As far as I can tell, dateutil is pure-Python by design, so it would be really hard to integrate this there.

dateutil is pure-python, but it also has lots of room for performance improvement. Also IIRC @pganssle has been considering adding C/cython/rust backend(s).

I don't see why this date format should be treated differently than ISO8601 one (our parsing is definitely way easier to maintain than that)

That logic can apply to many formats, most of which we don't want to be responsible for internally.

Mortgage dataset that takes dozens of minutes to be loaded because of this format being parsed slowly

I think if you pass the format kwarg to to_datetime it should use array_strptime, which I expect would be performant.

@jreback
Copy link
Contributor

left a comment

are there cases where this this fails to parse but we still call dateuti.parse? e.g. which are they (add to the doc-string).

certaily ok with parsing common cases, just want to note it.

Show resolved Hide resolved pandas/_libs/tslibs/parsing.pyx Outdated
Show resolved Hide resolved pandas/_libs/tslibs/parsing.pyx Outdated
Show resolved Hide resolved pandas/_libs/tslibs/parsing.pyx Outdated
@vnlitvin

This comment has been minimized.

Copy link
Contributor Author

commented Apr 1, 2019

I think if you pass the format kwarg to to_datetime it should use array_strptime, which I expect would be performant.

There's no direct way of passing format argument to to_datetime() from read_csv. I have prototyped passing the argument to run a benchmark, though, and measured the performance.
It is performing roughly the same (in the margin of error), but it requires the caller of read_csv to know date format beforehand (including knowing the separator, so if different separators are for some reason used in different rows this would fail to parse correctly).

date format format + array_strptime proposed PR
mY 43.9±0.4ms 43.5±0.3ms
mdY 16.4±0.08ms 16.5±0.2ms

So I'm stating that using format keyword here could save from implementing our speed ups, yes, but at expense of end user knowing the format of input data, while our PR makes parsing almost equally fast without requiring knowledge from the user (and keeping public API unchanged! 😄 ).

@vnlitvin vnlitvin force-pushed the anmyachev:parse-slashed-date branch from 0123245 to 8a05e93 Apr 1, 2019

@vnlitvin vnlitvin changed the title PERF: Parse certain dates in Cython instead of falling back to dateuril.parse PERF: Parse certain dates in Cython instead of falling back to dateutil.parse Apr 1, 2019

@pganssle

This comment has been minimized.

Copy link
Contributor

commented Apr 1, 2019

Does this not need additional tests?

I can't comment on the scoping stuff, but one of the hardest parts of changing any auto-magical interface is that it's very hard to tell if the things you are doing are going to break something someone is counting on. Even something like this with a fast-path, you should probably be doing a decent amount of testing of the desired behavior and similar, related behavior - hard to even say what that would do.

Since you are trying to replace dateutil.parser.parse, this may be an ideal situation to use hypothesis, because you can test the property "this function returns the same thing as dateutil" for all kinds of formats, since you're looking for something that is basically a faster version of dateutil.

@vnlitvin

This comment has been minimized.

Copy link
Contributor Author

commented Apr 1, 2019

Does this not need additional tests?

Great question, it most likely does need tests, we'll add some.

@anmyachev

This comment has been minimized.

Copy link
Contributor

commented Apr 2, 2019

I have added tests and fixed bug. Thanks @pganssle

@pganssle

This comment has been minimized.

Copy link
Contributor

commented Apr 3, 2019

Glad that it helped to add tests.

Maybe it's a job for a different PR, but I do still think it may be worth adding hypothesis-based tests testing that dateutil and pandas return the same thing for a wide swathe of dates and formats. It's super easy to change something about datetime parsing logic and change some use case you didn't think about.

@vnlitvin

This comment has been minimized.

Copy link
Contributor Author

commented Apr 3, 2019

testing that dateutil and pandas return the same thing for a wide swathe of dates and formats

Note that this change changes the behaviour for dates like MM/YYYY slightly, as dateutil parses MM/YYYY with day set equal to current day while our patch parses it set to 1. I think our approach is more consistent with how other abbreviated dates are parsed, see

cdef inline object _parse_dateabbr_string(object date_string, object default,
object freq):
cdef:
object ret
int year, quarter = -1, month, mnum, date_len
# special handling for possibilities eg, 2Q2005, 2Q05, 2005Q1, 05Q1
assert isinstance(date_string, (str, unicode))
# len(date_string) == 0
# should be NaT???
if date_string in nat_strings:
return NaT, NaT, ''
date_string = date_string.upper()
date_len = len(date_string)
if date_len == 4:
# parse year only like 2000
try:
ret = default.replace(year=int(date_string))
return ret, ret, 'year'
except ValueError:
pass
try:
if 4 <= date_len <= 7:
i = date_string.index('Q', 1, 6)
if i == 1:
quarter = int(date_string[0])
if date_len == 4 or (date_len == 5
and date_string[i + 1] == '-'):
# r'(\d)Q-?(\d\d)')
year = 2000 + int(date_string[-2:])
elif date_len == 6 or (date_len == 7
and date_string[i + 1] == '-'):
# r'(\d)Q-?(\d\d\d\d)')
year = int(date_string[-4:])
else:
raise ValueError
elif i == 2 or i == 3:
# r'(\d\d)-?Q(\d)'
if date_len == 4 or (date_len == 5
and date_string[i - 1] == '-'):
quarter = int(date_string[-1])
year = 2000 + int(date_string[:2])
else:
raise ValueError
elif i == 4 or i == 5:
if date_len == 6 or (date_len == 7
and date_string[i - 1] == '-'):
# r'(\d\d\d\d)-?Q(\d)'
quarter = int(date_string[-1])
year = int(date_string[:4])
else:
raise ValueError
if not (1 <= quarter <= 4):
msg = ('Incorrect quarterly string is given, quarter must be '
'between 1 and 4: {dstr}')
raise DateParseError(msg.format(dstr=date_string))
if freq is not None:
# hack attack, #1228
try:
mnum = MONTH_NUMBERS[_get_rule_month(freq)] + 1
except (KeyError, ValueError):
msg = ('Unable to retrieve month information from given '
'freq: {freq}'.format(freq=freq))
raise DateParseError(msg)
month = (mnum + (quarter - 1) * 3) % 12 + 1
if month > mnum:
year -= 1
else:
month = (quarter - 1) * 3 + 1
ret = default.replace(year=year, month=month)
return ret, ret, 'quarter'
except DateParseError:
raise
except ValueError:
pass
if date_len == 6 and (freq == 'M' or
getattr(freq, 'rule_code', None) == 'M'):
year = int(date_string[:4])
month = int(date_string[4:6])
try:
ret = default.replace(year=year, month=month)
return ret, ret, 'month'
except ValueError:
pass
for pat in ['%Y-%m', '%m-%Y', '%b %Y', '%b-%Y']:
try:
ret = datetime.strptime(date_string, pat)
return ret, ret, 'month'
except ValueError:
pass
raise ValueError('Unable to parse {0}'.format(date_string))

It never replaces day in _DEFAULT_DATETIME so day stays 1 there.

@pganssle

This comment has been minimized.

Copy link
Contributor

commented Apr 3, 2019

@vnlitvin While I think defaulting to "today" was not a good choice in dateutil, the reason I have not changed that is that it would be a backwards-incompatible change.

If you are going to make a backwards-incompatible change to the behavior of the library, that should be very prominently documented. This change is only documented as a performance improvement.

@vnlitvin

This comment has been minimized.

Copy link
Contributor Author

commented Apr 3, 2019

This change is only documented as a performance improvement.

That is because I didn't suspect that dateutil would use today's day for some unrelated date :)
We will update whatsnew with this change.

vnlitvin and others added some commits Apr 1, 2019

removed parsing MM.YYYY format, because, for example, 10.2019 interpr…
…eted as a float number, on the other hand 09.2019 can be parsed as date
Remove whatsnew entry for the change
It turned out that there is no change in behaviour
Remove duplicate parsing of MM-YYYY in _parse_dateabbr_string
This pattern is already handled by _parse_delimited_date
removed test_parse_delimited_date; added next formats: '%y %m %d', '%…
…Y%m%d', '%y%m%d' for test_hypothesis_delimited_date; created _helper_hypothesis_delimited_date func
fixed import order; using @settings(deadline=None) now; dates with ye…
…ar < 1000 not processig in _parse_delimited_date now

@anmyachev anmyachev force-pushed the anmyachev:parse-slashed-date branch from dbf26e3 to 2cd971a Apr 19, 2019

@anmyachev

This comment has been minimized.

Copy link
Contributor

commented Apr 19, 2019

@jreback Done & green

@jreback jreback merged commit 3e90d43 into pandas-dev:master Apr 20, 2019

11 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
pandas-dev.pandas Build #20190419.15 succeeded
Details
pandas-dev.pandas (Checks_and_doc) Checks_and_doc succeeded
Details
pandas-dev.pandas (Linux py35_compat) Linux py35_compat succeeded
Details
pandas-dev.pandas (Linux py36_locale_slow) Linux py36_locale_slow succeeded
Details
pandas-dev.pandas (Linux py36_locale_slow_old_np) Linux py36_locale_slow_old_np succeeded
Details
pandas-dev.pandas (Linux py37_locale) Linux py37_locale succeeded
Details
pandas-dev.pandas (Linux py37_np_dev) Linux py37_np_dev succeeded
Details
pandas-dev.pandas (Windows py36_np15) Windows py36_np15 succeeded
Details
pandas-dev.pandas (Windows py37_np141) Windows py37_np141 succeeded
Details
pandas-dev.pandas (macOS py35_macos) macOS py35_macos succeeded
Details
@jreback

This comment has been minimized.

Copy link
Contributor

commented Apr 20, 2019

thanks @vnlitvin and @anmyachev nice patch! these involved can take some time to review (and write of course).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants
You can’t perform that action at this time.