V0.7 is much slower to parse large calendar feed then V0.5 #244

cfhowes · 2020-05-10T01:27:06Z

I have the following code:

    resp = requests.get(ical_feed_url)
    if resp.status_code != 200:
        logger.error('> Error retrieving iCal feed!')
        return None

    try:
        print(f'begin parse {datetime.datetime.now()}')
        cal = ics.Calendar(resp.text)
        print(f'end parse {datetime.datetime.now()}')
    except Exception as e:
        logger.error('> Error parsing iCal data ({})'.format(e))
        return None

I recently tried upgrading from 0.5 to 0.7 but found that it was much slower, and eats a ton more CPU. See my print logs from a few minutes apart on the same feed, same computer.

V0.7 timing:

begin parse 2020-05-09 18:08:26.872346
end parse 2020-05-09 18:10:20.411268

V0.5 timing:

begin parse 2020-05-09 18:12:34.763552
end parse 2020-05-09 18:12:38.981032

I am using python 3.6.8, Ubuntu 16.04.

Due to privacy concerns I'm not going to include the ical feed, but it has about 15,000 events in it. I am happy to run some tests or try things out if i get pointed in the right direction.

The text was updated successfully, but these errors were encountered:

make-github-pseudonymous-again · 2020-05-10T09:25:13Z

Can you check if the same happens with v0.6 ? That is where tatsu parsing was introduced. See 6b71a49

N-Coder · 2020-05-10T16:29:02Z

Could you check how long calendar_string_to_containers takes when you pass it the resp.text ICS string, compared to calling ics.Callendar? The method calendar_string_to_containers calls Tatsu to parse every line of the string into a ContentLine. The ics.Calendar constructor first calls this, but then proceeds to read/parse the ContentLines to python objects and stores them in the fields of the calendar. This second step still has some (known) minor issues in 0.7, which might result in increased runtime or memory consumption for big inputs. Tatsu in the past also caused some problems with taking to much time, so comparing the two durations should help decide which of the two steps causes the issues. Could you also tell us how many bytes and lines the file has?

cfhowes · 2020-05-10T20:57:22Z

Thanks for the info, here's my results:

cal file:

ls -la maycal.ics
-rw-rw-r-- 1 cfhowes cfhowes 5551179 May 10 13:28 maycal.ics
more maycal.ics | wc -l
173189

v0.5

    print(f'begin string parse {datetime.datetime.now()}')
    cal = parse.string_to_container(cal_data_str)
    print(f'end string parse {datetime.datetime.now()}')

    print(f'begin Calendar parse {datetime.datetime.now()}')
    cal = ics.Calendar(cal_data_str)
    print(f'end Calendar parse {datetime.datetime.now()}')

    begin string parse 2020-05-10 13:38:18.434401
    end string parse 2020-05-10 13:38:19.102300
    begin Calendar parse 2020-05-10 13:38:19.102346
    end Calendar parse 2020-05-10 13:38:23.615815

v0.6

    print(f'begin string parse {datetime.datetime.now()}')
    cal = calendar_string_to_containers(cal_data_str)
    print(f'end string parse {datetime.datetime.now()}')

    print(f'begin Calendar parse {datetime.datetime.now()}')
    cal = ics.Calendar(cal_data_str)
    print(f'end Calendar parse {datetime.datetime.now()}')

    begin string parse 2020-05-10 13:41:12.333714
    end string parse 2020-05-10 13:43:11.287366
    begin Calendar parse 2020-05-10 13:43:11.287407
    end Calendar parse 2020-05-10 13:45:14.424139

v0.7

    print(f'begin string parse {datetime.datetime.now()}')
    cal = calendar_string_to_containers(cal_data_str)
    print(f'end string parse {datetime.datetime.now()}')

    print(f'begin Calendar parse {datetime.datetime.now()}')
    cal = ics.Calendar(cal_data_str)
    print(f'end Calendar parse {datetime.datetime.now()}')

    begin string parse 2020-05-10 13:46:19.980577
    end string parse 2020-05-10 13:48:08.468203
    begin Calendar parse 2020-05-10 13:48:08.468254
    end Calendar parse 2020-05-10 13:50:02.013652

So it looks like it's the TatSu parsing that is taking the time. as the ca;;s to ics.Calendar are only 1-2 seconds longer then the callse to calendar_string_to_containers.

N-Coder · 2020-05-12T16:59:02Z

Hmm, I see, Tatsu indeed seems to be the one to blame. Thanks for generating all these stats.
The hope with Tatsu was that it would ensure correctness, as it pretty much directly interprets the EBNF grammar given in the ics RFC, but this runtime trade-off is not acceptable.

I drafted a new, hand-written parser with minimal substring lookup and copying that should still be rfc-compliant for the next release 0.8. I'd probably let Tatsu stay in the code base as reference parser, but use the hand-written one by default if it still yields the same correct results as Tatsu (which will need quite some testing). If you want to try it out with 0.7, simply copy the code over and replace the tokenize_line function:

def tokenize_line(unfolded_lines):
    for nr, line in enumerate(unfolded_lines):
        yield ContentLineParser(line,nr).parse()

To backport the changes to 0.7, you'll probably also need the following additional imports:

import re
import attr
from typing import Iterator, Match, Union
from collections import UserString

class QuotedParamValue(UserString):
    pass

I also created a testcase from scraping and combining all Thunderbird holiday calendars, which yields a calendar with 10361 events, 3812250 bytes and 107862 lines, so roughly two thirds of your calendar. There's a pretty big speed difference between Tatsu and the new parser with this test data:

import datetime, ics
start = datetime.datetime.now()
print(start)
ics.Calendar(open("holidays.ics").read())
end = datetime.datetime.now()
print(end)
print(end - start)

# new parser:
# 2020-05-12 18:49:11.056358
# 2020-05-12 18:49:23.552765
# 0:00:12.496407

# Tatsu:
# 2020-05-12 18:50:24.984475
# 2020-05-12 18:52:40.815331
# 0:02:15.830856

I'll see whether I can squeeze out a few more seconds with the rewritten parsing logic in 0.8. But it might still take some time until that version is ready and it will contain some breaking API changes, so you might want to wait for that and stay with 0.5 if you don't feel like experimenting.

make-github-pseudonymous-again · 2020-05-12T17:25:23Z

@N-Coder Why does Tatsu have to be called on every line? Why not a grammar that captures the whole thing? Would that make it faster?

N-Coder · 2020-05-13T08:58:55Z

I guess the issue with Tatsu is not some startup time overhead, but more that matching different regexes, trying different rules and potential backtracking can take a lot of time. So splitting the input first at these easy to find line boundaries and then passing smaller chunks into the PEG engine should make it better.

To try that out, I added a new rule to the ebnf:

full = {(contentline ?"\r?\n")}+ ;

and then called Tatsu on the full input string (without any line splitting or the line-wise unfolding):

txt = re.sub("\r?\n[ \t]", "", txt)
ast = GRAMMAR.parse(txt, rule_name='full')

and it didn't quite change a thing (the numbers are not fully comparable to the above, as those are based on the work in progress 0.8 codebase the following numbers are now based on 0.7):

# full-input Tatsu
2020-05-13 10:42:21.421509
2020-05-13 10:43:36.632016
0:01:15.210507

# line-wise Tatsu
2020-05-13 10:40:41.859572
2020-05-13 10:41:58.320761
0:01:16.461189

I also "precompiled" the whole grammar and tried again, but that also only yielded very little improvement:

# tatsu --generate-parser ics/grammar/contentline.ebnf  -o ics/grammar/contentline.py
# full-input pregenerated Tatsu
2020-05-13 10:44:57.842085
2020-05-13 10:46:07.254298
0:01:09.412213

and also optimized the ebnf:

# full-input pregenerated optimized Tatsu
2020-05-13 10:53:51.698719
2020-05-13 10:54:53.630510
0:01:01.931791

With all these optimizations (which can be seen in this branch), Tatsu still seems to be one order of magnitude slower than my hand-written parser. So we could open an issue in the Tatsu repo and ask them if they have any further ideas for optimizations, but I'm honestly somewhat weary of those hard to read stack traces. Having both parsers (my hand-written one and Tatsu) together with enough test data should help us ensure that both run correct, while we are then free to choose whichever is faster and more user-friendly in practice.

make-github-pseudonymous-again · 2020-05-13T09:51:44Z

@N-Coder Wow amazing work again! I think it makes sense to keep both parsers for now. Maybe when things are more stable we can only keep it for testing. What would the API look like to choose which parser to use?

make-github-pseudonymous-again · 2020-05-13T09:52:43Z

@N-Coder Also makes sense to ask tatsu developers.

N-Coder · 2020-10-18T12:27:12Z

@aureooms, thanks for the feedback and sorry for the late reply.

I think it makes sense to keep both parsers for now. Maybe when things are more stable we can only keep it for testing. What would the API look like to choose which parser to use?

Being able to choose between different parsers would add a lot of complexity on both sides and I didn't see any clear advantage for this, so I did as you suggested and only kept my handwritten parser for production and moved tatsu to the testing code as reference. See my comment in the PR for details and some performance statistics. I guess we now don't need to tweak tatsu any further (or discuss this with its developers), as its performance now only matters for testing anyways.
I guess once the PR is merged, we can close this issue. 🎉

make-github-pseudonymous-again · 2020-10-18T20:10:46Z

Being able to choose between different parsers would add a lot of complexity on both sides and I didn't see any clear advantage for this, so I did as you suggested and only kept my handwritten parser for production and moved tatsu to the testing code as reference.

Good.

I guess we now don't need to tweak tatsu any further (or discuss this with its developers), as its performance now only matters for testing anyways.

Yes.

I guess once the PR is merged, we can close this issue.

Yes.

C4ptainCrunch · 2022-12-15T04:01:50Z

This should have been closed automatically when #248 was merged.

I don't know why it did not happen. Closing it manually now.

This was referenced May 16, 2020

[POC] Create a corpus of various ics files for more robust tests #236

Closed

Roadmap for v0.8 #245

Open

N-Coder added this to the Version 0.8 milestone May 16, 2020

This was referenced Jun 1, 2020

Contentline parser #247

Merged

Backport Optimized Tatsu Parser to 0.7 #248

Merged

N-Coder linked a pull request Jun 1, 2020 that will close this issue

Backport Optimized Tatsu Parser to 0.7 #248

Merged

vinraspa mentioned this issue Mar 20, 2022

Bug in v5.0 (but cannot upgrade to v7.0 because of parsing speed problem) #322

Closed

N-Coder mentioned this issue Mar 20, 2022

add warnings about breaking changes to Calendar.__str__ and __iter__ in v0.7(.1) #318

Merged

C4ptainCrunch closed this as completed Dec 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

V0.7 is much slower to parse large calendar feed then V0.5 #244

V0.7 is much slower to parse large calendar feed then V0.5 #244

cfhowes commented May 10, 2020

make-github-pseudonymous-again commented May 10, 2020 •

edited

N-Coder commented May 10, 2020

cfhowes commented May 10, 2020

N-Coder commented May 12, 2020

make-github-pseudonymous-again commented May 12, 2020

N-Coder commented May 13, 2020

make-github-pseudonymous-again commented May 13, 2020

make-github-pseudonymous-again commented May 13, 2020

N-Coder commented Oct 18, 2020

make-github-pseudonymous-again commented Oct 18, 2020

C4ptainCrunch commented Dec 15, 2022

V0.7 is much slower to parse large calendar feed then V0.5 #244

V0.7 is much slower to parse large calendar feed then V0.5 #244

Comments

cfhowes commented May 10, 2020

V0.7 timing:

V0.5 timing:

make-github-pseudonymous-again commented May 10, 2020 • edited

N-Coder commented May 10, 2020

cfhowes commented May 10, 2020

v0.5

v0.6

v0.7

N-Coder commented May 12, 2020

make-github-pseudonymous-again commented May 12, 2020

N-Coder commented May 13, 2020

make-github-pseudonymous-again commented May 13, 2020

make-github-pseudonymous-again commented May 13, 2020

N-Coder commented Oct 18, 2020

make-github-pseudonymous-again commented Oct 18, 2020

C4ptainCrunch commented Dec 15, 2022

make-github-pseudonymous-again commented May 10, 2020 •

edited