Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V0.7 is much slower to parse large calendar feed then V0.5 #244

Closed
cfhowes opened this issue May 10, 2020 · 11 comments · Fixed by #248
Closed

V0.7 is much slower to parse large calendar feed then V0.5 #244

cfhowes opened this issue May 10, 2020 · 11 comments · Fixed by #248
Milestone

Comments

@cfhowes
Copy link

cfhowes commented May 10, 2020

I have the following code:

    resp = requests.get(ical_feed_url)
    if resp.status_code != 200:
        logger.error('> Error retrieving iCal feed!')
        return None

    try:
        print(f'begin parse {datetime.datetime.now()}')
        cal = ics.Calendar(resp.text)
        print(f'end parse {datetime.datetime.now()}')
    except Exception as e:
        logger.error('> Error parsing iCal data ({})'.format(e))
        return None

I recently tried upgrading from 0.5 to 0.7 but found that it was much slower, and eats a ton more CPU. See my print logs from a few minutes apart on the same feed, same computer.

V0.7 timing:

begin parse 2020-05-09 18:08:26.872346
end parse 2020-05-09 18:10:20.411268

V0.5 timing:

begin parse 2020-05-09 18:12:34.763552
end parse 2020-05-09 18:12:38.981032

I am using python 3.6.8, Ubuntu 16.04.

Due to privacy concerns I'm not going to include the ical feed, but it has about 15,000 events in it. I am happy to run some tests or try things out if i get pointed in the right direction.

@make-github-pseudonymous-again
Copy link
Contributor

Can you check if the same happens with v0.6 ? That is where tatsu parsing was introduced. See 6b71a49

@N-Coder
Copy link
Member

N-Coder commented May 10, 2020

Could you check how long calendar_string_to_containers takes when you pass it the resp.text ICS string, compared to calling ics.Callendar? The method calendar_string_to_containers calls Tatsu to parse every line of the string into a ContentLine. The ics.Calendar constructor first calls this, but then proceeds to read/parse the ContentLines to python objects and stores them in the fields of the calendar. This second step still has some (known) minor issues in 0.7, which might result in increased runtime or memory consumption for big inputs. Tatsu in the past also caused some problems with taking to much time, so comparing the two durations should help decide which of the two steps causes the issues. Could you also tell us how many bytes and lines the file has?

@cfhowes
Copy link
Author

cfhowes commented May 10, 2020

Thanks for the info, here's my results:

cal file:

ls -la maycal.ics
-rw-rw-r-- 1 cfhowes cfhowes 5551179 May 10 13:28 maycal.ics
more maycal.ics | wc -l
173189

v0.5

    print(f'begin string parse {datetime.datetime.now()}')
    cal = parse.string_to_container(cal_data_str)
    print(f'end string parse {datetime.datetime.now()}')

    print(f'begin Calendar parse {datetime.datetime.now()}')
    cal = ics.Calendar(cal_data_str)
    print(f'end Calendar parse {datetime.datetime.now()}')
    begin string parse 2020-05-10 13:38:18.434401
    end string parse 2020-05-10 13:38:19.102300
    begin Calendar parse 2020-05-10 13:38:19.102346
    end Calendar parse 2020-05-10 13:38:23.615815

v0.6

    print(f'begin string parse {datetime.datetime.now()}')
    cal = calendar_string_to_containers(cal_data_str)
    print(f'end string parse {datetime.datetime.now()}')

    print(f'begin Calendar parse {datetime.datetime.now()}')
    cal = ics.Calendar(cal_data_str)
    print(f'end Calendar parse {datetime.datetime.now()}')
    begin string parse 2020-05-10 13:41:12.333714
    end string parse 2020-05-10 13:43:11.287366
    begin Calendar parse 2020-05-10 13:43:11.287407
    end Calendar parse 2020-05-10 13:45:14.424139

v0.7

    print(f'begin string parse {datetime.datetime.now()}')
    cal = calendar_string_to_containers(cal_data_str)
    print(f'end string parse {datetime.datetime.now()}')

    print(f'begin Calendar parse {datetime.datetime.now()}')
    cal = ics.Calendar(cal_data_str)
    print(f'end Calendar parse {datetime.datetime.now()}')
    begin string parse 2020-05-10 13:46:19.980577
    end string parse 2020-05-10 13:48:08.468203
    begin Calendar parse 2020-05-10 13:48:08.468254
    end Calendar parse 2020-05-10 13:50:02.013652

So it looks like it's the TatSu parsing that is taking the time. as the ca;;s to ics.Calendar are only 1-2 seconds longer then the callse to calendar_string_to_containers.

@N-Coder
Copy link
Member

N-Coder commented May 12, 2020

Hmm, I see, Tatsu indeed seems to be the one to blame. Thanks for generating all these stats.
The hope with Tatsu was that it would ensure correctness, as it pretty much directly interprets the EBNF grammar given in the ics RFC, but this runtime trade-off is not acceptable.

I drafted a new, hand-written parser with minimal substring lookup and copying that should still be rfc-compliant for the next release 0.8. I'd probably let Tatsu stay in the code base as reference parser, but use the hand-written one by default if it still yields the same correct results as Tatsu (which will need quite some testing). If you want to try it out with 0.7, simply copy the code over and replace the tokenize_line function:

def tokenize_line(unfolded_lines):
    for nr, line in enumerate(unfolded_lines):
        yield ContentLineParser(line,nr).parse()

To backport the changes to 0.7, you'll probably also need the following additional imports:

import re
import attr
from typing import Iterator, Match, Union
from collections import UserString

class QuotedParamValue(UserString):
    pass

I also created a testcase from scraping and combining all Thunderbird holiday calendars, which yields a calendar with 10361 events, 3812250 bytes and 107862 lines, so roughly two thirds of your calendar. There's a pretty big speed difference between Tatsu and the new parser with this test data:

import datetime, ics
start = datetime.datetime.now()
print(start)
ics.Calendar(open("holidays.ics").read())
end = datetime.datetime.now()
print(end)
print(end - start)

# new parser:
# 2020-05-12 18:49:11.056358
# 2020-05-12 18:49:23.552765
# 0:00:12.496407

# Tatsu:
# 2020-05-12 18:50:24.984475
# 2020-05-12 18:52:40.815331
# 0:02:15.830856

I'll see whether I can squeeze out a few more seconds with the rewritten parsing logic in 0.8. But it might still take some time until that version is ready and it will contain some breaking API changes, so you might want to wait for that and stay with 0.5 if you don't feel like experimenting.

@make-github-pseudonymous-again
Copy link
Contributor

@N-Coder Why does Tatsu have to be called on every line? Why not a grammar that captures the whole thing? Would that make it faster?

@N-Coder
Copy link
Member

N-Coder commented May 13, 2020

I guess the issue with Tatsu is not some startup time overhead, but more that matching different regexes, trying different rules and potential backtracking can take a lot of time. So splitting the input first at these easy to find line boundaries and then passing smaller chunks into the PEG engine should make it better.

To try that out, I added a new rule to the ebnf:

full = {(contentline ?"\r?\n")}+ ;

and then called Tatsu on the full input string (without any line splitting or the line-wise unfolding):

txt = re.sub("\r?\n[ \t]", "", txt)
ast = GRAMMAR.parse(txt, rule_name='full')

and it didn't quite change a thing (the numbers are not fully comparable to the above, as those are based on the work in progress 0.8 codebase the following numbers are now based on 0.7):

# full-input Tatsu
2020-05-13 10:42:21.421509
2020-05-13 10:43:36.632016
0:01:15.210507

# line-wise Tatsu
2020-05-13 10:40:41.859572
2020-05-13 10:41:58.320761
0:01:16.461189

I also "precompiled" the whole grammar and tried again, but that also only yielded very little improvement:

# tatsu --generate-parser ics/grammar/contentline.ebnf  -o ics/grammar/contentline.py
# full-input pregenerated Tatsu
2020-05-13 10:44:57.842085
2020-05-13 10:46:07.254298
0:01:09.412213

and also optimized the ebnf:

# full-input pregenerated optimized Tatsu
2020-05-13 10:53:51.698719
2020-05-13 10:54:53.630510
0:01:01.931791

With all these optimizations (which can be seen in this branch), Tatsu still seems to be one order of magnitude slower than my hand-written parser. So we could open an issue in the Tatsu repo and ask them if they have any further ideas for optimizations, but I'm honestly somewhat weary of those hard to read stack traces. Having both parsers (my hand-written one and Tatsu) together with enough test data should help us ensure that both run correct, while we are then free to choose whichever is faster and more user-friendly in practice.

@make-github-pseudonymous-again
Copy link
Contributor

@N-Coder Wow amazing work again! I think it makes sense to keep both parsers for now. Maybe when things are more stable we can only keep it for testing. What would the API look like to choose which parser to use?

@make-github-pseudonymous-again
Copy link
Contributor

@N-Coder Also makes sense to ask tatsu developers.

@N-Coder N-Coder added this to the Version 0.8 milestone May 16, 2020
@N-Coder N-Coder linked a pull request Jun 1, 2020 that will close this issue
@N-Coder
Copy link
Member

N-Coder commented Oct 18, 2020

@aureooms, thanks for the feedback and sorry for the late reply.

I think it makes sense to keep both parsers for now. Maybe when things are more stable we can only keep it for testing. What would the API look like to choose which parser to use?

Being able to choose between different parsers would add a lot of complexity on both sides and I didn't see any clear advantage for this, so I did as you suggested and only kept my handwritten parser for production and moved tatsu to the testing code as reference. See my comment in the PR for details and some performance statistics. I guess we now don't need to tweak tatsu any further (or discuss this with its developers), as its performance now only matters for testing anyways.
I guess once the PR is merged, we can close this issue. 🎉

@make-github-pseudonymous-again
Copy link
Contributor

Being able to choose between different parsers would add a lot of complexity on both sides and I didn't see any clear advantage for this, so I did as you suggested and only kept my handwritten parser for production and moved tatsu to the testing code as reference.

Good.

I guess we now don't need to tweak tatsu any further (or discuss this with its developers), as its performance now only matters for testing anyways.

Yes.

I guess once the PR is merged, we can close this issue.

Yes.

@C4ptainCrunch
Copy link
Member

This should have been closed automatically when #248 was merged.

image

I don't know why it did not happen. Closing it manually now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants