New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
V0.7 is much slower to parse large calendar feed then V0.5 #244
Comments
Can you check if the same happens with v0.6 ? That is where tatsu parsing was introduced. See 6b71a49 |
Could you check how long calendar_string_to_containers takes when you pass it the |
Thanks for the info, here's my results: cal file:
v0.5
v0.6
v0.7
So it looks like it's the TatSu parsing that is taking the time. as the ca;;s to |
Hmm, I see, Tatsu indeed seems to be the one to blame. Thanks for generating all these stats. I drafted a new, hand-written parser with minimal substring lookup and copying that should still be rfc-compliant for the next release 0.8. I'd probably let Tatsu stay in the code base as reference parser, but use the hand-written one by default if it still yields the same correct results as Tatsu (which will need quite some testing). If you want to try it out with 0.7, simply copy the code over and replace the
To backport the changes to 0.7, you'll probably also need the following additional imports:
I also created a testcase from scraping and combining all Thunderbird holiday calendars, which yields a calendar with 10361 events, 3812250 bytes and 107862 lines, so roughly two thirds of your calendar. There's a pretty big speed difference between Tatsu and the new parser with this test data:
I'll see whether I can squeeze out a few more seconds with the rewritten parsing logic in 0.8. But it might still take some time until that version is ready and it will contain some breaking API changes, so you might want to wait for that and stay with 0.5 if you don't feel like experimenting. |
@N-Coder Why does Tatsu have to be called on every line? Why not a grammar that captures the whole thing? Would that make it faster? |
I guess the issue with Tatsu is not some startup time overhead, but more that matching different regexes, trying different rules and potential backtracking can take a lot of time. So splitting the input first at these easy to find line boundaries and then passing smaller chunks into the PEG engine should make it better. To try that out, I added a new rule to the ebnf:
and then called Tatsu on the full input string (without any line splitting or the line-wise unfolding):
and it didn't quite change a thing (the numbers are not fully comparable to the above, as those are based on the work in progress 0.8 codebase the following numbers are now based on 0.7):
I also "precompiled" the whole grammar and tried again, but that also only yielded very little improvement:
and also optimized the ebnf:
With all these optimizations (which can be seen in this branch), Tatsu still seems to be one order of magnitude slower than my hand-written parser. So we could open an issue in the Tatsu repo and ask them if they have any further ideas for optimizations, but I'm honestly somewhat weary of those hard to read stack traces. Having both parsers (my hand-written one and Tatsu) together with enough test data should help us ensure that both run correct, while we are then free to choose whichever is faster and more user-friendly in practice. |
@N-Coder Wow amazing work again! I think it makes sense to keep both parsers for now. Maybe when things are more stable we can only keep it for testing. What would the API look like to choose which parser to use? |
@N-Coder Also makes sense to ask tatsu developers. |
@aureooms, thanks for the feedback and sorry for the late reply.
Being able to choose between different parsers would add a lot of complexity on both sides and I didn't see any clear advantage for this, so I did as you suggested and only kept my handwritten parser for production and moved tatsu to the testing code as reference. See my comment in the PR for details and some performance statistics. I guess we now don't need to tweak tatsu any further (or discuss this with its developers), as its performance now only matters for testing anyways. |
Good.
Yes.
Yes. |
This should have been closed automatically when #248 was merged. I don't know why it did not happen. Closing it manually now. |
I have the following code:
I recently tried upgrading from 0.5 to 0.7 but found that it was much slower, and eats a ton more CPU. See my print logs from a few minutes apart on the same feed, same computer.
V0.7 timing:
V0.5 timing:
I am using python 3.6.8, Ubuntu 16.04.
Due to privacy concerns I'm not going to include the ical feed, but it has about 15,000 events in it. I am happy to run some tests or try things out if i get pointed in the right direction.
The text was updated successfully, but these errors were encountered: