-
-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow parsing on some fillings #56
Comments
This is my suggestion, parsing time 1.2s!
|
Currently py-xbrl uses ElementTree for parsing xml. At that time I deliberately decided against BeautifulSoup for two reasons:
However, I will take a look at how you achieved the speeding up of the parsing time later this week, |
Additionally I do not really understand why you are only searching for the |
As I read the lxml is the fastest C based parser. There is also an eTree API for lxml: https://lxml.de/tutorial.html The current eTree implementation uses python based regexp, based on profiling, that's why it's so slow. I think the speed also comes from not parsing the html 'body' (which is huge), as I know there is no ixbrl in that. |
Here is a short explaination of taxonomies and linkbases: |
Still very fast, should be fully compatible with current code base:
|
I made same initial progress with integrating lxml, see branch got the namespace map, and etree root, but fails at Another thing I have noticed with the non optimized etree that RAM usage jumps up to 500-1000MB while parsing. |
I have done the integration of lxml. Turns out it isn't the bottleneck :( Anyway we could eliminate or reduce the number of calls to it? Can't we replace recursion with 136 flat calls? |
MSFT fillings parse very slowly, e.g. parsing only one of them takes 11secs @ 100% CPU.
ixbrl in html seems like a valid xml, cannot we just cut it out, parse it, and never use regexp?
There are 2120074 regexp calls, looks like every tag is searched this way.
Downloading the same file and parsing it with bs4 only takes 4secs: (3s if lxml mode used)
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
python3 -m cProfile -s tottime xbrl_small_test.py > prof.txt
Profiling result
The call stack to get to the bottleneck:
The text was updated successfully, but these errors were encountered: