Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow parsing on some fillings #56

Closed
mrx23dot opened this issue Aug 5, 2021 · 10 comments
Closed

Slow parsing on some fillings #56

mrx23dot opened this issue Aug 5, 2021 · 10 comments
Assignees

Comments

@mrx23dot
Copy link
Contributor

mrx23dot commented Aug 5, 2021

MSFT fillings parse very slowly, e.g. parsing only one of them takes 11secs @ 100% CPU.

ixbrl in html seems like a valid xml, cannot we just cut it out, parse it, and never use regexp?
There are 2120074 regexp calls, looks like every tag is searched this way.
Downloading the same file and parsing it with bs4 only takes 4secs: (3s if lxml mode used)
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')

python3 -m cProfile -s tottime xbrl_small_test.py > prof.txt

from xbrl.cache import HttpCache
from xbrl.instance import XbrlInstance, XbrlParser

dir = 'cache'
cache = HttpCache(dir)
# !Replace the dummy header with your information! SEC EDGAR require you to disclose information about your bot! (https://www.sec.gov/privacy.htm#security)
cache.set_headers({'From': 'test@gmail.com', 'User-Agent': 'revenue extactor v1.0'})
cache.set_connection_params(delay=1000/9.9, retries=5, backoff_factor=0.8, logs=True)

url = 'https://www.sec.gov/Archives/edgar/data/0000789019/000156459021002316/msft-10q_20201231.htm'
# same as zip:  https://www.sec.gov/Archives/edgar/data/0000789019/000156459021002316/0001564590-21-002316-xbrl.zip

inst = XbrlParser(cache).parse_instance(url)

Profiling result

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  2120074    5.464    0.000    5.464    0.000 {method 'findall' of '_sre.SRE_Pattern' objects}  <-- slowest part 5.5seconds
  1060027    1.244    0.000    8.874    0.000 uri_helper.py:58(compare_uri)
  2120054    0.861    0.000    7.029    0.000 re.py:214(findall)
531164/2886    0.810    0.000    9.684    0.003 taxonomy.py:170(get_taxonomy)
  2120160    0.703    0.000    0.728    0.000 re.py:286(_compile)
  2160290    0.622    0.000    0.622    0.000 {method 'split' of 'str' objects}
       31    0.193    0.006    0.193    0.006 {method '_parse_whole' of 'xml.etree.ElementTree.XMLParser' objects}
        1    0.139    0.139    0.323    0.323 xml_parser.py:9(parse_file)
      316    0.136    0.000    0.136    0.000 {method 'feed' of 'xml.etree.ElementTree.XMLParser' objects}
     25/1    0.127    0.005    2.553    2.553 taxonomy.py:219(parse_taxonomy)

The call stack to get to the bottleneck:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   10.646   10.646 xbrl_small_test.py:2(<module>)  <-- entry
        1    0.000    0.000   10.318   10.318 instance.py:644(parse_instance)
        1    0.024    0.024   10.318   10.318 instance.py:351(parse_ixbrl_url)
        1    0.016    0.016   10.293   10.293 instance.py:366(parse_ixbrl)
531164/2886    0.799    0.000    9.478    0.003 taxonomy.py:170(get_taxonomy)
  1060027    1.215    0.000    8.679    0.000 uri_helper.py:58(compare_uri)
  2120054    0.847    0.000    6.893    0.000 re.py:214(findall)
  2120074    5.345    0.000    5.345    0.000 {method 'findall' of '_sre.SRE_Pattern' objects}  <-- slow part

@mrx23dot
Copy link
Contributor Author

mrx23dot commented Aug 6, 2021

This is my suggestion, parsing time 1.2s!

# -*- coding: utf-8 -*-

import requests
requests.packages.urllib3.disable_warnings()

# pip install cchardet lxml beautifulsoup4 requests
# speed up BeautifulSoup only by installing cchardet
from bs4 import BeautifulSoup,SoupStrainer

url = 'https://www.sec.gov/Archives/edgar/data/0000789019/000156459021002316/msft-10q_20201231.htm'
resp = requests.get(url, verify=False, headers={"User-Agent":"Opera browser"})
print(resp.text.count('<xbrli:startDate>'))
#print(resp.text)

# requires utf-8 in file header
# !only parse specific parts for speed
target_tags = SoupStrainer('ix:header')
soup = BeautifulSoup(resp.text, 'lxml', parse_only=target_tags)

for i in soup.find_all('xbrli:startDate'.lower()):
  print(i.text)

@manusimidt manusimidt self-assigned this Aug 9, 2021
@manusimidt
Copy link
Owner

manusimidt commented Aug 10, 2021

Currently py-xbrl uses ElementTree for parsing xml. At that time I deliberately decided against BeautifulSoup for two reasons:

  • It is way slower than lxml or eTree ([1], [2])
  • I did not want py-xbrl to depend on any third party packages except of the default python3 packages.

However, I will take a look at how you achieved the speeding up of the parsing time later this week,

@manusimidt
Copy link
Owner

Additionally I do not really understand why you are only searching for the ix:header XML Element. The entire document can have Facts (ix:nonFraction) at any level and in every line of the document. In your code snippet you just ignore all facts that are outside of the ix:header XML Element.

@mrx23dot
Copy link
Contributor Author

As I read the lxml is the fastest C based parser.
This BeautifulSoup is just an API around lxml (and others).

There is also an eTree API for lxml: https://lxml.de/tutorial.html
Not sure if it allows filtering for specific tags to parse only.

The current eTree implementation uses python based regexp, based on profiling, that's why it's so slow.

I think the speed also comes from not parsing the html 'body' (which is huge), as I know there is no ixbrl in that.

@manusimidt
Copy link
Owner

manusimidt commented Aug 10, 2021

I think the speed also comes from not parsing the html 'body' (which is huge), as I know there is no ixbrl in that.

Thats wrong. The majority of the XBRL-facts are in the body of the HTML document. The ix:hidden element only contians facts that should not be displayed in the HTML report visible to the normal user. In the case of sec submissions the hidden facts are usually the facts that are tagged with the dei taxonomy. These facts contain meta information about the document itself.

All other financial XBRL-facts (like those from the balance sheet) are scattered around the entire HTML document!

Additionally you have to concider that px-xbrl not only parses the instance document, but also all taxonomy schemas and linkbases that the report depends on.

An example:
If you give the parser the following Instance Document:
https://www.sec.gov/Archives/edgar/data/0000789019/000156459021002316/msft-10q_20201231.htm

The parser will download and parse the following files:
https://www.sec.gov/Archives/edgar/data/0000789019/000156459021002316/msft-20201231.xsd
http://www.xbrl.org/2003/xbrl-instance-2003-12-31.xsd
http://www.xbrl.org/2003/xbrl-linkbase-2003-12-31.xsd
http://www.xbrl.org/2003/xl-2003-12-31.xsd
http://www.xbrl.org/2003/xlink-2003-12-31.xsd
http://www.xbrl.org/2005/xbrldt-2005.xsd
https://xbrl.sec.gov/country/2020/country-2020-01-31.xsd
http://www.xbrl.org/dtr/type/nonNumeric-2009-12-16.xsd
https://xbrl.sec.gov/currency/2020/currency-2020-01-31.xsd
https://xbrl.sec.gov/dei/2019/dei-2019-01-31.xsd
http://www.xbrl.org/dtr/type/numeric-2009-12-16.xsd
https://xbrl.sec.gov/exch/2020/exch-2020-01-31.xsd
http://www.xbrl.org/lrr/arcrole/factExplanatory-2009-12-16.xsd
http://www.xbrl.org/lrr/role/negated-2009-12-16.xsd
http://www.xbrl.org/lrr/role/net-2009-12-16.xsd
https://xbrl.sec.gov/naics/2017/naics-2017-01-31.xsd
https://xbrl.sec.gov/sic/2020/sic-2020-01-31.xsd
http://xbrl.fasb.org/srt/2020/elts/srt-2020-01-31.xsd
http://www.xbrl.org/2006/ref-2006-02-27.xsd
http://xbrl.fasb.org/srt/2020/elts/srt-types-2020-01-31.xsd
http://xbrl.fasb.org/srt/2020/elts/srt-roles-2020-01-31.xsd
https://xbrl.sec.gov/stpr/2018/stpr-2018-01-31.xsd
http://xbrl.fasb.org/us-gaap/2020/elts/us-gaap-2020-01-31.xsd
http://xbrl.fasb.org/us-gaap/2020/elts/us-types-2020-01-31.xsd
http://xbrl.fasb.org/us-gaap/2020/elts/us-roles-2020-01-31.xsd
http://xbrl.fasb.org/us-gaap/2020/elts/us-gaap-eedm-def-2020-01-31.xml
http://xbrl.fasb.org/srt/2020/elts/srt-eedm1-def-2020-01-31.xml
https://www.sec.gov/Archives/edgar/data/0000789019/000156459021002316/msft-20201231_cal.xml
https://www.sec.gov/Archives/edgar/data/0000789019/000156459021002316/msft-20201231_def.xml
https://www.sec.gov/Archives/edgar/data/0000789019/000156459021002316/msft-20201231_lab.xml
https://www.sec.gov/Archives/edgar/data/0000789019/000156459021002316/msft-20201231_pre.xml

Your code example above does not touch these taxonomy schemas and linkbases. This is also one reason why your code executes much faster.

@manusimidt
Copy link
Owner

Here is a short explaination of taxonomies and linkbases:
https://manusimidt.dev/2021-07/xbrl-explained

@mrx23dot
Copy link
Contributor Author

Still very fast, should be fully compatible with current code base:

from lxml import etree
import requests
requests.packages.urllib3.disable_warnings()

#download 
url = 'https://www.sec.gov/Archives/edgar/data/0000789019/000156459021002316/msft-10q_20201231.htm'
resp = requests.get(url, verify=False, headers={"User-Agent":"Opera browser"})
print(resp.text.count('<xbrli:startDate>'))
file_in_bytearray = bytes(resp.text, encoding='utf-8')

# parse
root = etree.XML(file_in_bytearray)
for i in root:
  print(i)

@mrx23dot
Copy link
Contributor Author

mrx23dot commented Aug 13, 2021

I made same initial progress with integrating lxml, see branch
https://github.com/mrx23dot/py-xbrl/tree/lxml

got the namespace map, and etree root, but fails at
root.find('.//{}schemaRef'.format(LINK_NS))
returns None

Another thing I have noticed with the non optimized etree that RAM usage jumps up to 500-1000MB while parsing.

@mrx23dot
Copy link
Contributor Author

mrx23dot commented Aug 18, 2021

I have done the integration of lxml. Turns out it isn't the bottleneck :(
It's the simple compare_uri(uri1: str, uri2: str) function.

Anyway we could eliminate or reduce the number of calls to it?
It's called half a million times, for one filling, each call has 2 regexp.
And 99.9% of time it returns false. And only ~136 times it's called with different value.
Called by get_taxonomy(url) which is a big recursion.

Can't we replace recursion with 136 flat calls?

@mrx23dot
Copy link
Contributor Author

mrx23dot commented Aug 26, 2021

Done in
#70
#68

Final result: 11secs-> 0.907 seconds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants