Slow parsing on some fillings #56

mrx23dot · 2021-08-05T21:51:16Z

MSFT fillings parse very slowly, e.g. parsing only one of them takes 11secs @ 100% CPU.

ixbrl in html seems like a valid xml, cannot we just cut it out, parse it, and never use regexp?
There are 2120074 regexp calls, looks like every tag is searched this way.
Downloading the same file and parsing it with bs4 only takes 4secs: (3s if lxml mode used)
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')

python3 -m cProfile -s tottime xbrl_small_test.py > prof.txt

from xbrl.cache import HttpCache
from xbrl.instance import XbrlInstance, XbrlParser

dir = 'cache'
cache = HttpCache(dir)
# !Replace the dummy header with your information! SEC EDGAR require you to disclose information about your bot! (https://www.sec.gov/privacy.htm#security)
cache.set_headers({'From': 'test@gmail.com', 'User-Agent': 'revenue extactor v1.0'})
cache.set_connection_params(delay=1000/9.9, retries=5, backoff_factor=0.8, logs=True)

url = 'https://www.sec.gov/Archives/edgar/data/0000789019/000156459021002316/msft-10q_20201231.htm'
# same as zip:  https://www.sec.gov/Archives/edgar/data/0000789019/000156459021002316/0001564590-21-002316-xbrl.zip

inst = XbrlParser(cache).parse_instance(url)

Profiling result

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  2120074    5.464    0.000    5.464    0.000 {method 'findall' of '_sre.SRE_Pattern' objects}  <-- slowest part 5.5seconds
  1060027    1.244    0.000    8.874    0.000 uri_helper.py:58(compare_uri)
  2120054    0.861    0.000    7.029    0.000 re.py:214(findall)
531164/2886    0.810    0.000    9.684    0.003 taxonomy.py:170(get_taxonomy)
  2120160    0.703    0.000    0.728    0.000 re.py:286(_compile)
  2160290    0.622    0.000    0.622    0.000 {method 'split' of 'str' objects}
       31    0.193    0.006    0.193    0.006 {method '_parse_whole' of 'xml.etree.ElementTree.XMLParser' objects}
        1    0.139    0.139    0.323    0.323 xml_parser.py:9(parse_file)
      316    0.136    0.000    0.136    0.000 {method 'feed' of 'xml.etree.ElementTree.XMLParser' objects}
     25/1    0.127    0.005    2.553    2.553 taxonomy.py:219(parse_taxonomy)

The call stack to get to the bottleneck:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   10.646   10.646 xbrl_small_test.py:2(<module>)  <-- entry
        1    0.000    0.000   10.318   10.318 instance.py:644(parse_instance)
        1    0.024    0.024   10.318   10.318 instance.py:351(parse_ixbrl_url)
        1    0.016    0.016   10.293   10.293 instance.py:366(parse_ixbrl)
531164/2886    0.799    0.000    9.478    0.003 taxonomy.py:170(get_taxonomy)
  1060027    1.215    0.000    8.679    0.000 uri_helper.py:58(compare_uri)
  2120054    0.847    0.000    6.893    0.000 re.py:214(findall)
  2120074    5.345    0.000    5.345    0.000 {method 'findall' of '_sre.SRE_Pattern' objects}  <-- slow part

The text was updated successfully, but these errors were encountered:

mrx23dot · 2021-08-06T09:14:58Z

This is my suggestion, parsing time 1.2s!

# -*- coding: utf-8 -*-

import requests
requests.packages.urllib3.disable_warnings()

# pip install cchardet lxml beautifulsoup4 requests
# speed up BeautifulSoup only by installing cchardet
from bs4 import BeautifulSoup,SoupStrainer

url = 'https://www.sec.gov/Archives/edgar/data/0000789019/000156459021002316/msft-10q_20201231.htm'
resp = requests.get(url, verify=False, headers={"User-Agent":"Opera browser"})
print(resp.text.count('<xbrli:startDate>'))
#print(resp.text)

# requires utf-8 in file header
# !only parse specific parts for speed
target_tags = SoupStrainer('ix:header')
soup = BeautifulSoup(resp.text, 'lxml', parse_only=target_tags)

for i in soup.find_all('xbrli:startDate'.lower()):
  print(i.text)

manusimidt · 2021-08-10T09:01:20Z

Currently py-xbrl uses ElementTree for parsing xml. At that time I deliberately decided against BeautifulSoup for two reasons:

It is way slower than lxml or eTree ([1], [2])
I did not want py-xbrl to depend on any third party packages except of the default python3 packages.

However, I will take a look at how you achieved the speeding up of the parsing time later this week,

manusimidt · 2021-08-10T09:05:41Z

Additionally I do not really understand why you are only searching for the ix:header XML Element. The entire document can have Facts (ix:nonFraction) at any level and in every line of the document. In your code snippet you just ignore all facts that are outside of the ix:header XML Element.

mrx23dot · 2021-08-10T09:31:02Z

As I read the lxml is the fastest C based parser.
This BeautifulSoup is just an API around lxml (and others).

There is also an eTree API for lxml: https://lxml.de/tutorial.html
Not sure if it allows filtering for specific tags to parse only.

The current eTree implementation uses python based regexp, based on profiling, that's why it's so slow.

I think the speed also comes from not parsing the html 'body' (which is huge), as I know there is no ixbrl in that.

manusimidt · 2021-08-10T12:21:29Z

I think the speed also comes from not parsing the html 'body' (which is huge), as I know there is no ixbrl in that.

Thats wrong. The majority of the XBRL-facts are in the body of the HTML document. The ix:hidden element only contians facts that should not be displayed in the HTML report visible to the normal user. In the case of sec submissions the hidden facts are usually the facts that are tagged with the dei taxonomy. These facts contain meta information about the document itself.

All other financial XBRL-facts (like those from the balance sheet) are scattered around the entire HTML document!

Additionally you have to concider that px-xbrl not only parses the instance document, but also all taxonomy schemas and linkbases that the report depends on.

An example:
If you give the parser the following Instance Document:
https://www.sec.gov/Archives/edgar/data/0000789019/000156459021002316/msft-10q_20201231.htm

Your code example above does not touch these taxonomy schemas and linkbases. This is also one reason why your code executes much faster.

manusimidt · 2021-08-10T12:22:37Z

Here is a short explaination of taxonomies and linkbases:
https://manusimidt.dev/2021-07/xbrl-explained

mrx23dot · 2021-08-10T16:14:26Z

Still very fast, should be fully compatible with current code base:

from lxml import etree
import requests
requests.packages.urllib3.disable_warnings()

#download 
url = 'https://www.sec.gov/Archives/edgar/data/0000789019/000156459021002316/msft-10q_20201231.htm'
resp = requests.get(url, verify=False, headers={"User-Agent":"Opera browser"})
print(resp.text.count('<xbrli:startDate>'))
file_in_bytearray = bytes(resp.text, encoding='utf-8')

# parse
root = etree.XML(file_in_bytearray)
for i in root:
  print(i)

mrx23dot · 2021-08-13T15:48:09Z

I made same initial progress with integrating lxml, see branch
https://github.com/mrx23dot/py-xbrl/tree/lxml

got the namespace map, and etree root, but fails at
root.find('.//{}schemaRef'.format(LINK_NS))
returns None

Another thing I have noticed with the non optimized etree that RAM usage jumps up to 500-1000MB while parsing.

mrx23dot · 2021-08-18T12:14:28Z

I have done the integration of lxml. Turns out it isn't the bottleneck :(
It's the simple compare_uri(uri1: str, uri2: str) function.

Anyway we could eliminate or reduce the number of calls to it?
It's called half a million times, for one filling, each call has 2 regexp.
And 99.9% of time it returns false. And only ~136 times it's called with different value.
Called by get_taxonomy(url) which is a big recursion.

Can't we replace recursion with 136 flat calls?

mrx23dot · 2021-08-26T09:56:59Z

Done in
#70
#68

Final result: 11secs-> 0.907 seconds

manusimidt self-assigned this Aug 9, 2021

mrx23dot closed this as completed Aug 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow parsing on some fillings #56

Slow parsing on some fillings #56

mrx23dot commented Aug 5, 2021 •

edited

Loading

mrx23dot commented Aug 6, 2021

manusimidt commented Aug 10, 2021 •

edited

Loading

manusimidt commented Aug 10, 2021

mrx23dot commented Aug 10, 2021

manusimidt commented Aug 10, 2021 •

edited

Loading

manusimidt commented Aug 10, 2021

mrx23dot commented Aug 10, 2021

mrx23dot commented Aug 13, 2021 •

edited

Loading

mrx23dot commented Aug 18, 2021 •

edited

Loading

mrx23dot commented Aug 26, 2021 •

edited

Loading

Slow parsing on some fillings #56

Slow parsing on some fillings #56

Comments

mrx23dot commented Aug 5, 2021 • edited Loading

mrx23dot commented Aug 6, 2021

manusimidt commented Aug 10, 2021 • edited Loading

manusimidt commented Aug 10, 2021

mrx23dot commented Aug 10, 2021

manusimidt commented Aug 10, 2021 • edited Loading

manusimidt commented Aug 10, 2021

mrx23dot commented Aug 10, 2021

mrx23dot commented Aug 13, 2021 • edited Loading

mrx23dot commented Aug 18, 2021 • edited Loading

mrx23dot commented Aug 26, 2021 • edited Loading

mrx23dot commented Aug 5, 2021 •

edited

Loading

manusimidt commented Aug 10, 2021 •

edited

Loading

manusimidt commented Aug 10, 2021 •

edited

Loading

mrx23dot commented Aug 13, 2021 •

edited

Loading

mrx23dot commented Aug 18, 2021 •

edited

Loading

mrx23dot commented Aug 26, 2021 •

edited

Loading