Parsing <TEXT> fails #62

mrx23dot · 2021-08-16T21:20:49Z

Parsing of
https://www.sec.gov/Archives/edgar/data/0001634379/000156459020053234/mtcr-10q_20200930.htm

causes exception in XbrlParser(cache).parse_instance(url)
Saying: not well-formed (invalid token): line 7, column 2 Thus most likely also other fillings from the same company.

SEC's response:

Please look at the contents of the link. You will see that like every other one of the millions of HTML documents on the EDGAR site, the first six lines are document metadata in SGML, that a browser ignores. They look like this:

<DOCUMENT>
<TYPE>10-Q
<SEQUENCE>1
<FILENAME>mtcr-10q_20200930.htm
<DESCRIPTION>10-Q
<TEXT>
 Programs can start parsing after the <TEXT> line and also ignore the last two lines
 </TEXT>
</DOCUMENT>

trace

  File "C:\python36\lib\site-packages\xbrl\instance.py", line 383, in parse_ixbrl
    root: ET = parse_file(instance_path)
  File "C:\python36\lib\site-packages\xbrl\helper\xml_parser.py", line 19, in parse_file
    for event, elem in ET.iterparse(path, events):
  File "C:\python36\lib\xml\etree\ElementTree.py", line 1221, in iterator
    yield from pullparser.read_events()
  File "C:\python36\lib\xml\etree\ElementTree.py", line 1296, in read_events
    raise event
  File "C:\python36\lib\xml\etree\ElementTree.py", line 1268, in feed
    self._parser.feed(data)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 7, column 2

The text was updated successfully, but these errors were encountered:

manusimidt · 2021-08-24T09:07:47Z

Hello, please make sure to only parse documents that follow the XBRL or iXBRL specification.
The document https://www.sec.gov/Archives/edgar/data/0001634379/000156459020053234/mtcr-10q_20200930.htm is a normal HTML document without any XBRL stuff.
Use the Instance Document of this submission for extracting data with py-xbrl.

mrx23dot · 2021-08-24T09:23:54Z

Ah, this is just a different error indicating non ixbrl file.

I could add a pre-check in the lxml implementation that would filter this out.
Can we say that every valid htm (non xml), must contain "ixbrl" lowercase text to be valid?
Or as a warning to console that possibly invalid.

manusimidt · 2021-09-01T19:03:41Z

I am not entirely sure what you mean with the following statement:

Can we say that every valid htm (non xml), must contain "ixbrl" lowercase text to be valid?

for an ixbrl instance document to be valid, it must comply with the iXBRL specification.

This includes many validation rules.
See for example the validation rules for the ix:nonFraction elements:
https://www.xbrl.org/specification/inlinexbrl-part1/rec-2013-11-18/inlinexbrl-part1-rec-2013-11-18.html#d1e5415

manusimidt · 2021-09-01T19:04:10Z

But yes, you are right that it would be nice if the parser could check if a document contains valid xbrl taggings.

manusimidt closed this as completed Aug 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing <TEXT> fails #62

Parsing <TEXT> fails #62

mrx23dot commented Aug 16, 2021 •

edited

Loading

manusimidt commented Aug 24, 2021

mrx23dot commented Aug 24, 2021 •

edited

Loading

manusimidt commented Sep 1, 2021

manusimidt commented Sep 1, 2021

Parsing <TEXT> fails #62

Parsing <TEXT> fails #62

Comments

mrx23dot commented Aug 16, 2021 • edited Loading

manusimidt commented Aug 24, 2021

mrx23dot commented Aug 24, 2021 • edited Loading

manusimidt commented Sep 1, 2021

manusimidt commented Sep 1, 2021

mrx23dot commented Aug 16, 2021 •

edited

Loading

mrx23dot commented Aug 24, 2021 •

edited

Loading