Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing <TEXT> fails #62

Closed
mrx23dot opened this issue Aug 16, 2021 · 4 comments
Closed

Parsing <TEXT> fails #62

mrx23dot opened this issue Aug 16, 2021 · 4 comments

Comments

@mrx23dot
Copy link
Contributor

mrx23dot commented Aug 16, 2021

Parsing of
https://www.sec.gov/Archives/edgar/data/0001634379/000156459020053234/mtcr-10q_20200930.htm

causes exception in XbrlParser(cache).parse_instance(url)
Saying: not well-formed (invalid token): line 7, column 2 Thus most likely also other fillings from the same company.

SEC's response:

Please look at the contents of the link. You will see that like every other one of the millions of HTML documents on the EDGAR site, the first six lines are document metadata in SGML, that a browser ignores. They look like this:

<DOCUMENT>
<TYPE>10-Q
<SEQUENCE>1
<FILENAME>mtcr-10q_20200930.htm
<DESCRIPTION>10-Q
<TEXT>
 Programs can start parsing after the <TEXT> line and also ignore the last two lines
 </TEXT>
</DOCUMENT>

trace

  File "C:\python36\lib\site-packages\xbrl\instance.py", line 383, in parse_ixbrl
    root: ET = parse_file(instance_path)
  File "C:\python36\lib\site-packages\xbrl\helper\xml_parser.py", line 19, in parse_file
    for event, elem in ET.iterparse(path, events):
  File "C:\python36\lib\xml\etree\ElementTree.py", line 1221, in iterator
    yield from pullparser.read_events()
  File "C:\python36\lib\xml\etree\ElementTree.py", line 1296, in read_events
    raise event
  File "C:\python36\lib\xml\etree\ElementTree.py", line 1268, in feed
    self._parser.feed(data)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 7, column 2
@manusimidt
Copy link
Owner

Hello, please make sure to only parse documents that follow the XBRL or iXBRL specification.
The document https://www.sec.gov/Archives/edgar/data/0001634379/000156459020053234/mtcr-10q_20200930.htm is a normal HTML document without any XBRL stuff.
Use the Instance Document of this submission for extracting data with py-xbrl.

image

@mrx23dot
Copy link
Contributor Author

mrx23dot commented Aug 24, 2021

Ah, this is just a different error indicating non ixbrl file.

I could add a pre-check in the lxml implementation that would filter this out.
Can we say that every valid htm (non xml), must contain "ixbrl" lowercase text to be valid?
Or as a warning to console that possibly invalid.

@manusimidt
Copy link
Owner

I am not entirely sure what you mean with the following statement:

Can we say that every valid htm (non xml), must contain "ixbrl" lowercase text to be valid?

for an ixbrl instance document to be valid, it must comply with the iXBRL specification.

This includes many validation rules.
See for example the validation rules for the ix:nonFraction elements:
https://www.xbrl.org/specification/inlinexbrl-part1/rec-2013-11-18/inlinexbrl-part1-rec-2013-11-18.html#d1e5415

@manusimidt
Copy link
Owner

But yes, you are right that it would be nice if the parser could check if a document contains valid xbrl taggings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants