Apply correct look & feel depending on a filing format #41

kuatroka · 2020-05-27T23:44:13Z

There are multiple formats in EDGAR. For old filings it's all in txt format, but the new ones, are xml even thought original files on EDGAR are still txt. It means that all the recent filings are unreadable after download because they're all in xml format.

It would be great if this tool could distinguish between real txt file (old format) and the xml (new format) and apply appropriate formatting automatically. Maybe it can be done by searching the document for a string of characters />. This would mean it's an .xml doc and if there isn't any, than it's a .txt. Maybe there is a certain year and a month starting from which all EDGAR filings started to be in the .xml format.

Regards

The text was updated successfully, but these errors were encountered:

jadchaar · 2020-05-29T04:33:59Z

Hey @kuatroka, thanks for bringing this up. SEC Edgar is kind of a wild west and lacks standardization (e.g. I am not sure if the start date of the new format was strictly enforced).

Can you give me two exact examples of filings that demonstrate the new and old format? That way I can put together a solution for this :).

kuatroka · 2020-05-29T15:15:14Z

I have a conceptual idea on how it might work:

Check this page.

If you filer it to view only documents prior the date 20000101
and filter to only filings 13F-HR
you will see only .txt files and if you open then them, you will find the file is an actual .txt with a clearly presented info and a table of all Berkshire's holdings.

Now, if you remove the filter on the date 20000101 and click on the .txt file, you will see a gibberish (xml formatted data), but if instead of the .txt file you open an .html file that contains word *Table* in its title, you will find a properly formatted table there

jadchaar · 2020-05-30T17:21:26Z

Interesting. It seems like older filings only post a txt file, but the newer ones have XML and HTML files in addition. I could potentially design a hierarchy that will attempt an XML parse if there is more than one submission file and if not, fall back to the txt.

Do you know if this is the case with other filing types as well?

kuatroka · 2020-06-01T23:03:13Z

I check all the filing types and here is the logic:

All filings, new or old have at least a .txt file
Starting from certain year some filings appear in other formats in addition to the always present .txt format.
If there is only a .txt file, then it's formatted in a normal way, but if besides the .txt there are other formats, then the .txt file is formatted as an xml and not easily readable.
Other than the .txt, other main formats are: .xml and .htm or .html
To identify if there are other formats other than .txt the flow is:
1. Parse the .txt and search for the tag `
2. If it exists, what comes after the > tag is the file name we are interested in. Capture it in a
  variable A.
3. Capture the current url and change it by swapping whatever is there after the last / for the
  variable A as in the gif below.

jadchaar · 2020-06-07T18:00:22Z

Is the HTML file more easily parsable than the TXT file since it is technically XML?

kuatroka · 2020-06-08T20:58:28Z

Not really sure. My idea was to have it in such an easy format so it can easily be opened with pandas for further analysis and such. I know pandas has simple .csv and .html read commands. Also .html seems to be of a smaller size, but in reality, it's up to you as I bet you know better what is easier to parse.

jadchaar · 2021-01-18T04:01:53Z

As part of v4 of this package (#52), I am adding the ability to download filing details provided by the EDGAR Search API (HTML or XML file depending on the filing type). That said, the XML file provided by the API is not the one you are looking for, so I will need to expand my search for a solution here. A potential idea I have is to take the accession number and append -index.html to get the index of all downloadable files like in your GIFs: https://www.sec.gov/Archives/edgar/data/0000102909/000110465920125446/0001104659-20-125446-index.html. Then, I will need to obtain this webpage and scrape/parse out all of the document hyperlinks for downloading. This will most likely come in a future release, but I have laid the groundwork in v4.

ake3210 · 2021-08-13T16:17:15Z

Thank you for this package, it helps a lot! Unfortunately, I am currently facing the same problem as kuatroka (the author of this issue). Is there already an update regarding his question?

If the package would allow for receiving the readable .txt and later on .html-files, is it possible to include a code that automatically takes the .html-files from the folder where it has been stored and convert it into a .txt-file? As a newbie to Python and Github, I solely found this option which might help to incorporate it into the package.

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
print(soup.get_text())

source: https://stackoverflow.com/questions/14694482/converting-html-to-text-with-python

Thank you so much for your effort to improve the package!

kuatroka · 2021-10-13T22:48:11Z

As part of v4 of this package (#52), I am adding the ability to download filing details provided by the EDGAR Search API (HTML or XML file depending on the filing type). That said, the XML file provided by the API is not the one you are looking for, so I will need to expand my search for a solution here. A potential idea I have is to take the accession number and append -index.html to get the index of all downloadable files like in your GIFs: https://www.sec.gov/Archives/edgar/data/0000102909/000110465920125446/0001104659-20-125446-index.html. Then, I will need to obtain this webpage and scrape/parse out all of the document hyperlinks for downloading. This will most likely come in a future release, but I have laid the groundwork in v4.

Sounds great!

jadchaar · 2021-10-15T03:40:54Z

Sorry for the delayed response to this request--things have been hectic recently. I am going to keep the issue open so I can give it another look going forward and see what I can do!

kuatroka · 2021-10-15T23:00:40Z

It's very kind of you @jadchaar . No worries about the delay. It's your side hustle, which you do for free, so I'm just happy for any help at all!
Nevertheless, let me try to share with you an idea of why you might want to spend your valuable and busy time on some other feature instead of this one.

The reason why I wanted this functionality is because I wanted to attack the filings parsing with a four-stage process:

A - Download all the files for all the years and all the companies
B - Split all those files into two parts:

1- the old format (unformatted .txt without any XML tags) and
2- the new format ( .txt with XML inside of it)

C - Select all the XML based .txt files > find corresponding .xml or .html and parser them.
D - Deal with the muuuuuch more complicated old/unformatted .txt (I have started and it's a pain)

The stage C is what we have been talking in this thread, but I found a solution that eliminates this need. Please see this notebook and in general the entire repo. The code here takes in .txt files(though not downloaded files per se. It uses direct links to them) and parses the hell out of the XML parts of the .txt files itself without even needing for the corresponding .xml or .html files

It means that, the parsing of the new/xml based .txt filings are enough to parse the data out of them. What it still need though is some sort of identifier that separates all the filings into two heaps 1- old and 2 - new (stage B)

That's the reason why you might want to drop this precise feature and instead maybe code the stage B and incorporate the parsing part from the repo I referenced here. That repo has got the parsing logic for four different filings and I bet the logic for the rest can be extrapolated.

In any case, this is just my idea, of course you might have other plans and features in mind. Thanks!

jadchaar · 2022-01-09T07:33:43Z

I looked into this a bit and it seems like an involved web scraping approach is required since the FILENAME attribute inside of the full submission does not seem sufficient to locate the table files. Take this one for example: https://www.sec.gov/Archives/edgar/data/1067983/000095012321015518/0000950123-21-015518-index.html. Here the filename is listed as <FILENAME>6820.xml, but the file is located at https://www.sec.gov/Archives/edgar/data/1067983/000095012321015518/0000950123-21-015518-6820.xml. This inconsistency makes it hard to generalize the URL without scraping the index page for the accession number. I am trying to avoid scraping at all costs since it is fragile. In v5, I am adopting the official Edgar API which should improve the quality of the downloaded documents, especially XML documents.

Thanks for elaborating on your solution. I hope others find it helpful. I am going to close this issue as I am not sure there is much I can do without adding scraping to the index file for the accession number. Also, the XML is available in the full submission TXT file as you stated, so it can be parsed out if needed.

Thanks again!

jadchaar added the enhancement New feature or request label May 29, 2020

jadchaar changed the title ~~Feature suggestion: Apply correct look & feel depending on a filing format~~ Apply correct look & feel depending on a filing format May 29, 2020

jadchaar mentioned this issue Jan 21, 2021

Download extension file #53

Open

kuatroka closed this as completed Oct 13, 2021

jadchaar reopened this Oct 15, 2021

jadchaar closed this as completed Jan 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply correct look & feel depending on a filing format #41

Apply correct look & feel depending on a filing format #41

kuatroka commented May 27, 2020

jadchaar commented May 29, 2020

kuatroka commented May 29, 2020 •

edited

Loading

jadchaar commented May 30, 2020

kuatroka commented Jun 1, 2020

jadchaar commented Jun 7, 2020

kuatroka commented Jun 8, 2020 •

edited

Loading

jadchaar commented Jan 18, 2021

ake3210 commented Aug 13, 2021

kuatroka commented Oct 13, 2021

jadchaar commented Oct 15, 2021

kuatroka commented Oct 15, 2021

jadchaar commented Jan 9, 2022

Apply correct look & feel depending on a filing format #41

Apply correct look & feel depending on a filing format #41

Comments

kuatroka commented May 27, 2020

jadchaar commented May 29, 2020

kuatroka commented May 29, 2020 • edited Loading

jadchaar commented May 30, 2020

kuatroka commented Jun 1, 2020

jadchaar commented Jun 7, 2020

kuatroka commented Jun 8, 2020 • edited Loading

jadchaar commented Jan 18, 2021

ake3210 commented Aug 13, 2021

kuatroka commented Oct 13, 2021

jadchaar commented Oct 15, 2021

kuatroka commented Oct 15, 2021

jadchaar commented Jan 9, 2022

kuatroka commented May 29, 2020 •

edited

Loading

kuatroka commented Jun 8, 2020 •

edited

Loading