Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply correct look & feel depending on a filing format #41

Closed
kuatroka opened this issue May 27, 2020 · 12 comments
Closed

Apply correct look & feel depending on a filing format #41

kuatroka opened this issue May 27, 2020 · 12 comments
Labels
enhancement New feature or request

Comments

@kuatroka
Copy link

There are multiple formats in EDGAR. For old filings it's all in txt format, but the new ones, are xml even thought original files on EDGAR are still txt. It means that all the recent filings are unreadable after download because they're all in xml format.

It would be great if this tool could distinguish between real txt file (old format) and the xml (new format) and apply appropriate formatting automatically. Maybe it can be done by searching the document for a string of characters />. This would mean it's an .xml doc and if there isn't any, than it's a .txt. Maybe there is a certain year and a month starting from which all EDGAR filings started to be in the .xml format.

Regards

@jadchaar
Copy link
Owner

Hey @kuatroka, thanks for bringing this up. SEC Edgar is kind of a wild west and lacks standardization (e.g. I am not sure if the start date of the new format was strictly enforced).

Can you give me two exact examples of filings that demonstrate the new and old format? That way I can put together a solution for this :).

@jadchaar jadchaar added the enhancement New feature or request label May 29, 2020
@jadchaar jadchaar changed the title Feature suggestion: Apply correct look & feel depending on a filing format Apply correct look & feel depending on a filing format May 29, 2020
@kuatroka
Copy link
Author

kuatroka commented May 29, 2020

I have a conceptual idea on how it might work:

Check this page.

If you filer it to view only documents prior the date 20000101
and filter to only filings 13F-HR
you will see only .txt files and if you open then them, you will find the file is an actual .txt with a clearly presented info and a table of all Berkshire's holdings.

Now, if you remove the filter on the date 20000101 and click on the .txt file, you will see a gibberish (xml formatted data), but if instead of the .txt file you open an .html file that contains word *Table* in its title, you will find a properly formatted table there

filtering_sec

@jadchaar
Copy link
Owner

Interesting. It seems like older filings only post a txt file, but the newer ones have XML and HTML files in addition. I could potentially design a hierarchy that will attempt an XML parse if there is more than one submission file and if not, fall back to the txt.

Do you know if this is the case with other filing types as well?

@kuatroka
Copy link
Author

kuatroka commented Jun 1, 2020

I check all the filing types and here is the logic:

  • All filings, new or old have at least a .txt file

  • Starting from certain year some filings appear in other formats in addition to the always present .txt format.

  • If there is only a .txt file, then it's formatted in a normal way, but if besides the .txt there are other formats, then the .txt file is formatted as an xml and not easily readable.

  • Other than the .txt, other main formats are: .xml and .htm or .html

  • To identify if there are other formats other than .txt the flow is:

    1. Parse the .txt and search for the tag `
    2. If it exists, what comes after the > tag is the file name we are interested in. Capture it in a
      variable A.
    3. Capture the current url and change it by swapping whatever is there after the last / for the
      variable A as in the gif below.
      sec2

@jadchaar
Copy link
Owner

jadchaar commented Jun 7, 2020

Is the HTML file more easily parsable than the TXT file since it is technically XML?

@kuatroka
Copy link
Author

kuatroka commented Jun 8, 2020

Not really sure. My idea was to have it in such an easy format so it can easily be opened with pandas for further analysis and such. I know pandas has simple .csv and .html read commands. Also .html seems to be of a smaller size, but in reality, it's up to you as I bet you know better what is easier to parse.
Annotation 2020-06-08 215649

@jadchaar
Copy link
Owner

As part of v4 of this package (#52), I am adding the ability to download filing details provided by the EDGAR Search API (HTML or XML file depending on the filing type). That said, the XML file provided by the API is not the one you are looking for, so I will need to expand my search for a solution here. A potential idea I have is to take the accession number and append -index.html to get the index of all downloadable files like in your GIFs: https://www.sec.gov/Archives/edgar/data/0000102909/000110465920125446/0001104659-20-125446-index.html. Then, I will need to obtain this webpage and scrape/parse out all of the document hyperlinks for downloading. This will most likely come in a future release, but I have laid the groundwork in v4.

@ake3210
Copy link

ake3210 commented Aug 13, 2021

Thank you for this package, it helps a lot! Unfortunately, I am currently facing the same problem as kuatroka (the author of this issue). Is there already an update regarding his question?

If the package would allow for receiving the readable .txt and later on .html-files, is it possible to include a code that automatically takes the .html-files from the folder where it has been stored and convert it into a .txt-file? As a newbie to Python and Github, I solely found this option which might help to incorporate it into the package.

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
print(soup.get_text())

source: https://stackoverflow.com/questions/14694482/converting-html-to-text-with-python

Thank you so much for your effort to improve the package!

@kuatroka
Copy link
Author

As part of v4 of this package (#52), I am adding the ability to download filing details provided by the EDGAR Search API (HTML or XML file depending on the filing type). That said, the XML file provided by the API is not the one you are looking for, so I will need to expand my search for a solution here. A potential idea I have is to take the accession number and append -index.html to get the index of all downloadable files like in your GIFs: https://www.sec.gov/Archives/edgar/data/0000102909/000110465920125446/0001104659-20-125446-index.html. Then, I will need to obtain this webpage and scrape/parse out all of the document hyperlinks for downloading. This will most likely come in a future release, but I have laid the groundwork in v4.

Sounds great!

@jadchaar
Copy link
Owner

Sorry for the delayed response to this request--things have been hectic recently. I am going to keep the issue open so I can give it another look going forward and see what I can do!

@jadchaar jadchaar reopened this Oct 15, 2021
@kuatroka
Copy link
Author

It's very kind of you @jadchaar . No worries about the delay. It's your side hustle, which you do for free, so I'm just happy for any help at all!
Nevertheless, let me try to share with you an idea of why you might want to spend your valuable and busy time on some other feature instead of this one.

The reason why I wanted this functionality is because I wanted to attack the filings parsing with a four-stage process:

A - Download all the files for all the years and all the companies
B - Split all those files into two parts:

1- the old format (unformatted .txt without any XML tags) and
2- the new format ( .txt with XML inside of it)

C - Select all the XML based .txt files > find corresponding .xml or .html and parser them.
D - Deal with the muuuuuch more complicated old/unformatted .txt (I have started and it's a pain)

The stage C is what we have been talking in this thread, but I found a solution that eliminates this need. Please see this notebook and in general the entire repo. The code here takes in .txt files(though not downloaded files per se. It uses direct links to them) and parses the hell out of the XML parts of the .txt files itself without even needing for the corresponding .xml or .html files

It means that, the parsing of the new/xml based .txt filings are enough to parse the data out of them. What it still need though is some sort of identifier that separates all the filings into two heaps 1- old and 2 - new (stage B)

That's the reason why you might want to drop this precise feature and instead maybe code the stage B and incorporate the parsing part from the repo I referenced here. That repo has got the parsing logic for four different filings and I bet the logic for the rest can be extrapolated.

In any case, this is just my idea, of course you might have other plans and features in mind. Thanks!

@jadchaar
Copy link
Owner

jadchaar commented Jan 9, 2022

I looked into this a bit and it seems like an involved web scraping approach is required since the FILENAME attribute inside of the full submission does not seem sufficient to locate the table files. Take this one for example: https://www.sec.gov/Archives/edgar/data/1067983/000095012321015518/0000950123-21-015518-index.html. Here the filename is listed as <FILENAME>6820.xml, but the file is located at https://www.sec.gov/Archives/edgar/data/1067983/000095012321015518/0000950123-21-015518-6820.xml. This inconsistency makes it hard to generalize the URL without scraping the index page for the accession number. I am trying to avoid scraping at all costs since it is fragile. In v5, I am adopting the official Edgar API which should improve the quality of the downloaded documents, especially XML documents.

Thanks for elaborating on your solution. I hope others find it helpful. I am going to close this issue as I am not sure there is much I can do without adding scraping to the index file for the accession number. Also, the XML is available in the full submission TXT file as you stated, so it can be parsed out if needed.

Thanks again!

@jadchaar jadchaar closed this as completed Jan 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants