-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apply correct look & feel depending on a filing format #41
Comments
Hey @kuatroka, thanks for bringing this up. SEC Edgar is kind of a wild west and lacks standardization (e.g. I am not sure if the start date of the new format was strictly enforced). Can you give me two exact examples of filings that demonstrate the new and old format? That way I can put together a solution for this :). |
I have a conceptual idea on how it might work: Check this page. If you filer it to view only documents prior the date 20000101 Now, if you remove the filter on the date 20000101 and click on the |
Interesting. It seems like older filings only post a txt file, but the newer ones have XML and HTML files in addition. I could potentially design a hierarchy that will attempt an XML parse if there is more than one submission file and if not, fall back to the txt. Do you know if this is the case with other filing types as well? |
I check all the filing types and here is the logic:
|
Is the HTML file more easily parsable than the TXT file since it is technically XML? |
As part of v4 of this package (#52), I am adding the ability to download filing details provided by the EDGAR Search API (HTML or XML file depending on the filing type). That said, the XML file provided by the API is not the one you are looking for, so I will need to expand my search for a solution here. A potential idea I have is to take the accession number and append |
Thank you for this package, it helps a lot! Unfortunately, I am currently facing the same problem as kuatroka (the author of this issue). Is there already an update regarding his question? If the package would allow for receiving the readable .txt and later on .html-files, is it possible to include a code that automatically takes the .html-files from the folder where it has been stored and convert it into a .txt-file? As a newbie to Python and Github, I solely found this option which might help to incorporate it into the package. from bs4 import BeautifulSoup source: https://stackoverflow.com/questions/14694482/converting-html-to-text-with-python Thank you so much for your effort to improve the package! |
Sounds great! |
Sorry for the delayed response to this request--things have been hectic recently. I am going to keep the issue open so I can give it another look going forward and see what I can do! |
It's very kind of you @jadchaar . No worries about the delay. It's your side hustle, which you do for free, so I'm just happy for any help at all! The reason why I wanted this functionality is because I wanted to attack the filings parsing with a four-stage process: A - Download all the files for all the years and all the companies
C - Select all the The stage C is what we have been talking in this thread, but I found a solution that eliminates this need. Please see this notebook and in general the entire repo. The code here takes in It means that, the parsing of the new/xml based That's the reason why you might want to drop this precise feature and instead maybe code the stage B and incorporate the parsing part from the repo I referenced here. That repo has got the parsing logic for four different filings and I bet the logic for the rest can be extrapolated. In any case, this is just my idea, of course you might have other plans and features in mind. Thanks! |
I looked into this a bit and it seems like an involved web scraping approach is required since the Thanks for elaborating on your solution. I hope others find it helpful. I am going to close this issue as I am not sure there is much I can do without adding scraping to the index file for the accession number. Also, the XML is available in the full submission TXT file as you stated, so it can be parsed out if needed. Thanks again! |
There are multiple formats in EDGAR. For old filings it's all in
txt
format, but the new ones, arexml
even thought original files on EDGAR are stilltxt
. It means that all the recent filings are unreadable after download because they're all inxml
format.It would be great if this tool could distinguish between real
txt
file (old format) and thexml
(new format) and apply appropriate formatting automatically. Maybe it can be done by searching the document for a string of characters/>
. This would mean it's an .xml
doc and if there isn't any, than it's a .txt
. Maybe there is a certain year and a month starting from which all EDGAR filings started to be in the .xml
format.Regards
The text was updated successfully, but these errors were encountered: