What is This About? | How It Works | Quick Start | Examples | Limitations | Why This Program? | Acknowledgments | Other Projects
In the era of information unity, the uncentralized fashion of delivering financial information page after page is becoming a less paceful practice. This package aims to provide dynamic functions to retrieve and centralize financial information from the U.S. Securities and Exchange Commission (SEC) EDGAR database, a web reposoitory that stores reliable filings and track records of all publicly-traded companies in the U.S.. Think of this as your "librarian, let it know the specific company statement(s) you are looking for, then it will gather, tidy, then delivery them to you. The major catch is that there is only one place to look, regardless how many statements you've requested.
The program takes an object-oriented apporach by having the user establish a business entity to store a specified type of financial information. The algorithm serves ready as a "librarian" in the entity to process your requests- curating, tidying, then sending them back to you all in one place. On the retrieving side, the "librarian" parses through the SEC database to look for the company's identity- this is identified by the company's Central Index Key (CIK), similar to a pereson's ID card number. then, the accession number of each filing requested, similar to the International Standard Book Number (ISBN) found on the back of a book. These are sufficient for the "librarian" to locate the HTML links to the full reports requested. To extract a specified financial statement within the report, the "librarian" virtually downloads the report section containing the specified statement, typically called "Item 8- Financial Statements and Supplementary Data" in an annual report (10-K) and "Item 1- Financial Statements" in a quarter report (10-Q); the report section will not be stored on your computer. Once downloaded, the "librarian" reads the section, grabs the financial statement table, and curates it into the "shelf" of your computer; "shelf" is a folder created by the "librarian" to centralize all financial statements you've requested. On the centralizing side, all your financial statements can be retrieved as many times as you want. You can choose which statements to keep and drop at all times. The "librarain" is also apt at tidying multiple financial statements into one table such that you only have to look at one place for comparing financial performances of a business over many years instead of jumping between several places. The "librarian" also offers the classic way of report delivery, that is to deliver specified reports page by page. As long as you have Chrome, an internet browser, on your computer, the rest is taken care of by the "librarian". This works because once the "librarian" has located the HTML links to the full reports requested, it will load each full report on a seperate tab, with all tabs lined up in one browser window. You may then simply click around to compare business performances in that one window
- Install Chrome Driver (skip this if you've already done so):
- If you are using Chrome version 85, please download ChromeDriver 85.0.4183.38
- If you are using Chrome version 84, please download ChromeDriver 84.0.4147.30
- If you are using Chrome version 83, please download ChromeDriver 83.0.4103.39
- If your Chrome version is neither of the above, go here to select a version that suits.
- Install Python Packages:
pip installl os
pip install pickle
pip install re
pip install bs4
pip install requests
pip install pandas
pip install numpy
pip install datetime
pip install selenium
Begin by importing the module. Make sure your current directory is set to where the "sec_business_scraper.py" is located.
import sec_business_scraper
From the sec_business_scraper module, create a business entity to store a specified type of financial information. This can be done in the form sec_business_scraper.Business(...)
. Below is an example with Amazon:
amzn_annual=sec_business_scraper.Business(foreign=False, symbol='AMZN', report_type='annual', start_period=30160101, end_period=20191231)
amzn_quarter=sec_business_scraper.Business(foreign=False, symbol='AMZN', report_type='quarter', start_period=30160101, end_period=20191231)
-
foreign=False
means that our company of interest is a U.S. based business. If you are interest in a foreign based business, for example Alibaba Group in China, then specifyforeign=True
. In our case Amazon is a U.S. based company so we set the foreign logic to be False. -
symbol=AMZN
means we are specifying the stock ticker symbol of a businsess. Stock ticker symbol for all publicly traded companies can be searched through your local browser. In Amazon's case, its stock ticker symbol is 'AMZN'. -
report_type='annual'
means that we are interested in the annual term reports of a company. Quarter term reports may be obtained by specifyingreport_type='quarter'
-
start_period=20100101
means that we are asking the algorithm to retrieve data starting from 01/01/2016. Input the date as a numeric type with a 4-digityear
followed by a 2-digitmonth
, then a 2-digitday
. There is NO need to format the date with seperators such as '/' or '-'. The algorithm detects for leap years and non-valid dates, then guides you to input a valid one. -
end_period=20201231
means that we are asking the algorithm retrieve data until 12/31/2019. Input the date as a numeric type with a 4-digityear
followed by a 2-digitmonth
, then a 2-digitday
. There is NO need to format the date with seperators such as '/' or '-'. The algorithm detects for leap years and non-valid dates, then guides you to input a valid one.
We have now stored our requested information of Amazon in variables called amzn_annual
and amzn_quarter
. This should just take a few miliseconds to complete because the algorithm is just initializing the information we've requested. The next step is to send out metaphorically, a "librarian" to search for our requested information.
It is now time to send our "librarian" to work !
## returned dataframes are stored in thier corresponding variables
amzn_annual_income=amazon_annual.ghost_income()
amzn_quarter_balance=amazon_annual.ghost_balance()
amzn_annual_cashflow=amazon_annual.ghost_cashflow()
-
amzn_annual.ghost_income()
means that the "libraian" will search through the entire SEC EDGAR database to look for all annual income statements of Amazon between 01/01/2016 and 12/31/2019, and return ONE dataframe with corresponding income statements put sided by side for comparison. This dataframe is designed to contain as few repeated income statement columns as possible. Income statements retrieved between the specified periods are stored in the "statemnet_pile" folder as your book shelf. Here's an excerpt of the dataframe: -
amzn_quarter.ghost_balance()
means that the "libraian" will search through the entire SEC EDGAR database to look for all quarter balance sheets of Amazon between 01/01/2016 and 12/31/2019, and return ONE dataframe with corresponding balance sheets put sided by side for comparison. This dataframe is designed to contain as few repeated balance sheets columns as possible. Balance sheets retrieved between the specified periods are stored in the "statemnet_pile" folder as your book shelf. Here's an excerpt of the dataframe: -
amzn_annual.ghost_cashflow()
means that the "libraian" will search through the entire SEC EDGAR database to look for all annual cashflow statements of Amazon between 01/01/2016 and 12/31/2019, and return ONE dataframe with corresponding cashflow statements put sided by side for comparison. This dataframe is designed to contain as few repeated cashflow statements columns as possible. Cashflow statements retrieved between the specified periods are stored in the "statemnet_pile" folder as your book shelf. Here's an excerpt of the dataframe:



- Quick Update of Current Statements
- Browse Financial Report Pages
- Browse Company Risks
amzn_annual=sec_business_scraper.Business(foreign=False, symbol='AMZN', report_type='annual', start_period=30100214, end_period=20180214)
amzn_quarter=sec_business_scraper.Business(foreign=False, symbol='AMZN', report_type='quarter', start_period=30100214, end_period=20180214)
## Update the annual income statements on shelf
amzn_annual.update_financial_statements(statement_type='income')
## Update the quarter balance sheets on shelf
amzn_quarter.update_financial_statements(statement_type='balance')
## Update the annual cashflow statements on shelf
amzn_annual.update_financial_statements(statement_type='cashflow')
When you are interested in the same company but over a different period, which in this case is from 02/14/2010 to 02/14/2018, .update_financial_statements(...)
can be called to update your statement_pile folder to contain statements of the newly updated time range. This will not overwrite the previous statemnents that the program has retrieved. '...' represents the statement type that you would like to update.
## exhibiting the annual reports in browser
amzn_annual.financial_statements_exhitbit()
## exhibiting the quarter report in browser
amzn_quarter.financial_statements_exhitbit()
Here's one for analysts who would like to scrutinize full reports page by page..financial_statements_exhitbit()
displays all annual (10-K) or quarter (10-Q)reports of a company over the specified period. Each report is displayed in a seperate tab, with the report best-scrolled to the section containing the financial statements. This is typically the "Item 8- Financial Statements and Supplementary Data" section for annual reports and "Item 1- Financial Statements" section for quarter reports. All tabs are hosted by only one windows, thus allowing you to locate the correct report from just ONE place.
## exhibiting enterprise (internal) risk
amzn_annual.risk_factors_exhibit(risk_type='enterprise')
## exhibiting enterprise (external) risk
amzn_annual.risk_factors_exhibit(risk_type='market')
Similarly, .risk_factors_exhitbit(...)
displays all annual (10-K) or quarter (10-Q)reports of a company over the specified period, with each report best-scrolled to the risk section. '...' represents the risk type of interest. An enterprise
risk type includes internal pressures such as change in management structure and poor employee relationships; a market
risk type includes external pressures such as industry compettition and customer liability.
Scenario: I would like to examine the annual Net Income and Free Cash Flow of this companny between 02/14/2010 and 02/14/2015. Net Income is an item in the income statement while Free Cash Flow is an item in the statement of cashflows.
## creating a business entity to store the specified type of information
tsm_annual=sec_business_scraper.Business(foreign=True, symbol='TSM', report_type='annual', start_period=20100214, end_period=20150214)
## requesting the program to gather income statement and combine them into one dataframe
tsm_annual_income=tsm_annual.ghost_income()
## requesting the program to gather statement of cash flows and combine them into one dataframe
tsm_annual_cashflow=tsm_annual.ghost_cashflow()
The ONE place you have to look for historical Net Income will be in the tsm_annual_income
dataframe. The ONE place you have to look for historical Free Cash Flow will be in the tsm_annual_cashflow
dataframe.
Takeaway: remember to set the foreign logic to true, foreign=True
. when analyzing foreign companies.
Scenario: I would like to calculate the Current Ratio of the Company as of the most recent quarter. Current Ratio = Total Current Assets / Total Current Liabilities. Both Total Current Assets and Total Current Liabilities are items in the quarter balance sheet.
## creating a business entity to store the specified type of information
lmt_quarter=sec_business_scraper.Business(foreign=False, symbol='LMT', report_type='quarter', start_period=201901001, end_period=20191201)
## requesting the program to gather balance sheets and combine them into one dataframe
lmt_quarter_balance=lmt_quarter.ghost_balance()
This message is shown because there is no quarter filings contained within this time range. Trying expanding the time range.
lmt_quarter=sec_business_scraper.Business(foreign=True, symbol='LMT', report_type='quarter', start_period=20190901, end_period=20191231)
lmt_quarter_balance=lmt_quarter.ghost_balance()
The time range is now expanded from (10/01/2019,12/01/2019) to (09/01/2019,12/31/2019). This should now work. If it doesn't, keep expanding the time range.
Eyeball the first columns for the Total Current Assets and Total Current Liabilities for the most recent quarter, then apply calculations.
Takeaway: Whenever the program complains about not being able to find filings in a too-narrowed time frame, try expanding the time range.
- Quarter reports are NOT available for foreign companies.
- This program works mostly for all filings after 2009 because a statemnet excerpts were less standardized before 2009.
- Despite combing all information in one dataframe, the data is not best-formatted and will require some eyeballing to grasp each items from the combined dataframe. This may be improved in the next version.
- A seperate business entity is needed when initializing information of different periods, "report_type='annual'" and "report_type='quarter'". One business entity may be condsidered by moving the "report_type" parameter to the execution stpe.
Rather than parsing through each report and extracting the financial statements of interest, this program virtually downloads parsed-ready financial statements from the SEC repository. These financial statements were parsed by the SEC officials and stored as multiple tables in an excel spreadsheet. That being said, financial data extracted from this source are more reliable than any one conventional scraper can achieve. While having a reliably-parsed statement is important, having an analyst-friendly output is also crucial. The keyword here is centralize. How may an analyst go through necessary items year by year, statement by statement, and company by company without much hassle of losing track? The execution functions .ghost_income()
, .ghost_balance()
, and .ghost_cashflow()
serve not only the effort of a scraper, but also a centralizer that optimally tidiy all relevant data in one place with minimal data loss. The outcome? Well, there are no longer needs to frequently jump between statements and reports to grasp necessary items during an analysis.