# Scraping EDGAR for Financial Documents

In this notebook, I will be writing code to retrieve 10K and 10Q reports from a given company, given their tickers. To run this, you will need to install the very powerful sec_edgar_downloader package via  

In [1]:
!pip install sec_edgar_downloader



I was able to pull financial documents from all companies that submitted such documents via the SEC website. However, this was done manually via copy/paste. This is a step that would need to be automated if I were to extrapolate this to other sectors/industries

In [13]:
import pandas as pd
import os
from sec_edgar_downloader import Downloader

In [14]:
# Open the file
companies = pd.read_csv("company_list.csv")
companies.head()

Unnamed: 0,CIK,Company,State/Country
0,1053468,ABBOTT GREGORY,CO
1,1295721,ACE Aviation Holdings Inc.,A8
2,1002819,AIR CANADA /QUEBEC/,A8
3,1110452,AIR FRANCE-KLM /FI,I0
4,310454,AIR MIDWEST INC,KS


## Pulling Data From One Company

The company CIK can be used to scrape the data. I will demonstrate a quick example (American Airlines) prior to downloading the full list of companies:

In [7]:
companies[companies['Company'] == 'American Airlines Group Inc.']

Unnamed: 0,CIK,Company,State/Country
19,6201,American Airlines Group Inc.,TX


In [9]:
# Initialize a downloader instance. If no argument is passed
# to the constructor, the package will download filings to
# the current working directory.
dl = Downloader()

In [10]:
# Get all 10-K filings for American Airlines (ticker: AAL) from 2000 onwards
dl.get("10-K", "6201", after="2000-01-01")

19

In [12]:
os.listdir("sec-edgar-filings/")

['AAL', '0000006201']

We can see here that in the file created by the package, company documents are organized by their tickers, even though the CIK was used in the request. This will be handy in the backtesting phase.

# Pulling Data From All Companies

Now, we will pull the data from all the companies found in the .csv file.

In [18]:
CIKs = companies["CIK"]
for cik in CIKs.values:
    dl.get("10-K", str(cik), after="2000-01-01")

In [20]:
os.listdir("sec-edgar-filings/")[:5]

['0000100517', '0001351548', '0000101001', 'AAL', '0001405419']

As we can see, the results actually contain the CIKs, so we will need to rename them to get a better idea of tickers. Luckily, there is a complete mapping that is easily downloaded from the following link:

http://rankandfiled.com/#/data/tickers