In [1]:
%%html
<style>
table {float:left}
</style>

# Download SEC EDGAR Quarterly (10-Q) Filing(s)

## Objective

Navigate through the EDGAR master indices and download the 10-Q finantial statement for a ```(CIK, year, quarter)```.

## Background
There are multiple ways to reach to the target 10-Q file for the given year/quarter. One way is to use the master indices CSV file for the year/quarter identified with the URL.

```https://www.sec.gov/Archives/edgar/full-index/${YEAR}/${QTR}/master.gz```

Each row in the the master CSV tells where is the 10-Q for a specific company/CIK for the (year, quarter).

| CIK    | Company Name | Form Type          | Date Filed | Filename   |                                           
|--------|--------------|--------------------|------------|------------|
| 320193 | APPLE COMPUTER INC | 3            | 2006-09-07 | edgar/data/320193/0001181431-06-051734.txt |

## URL
The URL to 10-Q index.html for the ```(CIK, year, quarter)``` is ```https://sec.gov/Archives/edgar/data/${CIK}/${path}/index.html```

To get ```path```, replace ```-``` (hyphen) and ```.txt``` suffix from the **filename** field value.

# Setup

In [2]:
import os
import re
import json
import requests
import pandas as pd
from bs4 import BeautifulSoup

## Constant

In [3]:
EDGAR_BASE_URL = "https://sec.gov/Archives"

---
# EDGAR

* Investopedia - [Where Can I Find a Company's Annual Report and Its SEC Filings?](https://www.investopedia.com/ask/answers/119.asp)

> If you want to dig deeper and go beyond the slick marketing version of the annual report found on corporate websites, you'll have to search through required filings made to the Securities and Exchange Commission. All publicly-traded companies in the U.S. must file regular financial reports with the SEC. These filings include the annual report (known as the 10-K), quarterly report (10-Q), and a myriad of other forms containing all types of financial data.45

# Quarterly filing indices

* [Accessing EDGAR Data](https://www.sec.gov/os/accessing-edgar-data)

> Using the EDGAR index files  
Indexes to all public filings are available from 1994Q3 through the present and located in the following browsable directories:
> * https://www.sec.gov/Archives/edgar/daily-index/ — daily index files through the current year; (**DO NOT forget the trailing slash '/'**)
> * https://www.sec.gov/Archives/edgar/full-index/ — full indexes offer a "bridge" between quarterly and daily indexes, compiling filings from the beginning of the current quarter through the previous business day. At the end of the quarter, the full index is rolled into a static quarterly index.
> 
> Each directory and all child sub directories contain three files to assist in automated crawling of these directories. Note that these are not visible through directory browsing.
> * index.html (the web browser would normally receive these)
> * index.xml (an XML structured version of the same content)
> * index.json (a JSON structured vision of the same content)
> 
> Four types of indexes are available:
> * company — sorted by company name
> * form — sorted by form type
> * **master** — sorted by CIK number   # This looks to be **the only file in which the fields are delimited**
> * XBRL — list of submissions containing XBRL financial files, sorted by CIK number; these include Voluntary Filer Program submissions
> 
> The EDGAR indexes list the following information for each filing:
> * company name
> * form type
> * central index key (CIK)
> * date filed
> * file name (including folder path)

## Example

Full index files for 2006 QTR 3.
<img src="../image/edgar_full_index_quarter_2006QTR3.png" align="left" width="800"/>

## Download the master indices of 2006 QTR3

For instance, download the file for 2006 QTR3.

In [4]:
YEAR = 2006
QTR = "QTR3"
GZ_PATH = "../data/master.gz "
DATA_PATH = "../data/master"

In [5]:
%%bash -s "$YEAR" "$QTR" "$GZ_PATH"
curl \
--header "User-Agent:Company Name myname@company.com" \
--output $3 \
"https://www.sec.gov/Archives/edgar/full-index/$1/$2/master.gz"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0 3025k    0 16463    0     0  40153      0  0:01:17 --:--:--  0:01:17 40055100 3025k  100 3025k    0     0  4236k      0 --:--:-- --:--:-- --:--:-- 4230k


In [6]:
%%bash -s "$GZ_PATH" "$DATA_PATH"
gunzip -c $1 > $2

In [7]:
!head -n 16 $DATA_PATH 

Description:           Master Index of EDGAR Dissemination Feed
Last Data Received:    September 30, 2006
Comments:              webmaster@sec.gov
Anonymous FTP:         ftp://ftp.sec.gov/edgar/
Cloud HTTP:            https://www.sec.gov/Archives/

 
 
 
CIK|Company Name|Form Type|Date Filed|Filename
--------------------------------------------------------------------------------
1000045|NICHOLAS FINANCIAL INC|10-Q|2006-08-14|edgar/data/1000045/0001193125-06-172516.txt
1000045|NICHOLAS FINANCIAL INC|8-K|2006-07-27|edgar/data/1000045/0001193125-06-154794.txt
1000069|TEXAS CAPITAL VALUE FUNDS INC|497|2006-09-12|edgar/data/1000069/0001000069-06-000017.txt
1000069|TEXAS CAPITAL VALUE FUNDS INC|N-PX|2006-08-23|edgar/data/1000069/0001000069-06-000013.txt
1000069|TEXAS CAPITAL VALUE FUNDS INC|N-Q|2006-08-29|edgar/data/1000069/0001000069-06-000015.txt


## Remove the non-csv lines

Note that in sed, ```\s``` in a list is not a special character for a space, hence ```[\s]``` is same with a single character ```[s]```.

In [8]:
%%bash -s "$DATA_PATH" 
sed -i -e '1,/^[ \t]*$/d; /^[- \t]*$/d' $1

In [9]:
!head -n 6 $DATA_PATH

CIK|Company Name|Form Type|Date Filed|Filename
1000045|NICHOLAS FINANCIAL INC|10-Q|2006-08-14|edgar/data/1000045/0001193125-06-172516.txt
1000045|NICHOLAS FINANCIAL INC|8-K|2006-07-27|edgar/data/1000045/0001193125-06-154794.txt
1000069|TEXAS CAPITAL VALUE FUNDS INC|497|2006-09-12|edgar/data/1000069/0001000069-06-000017.txt
1000069|TEXAS CAPITAL VALUE FUNDS INC|N-PX|2006-08-23|edgar/data/1000069/0001000069-06-000013.txt
1000069|TEXAS CAPITAL VALUE FUNDS INC|N-Q|2006-08-29|edgar/data/1000069/0001000069-06-000015.txt


---
# Examine the index file (master)

In [10]:
"""
df = pd.read_csv(
    DATA_PATH,
    skiprows=[
        *range(5),    # First N descriptive lines
        10            # -----
    ],
    skip_blank_lines=True,
    header=0,         # The 1st data line after omitting skiprows and blank lines.
    sep='|'
)
"""
df = pd.read_csv(
    DATA_PATH,
    skip_blank_lines=True,
    header=0,         # The 1st data line after omitting skiprows and blank lines.
    sep='|',
    parse_dates=['Date Filed']
)
print(df.dtypes)
df[:5]

CIK                      int64
Company Name            object
Form Type               object
Date Filed      datetime64[ns]
Filename                object
dtype: object


Unnamed: 0,CIK,Company Name,Form Type,Date Filed,Filename
0,1000045,NICHOLAS FINANCIAL INC,10-Q,2006-08-14,edgar/data/1000045/0001193125-06-172516.txt
1,1000045,NICHOLAS FINANCIAL INC,8-K,2006-07-27,edgar/data/1000045/0001193125-06-154794.txt
2,1000069,TEXAS CAPITAL VALUE FUNDS INC,497,2006-09-12,edgar/data/1000069/0001000069-06-000017.txt
3,1000069,TEXAS CAPITAL VALUE FUNDS INC,N-PX,2006-08-23,edgar/data/1000069/0001000069-06-000013.txt
4,1000069,TEXAS CAPITAL VALUE FUNDS INC,N-Q,2006-08-29,edgar/data/1000069/0001000069-06-000015.txt


## Apple computer 2006 QTR3 Filings

In [11]:
df[df['Company Name'] == 'APPLE COMPUTER INC']

Unnamed: 0,CIK,Company Name,Form Type,Date Filed,Filename
159958,320193,APPLE COMPUTER INC,3,2006-09-07,edgar/data/320193/0001181431-06-051734.txt
159959,320193,APPLE COMPUTER INC,4/A,2006-07-26,edgar/data/320193/0001181431-06-043774.txt
159960,320193,APPLE COMPUTER INC,4,2006-07-06,edgar/data/320193/0001181431-06-040870.txt
159961,320193,APPLE COMPUTER INC,4,2006-07-24,edgar/data/320193/0001181431-06-043325.txt
159962,320193,APPLE COMPUTER INC,4,2006-07-26,edgar/data/320193/0001181431-06-043775.txt
159963,320193,APPLE COMPUTER INC,4,2006-07-28,edgar/data/320193/0001181431-06-044234.txt
159964,320193,APPLE COMPUTER INC,4,2006-08-02,edgar/data/320193/0001181431-06-044978.txt
159965,320193,APPLE COMPUTER INC,4,2006-08-08,edgar/data/320193/0001181431-06-046420.txt
159966,320193,APPLE COMPUTER INC,4,2006-08-08,edgar/data/320193/0001181431-06-046421.txt
159967,320193,APPLE COMPUTER INC,4,2006-08-17,edgar/data/320193/0001181431-06-048402.txt


# Apple computer quarter finantial statement (10-Q) index HTML

From the index file, get the URL to the 10-Q (quareterly financial statement) file for the 2006 QTR6.

In [12]:
# --------------------------------------------------------------------------------
# Create URL to 10-Q index from the master index record.
# --------------------------------------------------------------------------------
path_to_10Q = df[
    (df['Company Name'] == 'APPLE COMPUTER INC') & (df['Form Type'].str.contains("10-Q"))
]['Filename'].values[0]
print(f"Text index of the 10-Q is {path_to_10Q}")

path_to_10Q_index_html = f"{os.path.splitext(path_to_10Q)[0]}".replace("-", "") + "/index.html"
url_to_10Q_index_html = f"{EDGAR_BASE_URL}/{path_to_10Q_index_html}"
print(f"URL to the index of 10-Q is {url_to_10Q_index_html}")

Text index of the 10-Q is edgar/data/320193/0001104659-06-053743.txt
URL to the index of 10-Q is https://sec.gov/Archives/edgar/data/320193/000110465906053743/index.html


In [13]:
# --------------------------------------------------------------------------------
# Download the HTML from the URL to 10-Q index.
# --------------------------------------------------------------------------------
headers = {"User-Agent": "Company Name myname@company.com"}
response = requests.get(url_to_10Q_index_html, headers=headers)

if response.status_code == 200:
    content_html = response.content.decode("utf-8") 
else:
    print(f"HTML from {url_to_10Q_index_html} failed with status {response.status_code}")

In [14]:
# --------------------------------------------------------------------------------
# Display the HTML 10-Q index part
# --------------------------------------------------------------------------------
soup = BeautifulSoup(content_html)
div = BeautifulSoup(
"""</html>
<body>
{}
</body>
</html>
""".format(soup.find("div", {"id": "main-content"}))
)

In [15]:
from IPython.display import HTML
HTML(data=div.prettify())

## Index in JSON format

> Each directory and all child sub directories contain three files to assist in automated crawling of these directories. Note that these are not visible through directory browsing.
> * index.html (the web browser would normally receive these)
> * index.xml (an XML structured version of the same content)
> * index.json (a JSON structured vision of the same content)

For parsing, get the JSON index file.

In [16]:
url_to_10Q_index_json = re.sub("\.html$", ".json", url_to_10Q_index_html)

# --------------------------------------------------------------------------------
# Download the JSON from the URL to 10-Q index.
# --------------------------------------------------------------------------------
headers = {"User-Agent": "Company Name myname@company.com"}
response = requests.get(url_to_10Q_index_json, headers=headers)

if response.status_code == 200:
    content_json = response.json()
else:
    print(f"HTML from {url_to_10Q_index_json} failed with status {response.status_code}")
    
print(json.dumps(content_json, indent=4))

{
    "directory": {
        "item": [
            {
                "last-modified": "2006-08-10 20:45:42",
                "name": "0001104659-06-053743-index-headers.html",
                "type": "text.gif",
                "size": ""
            },
            {
                "last-modified": "2006-08-10 20:45:42",
                "name": "0001104659-06-053743-index.html",
                "type": "text.gif",
                "size": ""
            },
            {
                "last-modified": "2006-08-10 20:45:42",
                "name": "0001104659-06-053743.txt",
                "type": "text.gif",
                "size": ""
            },
            {
                "last-modified": "2006-08-10 20:45:42",
                "name": "a06-16145_1nt10q.htm",
                "type": "text.gif",
                "size": "33456"
            }
        ],
        "name": "/Archives/edgar/data/320193/000110465906053743",
        "parent-dir": "/Archives/edgar/data/320193"
    }
}


In [17]:
pd.DataFrame(content_json['directory']['item'])

Unnamed: 0,last-modified,name,type,size
0,2006-08-10 20:45:42,0001104659-06-053743-index-headers.html,text.gif,
1,2006-08-10 20:45:42,0001104659-06-053743-index.html,text.gif,
2,2006-08-10 20:45:42,0001104659-06-053743.txt,text.gif,
3,2006-08-10 20:45:42,a06-16145_1nt10q.htm,text.gif,33456.0
