<font size="6"> **Donwload SEC 10-K Fillings** </font>

In [2]:
import numpy as np
import pandas as pd
import pickle
import pprint


from tqdm import tqdm

In [3]:
%load_ext autoreload
%autoreload 2

In [4]:
%run ../nb_config.py

In [5]:
import os

import numpy as np
import scipy

from src import utils
from src.load_data import load_sec10k, io_utils

In [6]:
cfg = utils.read_conf()

In [7]:
DOC_TYPE = '10-K'
newest_pricing_data = '2018-01-01'

In [8]:
OUTPATH = os.path.join(io_utils.raw_path, 'sec_10k', '')
OUTFILE1 = 'metadata.pkl'
OUTFILE2 = 'sec_10k.pkl'

# Get 10ks
We'll be running NLP analysis on 10-k documents. To do that, we first need to download the documents. For this project, we'll download 10-ks for a few companies. To lookup documents for these companies, we'll use their CIK. If you would like to run this against other stocks, we've provided the dict `additional_cik` for more stocks. However, the more stocks you try, the long it will take to run.

In [10]:
load_sec10k.cik_lookup

{'AMZN': '0001018724',
 'BMY': '0000014272',
 'CNP': '0001130310',
 'CVX': '0000093410',
 'FL': '0000850209',
 'FRT': '0000034903',
 'HON': '0000773840'}

## Get list of 10-ks urls
The SEC has a limit on the number of calls that can be made to the website per second. The `SecAPI` class, will cache data from the SEC and prevent you from going over the limit.

In [11]:
sec_api = load_sec10k.SecAPI()

With the class constructed, let's pull a list of filled 10-ks from the SEC for each company.

Let's pull the list using the `get_sec_data` function, then display some of the results. For displaying some of the data, we'll use Amazon as an example. 

In [20]:
sec_data = {}
sec_dates = {}
for ticker, cik in load_sec10k.cik_lookup.items():
    sec_data[ticker] = load_sec10k.get_sec_data(sec_api=sec_api, cik=cik, newest_pricing_data=newest_pricing_data, doc_type='10-K')
    sec_dates[ticker] = [x[2] for x in sec_data[ticker]]

In [24]:
example_ticker = 'AMZN'
sec_dates[example_ticker]

['2017-02-10',
 '2016-01-29',
 '2015-01-30',
 '2014-01-31',
 '2013-01-30',
 '2012-02-01',
 '2011-02-28',
 '2011-01-28',
 '2010-01-29',
 '2009-01-30',
 '2008-02-11',
 '2007-02-16',
 '2006-02-17',
 '2005-03-11',
 '2004-02-25',
 '2003-02-19',
 '2002-01-24',
 '2001-03-23',
 '2000-09-08',
 '2000-03-29',
 '1999-03-05',
 '1998-03-30']

In [25]:
pprint.pprint(sec_data[example_ticker][:5])

[('https://www.sec.gov/Archives/edgar/data/1018724/000101872417000011/0001018724-17-000011-index.htm',
  '10-K',
  '2017-02-10'),
 ('https://www.sec.gov/Archives/edgar/data/1018724/000101872416000172/0001018724-16-000172-index.htm',
  '10-K',
  '2016-01-29'),
 ('https://www.sec.gov/Archives/edgar/data/1018724/000101872415000006/0001018724-15-000006-index.htm',
  '10-K',
  '2015-01-30'),
 ('https://www.sec.gov/Archives/edgar/data/1018724/000101872414000006/0001018724-14-000006-index.htm',
  '10-K',
  '2014-01-31'),
 ('https://www.sec.gov/Archives/edgar/data/1018724/000119312513028520/0001193125-13-028520-index.htm',
  '10-K',
  '2013-01-30')]


## Download 10-ks
As you see, this is a list of urls. These urls point to a file that contains metadata related to each filling. Since we don't care about the metadata, we'll pull the filling by replacing the url with the filling url.

In [26]:
raw_fillings_by_ticker = {}

for ticker, data in sec_data.items():
    raw_fillings_by_ticker[ticker] = {}
    for index_url, file_type, file_date in tqdm(data, desc='Downloading {} Fillings'.format(ticker), unit='filling'):
        if (file_type == DOC_TYPE):
            file_url = index_url.replace('-index.htm', '.txt').replace('.txtl', '.txt')            
            
            raw_fillings_by_ticker[ticker][file_date] = sec_api.get(file_url)

Downloading AMZN Fillings: 100%|██████████| 22/22 [00:04<00:00,  4.62filling/s]
Downloading BMY Fillings: 100%|██████████| 27/27 [00:06<00:00,  4.08filling/s]
Downloading CNP Fillings: 100%|██████████| 19/19 [00:04<00:00,  4.27filling/s]
Downloading CVX Fillings: 100%|██████████| 25/25 [00:06<00:00,  3.64filling/s]
Downloading FL Fillings: 100%|██████████| 22/22 [00:04<00:00,  4.55filling/s]
Downloading FRT Fillings: 100%|██████████| 29/29 [00:08<00:00,  3.44filling/s]
Downloading HON Fillings: 100%|██████████| 25/25 [00:06<00:00,  3.59filling/s]


In [27]:
print('Example Document:\n\n{}...'.format(next(iter(raw_fillings_by_ticker[example_ticker].values()))[:1000]))

Example Document:

<SEC-DOCUMENT>0001018724-17-000011.txt : 20170210
<SEC-HEADER>0001018724-17-000011.hdr.sgml : 20170210
<ACCEPTANCE-DATETIME>20170209175636
ACCESSION NUMBER:		0001018724-17-000011
CONFORMED SUBMISSION TYPE:	10-K
PUBLIC DOCUMENT COUNT:		92
CONFORMED PERIOD OF REPORT:	20161231
FILED AS OF DATE:		20170210
DATE AS OF CHANGE:		20170209

FILER:

	COMPANY DATA:	
		COMPANY CONFORMED NAME:			AMAZON COM INC
		CENTRAL INDEX KEY:			0001018724
		STANDARD INDUSTRIAL CLASSIFICATION:	RETAIL-CATALOG & MAIL-ORDER HOUSES [5961]
		IRS NUMBER:				911646860
		STATE OF INCORPORATION:			DE
		FISCAL YEAR END:			1231

	FILING VALUES:
		FORM TYPE:		10-K
		SEC ACT:		1934 Act
		SEC FILE NUMBER:	000-22513
		FILM NUMBER:		17588807

	BUSINESS ADDRESS:	
		STREET 1:		410 TERRY AVENUE NORTH
		CITY:			SEATTLE
		STATE:			WA
		ZIP:			98109
		BUSINESS PHONE:		2062661000

	MAIL ADDRESS:	
		STREET 1:		410 TERRY AVENUE NORTH
		CITY:			SEATTLE
		STATE:			WA
		ZIP:			98109
</SEC-HEADER>
<DOCUMENT>
<TYPE>10-K
<S

# Parse Documents

## Get Documents
Each filling is broken into several associated documents, sectioned off in the fillings with the tags:
      <DOCUMENT> </DOCUMENT> There's no overlap with these documents, so each `</DOCUMENT>` tag should come after the `<DOCUMENT>` with no `<DOCUMENT>` tag in between.


In [28]:
filling_documents_by_ticker = {}

for ticker, raw_fillings in raw_fillings_by_ticker.items():
    filling_documents_by_ticker[ticker] = {}
    for file_date, filling in tqdm(raw_fillings.items(), desc='Getting Documents from {} Fillings'.format(ticker), unit='filling'):
        filling_documents_by_ticker[ticker][file_date] = load_sec10k.get_documents(filling)

Getting Documents from AMZN Fillings: 100%|██████████| 17/17 [00:00<00:00, 73.78filling/s]
Getting Documents from BMY Fillings: 100%|██████████| 23/23 [00:00<00:00, 42.79filling/s]
Getting Documents from CNP Fillings: 100%|██████████| 15/15 [00:00<00:00, 44.10filling/s]
Getting Documents from CVX Fillings: 100%|██████████| 21/21 [00:00<00:00, 43.50filling/s]
Getting Documents from FL Fillings: 100%|██████████| 16/16 [00:00<00:00, 71.92filling/s]
Getting Documents from FRT Fillings: 100%|██████████| 19/19 [00:00<00:00, 67.20filling/s]
Getting Documents from HON Fillings: 100%|██████████| 20/20 [00:00<00:00, 53.62filling/s]


In [29]:
print('\n\n'.join([
    'Document {} Filed on {}:\n{}...'.format(doc_i, file_date, doc[:200])
    for file_date, docs in filling_documents_by_ticker[example_ticker].items()
    for doc_i, doc in enumerate(docs)][:3]))

Document 0 Filed on 2017-02-10:

<TYPE>10-K
<SEQUENCE>1
<FILENAME>amzn-20161231x10k.htm
<DESCRIPTION>FORM 10-K
<TEXT>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
	<he...

Document 1 Filed on 2017-02-10:

<TYPE>EX-12.1
<SEQUENCE>2
<FILENAME>amzn-20161231xex121.htm
<DESCRIPTION>COMPUTATION OF RATIO OF EARNINGS TO FIXED CHARGES
<TEXT>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http:...

Document 2 Filed on 2017-02-10:

<TYPE>EX-21.1
<SEQUENCE>3
<FILENAME>amzn-20161231xex211.htm
<DESCRIPTION>LIST OF SIGNIFICANT SUBSIDIARIES
<TEXT>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/h...


## Get Document Types
Now that we have all the documents, we want to find the 10-k form in this 10-k filing. Implement the `get_document_type` function to return the type of document given. The document type is located on a line with the `<TYPE>` tag. For example, a form of type "TEST" would have the line `<TYPE>TEST`. Make sure to return the type as lowercase, so this example would be returned as "test".

With the `get_document_type` function, we'll filter out all non 10-k documents.

In [39]:
ten_ks_by_ticker = {}

for ticker, filling_documents in filling_documents_by_ticker.items():
    ten_ks_by_ticker[ticker] = []
    for file_date, documents in filling_documents.items():
        for document in documents:
            if load_sec10k.get_document_type(document) == DOC_TYPE:
                ten_ks_by_ticker[ticker].append({
                    'cik': load_sec10k.cik_lookup[ticker],
                    'file': document,
                    'file_date': file_date})

In [31]:
load_sec10k.print_ten_k_data(ten_ks_by_ticker[example_ticker][:5], ['cik', 'file', 'file_date'])

[
  {
    cik: '0001018724'
    file: '\n<TYPE>10-K\n<SEQUENCE>1\n<FILENAME>amzn-2016123...
    file_date: '2017-02-10'},
  {
    cik: '0001018724'
    file: '\n<TYPE>10-K\n<SEQUENCE>1\n<FILENAME>amzn-2015123...
    file_date: '2016-01-29'},
  {
    cik: '0001018724'
    file: '\n<TYPE>10-K\n<SEQUENCE>1\n<FILENAME>amzn-2014123...
    file_date: '2015-01-30'},
  {
    cik: '0001018724'
    file: '\n<TYPE>10-K\n<SEQUENCE>1\n<FILENAME>amzn-2013123...
    file_date: '2014-01-31'},
  {
    cik: '0001018724'
    file: '\n<TYPE>10-K\n<SEQUENCE>1\n<FILENAME>d445434d10k....
    file_date: '2013-01-30'},
]


In [36]:
ten_ks_by_ticker[example_ticker][4]['file_date']

'2013-01-30'

In [37]:
ten_ks_by_ticker[example_ticker][4]['file'][:1000]

'\n<TYPE>10-K\n<SEQUENCE>1\n<FILENAME>d445434d10k.htm\n<DESCRIPTION>FORM 10-K\n<TEXT>\n<HTML><HEAD>\n<TITLE>Form 10-K</TITLE>\n</HEAD>\n <BODY BGCOLOR="WHITE">\n<h5 align="left"><a href="#toc">Table of Contents</a></h5>\n\n <P STYLE="line-height:0px;margin-top:0px;margin-bottom:0px;border-bottom:0.5pt solid #000000">&nbsp;</P>\n<P STYLE="line-height:3px;margin-top:0px;margin-bottom:2px;border-bottom:0.5pt solid #000000">&nbsp;</P> <P STYLE="margin-top:4px;margin-bottom:0px" ALIGN="center"><FONT STYLE="font-family:Times New Roman" SIZE="5"><B>UNITED STATES </B></FONT></P>\n<P STYLE="margin-top:0px;margin-bottom:0px" ALIGN="center"><FONT STYLE="font-family:Times New Roman" SIZE="5"><B>SECURITIES AND EXCHANGE COMMISSION </B></FONT></P> <P STYLE="margin-top:0px;margin-bottom:0px" ALIGN="center"><FONT\nSTYLE="font-family:Times New Roman" SIZE="3"><B>Washington, D.C. 20549 </B></FONT></P> <P STYLE="font-size:6px;margin-top:0px;margin-bottom:0px">&nbsp;</P><center>\n<P STYLE="line-height:6px;

# Write Raw 10Ks

In [34]:
metadata = {'newest_pricing_data': '2018-01-01',
           'tickers': load_sec10k.cik_lookup}

In [35]:
with open(OUTPATH + OUTFILE1, 'wb') as file:
    pickle.dump(metadata, file)

In [36]:
with open(OUTPATH + OUTFILE2, 'wb') as file:
    pickle.dump(ten_ks_by_ticker, file)