# Scraping EDGAR for Financial Documents

In this notebook, I will be writing code to retrieve 10K and 10Q reports from a given company, given their tickers. To run this, you will need to install the very powerful sec_edgar_downloader package via  

In [1]:
#!pip install sec_api

I was able to pull financial documents from all companies that submitted such documents via the SEC website. However, this was done manually via copy/paste. This is a step that would need to be automated if I were to extrapolate this to other sectors/industries

In [1]:
import pandas as pd
import os
import pickle
import h5py
import re

import requests
import json
import time

from bs4 import BeautifulSoup
from sec_edgar_downloader import Downloader

In [35]:
# Open the file
df = pd.read_hdf('1-rets.h5')
df.head()

Unnamed: 0,A,AABA,AAL,AAMRQ,AAP,AAPL,ABBV,ABC,ABI,ABKFQ,...,XRX,XTO,XYL,YNR,YRCW,YUM,ZBH,ZBRA,ZION,ZTS
1996-01-02,,,,,,0.007843,,,,,...,-0.00365,,,,,,,,,
1996-01-03,,,,,,0.0,,,,,...,-0.000915,,,,,,,,,
1996-01-04,,,,,,-0.017508,,,,,...,-0.021082,,,,,,,,,
1996-01-05,,,,,,0.08515,,,,,...,-0.011236,,,,,,,,,
1996-01-08,,,,,,0.010948,,,,,...,-0.001894,,,,,,,,,


# Pulling Data From All Companies

Now, we will pull the data from all the companies found in the .csv file.

In [10]:
response = json.loads(requests.get("https://www.sec.gov/files/company_tickers.json").text)

In [11]:
mapping = pd.DataFrame(response).T
mapping.head()

Unnamed: 0,cik_str,ticker,title
0,320193,AAPL,Apple Inc.
1,789019,MSFT,MICROSOFT CORP
2,1652044,GOOG,Alphabet Inc.
3,1018724,AMZN,AMAZON COM INC
4,1326801,FB,Facebook Inc


In [12]:
ticks = df.columns
len(ticks)

1110

In [13]:
no_matches = [x for x in ticks if x not in mapping['ticker'].values]
len(no_matches)

413

Wow! So 417/1110 of these stocks are missing assoicated tickers via the SEC's own database. This is pretty crazy, I think. Let's try a dataset provided by rankandfiled found here: http://rankandfiled.com/#/data/cusips

In [14]:
mapping = pd.read_csv("2-cusip_ticker.csv", sep='|')
mapping.head()

Unnamed: 0,Issuer,Ticker,CUSIP,CIK
0,ALCOA INC,AA,013817101,4281.0
1,ALTANA AKTIENGESELLSCHAFT SPON,AAA,02143N103,
2,AAA PUB ADJUSTING GRP INC NEW,AAAA,00249C203,
3,ASIA AUTOMOTIVE ACQUISITION CO,AAAC,04519K101,1332552.0
4,ASIA AUTOMOTIVE ACQUISITION CO,AAACU,04519K200,1332552.0


In [15]:
matches = mapping[mapping['Ticker'].isin(ticks)]
matches.head()

Unnamed: 0,Issuer,Ticker,CUSIP,CIK
55,AMERICAN AIRLINES GROUP INC CO,AAL,02376R102,6201.0
67,AMR CORP,AAMRQ,001765106,
80,ADVANCE AUTO PARTS INC,AAP,00751Y106,1158449.0
84,APPLE INC;COM NPV,AAPL,037833100,320193.0
132,ABBVIE INC COM STK (DE),ABBV,00287Y109,1551152.0


In [16]:
1110 - len(matches)

146

There appears to be less tickers missing in here than in the SEC's own mapping!

# Other Mapping 

Let's try using another link to see if we can score even better

In [44]:
response = requests.get("https://sec.report/Ticker/AABA",
                       headers=heads).text

In [47]:
soup = BeautifulSoup(response)

In [52]:
soup.find_all('title')

[<title>AABA Stock - Altaba Inc. SEC Filings</title>]

In [58]:
soup.find_all("h2")

[<h2>SEC CIK 0001011006</h2>, <h2>Ticker: AABA</h2>]

In [70]:
start = time.time()
mapping = {}
counter = 1
for ticker in tickers:
    response = requests.get("https://sec.report/Ticker/"+ticker,
                       headers=heads).text
    soup = BeautifulSoup(response)
    mapping[ticker] = soup.find_all("h2")
    
    if counter % 100 == 0:
        with open('2-ticker_cik_mapping.pickle', 'wb') as handle:
            pickle.dump(mapping, handle, protocol=pickle.HIGHEST_PROTOCOL)
        
end = time.time()

In [71]:
print(end-start)

1512.5129671096802


In [89]:
#with open('2-ticker_cik_mapping.pickle', 'wb') as handle:
#            pickle.dump(mapping, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [2]:
mapping = pd.read_pickle('2-ticker_cik_mapping.pickle')

In [3]:
list(mapping.values())[0]

'[<h2>SEC CIK 0001090872</h2>, <h2>Ticker: A</h2>]'

In [4]:
# Pull out only the CIKs
for key in mapping.keys():
    try:
        mapping[key] = str(re.search(r"\d{10}", mapping[key])[0])
    except:
        continue

In [5]:
map_df = pd.DataFrame([mapping]).T
map_df

Unnamed: 0,0
A,0001090872
AABA,0001011006
AAL,0000006201
AAMRQ,0000006201
AAP,0001158449
...,...
YUM,0001041061
ZBH,0001136869
ZBRA,0000877212
ZION,0000109380


Let's now try to fill in the gaps, this seems like a good start already:

In [220]:
missing = map_df[[len(x) != 10 for x in map_df[0]]].index
missing

Index(['ABX', 'AFS.A', 'AS', 'AZA.A', 'BF.B', 'BHMSQ', 'BKB', 'BLY', 'BMGCA',
       'BOAT', 'CGP', 'CNG', 'COC.B', 'CPQ', 'CYM', 'CYR', 'DALRQ', 'DCNAQ',
       'DEC', 'DGN', 'DI', 'DWD', 'ECH', 'ECO', 'EDS', 'EFU', 'ENRNQ', 'EP',
       'FBO', 'FJ', 'FLMIQ', 'FPC', 'FTL.A', 'GAPTQ', 'GFS.A', 'GIDL', 'GPU',
       'GWF', 'GX', 'HBOC', 'HFS', 'HM', 'HPH', 'HRS', 'IMNX', 'INCLF', 'JOS',
       'KM', 'KWP', 'LDW.B', 'LLX', 'MCIC', 'MIL', 'MKG', 'MOB', 'MST', 'NAE',
       'NGH', 'NLV', 'NMK', 'NYN', 'OAT', 'OK', 'ORX', 'PEL', 'PHB', 'PNU',
       'PPW', 'PWJ', 'Q', 'RAL', 'RATL', 'RBD', 'RDS.A', 'RYC', 'SEG', 'SGID',
       'SHN', 'SK', 'STO', 'SUB', 'TCOMA', 'TDM', 'THY', 'TOS', 'UAWGQ', 'UCC',
       'UCM', 'UMG', 'UPR', 'USBC', 'USHC', 'USW', 'VAT', 'VO', 'WAI', 'WAMUQ',
       'WLA', 'WMX', 'YNR'],
      dtype='object')

In [224]:
map_df

Unnamed: 0,0
A,0001090872
AABA,0001011006
AAL,0000006201
AAMRQ,0000006201
AAP,0001158449
...,...
YUM,0001041061
ZBH,0001136869
ZBRA,0000877212
ZION,0000109380


In [210]:
df.loc[:,missing].dropna(how='all', axis=1).dropna(how='all')

Unnamed: 0,EP,MCIC,NMK,WAMUQ
1996-01-02,,,0.052632,
1996-01-03,,,-0.012500,
1996-01-04,,,0.000000,
1996-01-05,,,-0.050633,
1996-01-08,,,-0.013333,
...,...,...,...,...
2012-06-06,0.000000,,,
2012-06-07,0.000000,,,
2012-06-08,0.142857,,,
2012-06-11,0.000000,,,


In [6]:
mapping['EP']

'[<h2>SEC CIK 1066107</h2>, <h2>Ticker: EP</h2>]'

In [7]:
map_df.to_csv('2-new_mapping.csv')

Not much else we can do here!

In [18]:
#valid_tickers = matches.dropna(axis=0)

In [19]:
#valid_tickers.to_csv('1-valid_tickers.csv')

In [20]:
#CIKs = valid_tickers['CIK']

In [254]:
#CIKs = map_df[map_df[0] != '[]'][0]
CIKs = map_df[[len(x) == 10 for x in map_df[0]]]
len(CIKs)

1010

In [258]:
CIKs[0].values[0]

'0001090872'

# Download the Necessary Documents

In [168]:
# base URL for the SEC EDGAR browser
endpoint = r"https://www.sec.gov/cgi-bin/browse-edgar"

# define our parameters dictionary
param_dict = {'action':'getcompany',
              'CIK':'1265107',
              'type':'10-k',
              'dateb':'20190101',
              'owner':'exclude',
              'start':'',
              'output':'',
              'count':'100'}

# request the url, and then parse the response.
response = requests.get(url = endpoint, params = param_dict)
soup = BeautifulSoup(response.content, 'html.parser')

# Let the user know it was successful.
print('Request Successful')
print(response.url)

Request Successful
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=1265107&type=10-k&dateb=20190101&owner=exclude&start=&output=&count=100


In [169]:
# Used for the requests
heads = {#'Host': 'www.sec.gov', 
         #'Connection': 'close',
         #'Accept': 'application/json',#, text/javascript, */*; q=0.01', 
         'X-Requested-With': 'XMLHttpRequest',
         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36',
         }

In [170]:
import requests
import json
response = json.loads(requests.get("https://data.sec.gov/submissions/CIK0000006201.json", headers=heads).text)

In [171]:
selection = pd.DataFrame(response['filings']['recent'])
tenks = selection[selection['form'] == '10-K']
tenks

Unnamed: 0,accessionNumber,filingDate,reportDate,acceptanceDateTime,act,form,fileNumber,filmNumber,items,size,isXBRL,isInlineXBRL,primaryDocument,primaryDocDescription
51,0000006201-21-000014,2021-02-17,2020-12-31,2021-02-17T17:17:57.000Z,34,10-K,001-08400,21646186,,43925703,1,1,aal-20201231.htm,10-K 2020 02.17.21
152,0000006201-20-000023,2020-02-19,2019-12-31,2020-02-19T07:31:30.000Z,34,10-K,001-08400,20627428,,30851334,1,1,a10k123119.htm,10-K 2019 02.19.20
227,0000006201-19-000009,2019-02-25,2018-12-31,2019-02-25T07:31:34.000Z,34,10-K,001-08400,19628071,,30572408,1,0,a10k123118.htm,10-K 2018 02.25.19
317,0000006201-18-000009,2018-02-21,2017-12-31,2018-02-21T08:02:40.000Z,34,10-K,001-08400,18627088,,27914491,1,0,a10k123117.htm,10-K
414,0001193125-17-051216,2017-02-22,2016-12-31,2017-02-22T08:01:43.000Z,34,10-K,001-08400,17627073,,24888480,1,0,d286458d10k.htm,FORM 10-K
540,0001193125-16-474605,2016-02-24,2015-12-31,2016-02-24T08:04:10.000Z,34,10-K,001-08400,161450518,,26170400,1,0,d78287d10k.htm,FORM 10-K
653,0001193125-15-061145,2015-02-25,2014-12-31,2015-02-25T08:02:34.000Z,34,10-K,001-08400,15645918,,39524925,1,0,d829913d10k.htm,FORM 10-K
752,0000006201-14-000004,2014-02-28,2013-12-31,2014-02-28T07:52:16.000Z,34,10-K,001-08400,14651496,,47888955,1,0,aagaa10k-20131231.htm,10-K


In [172]:
# Target: https://www.sec.gov/Archives/edgar/data/4515/000119312516474605/d78287d10k.htm

base_url = 'https://www.sec.gov/Archives/edgar/data/'
cik = response['cik']
accession_num = '0001193125-16-474605'.replace('-', '')
doc = 'd78287d10k.htm'

link = base_url + cik + '/' + accession_num + '/' + doc
link

'https://www.sec.gov/Archives/edgar/data/6201/000119312516474605/d78287d10k.htm'

In [176]:
response = requests.get(link, headers=heads)

In [174]:
soup = BeautifulSoup(response.content, 'html.parser')

In [98]:
soup.text[:500]

'\n10-K\n1\nd78287d10k.htm\nFORM 10-K\n\n\nForm 10-K\n\n\nTable of Contents\n\xa0\n\xa0 UNITED STATES SECURITIES AND\nEXCHANGE COMMISSION  Washington, D.C. 20549 \n\xa0 \xa0\nFORM 10-K  \xa0\n\xa0 \xa0\n\n\nþ\n ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 \nFor the Fiscal Year Ended December\xa031, 2015 \n\xa0\n\n\n¨\n TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 \nFor the Transition Period From\n\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0 to \xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\n Commission file number 1-8'

In [177]:
links = pd.Series([base_url]*len(tenks))
links += response['cik'] + "/"
links += list(map(lambda s: s.replace('-' , ''), tenks['accessionNumber']))
links += ["/"]*len(tenks)
links += tenks['primaryDocument'].values
links[7]

TypeError: 'Response' object is not subscriptable

In [178]:
tenks['accessionNumber'].values

array(['0000006201-21-000014', '0000006201-20-000023',
       '0000006201-19-000009', '0000006201-18-000009',
       '0001193125-17-051216', '0001193125-16-474605',
       '0001193125-15-061145', '0000006201-14-000004'], dtype=object)

In [179]:
tenks['Link'] = links.values
tenks

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,accessionNumber,filingDate,reportDate,acceptanceDateTime,act,form,fileNumber,filmNumber,items,size,isXBRL,isInlineXBRL,primaryDocument,primaryDocDescription,Link
51,0000006201-21-000014,2021-02-17,2020-12-31,2021-02-17T17:17:57.000Z,34,10-K,001-08400,21646186,,43925703,1,1,aal-20201231.htm,10-K 2020 02.17.21,https://www.sec.gov/Archives/edgar/data/
152,0000006201-20-000023,2020-02-19,2019-12-31,2020-02-19T07:31:30.000Z,34,10-K,001-08400,20627428,,30851334,1,1,a10k123119.htm,10-K 2019 02.19.20,https://www.sec.gov/Archives/edgar/data/
227,0000006201-19-000009,2019-02-25,2018-12-31,2019-02-25T07:31:34.000Z,34,10-K,001-08400,19628071,,30572408,1,0,a10k123118.htm,10-K 2018 02.25.19,https://www.sec.gov/Archives/edgar/data/
317,0000006201-18-000009,2018-02-21,2017-12-31,2018-02-21T08:02:40.000Z,34,10-K,001-08400,18627088,,27914491,1,0,a10k123117.htm,10-K,https://www.sec.gov/Archives/edgar/data/
414,0001193125-17-051216,2017-02-22,2016-12-31,2017-02-22T08:01:43.000Z,34,10-K,001-08400,17627073,,24888480,1,0,d286458d10k.htm,FORM 10-K,https://www.sec.gov/Archives/edgar/data/
540,0001193125-16-474605,2016-02-24,2015-12-31,2016-02-24T08:04:10.000Z,34,10-K,001-08400,161450518,,26170400,1,0,d78287d10k.htm,FORM 10-K,https://www.sec.gov/Archives/edgar/data/
653,0001193125-15-061145,2015-02-25,2014-12-31,2015-02-25T08:02:34.000Z,34,10-K,001-08400,15645918,,39524925,1,0,d829913d10k.htm,FORM 10-K,https://www.sec.gov/Archives/edgar/data/
752,0000006201-14-000004,2014-02-28,2013-12-31,2014-02-28T07:52:16.000Z,34,10-K,001-08400,14651496,,47888955,1,0,aagaa10k-20131231.htm,10-K,https://www.sec.gov/Archives/edgar/data/


In [267]:
#formatted_CIKs = [str(int(x)).zfill(10) for x in CIKs.values]
formatted_CIKs = CIKs[0].values#[x for x in CIKs.values if x != '[]']
formatted_CIKs

array(['0001090872', '0001011006', '0000006201', ..., '0000877212',
       '0000109380', '0001555280'], dtype=object)

In [268]:
len(formatted_CIKs)

1010

In [248]:
cik

array(['0001090872'], dtype=object)

In [263]:
import time
link_dict = {}
for cik in formatted_CIKs:
    response = json.loads(requests.get("https://data.sec.gov/submissions/CIK"+cik+".json", headers=heads).text)
    selection = pd.DataFrame(response['filings']['recent'])
    tenks = selection[selection['form'] == '10-K']
    
    links = pd.Series([base_url]*len(tenks))
    links += response['cik'] + "/"
    links += list(map(lambda s: s.replace('-' , ''), tenks['accessionNumber']))
    links += ["/"]*len(tenks)
    links += tenks['primaryDocument'].values
    
    tenks['Link'] = links.values
    
    link_dict[cik] = tenks.copy()
    time.sleep(0.11)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
  


In [264]:
link_dict['0001004434']['Link'].values[10]

'https://www.sec.gov/Archives/edgar/data/1004434/000104746911001624/a2202148z10-k.htm'

In [265]:
with open('2-link_dict.pickle', 'wb') as handle:
    pickle.dump(link_dict, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [266]:
len(link_dict.keys())

985

# Validation

In [159]:
bads = [x for x in link_dict.keys() if link_dict[x].empty]
bads

['0001391461',
 '0001409741',
 '0001181249',
 '0001124074',
 '0001576340',
 '0000031225',
 '0001319183',
 '0001438893',
 '0000806085',
 '0001323206',
 '0000798738',
 '0000833203']

In [160]:
response = json.loads(requests.get("https://data.sec.gov/submissions/CIK"+bads[0]+".json", headers=heads).text)

In [161]:
selection = pd.DataFrame(response['filings']['recent'])

In [162]:
selection

Unnamed: 0,accessionNumber,filingDate,reportDate,acceptanceDateTime,act,form,fileNumber,filmNumber,items,size,isXBRL,isInlineXBRL,primaryDocument,primaryDocDescription
0,0000891804-19-000279,2019-07-31,2019-06-30,2019-07-31T09:24:40.000Z,40,N-PX,811-22022,19987539,,6585,0,0,gugagc-npx.txt,AGC
1,9999999997-18-009099,2018-11-21,,2018-11-21T14:41:50.000Z,40,N-8F ORDR,811-22022,181198004,,74228,0,0,filename1.pdf,
2,9999999997-18-008807,2018-10-29,,2018-10-29T08:50:24.000Z,40,N-8F NTC,811-22022,181142889,,148689,0,0,filename1.pdf,
3,0000891804-18-000451,2018-10-12,,2018-10-12T16:05:19.000Z,40,N-8F/A,811-22022,181120285,,62232,0,0,gug75184-n8fa.htm,AGC
4,0000891804-18-000442,2018-09-28,2018-07-31,2018-09-28T13:39:42.000Z,40,N-Q,811-22022,181093434,,691307,0,0,gug74683-nq.htm,AGC
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
381,0001341004-07-001216,2007-04-13,,2007-04-13T07:32:29.000Z,4033,N-2/A,"811-22022,333-140951",0776472107764720,,561307,0,0,adventn2a.txt,AMENDMENT NO. 1
382,0001341004-07-001210,2007-04-12,,2007-04-12T19:10:36.000Z,,CORRESP,,,,39470,0,0,filename1.txt,
383,0001341004-07-000847,2007-03-07,,2007-03-07T17:14:29.000Z,40,N-8A/A,811-22022,07678575,,17282,0,0,chi535797.htm,
384,0001341004-07-000769,2007-02-28,,2007-02-28T15:53:22.000Z,4033,N-2,"811-22022,333-140951",0765767307657674,,503474,0,0,n2.txt,FORM N-2


There may not be any 10-K forms in these documents....

In [2]:
sum(selection["form"] == '10-K')

NameError: name 'selection' is not defined

In [26]:
response = requests.get("https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&output=xml&CIK=AAPL",
                       headers=heads).text

In [27]:
import xml.etree.ElementTree as ET
root = ET.fromstring(response)

In [33]:
root[0][0].text

'0000320193'

In [36]:
tickers = df.columns
tickers

Index(['A', 'AABA', 'AAL', 'AAMRQ', 'AAP', 'AAPL', 'ABBV', 'ABC', 'ABI',
       'ABKFQ',
       ...
       'XRX', 'XTO', 'XYL', 'YNR', 'YRCW', 'YUM', 'ZBH', 'ZBRA', 'ZION',
       'ZTS'],
      dtype='object', length=1110)

In [40]:
mapping = {}
for ticker in tickers[:10]:
    response = requests.get("https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&output=xml&CIK="+ticker,
                       headers=heads).text
    root = ET.fromstring(response)
    mapping[ticker] = root[0][0].text

ParseError: syntax error: line 2, column 61 (<string>)