# Download Filing Dates for Each Entry

In this notebook, we will be scraping the data necessary to map each filing to the date that they were filed. I will be using the edgar API (which I may also use in the future to scrape the necessary data instead of relying on the sec-edgar-downloader package).

You can find out more about how the SEC API works here: https://www.sec.gov/edgar/sec-api-documentation

In [157]:
import numpy as np
import pandas as pd
import requests
import json
import os
import urllib.request
import time
import pickle


# Used for the requests
heads = {#'Host': 'www.sec.gov', 
         #'Connection': 'close',
         #'Accept': 'application/json',#, text/javascript, */*; q=0.01', 
         #'X-Requested-With': 'XMLHttpRequest',
         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36',
         }

In [124]:
sample = os.listdir('sec-edgar-filings')[0]
sample

'0000100517'

In [125]:
url = "https://data.sec.gov/submissions/CIK"+sample+".json"
r = requests.get(url, headers=heads)

In [126]:
r.headers

{'Content-Type': 'application/json', 'x-amzn-RequestId': 'da81f39f-326d-462a-bf7f-a003eaaf89df', 'Access-Control-Allow-Origin': '*', 'x-amz-apigw-id': 'DZPxkFLooAMFcCQ=', 'X-Amzn-Trace-Id': 'Root=1-6106c670-5c3a8319407633fd4e67d885', 'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip', 'Expires': 'Sun, 01 Aug 2021 16:06:08 GMT', 'Cache-Control': 'max-age=0, no-cache, no-store', 'Pragma': 'no-cache', 'Date': 'Sun, 01 Aug 2021 16:06:08 GMT', 'Content-Length': '26072', 'Connection': 'keep-alive', 'Strict-Transport-Security': 'max-age=31536000 ; preload', 'Set-Cookie': 'ak_bmsc=C03D193DEB5CD28B32227180565622C7~000000000000000000000000000000~YAAQr5cwF4We3/l6AQAAlCd3Agxq4Gp7K0jP6+i9L1puRUHJwOdg/sZvxxtLxGiCWrelV7trjTlVR6FLx6Fw9n2mc7qL16NkyRpJtJjZpis07JRrpEKAo6ocJBmfdntPcwdEMPBtjO45ZJZk2QkBTHDNUq8IZBqNCjA0116h+KNsKt6idblksY9yyYA8A0xNjAShWcEXEAweNwNxb4GYu3ELxQWc4HJ/FZjzq1WEsdGbVVWrNQVCNGnfALbGflO4Nt6VStOLxJon8i2ACddKy7V+6KGkhTT4yzrWCg1tmUUmwofD0+weDd+jVXu02bTUcmIc88yEaFqcmF+OJnxXjDshgkEo6Id1p

In [127]:
r.text

'{"cik":"100517","entityType":"operating","sic":"4512","sicDescription":"Air Transportation, Scheduled","insiderTransactionForOwnerExists":1,"insiderTransactionForIssuerExists":1,"name":"United Airlines Holdings, Inc.","tickers":["UAL"],"exchanges":["Nasdaq"],"ein":"362675207","description":"","website":"","investorWebsite":"","category":"Large accelerated filer","fiscalYearEnd":"1231","stateOfIncorporation":"DE","stateOfIncorporationDescription":"DE","addresses":{"mailing":{"street1":"E. ANNA HA - WHQLD","street2":"233 SOUTH WACKER DRIVE","city":"CHICAGO","stateOrCountry":"IL","zipCode":"60606","stateOrCountryDescription":"IL"},"business":{"street1":"E. ANNA HA - WHQLD","street2":"233 SOUTH WACKER DRIVE","city":"CHICAGO","stateOrCountry":"IL","zipCode":"60606","stateOrCountryDescription":"IL"}},"phone":"872-825-4000","flags":"","formerNames":[{"name":"United Continental Holdings, Inc.","from":"2010-07-22T00:00:00.000Z","to":"2019-06-27T00:00:00.000Z"},{"name":"UAL CORP /DE/","from":"1

In [128]:
info = json.loads(r.text)

In [129]:
info

{'cik': '100517',
 'entityType': 'operating',
 'sic': '4512',
 'sicDescription': 'Air Transportation, Scheduled',
 'insiderTransactionForOwnerExists': 1,
 'insiderTransactionForIssuerExists': 1,
 'name': 'United Airlines Holdings, Inc.',
 'tickers': ['UAL'],
 'exchanges': ['Nasdaq'],
 'ein': '362675207',
 'description': '',
 'website': '',
 'investorWebsite': '',
 'category': 'Large accelerated filer',
 'fiscalYearEnd': '1231',
 'stateOfIncorporation': 'DE',
 'stateOfIncorporationDescription': 'DE',
 'addresses': {'mailing': {'street1': 'E. ANNA HA - WHQLD',
   'street2': '233 SOUTH WACKER DRIVE',
   'city': 'CHICAGO',
   'stateOrCountry': 'IL',
   'zipCode': '60606',
   'stateOrCountryDescription': 'IL'},
  'business': {'street1': 'E. ANNA HA - WHQLD',
   'street2': '233 SOUTH WACKER DRIVE',
   'city': 'CHICAGO',
   'stateOrCountry': 'IL',
   'zipCode': '60606',
   'stateOrCountryDescription': 'IL'}},
 'phone': '872-825-4000',
 'flags': '',
 'formerNames': [{'name': 'United Continenta

# Inspect the Sample

Now, we will try to find a filing->filing date mapping

In [130]:
info.keys()

dict_keys(['cik', 'entityType', 'sic', 'sicDescription', 'insiderTransactionForOwnerExists', 'insiderTransactionForIssuerExists', 'name', 'tickers', 'exchanges', 'ein', 'description', 'website', 'investorWebsite', 'category', 'fiscalYearEnd', 'stateOfIncorporation', 'stateOfIncorporationDescription', 'addresses', 'phone', 'flags', 'formerNames', 'filings'])

In [131]:
info['tickers']

['UAL']

In [132]:
info['filings'].keys()

dict_keys(['recent', 'files'])

In [133]:
info['filings']['recent'].keys()

dict_keys(['accessionNumber', 'filingDate', 'reportDate', 'acceptanceDateTime', 'act', 'form', 'fileNumber', 'filmNumber', 'items', 'size', 'isXBRL', 'isInlineXBRL', 'primaryDocument', 'primaryDocDescription'])

In [134]:
info['filings']['recent']['accessionNumber'][:20]

['0000100517-21-000055',
 '0000100517-21-000049',
 '0000899243-21-028679',
 '0000100517-21-000043',
 '0000899243-21-027279',
 '0000899243-21-027277',
 '0000899243-21-027278',
 '0001104659-21-087695',
 '0001104659-21-087692',
 '0000100517-21-000038',
 '0001104659-21-086312',
 '0000100517-21-000034',
 '0000899243-21-024212',
 '0000899243-21-024211',
 '0000899243-21-022336',
 '0000899243-21-021386',
 '0000899243-21-021385',
 '0000899243-21-021384',
 '0000899243-21-021383',
 '0000899243-21-021381']

In [135]:
info['filings']['recent']['filingDate'][:20]

['2021-07-22',
 '2021-07-20',
 '2021-07-16',
 '2021-07-09',
 '2021-07-02',
 '2021-07-02',
 '2021-07-02',
 '2021-06-30',
 '2021-06-30',
 '2021-06-29',
 '2021-06-28',
 '2021-06-28',
 '2021-06-16',
 '2021-06-16',
 '2021-06-07',
 '2021-06-01',
 '2021-06-01',
 '2021-06-01',
 '2021-06-01',
 '2021-06-01']

In [136]:
f2d = pd.DataFrame([info['filings']['recent']['accessionNumber'], info['filings']['recent']['filingDate']],
                  index=['file','date']).T

In [137]:
f2d

Unnamed: 0,file,date
0,0000100517-21-000055,2021-07-22
1,0000100517-21-000049,2021-07-20
2,0000899243-21-028679,2021-07-16
3,0000100517-21-000043,2021-07-09
4,0000899243-21-027279,2021-07-02
...,...,...
995,0001181431-13-034985,2013-06-14
996,0001181431-13-034984,2013-06-14
997,0001181431-13-034983,2013-06-14
998,0001181431-13-034982,2013-06-14


# Map to Filing Dates

In [138]:
files = os.listdir('sec-edgar-filings/'+sample+'/10-K')
files

['0001193125-18-054235',
 '0000100517-04-000007',
 '0000950137-09-001460',
 '0001193125-14-060695',
 '.DS_Store',
 '0001104659-07-019919',
 '0001193125-12-073010',
 '0001193125-16-468479',
 '0000100517-21-000016',
 '0001193125-17-054129',
 '0001193125-13-074391',
 '0000100517-05-000006',
 '0001047469-08-001951',
 '0000100517-01-500026',
 '0000100517-19-000009',
 '0001193125-15-056493',
 '0000100517-06-000021',
 '0000100517-03-000007',
 '0000100517-20-000010',
 '0001193125-11-042335',
 '0001193125-10-041523']

In [139]:
files = [x for x in files if x != '.DS_Store']
files

['0001193125-18-054235',
 '0000100517-04-000007',
 '0000950137-09-001460',
 '0001193125-14-060695',
 '0001104659-07-019919',
 '0001193125-12-073010',
 '0001193125-16-468479',
 '0000100517-21-000016',
 '0001193125-17-054129',
 '0001193125-13-074391',
 '0000100517-05-000006',
 '0001047469-08-001951',
 '0000100517-01-500026',
 '0000100517-19-000009',
 '0001193125-15-056493',
 '0000100517-06-000021',
 '0000100517-03-000007',
 '0000100517-20-000010',
 '0001193125-11-042335',
 '0001193125-10-041523']

In [140]:
sample1 = files[0]

In [141]:
f2d[f2d['file'] == sample1]

Unnamed: 0,file,date
411,0001193125-18-054235,2018-02-22


# Build Out

Now we will extend this for all CIKs in the directory

In [149]:
filing_dates = {}
json_dict = {}

for CIK in os.listdir('sec-edgar-filings'):
    if (CIK == '.DS_Store') or (CIK == 'AAL'):
        continue
        
    url = "https://data.sec.gov/submissions/CIK"+CIK+".json"
    
    r = requests.get(url, headers=heads)
    
    # Sleep for 20 seconds for good measure
    time.sleep(20)
    
    info = json.loads(r.text)
    
    
    json_dict[CIK] = info
    filing_dates[CIK] = pd.DataFrame([info['filings']['recent']['accessionNumber'], info['filings']['recent']['filingDate']],
                                      index=['file','date']).T

In [150]:
filing_dates.keys()

dict_keys(['0000100517', '0001351548', '0000101001', '0001405419', '0001614436', '0001144331', '0001166291', '0000921929', '0000899394', '0001159154', '0001498710', '0000714560', '0000869187', '0000319687', '0000027904', '0000006201', '0001050715', '0000904020', '0000706270', '0000810332', '0001029863', '0000948845', '0001172222', '0000766421', '0001058033', '0000835768', '0000793733', '0000092380', '0000004515', '0001088734', '0001158463', '0001362468', '0001011696', '0000003202', '0000948846', '0000914397', '0001000578', '0000701345', '0000046205'])

In [155]:
filing_dates['0000100517'].head()

Unnamed: 0,file,date
0,0000100517-21-000055,2021-07-22
1,0000100517-21-000049,2021-07-20
2,0000899243-21-028679,2021-07-16
3,0000100517-21-000043,2021-07-09
4,0000899243-21-027279,2021-07-02


In [158]:
# Export the results
with open('filing_dates.pickle', 'wb') as handle:
    pickle.dump(filing_dates, handle, protocol=pickle.HIGHEST_PROTOCOL)

# Look for Tickers? 

We were having problems getting a full and accurate ticker list from previous entries. Perhaps this will allow us to find more than what was previously available?

In [154]:
[json_dict[x]['tickers'] for x in json_dict.keys()]

[['UAL'],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 ['HRBR'],
 [],
 ['SAVE'],
 [],
 [],
 [],
 ['DAL'],
 ['AAL'],
 [],
 [],
 [],
 ['MESA'],
 [],
 [],
 ['HA'],
 ['ALK'],
 [],
 [],
 ['SKYW'],
 ['LUV'],
 [],
 [],
 ['JBLU'],
 ['ALGT'],
 [],
 [],
 [],
 [],
 [],
 [],
 []]

I take it back, the SEC has not been able to keep track of company tickers at all(?)