# Parsing All Airline Filings

In this notebook, I will be extending the work done in the AAL_parse.ipynb file to every stock that submitted a filing to the SEC. To verify that the methodology is robust, I will first try the first 5 companies in the folder and perform some validation.

In [201]:
import numpy as np
import pandas as pd
import os
import os
import re
import html2text
import pickle

## Subsample

First, let us take a look at the first 5 companies in the folder:

In [202]:
companies = os.listdir("sec-edgar-filings/")
companies[:5]

['0000100517', '0001351548', '0000101001', 'AAL', '0001405419']

For each of these companies, I would like to parse each statement, save into a dictionary, and pickle the dictionaries. Moreover, I want to save the character lengths to validate that the algorithm is pulling the information correctly (I assume the lengths would have to be somewhat alike...)

In [243]:
searchstr = '(1(.|a|b|.a.|.b.)|'
for i in range(2,16):
    searchstr+=str(i)+'(.|a|b|.a.|.b.)|'

searchstr+='16(.|a|b|.a.|.b.))'
searchstr

'(1(.|a|b|.a.|.b.)|2(.|a|b|.a.|.b.)|3(.|a|b|.a.|.b.)|4(.|a|b|.a.|.b.)|5(.|a|b|.a.|.b.)|6(.|a|b|.a.|.b.)|7(.|a|b|.a.|.b.)|8(.|a|b|.a.|.b.)|9(.|a|b|.a.|.b.)|10(.|a|b|.a.|.b.)|11(.|a|b|.a.|.b.)|12(.|a|b|.a.|.b.)|13(.|a|b|.a.|.b.)|14(.|a|b|.a.|.b.)|15(.|a|b|.a.|.b.)|16(.|a|b|.a.|.b.))'

In [310]:
parsed = {}
lens = {}
positions = {}

h = html2text.HTML2Text()

for company in companies:
    if company == '.DS_Store':
        continue
        
    pulls = os.listdir("sec-edgar-filings/"+company+"/10-K")
    
    parsed[company] = {}
    positions[company] = {}
    
    a = pd.DataFrame()
    
    
    for year in pulls:

        try:
            f = open("sec-edgar-filings/"+company+"/10-K/"+year+"/filing-details.html", "r")
        except:
            f = open("sec-edgar-filings/"+company+"/10-K/"+year+"/filing-details.txt", "r")
        raw_10k = f.read()
        
        
        matches = re.finditer("\/*>ITEM(\s*|&#160;*|&nbsp;*)"+searchstr, raw_10k, re.IGNORECASE)
        #matches = re.finditer(">ITEM(\s|&#160;|&nbsp;)"+searchstr, raw_10k, re.IGNORECASE)
        #matches = re.finditer("\/*>ITEM(\s|&#160;|&nbsp;)"+searchstr, raw_10k, re.IGNORECASE)
        #matches = re.finditer("\/*>ITEM(\s|&#160;|&nbsp;)"+searchstr+'\.{0,1}', raw_10k, re.IGNORECASE)
        #matches = re.finditer("\/*>Item(\s|&#160;|&nbsp;)"+searchstr, raw_10k, re.IGNORECASE)
        locations = [x for x in matches]


        # Create the dataframe
        try:
            test_df = pd.DataFrame([(x.group(), x.start(), x.end()) for x in locations])

            test_df.columns = ['item', 'start', 'end']
            test_df['item'] = test_df.item.str.lower()
        
        except:
            continue

        # Get rid of unnesesary charcters from the dataframe
        test_df.replace('&#160;',' ',regex=True,inplace=True)
        test_df.replace('&nbsp;',' ',regex=True,inplace=True)
        test_df.replace(' ','',regex=True,inplace=True)
        test_df.replace('\.','',regex=True,inplace=True)
        test_df.replace('>','',regex=True,inplace=True)
        #test_df.replace('\(','',regex=True,inplace=True)
        #test_df.replace('\)','',regex=True,inplace=True)
        test_df.replace('\n','',regex=True,inplace=True)
        test_df.replace('\,','',regex=True,inplace=True)
        test_df.replace('\_','',regex=True,inplace=True)
        test_df.replace('\xa0','',regex=True,inplace=True)

        # Drop duplicates, keep last
        pos_dat = test_df.sort_values('start', ascending=True).drop_duplicates(subset=['item'], keep='last')
        #pos_dat['length'] = len(raw_10k)

        # Set item as the dataframe index
        pos_dat.set_index('item', inplace=True)
        #lens[company][year] = pos_dat

        #starts = [0]+[locations[i].start() for i in range(len(locations))] + [len(raw_10k)]

        sections = {}

        #for i in range(len(starts)-1):
        #    temp = raw_10k[starts[i]:starts[i+1]]
            #tempsoup = BeautifulSoup(temp, 'html.parser')
            #goods = tempsoup.find_all('font')
            #sections[i] = [x.text for x in goods]
        #    sections[i] = h.handle(temp)

        for i in range(len(pos_dat.index)-1):
            temp = raw_10k[pos_dat['start'][i]:pos_dat['start'][i+1]]
            #sections[pos_dat['item'][i]] = h.handle(temp)
            sections[pos_dat.index[i]] = h.handle(temp)


        positions[company][year[11:13]] = pos_dat
        parsed[company][year[11:13]] = sections
        b = pd.DataFrame([len(x) for x in sections.values()], index = sections.keys(), columns=[year])
        a = pd.concat([a,b], axis=1, join='outer')
    
    lens[company] = a

In [311]:
parsed.keys()

dict_keys(['0000100517', '0001351548', '0000101001', 'AAL', '0001405419', '0001614436', '0001144331', '0001166291', '0000921929', '0000899394', '0001159154', '0001498710', '0000714560', '0000869187', '0000319687', '0000027904', '0000006201', '0001050715', '0000904020', '0000706270', '0000810332', '0001029863', '0000948845', '0001172222', '0000766421', '0001058033', '0000835768', '0000793733', '0000092380', '0000004515', '0001088734', '0001158463', '0001362468', '0001011696', '0000003202', '0000948846', '0000914397', '0001000578', '0000701345', '0000046205'])

In [313]:
parsed['AAL']['08']['item7']

'>ITEM 7. MANAGEMENT\'S DISCUSSION AND ANALYSIS OF FINANCIAL CONDITION AND\nRESULTS OF OPERATIONS  \nForward-Looking Information  \nThe discussions under Business, Risk Factors, Properties and Legal Proceedings\nand the following discussions under Management\'s Discussion and Analysis of\nFinancial Condition and Results of Operations and Quantitative and Qualitative\nDisclosures about Market Risk contain various forward-looking statements\nwithin the meaning of Section 27A of the Securities Act of 1933, as amended,\nand Section 21E of the Securities Exchange Act of 1934, as amended, which\nrepresent the Company\'s expectations or beliefs concerning future events. When\nused in this document and in documents incorporated herein by reference, the\nwords "expects," "plans," "anticipates," "indicates," "believes," "forecast,"\n"guidance," "outlook," "may," "will," "should," and similar expressions are\nintended to identify forward-looking statements. Forward-looking statements\ninclude, wi

In [314]:
# Export the results
with open('parsed.pickle', 'wb') as handle:
    pickle.dump(parsed, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [270]:
matches = re.finditer("ITEM(\s*|&#160;*|&nbsp;*)"+searchstr, raw_10k, re.IGNORECASE)
[x for x in matches]

[<re.Match object; span=(2309, 2316), match='Item\n40'>,
 <re.Match object; span=(3375, 3382), match='Item 1.'>,
 <re.Match object; span=(4267, 4274), match='Item 2.'>,
 <re.Match object; span=(4511, 4518), match='Item 3.'>,
 <re.Match object; span=(4755, 4762), match='Item 4.'>,
 <re.Match object; span=(4920, 4927), match='Item 5.'>,
 <re.Match object; span=(5075, 5082), match='Item 6.'>,
 <re.Match object; span=(5157, 5164), match='Item 7.'>,
 <re.Match object; span=(6039, 6046), match='Item 7A'>,
 <re.Match object; span=(6121, 6128), match='Item 8.'>,
 <re.Match object; span=(6203, 6210), match='Item 9.'>,
 <re.Match object; span=(6356, 6363), match='Item 9A'>,
 <re.Match object; span=(6571, 6578), match='Item 10'>,
 <re.Match object; span=(6653, 6660), match='Item 11'>,
 <re.Match object; span=(6735, 6742), match='Item 12'>,
 <re.Match object; span=(6892, 6899), match='Item 13'>,
 <re.Match object; span=(6974, 6981), match='Item 14'>,
 <re.Match object; span=(7141, 7148), match='It

In [63]:
lens[list(lens.keys())[0]]

Unnamed: 0,0001193125-18-054235,0000100517-04-000007,0000950137-09-001460,0001193125-14-060695,0001104659-07-019919,0001193125-12-073010,0001193125-16-468479,0000100517-21-000016,0001193125-17-054129,0001193125-13-074391,0000100517-05-000006,0001047469-08-001951,0000100517-01-500026,0000100517-19-000009,0001193125-15-056493,0000100517-06-000021,0000100517-03-000007,0000100517-20-000010,0001193125-11-042335,0001193125-10-041523
item4,127,144.0,2440.0,127.0,4666.0,126.0,101.0,60,127,126.0,144.0,75.0,2582.0,72,127.0,1126.0,197.0,72,89.0,151.0
item1,5442,2414.0,50.0,8715.0,75.0,19086.0,3004.0,1026,2893,29237.0,2443.0,21.0,13063.0,180,18244.0,20887.0,22071.0,991,4190.0,47.0
item1a,151,,365.0,197.0,334.0,163.0,161.0,176,152,163.0,,26.0,,192,197.0,332.0,,192,350.0,312.0
item1b,106,,115.0,106.0,72.0,106.0,81.0,44,106,106.0,,39.0,,91,106.0,57.0,,91,51.0,54.0
item2,34577,2810.0,7242.0,5760.0,4606.0,7214.0,36476.0,38829,37852,7626.0,3043.0,23.0,4013.0,21866,6296.0,3253.0,52935.0,35057,5627.0,38.0
item3,3442,4976.0,9385.0,4474.0,10460.0,12507.0,4596.0,3245,4369,11328.0,5769.0,30.0,3117.0,3427,2887.0,7372.0,6568.0,3359,18088.0,12976.0
item5,28,1647.0,2876.0,28.0,5618.0,28.0,28.0,28,28,28.0,1499.0,121.0,1455.0,35,28.0,2555.0,2645.0,35,4078.0,2697.0
item6,394,2188.0,7554.0,74618.0,5202.0,83477.0,65369.0,89,71198,74162.0,3027.0,36.0,2647.0,4764,60181.0,3136.0,2051.0,269,82174.0,8026.0
item7,4104,1823.0,23202.0,8546.0,9233.0,7803.0,1467.0,2741,4453,9167.0,1528.0,71.0,11589.0,3572,9195.0,38.0,6646.0,2737,9960.0,13052.0
item8,1102,905.0,967.0,526.0,1112.0,9763.0,526.0,695,858,9532.0,622.0,808.0,901.0,632,526.0,638.0,905.0,645,10163.0,643.0


In [64]:
raw_10k[:1000]

'<html><body><document>\n<type>10-K\n<sequence>1\n<filename>gulfstream10k.htm\n<description>ANNUAL REPORT\n<text>\n<!DOCTYPE html PUBLIC "-//IETF//DTD HTML//EN">\n\n<title>Gulfstream - 10-K</title>\n<meta content="C054099" name="keywords"/>\n<meta content="EWH" name="author"/>\n<meta content="04/10/2008" name="date"/>\n<p align="center" style="margin-top:2px; margin-bottom:0px"><br/></p>\n<p align="right" style="margin-top:0px; margin-bottom:2.2px; padding-bottom:4px; border-bottom:4px solid #000000">&#160;</p>\n<p align="right" style="margin:0px; padding-top:4px; border-top:1.333px solid #000000">&#160;</p>\n<p align="center" style="line-height:14pt; margin:0px; font-family:Times New Roman Bold; font-size:12pt"><b>UNITED STATES<br/>\n</b><font style="font-family:Times New Roman"><b>SECURITIES AND EXCHANGE COMMISSION<br/>\nWashington, D.C. 20549</b></font></p>\n<p align="center" style="margin:0px">&#151;&#151;&#151;&#151;&#151;&#151;&#151;</p>\n<p align="center" style="line-height:14pt

In [65]:
sections['item3'][:1000]

'Item 3.**\n\n**Legal Proceedings**\n\nIn January 2006, a former salesman of the Academy formed a business that the\nCompany believes competes directly with the Academy for student pilots.\nThereafter, the former President of the Academy resigned his position at the\nAcademy and the Company believes he became affiliated with the alleged\ncompeting business. The Academy has initiated a lawsuit against these former\nemployees, alleging violation of non-competition and fiduciary obligations.\nThe defendants, including the Academy\x92s former President, subsequently filed a\ncounterclaim against the Academy based upon lost earnings and breach of\ncontract.\n\nFrom time to time, we are involved in litigation relating to claims arising\nout of our operations in the normal course of business. As of the date of this\nannual report, we were not engaged in any other legal proceedings which are\nexpected, individually or in the aggregate, to have a material adverse effect\non us.\n\n**\n\n'

Self Note: I will probably need to use some sort of k-nearest-neighbours algorithm to ensure that the entries were selected from the correct 'Item'. For example, right now, we are assuming the correct item 1 is the last match in the document, which may not be true. Further investigation will be required.

# Item Selection Validation

I will now try to do exactly what I mentioned above--validate that the correct sections were pulled from each of the 10Ks

In [73]:
temp = positions[list(positions.keys())[0]]['08'].copy()
temp.sort_values('start')

Unnamed: 0,item,start,end,length
0,item4,11495,11506,2619994
1,item1,21171,21183,2619994
2,item1a,21594,21606,2619994
3,item1b,22023,22035,2619994
4,item2,22465,22476,2619994
...,...,...,...,...
75,item11,2618492,2618499,2619994
76,item12,2618608,2618615,2619994
77,item13,2618796,2618803,2619994
78,item14,2618959,2618966,2619994


We can use the length of each 'section' to define a point and perhaps use k-nearest neighbours or some similar algorithm to find the items that are closest together? 

In [1]:
#positions[list(positions.keys())[0]]

In [68]:
temp['length'][1:]

1     2619994
2     2619994
3     2619994
4     2619994
5     2619994
       ...   
75    2619994
76    2619994
77    2619994
78    2619994
79    2619994
Name: length, Length: 79, dtype: int64

In [69]:
sections.keys()

dict_keys(['item4', 'item1', 'item1a', 'item1b', 'item2', 'item3', 'item5', 'item6', 'item7', 'item8', 'item9', 'item10', 'item11', 'item12', 'item13', 'item14', 'item15'])

# Pivot in Methodology

Let's try matching the exact title for each item instead of trying to figure out where the correct item is. For example:

In [195]:
item = [None]*20
item[0]  = 'ITEM(\s|&#160;|&nbsp;)1.{0,20}business'
item[1]  = 'ITEM(\s|&#160;|&nbsp;)1a.{0,20}risk'
item[2]  = 'ITEM(\s|&#160;|&nbsp;)1b.{0,20}unresolved'
item[3]  = 'ITEM(\s|&#160;|&nbsp;)2.{0,20}properties'
item[4]  = 'ITEM(\s|&#160;|&nbsp;)3.{0,20}legal'
item[5]  = '\/*>ITEM(\s|&#160;|&nbsp;)4.{0,20}mine'
item[6]  = '\/*>ITEM(\s|&#160;|&nbsp;)5.{0,20}market'
item[7]  = '\/*>ITEM(\s|&#160;|&nbsp;)6.{0,20}selected'
item[8]  = '\/*>ITEM(\s|&#160;|&nbsp;)7.{0,20}management'
item[9]  = '\/*>ITEM(\s|&#160;|&nbsp;)7a.{0,20}quantitative'
item[10] = '\/*>ITEM(\s|&#160;|&nbsp;)8.{0,20}financial'
item[11] = 'ITEM(\s|&#160;|&nbsp;)9.{0,20}changes'
item[12] = '\/*>ITEM(\s|&#160;|&nbsp;)9a.{0,20}controls'
item[13] = '\/*>ITEM(\s|&#160;|&nbsp;)9b.{0,20}other'
item[14] = '\/*>ITEM(\s|&#160;|&nbsp;)10.{0,30}directors'
item[15] = '\/*>ITEM(\s|&#160;|&nbsp;)11.{0,30}'
item[16] = '\/*>ITEM(\s|&#160;|&nbsp;)12.{0,20}security'
item[17] = '\/*>ITEM(\s|&#160;|&nbsp;)13.{0,20}certain'
item[18] = '\/*>ITEM(\s|&#160;|&nbsp;)14.{0,20}principal'
item[19] = '\/*>ITEM(\s|&#160;|&nbsp;)15.{0,20}exhibits'

In [1]:
matches = re.finditer("ITEM(\s*|&#160;*|&nbsp;*)1.{0,50}", raw_10k, re.IGNORECASE)
[x for x in matches]

NameError: name 're' is not defined

In [196]:
searchstr = '('
for i in range(19):
    searchstr += item[i] + '|'
searchstr += item[19]+')'

In [197]:
searchstr

'(ITEM(\\s|&#160;|&nbsp;)1.{0,20}business|ITEM(\\s|&#160;|&nbsp;)1a.{0,20}risk|ITEM(\\s|&#160;|&nbsp;)1b.{0,20}unresolved|ITEM(\\s|&#160;|&nbsp;)2.{0,20}properties|ITEM(\\s|&#160;|&nbsp;)3.{0,20}legal|\\/*>ITEM(\\s|&#160;|&nbsp;)4.{0,20}mine|\\/*>ITEM(\\s|&#160;|&nbsp;)5.{0,20}market|\\/*>ITEM(\\s|&#160;|&nbsp;)6.{0,20}selected|\\/*>ITEM(\\s|&#160;|&nbsp;)7.{0,20}management|\\/*>ITEM(\\s|&#160;|&nbsp;)7a.{0,20}quantitative|\\/*>ITEM(\\s|&#160;|&nbsp;)8.{0,20}financial|ITEM(\\s|&#160;|&nbsp;)9.{0,20}changes|\\/*>ITEM(\\s|&#160;|&nbsp;)9a.{0,20}controls|\\/*>ITEM(\\s|&#160;|&nbsp;)9b.{0,20}other|\\/*>ITEM(\\s|&#160;|&nbsp;)10.{0,30}directors|\\/*>ITEM(\\s|&#160;|&nbsp;)11.{0,30}|\\/*>ITEM(\\s|&#160;|&nbsp;)12.{0,20}security|\\/*>ITEM(\\s|&#160;|&nbsp;)13.{0,20}certain|\\/*>ITEM(\\s|&#160;|&nbsp;)14.{0,20}principal|\\/*>ITEM(\\s|&#160;|&nbsp;)15.{0,20}exhibits)'

In [198]:
parsed = {}
lens = {}
positions = {}

h = html2text.HTML2Text()

for company in companies[:1]:
    pulls = os.listdir("sec-edgar-filings/"+company+"/10-K")
    
    parsed[company] = {}
    positions[company] = {}
    
    a = pd.DataFrame()
    
    
    for year in pulls[:1]:

        try:
            f = open("sec-edgar-filings/"+company+"/10-K/"+year+"/filing-details.html", "r")
        except:
            f = open("sec-edgar-filings/"+company+"/10-K/"+year+"/filing-details.txt", "r")
        raw_10k = f.read()
        
        
        matches = re.finditer(searchstr, raw_10k, re.IGNORECASE)
        
        positions[company][year] = [x for x in matches]

In [199]:
positions

{'0000100517': {'0001193125-18-054235': [<re.Match object; span=(43794, 43817), match='>Item&#160;11.</p></td>'>,
   <re.Match object; span=(48898, 48916), match='Item&#160;1A, Risk'>,
   <re.Match object; span=(114010, 114023), match='Item 3, Legal'>,
   <re.Match object; span=(141953, 141969), match='Item 1, Business'>,
   <re.Match object; span=(145209, 145225), match='Item 1, Business'>,
   <re.Match object; span=(304941, 304955), match='Item 1A., Risk'>,
   <re.Match object; span=(547274, 547288), match='Item 1A., Risk'>,
   <re.Match object; span=(1518986, 1519004), match='Item 2, Properties'>,
   <re.Match object; span=(1667346, 1667369), match='>ITEM&#160;11.</b></td>'>]}}

In [144]:
company

'0000100517'

In [99]:
company

'0000100517'

In [100]:
year

'0001193125-18-054235'