# Parsing All Airline Filings

In this notebook, I will be extending the work done in the AAL_parse.ipynb file to every stock that submitted a filing to the SEC. To verify that the methodology is robust, I will first try the first 5 companies in the folder and perform some validation.

In [4]:
import numpy as np
import pandas as pd
import os
import os
import re
import html2text
import pickle

## Subsample

First, let us take a look at the first 5 companies in the folder:

In [5]:
companies = os.listdir("sec-edgar-filings/")
companies[:5]

['0000100517', '0001351548', '0000101001', 'AAL', '0001405419']

For each of these companies, I would like to parse each statement, save into a dictionary, and pickle the dictionaries. Moreover, I want to save the character lengths to validate that the algorithm is pulling the information correctly (I assume the lengths would have to be somewhat alike...)

In [19]:
searchstr = '(1(.|a|b)|'
for i in range(2,16):
    searchstr+=str(i)+'(|a|b)|'

searchstr+='16(|a|b))'
searchstr

'(1(.|a|b)|2(|a|b)|3(|a|b)|4(|a|b)|5(|a|b)|6(|a|b)|7(|a|b)|8(|a|b)|9(|a|b)|10(|a|b)|11(|a|b)|12(|a|b)|13(|a|b)|14(|a|b)|15(|a|b)|16(|a|b))'

In [52]:
parsed = {}
lens = {}
positions = {}

h = html2text.HTML2Text()

for company in companies[:5]:
    pulls = os.listdir("sec-edgar-filings/"+company+"/10-K")
    
    parsed[company] = {}
    positions[company] = {}
    
    a = pd.DataFrame()
    
    
    for year in pulls:

        try:
            f = open("sec-edgar-filings/"+company+"/10-K/"+year+"/filing-details.html", "r")
        except:
            f = open("sec-edgar-filings/"+company+"/10-K/"+year+"/filing-details.txt", "r")
        raw_10k = f.read()
        
        matches = re.finditer("ITEM(|\s|&#160;|&nbsp;)"+searchstr, raw_10k, re.IGNORECASE)
        #matches = re.finditer("\/*>ITEM(\s|&#160;|&nbsp;)"+searchstr, raw_10k, re.IGNORECASE)
        #matches = re.finditer("\/*>ITEM(\s|&#160;|&nbsp;)"+searchstr+'\.{0,1}', raw_10k, re.IGNORECASE)
        #matches = re.finditer("\/*>Item(\s|&#160;|&nbsp;)"+searchstr, raw_10k, re.IGNORECASE)
        locations = [x for x in matches]


        # Create the dataframe
        test_df = pd.DataFrame([(x.group(), x.start(), x.end()) for x in locations])

        test_df.columns = ['item', 'start', 'end']
        test_df['item'] = test_df.item.str.lower()

        # Get rid of unnesesary charcters from the dataframe
        test_df.replace('&#160;',' ',regex=True,inplace=True)
        test_df.replace('&nbsp;',' ',regex=True,inplace=True)
        test_df.replace(' ','',regex=True,inplace=True)
        test_df.replace('\.','',regex=True,inplace=True)
        test_df.replace('>','',regex=True,inplace=True)
        test_df.replace('\(','',regex=True,inplace=True)
        test_df.replace('\)','',regex=True,inplace=True)
        test_df.replace('\n','',regex=True,inplace=True)
        test_df.replace('\,','',regex=True,inplace=True)
        test_df.replace('\_','',regex=True,inplace=True)
        test_df.replace('\xa0','',regex=True,inplace=True)

        # Drop duplicates, keep last
        pos_dat = test_df.sort_values('start', ascending=True)#.drop_duplicates(subset=['item'], keep='last')

        # Set item as the dataframe index
        #pos_dat.set_index('item', inplace=True)
        #lens[company][year] = pos_dat

        #starts = [0]+[locations[i].start() for i in range(len(locations))] + [len(raw_10k)]

        sections = {}

        #for i in range(len(starts)-1):
        #    temp = raw_10k[starts[i]:starts[i+1]]
            #tempsoup = BeautifulSoup(temp, 'html.parser')
            #goods = tempsoup.find_all('font')
            #sections[i] = [x.text for x in goods]
        #    sections[i] = h.handle(temp)

        for i in range(len(pos_dat.index)-1):
            temp = raw_10k[pos_dat['start'][i]:pos_dat['start'][i+1]]
            sections[pos_dat['item'][i]] = h.handle(temp)


        positions[company][year[11:13]] = pos_dat
        parsed[company][year[11:13]] = sections
        b = pd.DataFrame([len(x) for x in sections.values()], index = sections.keys(), columns=[year])
        a = pd.concat([a,b], axis=1, join='outer')
    
    lens[company] = a

In [51]:
pos_dat

Unnamed: 0,item,start,end
0,item4,7033,7039
1,item1,11313,11320
2,item1a,11573,11580
3,item1b,11817,11824
4,item2,12074,12080
5,item3,12333,12339
6,item4,12599,12605
7,item5,12957,12963
8,item6,13438,13444
9,item7,13703,13709


In [46]:
lens[list(lens.keys())[0]]

Unnamed: 0,0001193125-18-054235,0000100517-04-000007,0000950137-09-001460,0001193125-14-060695,0001104659-07-019919,0001193125-12-073010,0001193125-16-468479,0000100517-21-000016,0001193125-17-054129,0001193125-13-074391,0000100517-05-000006,0001047469-08-001951,0000100517-01-500026,0000100517-19-000009,0001193125-15-056493,0000100517-06-000021,0000100517-03-000007,0000100517-20-000010,0001193125-11-042335,0001193125-10-041523
item4,127,144.0,2440.0,127.0,4666.0,126.0,101.0,60,127,126.0,144.0,75.0,2582.0,72,127.0,1126.0,197.0,72,89.0,151.0
item1,5442,2414.0,50.0,8715.0,75.0,19086.0,3004.0,1026,2893,29237.0,2443.0,21.0,13063.0,180,18244.0,20887.0,22071.0,991,4190.0,47.0
item1a,151,,365.0,197.0,334.0,163.0,161.0,176,152,163.0,,26.0,,192,197.0,332.0,,192,350.0,312.0
item1b,106,,115.0,106.0,72.0,106.0,81.0,44,106,106.0,,39.0,,91,106.0,57.0,,91,51.0,54.0
item2,34577,2810.0,7242.0,5760.0,4606.0,7214.0,36476.0,38829,37852,7626.0,3043.0,23.0,4013.0,21866,6296.0,3253.0,52935.0,35057,5627.0,38.0
item3,3442,4976.0,9385.0,4474.0,10460.0,12507.0,4596.0,3245,4369,11328.0,5769.0,30.0,3117.0,3427,2887.0,7372.0,6568.0,3359,18088.0,12976.0
item5,28,1647.0,2876.0,28.0,5618.0,28.0,28.0,28,28,28.0,1499.0,121.0,1455.0,35,28.0,2555.0,2645.0,35,4078.0,2697.0
item6,394,2188.0,7554.0,74618.0,5202.0,83477.0,65369.0,89,71198,74162.0,3027.0,36.0,2647.0,4764,60181.0,3136.0,2051.0,269,82174.0,8026.0
item7,4104,1823.0,23202.0,8546.0,9233.0,7803.0,1467.0,2741,4453,9167.0,1528.0,71.0,11589.0,3572,9195.0,38.0,6646.0,2737,9960.0,13052.0
item8,1102,905.0,967.0,526.0,1112.0,9763.0,526.0,695,858,9532.0,622.0,808.0,901.0,632,526.0,638.0,905.0,645,10163.0,643.0


In [38]:
raw_10k[:10000]

'<html><body><document>\n<type>10-K\n<sequence>1\n<filename>gulfstream10k.htm\n<description>ANNUAL REPORT\n<text>\n<!DOCTYPE html PUBLIC "-//IETF//DTD HTML//EN">\n\n<title>Gulfstream - 10-K</title>\n<meta content="C054099" name="keywords"/>\n<meta content="EWH" name="author"/>\n<meta content="04/10/2008" name="date"/>\n<p align="center" style="margin-top:2px; margin-bottom:0px"><br/></p>\n<p align="right" style="margin-top:0px; margin-bottom:2.2px; padding-bottom:4px; border-bottom:4px solid #000000">&#160;</p>\n<p align="right" style="margin:0px; padding-top:4px; border-top:1.333px solid #000000">&#160;</p>\n<p align="center" style="line-height:14pt; margin:0px; font-family:Times New Roman Bold; font-size:12pt"><b>UNITED STATES<br/>\n</b><font style="font-family:Times New Roman"><b>SECURITIES AND EXCHANGE COMMISSION<br/>\nWashington, D.C. 20549</b></font></p>\n<p align="center" style="margin:0px">&#151;&#151;&#151;&#151;&#151;&#151;&#151;</p>\n<p align="center" style="line-height:14pt

Self Note: I will probably need to use some sort of k-nearest-neighbours algorithm to ensure that the entries were selected from the correct 'Item'. For example, right now, we are assuming the correct item 1 is the last match in the document, which may not be true. Further investigation will be required.

# Item Selection Validation

I will now try to do exactly what I mentioned above--validate that the correct sections were pulled from each of the 10Ks

In [60]:
pd.concat([positions[list(positions.keys())[0]][x] for x in positions[list(positions.keys())[0]].keys()], 
          axis=1,
         )

Unnamed: 0,item,start,end,item.1,start.1,end.1,item.2,start.2,end.2,item.3,...,end.3,item.4,start.3,end.4,item.5,start.4,end.5,item.6,start.5,end.6
0,item4,19371.0,19382.0,item4,4144.0,4150.0,item4,16396.0,16407.0,item4,...,5164.0,item1,585002.0,585009.0,item4,22407,22418,item4,17422.0,17433.0
1,item1,33293.0,33305.0,item1,5626.0,5633.0,item1,27498.0,27505.0,item1,...,6578.0,item1a,586011.0,586018.0,item1,38227,38239,item1,29833.0,29845.0
2,item1a,33835.0,33847.0,item2,5814.0,5820.0,item1a,27838.0,27845.0,item1a,...,6765.0,item1b,587026.0,587033.0,item1a,38867,38879,item1a,30328.0,30340.0
3,item1b,34383.0,34395.0,item3,6005.0,6011.0,item1b,28184.0,28191.0,item1b,...,6956.0,item2,588054.0,588060.0,item1b,39513,39525,item1b,30829.0,30841.0
4,item2,34944.0,34955.0,item4,6203.0,6209.0,item2,28533.0,28539.0,item2,...,7154.0,item3,589066.0,589072.0,item2,40172,40183,item2,31343.0,31354.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
107,,,,,,,,,,,...,,,,,item14,4019179,4019191,,,
108,,,,,,,,,,,...,,,,,item15,4043866,4043878,,,
109,,,,,,,,,,,...,,,,,item8,4044509,4044520,,,
110,,,,,,,,,,,...,,,,,item6,4148212,4148218,,,
