# Parsing All Airline Filings

In this notebook, I will be extending the work done in the AAL_parse.ipynb file to every stock that submitted a filing to the SEC. To verify that the methodology is robust, I will first try the first 5 companies in the folder and perform some validation.

In [1]:
import numpy as np
import pandas as pd
import os
import os
import re
import html2text
import pickle

## Subsample

First, let us take a look at the first 5 companies in the folder:

In [6]:
companies = os.listdir("sec-edgar-filings/")
companies[:5]

['0000100517', '0001351548', '0000101001', 'AAL', '0001405419']

For each of these companies, I would like to parse each statement, save into a dictionary, and pickle the dictionaries. Moreover, I want to save the character lengths to validate that the algorithm is pulling the information correctly (I assume the lengths would have to be somewhat alike...)

In [9]:
searchstr = '('
for i in range(1,16):
    searchstr+=str(i)+'.|'

searchstr+='16.)'
searchstr

'(1.|2.|3.|4.|5.|6.|7.|8.|9.|10.|11.|12.|13.|14.|15.|16.)'

In [42]:
parsed = {}
lens = {}

h = html2text.HTML2Text()

for company in companies[:5]:
    pulls = os.listdir("sec-edgar-filings/"+company+"/10-K")
    
    parsed[company] = {}
    
    a = pd.DataFrame()
    
    
    for year in pulls:

        try:
            f = open("sec-edgar-filings/"+company+"/10-K/"+year+"/filing-details.html", "r")
        except:
            f = open("sec-edgar-filings/"+company+"/10-K/"+year+"/filing-details.txt", "r")
        raw_10k = f.read()
        
        matches = re.finditer("ITEM(|\s|&#160;|&nbsp;)"+searchstr, raw_10k, re.IGNORECASE)
        #matches = re.finditer("\/*>ITEM(\s|&#160;|&nbsp;)"+searchstr, raw_10k, re.IGNORECASE)
        #matches = re.finditer("\/*>ITEM(\s|&#160;|&nbsp;)"+searchstr+'\.{0,1}', raw_10k, re.IGNORECASE)
        #matches = re.finditer("\/*>Item(\s|&#160;|&nbsp;)"+searchstr, raw_10k, re.IGNORECASE)
        locations = [x for x in matches]


        # Create the dataframe
        test_df = pd.DataFrame([(x.group(), x.start(), x.end()) for x in locations])

        test_df.columns = ['item', 'start', 'end']
        test_df['item'] = test_df.item.str.lower()

        # Get rid of unnesesary charcters from the dataframe
        test_df.replace('&#160;',' ',regex=True,inplace=True)
        test_df.replace('&nbsp;',' ',regex=True,inplace=True)
        test_df.replace(' ','',regex=True,inplace=True)
        test_df.replace('\.','',regex=True,inplace=True)
        test_df.replace('>','',regex=True,inplace=True)
        test_df.replace('\(','',regex=True,inplace=True)
        test_df.replace('\)','',regex=True,inplace=True)
        test_df.replace('\n','',regex=True,inplace=True)
        test_df.replace('\,','',regex=True,inplace=True)
        test_df.replace('\_','',regex=True,inplace=True)
        test_df.replace('\xa0','',regex=True,inplace=True)

        # Drop duplicates, keep last
        pos_dat = test_df.sort_values('start', ascending=True).drop_duplicates(subset=['item'], keep='last')

        # Set item as the dataframe index
        pos_dat.set_index('item', inplace=True)
        #lens[company][year] = pos_dat

        #starts = [0]+[locations[i].start() for i in range(len(locations))] + [len(raw_10k)]

        sections = {}

        #for i in range(len(starts)-1):
        #    temp = raw_10k[starts[i]:starts[i+1]]
            #tempsoup = BeautifulSoup(temp, 'html.parser')
            #goods = tempsoup.find_all('font')
            #sections[i] = [x.text for x in goods]
        #    sections[i] = h.handle(temp)

        for i in range(len(pos_dat.index)-1):
            temp = raw_10k[pos_dat['start'][i]:pos_dat['start'][i+1]]
            sections[pos_dat.index[i]] = h.handle(temp)


        
        parsed[company][year[11:13]] = sections
        b = pd.DataFrame([len(x) for x in sections.values()], index = sections.keys(), columns=[year])
        a = pd.concat([a,b], axis=1, join='outer')
    
    lens[company] = a

In [49]:
lens[list(lens.keys())[0]]

Unnamed: 0,0001193125-18-054235,0000100517-04-000007,0000950137-09-001460,0001193125-14-060695,0001104659-07-019919,0001193125-12-073010,0001193125-16-468479,0000100517-21-000016,0001193125-17-054129,0001193125-13-074391,0000100517-05-000006,0001047469-08-001951,0000100517-01-500026,0000100517-19-000009,0001193125-15-056493,0000100517-06-000021,0000100517-03-000007,0000100517-20-000010,0001193125-11-042335,0001193125-10-041523
item40,71642.0,144.0,118884.0,64255.0,81818.0,75603.0,71162.0,,69912.0,74666.0,144.0,28353.0,46278.0,94986.0,75742.0,17500.0,74262.0,,120677.0,3029.0
item1,15229.0,45595.0,32037.0,30438.0,25229.0,29312.0,15097.0,1026.0,14413.0,30264.0,43972.0,21.0,13429.0,38624.0,18244.0,20887.0,22071.0,991.0,4190.0,29700.0
item1b,7248.0,,115.0,106.0,102.0,106.0,7996.0,6363.0,7525.0,106.0,,39.0,,5491.0,106.0,57.0,,6389.0,51.0,54.0
item3,3442.0,4976.0,9385.0,4474.0,10531.0,12507.0,4596.0,3245.0,4369.0,11328.0,5769.0,30.0,3117.0,3427.0,2887.0,7372.0,6741.0,3359.0,18088.0,12976.0
item4,30354.0,174.0,2440.0,48352.0,4794.0,6259.0,44176.0,20938.0,33705.0,52119.0,96.0,75.0,2582.0,22120.0,49654.0,90.0,197.0,12432.0,89.0,120101.0
item6,52330.0,74143.0,134162.0,60263.0,83888.0,136908.0,55931.0,51706.0,51450.0,82854.0,79169.0,36.0,2647.0,10903.0,54322.0,48827.0,2051.0,44060.0,117188.0,124752.0
item7,4507.0,1823.0,3309.0,3763.0,3097.0,3945.0,4133.0,185177.0,4139.0,3823.0,1528.0,98.0,97959.0,4719.0,3915.0,38.0,4363.0,150602.0,3894.0,3755.0
item1a,151.0,,365.0,197.0,412.0,163.0,161.0,176.0,152.0,163.0,,26.0,,192.0,197.0,332.0,,192.0,350.0,312.0
item7a,164580.0,132535.0,324798.0,11793.0,256444.0,361686.0,172834.0,1551.0,157988.0,15999.0,129908.0,71.0,11589.0,151674.0,220499.0,159600.0,146120.0,1373.0,386644.0,263371.0
item2,34577.0,2810.0,7242.0,7037.0,4643.0,8828.0,36476.0,38829.0,37852.0,9628.0,3043.0,23.0,4013.0,34913.0,8304.0,3256.0,53410.0,35057.0,8579.0,83.0


In [22]:
raw_10k[:10000]

'<html><body><document>\n<type>10-K\n<sequence>1\n<filename>form10k.htm\n<description>FRONTIER AIRLINES HOLDINGS, INC. FORM 10-K\n<text>\n<title>\n      Frontier Airlines Holdings, Inc. Form 10-K\n</title>\n<!-- Licensed to: Cenveo-->\n<!-- Document Created using EDGARizer HTML 3.0.4.0 -->\n<!-- Copyright 2006 EDGARfilings, Ltd., an IEC company.-->\n<!-- All rights reserved EDGARfilings.com -->\n<div>\n<hr align="left" noshade="" size="1" style="COLOR: black" width="100%"/>\n</div><br/>\n<div align="center" style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; LINE-HEIGHT: 1.25; MARGIN-RIGHT: 0pt"><font style="DISPLAY: inline; FONT-SIZE: 10pt; FONT-FAMILY: Times New Roman, serif"><strong>UNITED\n      STATES</strong></font></div>\n<div align="center" style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; LINE-HEIGHT: 1.25; MARGIN-RIGHT: 0pt"><font style="DISPLAY: inline; FONT-SIZE: 10pt; FONT-FAMILY: Times New Roman, serif"><strong>SECURITIES\n      AND EXCHANGE COMMISSION</st