# Retrieving Data From Edgar

In this notebook, I will explore pulling the necessary data from EDGAR--a public database containing all the quarterly and annual financial reports required by law. First, I will take a look at one company, American Airlines (AAL), and then extrapolate the code to include all necessary companies. 

We will be using the sec_edgar_downloader package for this, as it is an extremely powerful and simple tool to scrape the necessary data. More information about the package can be found here:

https://pypi.org/project/sec-edgar-downloader/

In [1]:
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import os
import re
import html2text
import pickle
from sec_edgar_downloader import Downloader

## American Airlines Sample

First, we will start with the 10K filing. I provided my opinion on the valuable parts in the 10K analysis file found in the github. 

In [2]:
company = 'American Airlines Inc'
ticker = 'AAL'

# Initialize a downloader instance. If no argument is passed
# to the constructor, the package will download filings to
# the current working directory.
dl = Downloader()

In [3]:
# Get all 10-K filings for American Airlines (ticker: AAL) from 2000 onwards
#dl.get("10-K", ticker, after="2000-01-01")

The above request downloads the specified reports into the working directory: 

In [4]:
pulls = os.listdir("sec-edgar-filings/AAL/10-K")
pulls[:10]

['0000004515-08-000014',
 '0000006201-20-000023',
 '0000950134-06-003715',
 '.DS_Store',
 '0001047469-03-013301',
 '0000950134-05-003726',
 '0000950134-04-002668',
 '0000006201-10-000006',
 '0000950123-11-014726',
 '0000006201-18-000009']

In [5]:
# We want the one from 2021
os.listdir("sec-edgar-filings/AAL/10-K/0001193125-15-061145")

['full-submission.txt', 'filing-details.html']

We can see here that there are two files. Let us explore these files:

In [6]:
# Get the most recent filing
f = open("sec-edgar-filings/AAL/10-K/0001193125-15-061145/filing-details.html", "r")
raw_10k = f.read()

In [7]:
print(raw_10k[:500])

<html><body><document>
<type>10-K
<sequence>1
<filename>d829913d10k.htm
<description>FORM 10-K
<text>
<title>Form 10-K</title>
<h5 align="left"><a href="#toc">Table of Contents</a></h5>
<p style="line-height:4px;margin-top:0px;margin-bottom:0px;border-bottom:2pt solid #000000">&#160;</p>
<p style="line-height:3px;margin-top:0px;margin-bottom:2px;border-bottom:0.5pt solid #000000">&#160;</p> <p align="center" style="margin-top:1px;margin-bottom:0px"><font size="2" style="font-family:Times New Rom


In [8]:
soup = BeautifulSoup(raw_10k, 'lxml')

In [9]:
cleaned_soup = soup.text
cleaned_soup[:500]

'\n10-K\n1\nd829913d10k.htm\nFORM 10-K\n\nForm 10-K\nTable of Contents\n\xa0\n\xa0 UNITED STATES SECURITIES AND\nEXCHANGE COMMISSION  Washington, D.C. 20549 \n\xa0 \xa0\nFORM 10-K  \xa0\n\xa0 \xa0\n\n\nþ\n ANNUAL REPORT PURSUANT TO SECTION\xa013 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 \nFor the Fiscal Year Ended December\xa031, 2014 \n\xa0\n\n\n¨\n TRANSITION REPORT PURSUANT TO SECTION\xa013 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 \nFor the Transition Period\nFrom\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0 to\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\n Commission file number 1-8400 '

In [11]:
cleaned_soup[:500].lstrip()

'10-K\n1\nd829913d10k.htm\nFORM 10-K\n\nForm 10-K\nTable of Contents\n\xa0\n\xa0 UNITED STATES SECURITIES AND\nEXCHANGE COMMISSION  Washington, D.C. 20549 \n\xa0 \xa0\nFORM 10-K  \xa0\n\xa0 \xa0\n\n\nþ\n ANNUAL REPORT PURSUANT TO SECTION\xa013 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 \nFor the Fiscal Year Ended December\xa031, 2014 \n\xa0\n\n\n¨\n TRANSITION REPORT PURSUANT TO SECTION\xa013 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 \nFor the Transition Period\nFrom\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0 to\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\n Commission file number 1-8400 '

In [12]:
cleaned_soup = '\n'.join(' '.join(line.split()) for line in cleaned_soup.split('\n'))

In [13]:
cleaned_soup[:500]

'\n10-K\n1\nd829913d10k.htm\nFORM 10-K\n\nForm 10-K\nTable of Contents\n\nUNITED STATES SECURITIES AND\nEXCHANGE COMMISSION Washington, D.C. 20549\n\nFORM 10-K\n\n\n\nþ\nANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\nFor the Fiscal Year Ended December 31, 2014\n\n\n\n¨\nTRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\nFor the Transition Period\nFrom to\nCommission file number 1-8400\nAmerican\nAirlines Group Inc. (Exact name of registrant as spe'

In [14]:
exp = re.compile("[^\S\r\n]")
res = exp.sub('', cleaned_soup)

In [15]:
res[:5000]

'\n10-K\n1\nd829913d10k.htm\nFORM10-K\n\nForm10-K\nTableofContents\n\nUNITEDSTATESSECURITIESAND\nEXCHANGECOMMISSIONWashington,D.C.20549\n\nFORM10-K\n\n\n\nþ\nANNUALREPORTPURSUANTTOSECTION13OR15(d)OFTHESECURITIESEXCHANGEACTOF1934\nFortheFiscalYearEndedDecember31,2014\n\n\n\n¨\nTRANSITIONREPORTPURSUANTTOSECTION13OR15(d)OFTHESECURITIESEXCHANGEACTOF1934\nFortheTransitionPeriod\nFromto\nCommissionfilenumber1-8400\nAmerican\nAirlinesGroupInc.(Exactnameofregistrantasspecifiedinitscharter)\n\n\n\n\n\n\n\n\nDelaware\n\n75-1825172\n\n(Stateorotherjurisdictionof\nincorporationororganization)\n\n(I.R.S.EmployerIdentificationNo.)\n\n4333AmonCarterBlvd.,FortWorth,Texas76155\n\n(817)963-1234\n\n(Addressofprincipalexecutiveoffices,includingzipcode)\n\nRegistrant\x92stelephonenumber,includingareacode\n(Formername,formeraddressand\nformerfiscalyear,ifchangedsincelastreport)SecuritiesregisteredpursuanttoSection12(b)oftheAct:\n\n\n\n\n\n\n\n\n\nNameofExchangeonWhichRegistered\n\nCommonStock,$0.01parvaluep

In [10]:
cleaned_soup = ' '.join(cleaned_soup.split())
cleaned_soup[:500]

'10-K 1 d829913d10k.htm FORM 10-K Form 10-K Table of Contents UNITED STATES SECURITIES AND EXCHANGE COMMISSION Washington, D.C. 20549 FORM 10-K þ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the Fiscal Year Ended December 31, 2014 ¨ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the Transition Period From to Commission file number 1-8400 American Airlines Group Inc. (Exact name of registrant as specified in '

In [17]:
searchstr = '('
for i in range(1,16):
    searchstr+=str(i)+'.|'

searchstr+='16.)'
searchstr

'(1.|2.|3.|4.|5.|6.|7.|8.|9.|10.|11.|12.|13.|14.|15.|16.)'

In [18]:
matches = re.finditer("Item\s"+searchstr, cleaned_soup, re.IGNORECASE)
locations = [x for x in matches]
locations

[<re.Match object; span=(2939, 2946), match='Item 40'>,
 <re.Match object; span=(5199, 5206), match='Item 1.'>,
 <re.Match object; span=(5218, 5225), match='Item 1A'>,
 <re.Match object; span=(5243, 5250), match='Item 1B'>,
 <re.Match object; span=(5281, 5288), match='Item 2.'>,
 <re.Match object; span=(5303, 5310), match='Item 3.'>,
 <re.Match object; span=(5332, 5339), match='Item 4.'>,
 <re.Match object; span=(5375, 5382), match='Item 5.'>,
 <re.Match object; span=(5486, 5493), match='Item 6.'>,
 <re.Match object; span=(5534, 5541), match='Item 7.'>,
 <re.Match object; span=(5631, 5638), match='Item 7A'>,
 <re.Match object; span=(5703, 5710), match='Item 8A'>,
 <re.Match object; span=(5805, 5812), match='Item 8B'>,
 <re.Match object; span=(5902, 5909), match='Item 9.'>,
 <re.Match object; span=(5999, 6006), match='Item 9A'>,
 <re.Match object; span=(6045, 6052), match='Item 10'>,
 <re.Match object; span=(6113, 6120), match='Item 11'>,
 <re.Match object; span=(6149, 6156), match='Ite

In [19]:
item = [None]*20
item[0]  = 'ITEM\s1.{0,20}business'
item[1]  = 'ITEM\s1a.{0,20}risk'
item[2]  = 'ITEM\s1b.{0,20}unresolved'
item[3]  = 'ITEM\s2.{0,20}properties'
item[4]  = 'ITEM\s3.{0,20}legal'
item[5]  = 'ITEM\s4.{0,20}mine'
item[6]  = 'ITEM\s5.{0,20}market'
item[7]  = 'ITEM\s6.{0,20}selected'
item[8]  = 'ITEM\s7.{0,20}management'
item[9]  = 'ITEM\s7a.{0,20}quantitative'
item[10] = 'ITEM\s8.{0,20}financial'
item[11] = 'ITEM\s9.{0,20}changes'
item[12] = 'ITEM\s9a.{0,20}controls'
item[13] = 'ITEM\s9b.{0,20}other'
item[14] = 'ITEM\s10.{0,30}directors'
item[15] = 'ITEM\s11.{0,30}'
item[16] = 'ITEM\s12.{0,20}security'
item[17] = 'ITEM\s13.{0,20}certain'
item[18] = 'ITEM\s14.{0,20}principal'
item[19] = 'ITEM\s15.{0,20}exhibits'

In [20]:
searchstr = '('
for i in range(19):
    searchstr += item[i] + '[a-z]{0,20}\n|'
searchstr += item[19]+')'

In [21]:
searchstr

'(ITEM\\s1.{0,20}business[a-z]{0,20}\n|ITEM\\s1a.{0,20}risk[a-z]{0,20}\n|ITEM\\s1b.{0,20}unresolved[a-z]{0,20}\n|ITEM\\s2.{0,20}properties[a-z]{0,20}\n|ITEM\\s3.{0,20}legal[a-z]{0,20}\n|ITEM\\s4.{0,20}mine[a-z]{0,20}\n|ITEM\\s5.{0,20}market[a-z]{0,20}\n|ITEM\\s6.{0,20}selected[a-z]{0,20}\n|ITEM\\s7.{0,20}management[a-z]{0,20}\n|ITEM\\s7a.{0,20}quantitative[a-z]{0,20}\n|ITEM\\s8.{0,20}financial[a-z]{0,20}\n|ITEM\\s9.{0,20}changes[a-z]{0,20}\n|ITEM\\s9a.{0,20}controls[a-z]{0,20}\n|ITEM\\s9b.{0,20}other[a-z]{0,20}\n|ITEM\\s10.{0,30}directors[a-z]{0,20}\n|ITEM\\s11.{0,30}[a-z]{0,20}\n|ITEM\\s12.{0,20}security[a-z]{0,20}\n|ITEM\\s13.{0,20}certain[a-z]{0,20}\n|ITEM\\s14.{0,20}principal[a-z]{0,20}\n|ITEM\\s15.{0,20}exhibits)'

In [22]:
matches = re.finditer('ITEM\s8.{0,20}financial', cleaned_soup, re.IGNORECASE)
locations = [x for x in matches]
locations

[<re.Match object; span=(5703, 5734), match='Item 8A. Consolidated Financial'>,
 <re.Match object; span=(5805, 5836), match='Item 8B. Consolidated Financial'>,
 <re.Match object; span=(428244, 428275), match='ITEM 8A. CONSOLIDATED FINANCIAL'>,
 <re.Match object; span=(650747, 650778), match='ITEM 8B. CONSOLIDATED FINANCIAL'>]

In [23]:
matches = re.compile(r'(item\s(7[\.\s]|8[\.\s])|'
                             'discussion\sand\sanalysis\sof\s(consolidated\sfinancial|financial)\scondition|'
                             '(consolidated\sfinancial|financial)\sstatements\sand\ssupplementary\sdata)', re.IGNORECASE)

In [24]:
[x for x in matches.finditer(cleaned_soup)]

[<re.Match object; span=(5534, 5541), match='Item 7.'>,
 <re.Match object; span=(5555, 5601), match='Discussion and Analysis of Financial Condition'>,
 <re.Match object; span=(5712, 5768), match='Consolidated Financial Statements and Supplementa>,
 <re.Match object; span=(5814, 5870), match='Consolidated Financial Statements and Supplementa>,
 <re.Match object; span=(15306, 15313), match='Item 7.'>,
 <re.Match object; span=(15327, 15373), match='Discussion and Analysis of Financial Condition'>,
 <re.Match object; span=(22111, 22118), match='Item 7.'>,
 <re.Match object; span=(22132, 22178), match='Discussion and Analysis of Financial Condition'>,
 <re.Match object; span=(23860, 23867), match='Item 7.'>,
 <re.Match object; span=(23881, 23927), match='Discussion and Analysis of Financial Condition'>,
 <re.Match object; span=(24012, 24058), match='Discussion and Analysis of Financial Condition'>,
 <re.Match object; span=(25916, 25923), match='Item 7.'>,
 <re.Match object; span=(25937, 259

In [25]:
cleaned_soup = ' '.join(cleaned_soup.split())

In [26]:
TenKtext = cleaned_soup

In [33]:
# Set up the regex pattern
matches = re.compile(r'(item\s(7[\.\s]|(8A|8)[\.\s])|'
                     'discussion\sand\sanalysis\sof\s(consolidated\sfinancial|financial)\scondition|'
                     '(consolidated\sfinancial|financial)\sstatements\sand\ssupplementary\sdata)', re.IGNORECASE)

matches_array = pd.DataFrame([(match.group(), match.start()) for match in matches.finditer(TenKtext)])
matches_array.head()

Unnamed: 0,0,1
0,Item 7.,5534
1,Discussion and Analysis of Financial Condition,5555
2,Item 8A.,5703
3,Consolidated Financial Statements and Suppleme...,5712
4,Consolidated Financial Statements and Suppleme...,5814


In [28]:
# Set columns in the dataframe
matches_array.columns = ['SearchTerm', 'Start']

# Get the number of rows in the dataframe
Rows = matches_array['SearchTerm'].count()

In [34]:
# Create a new column in 'matches_array' called 'Selection' and add adjacent 
# 'SearchTerm' (i and i+1 rows) text concatenated
count = 0 # Counter to help with row location and iteration

while count < (Rows-1): # Can only iterate to the second last row
    matches_array.at[count,'Selection'] = (matches_array.iloc[count,0] + matches_array.iloc[count+1,0]).lower() # Convert to lower case
    count += 1

In [35]:
matches_array.head()

Unnamed: 0,0,1,Selection
0,Item 7.,5534,item 7.discussion and analysis of financial co...
1,Discussion and Analysis of Financial Condition,5555,discussion and analysis of financial condition...
2,Item 8A.,5703,item 8a.consolidated financial statements and ...
3,Consolidated Financial Statements and Suppleme...,5712,consolidated financial statements and suppleme...
4,Consolidated Financial Statements and Suppleme...,5814,consolidated financial statements and suppleme...


In [30]:
# Set up 'Item 7/8 Search Pattern' regex patterns
matches_item7 = re.compile(r'(item\s7\.discussion\s[a-z]*)')
matches_item8 = re.compile(r'(item\s8(|.)\.(|consolidated\sfinancial|financial)\s[a-z]*)')

# Lists to store the locations of Item 7/8 Search Pattern matches
Start_Loc = []
End_Loc = []

# Find and store the locations of Item 7/8 Search Pattern matches
count = 0 # Set up counter

while count < (Rows-1): # Can only iterate to the second last row

    # Match Item 7 Search Pattern
    if re.match(matches_item7, matches_array.at[count,'Selection']):
        # Column 1 = 'Start' columnn in 'matches_array'
        Start_Loc.append(matches_array.iloc[count,1]) # Store in list => Item 7 will be the starting location (column '1' = 'Start' column)

    # Match Item 8 Search Pattern
    if re.match(matches_item8, matches_array.at[count,'Selection']):
        End_Loc.append(matches_array.iloc[count,1])

    count += 1

In [31]:
# Extract section of text and store in 'TenKItem7'
TenKItem7 = TenKtext[Start_Loc[1]:End_Loc[1]]

# Clean newly extracted text
TenKItem7 = TenKItem7.strip() # Remove starting/ending white spaces
TenKItem7 = TenKItem7.replace('\n', ' ') # Replace \n (new line) with space
TenKItem7 = TenKItem7.replace('\r', '') # Replace \r (carriage returns-if you're on windows) with space
TenKItem7 = TenKItem7.replace(' ', ' ') # Replace " " (a special character for space in HTML) with space
TenKItem7 = TenKItem7.replace(' ', ' ') # Replace " " (a special character for space in HTML) with space
while '  ' in TenKItem7:
    TenKItem7 = TenKItem7.replace('  ', ' ') # Remove extra spaces

# Print first 500 characters of newly extracted text
print(TenKItem7[:500])

Item 7. Managements Discussion and Analysis of Financial Condition and Results of Operations) and in our other filings with the Securities and Exchange Commission (the SEC), and other risks and uncertainties listed from time to time in our filings with the SEC. All of the forward-looking statements are qualified in their entirety by reference to the factors discussed in Part I, Item 1A. Risk Factors and elsewhere in this report. There may be other factors of which we are not currently aware tha


In [40]:
TenKItem7[:500]

'Item 7. Management\x92s Discussion and Analysis of Financial Condition and Results of Operations) and in our other filings with the Securities and Exchange Commission (the SEC), and other risks and uncertainties listed from time to time in our filings with the SEC. All of the forward-looking statements are qualified in their entirety by reference to the factors discussed in Part I, Item 1A. Risk Factors and elsewhere in this report. There may be other factors of which we are not currently aware tha'

# Build Out to All of AAL

This is great, and closer to what we ultimately want. Let us now see if we can easily generalize this to every 10K for AAL:

In [42]:
pulls[0][11:13]

'08'

In [43]:
os.listdir("sec-edgar-filings/AAL/10-K/")

['0000004515-08-000014',
 '0000006201-20-000023',
 '0000950134-06-003715',
 '.DS_Store',
 '0001047469-03-013301',
 '0000950134-05-003726',
 '0000950134-04-002668',
 '0000006201-10-000006',
 '0000950123-11-014726',
 '0000006201-18-000009',
 '0001193125-15-061145',
 '0001193125-12-063516',
 '0000006201-21-000014',
 '0001193125-16-474605',
 '0000006201-14-000004',
 '0000950134-07-003888',
 '0000006201-19-000009',
 '0000006201-09-000009',
 '0000006201-13-000023',
 '0001193125-17-051216']

In [42]:
# Initialize dict to save results
item7 = {}

# Loop through each file
for filing in os.listdir("sec-edgar-filings/AAL/10-K/")[1:]:
    
    if filing == '.DS_Store':
        continue
        
    print(filing)
        
#     if int(filing[11:13]) >= 9:
#         continue

    # Get the most recent filing
    f = open("sec-edgar-filings/AAL/10-K/"+filing+"/filing-details.html", "r")
    raw_10k = f.read()


    soup = BeautifulSoup(raw_10k, 'lxml')

    cleaned_soup = soup.text

    TenKtext = ' '.join(cleaned_soup.split())

    #Set up the regex pattern
    matches = re.compile(r'(item\s(7[\.\s]|(8A|8)[\.\s])|'
                    '(|management\x92s\s)discussion\sand\sanalysis\sof\s(consolidated\sfinancial|financial)\scondition|'
                    '(consolidated\sfinancial|financial)\sstatements(|\sand\ssupplementary\sdata))', re.IGNORECASE)
    
#     matches = re.compile(r'Item\s7.{1,10}(Management|Discussion)|'
#                          '|Item\s8.{1,20}(consolidated|financial)',
#                      re.IGNORECASE)

    matches_array = pd.DataFrame([(match.group(), match.start()) for match in matches.finditer(TenKtext)])
    matches_array.head()

    # Set columns in the dataframe
    matches_array.columns = ['SearchTerm', 'Start']

    # Get the number of rows in the dataframe
    Rows = matches_array['SearchTerm'].count()

    # Create a new column in 'matches_array' called 'Selection' and add adjacent 
    # 'SearchTerm' (i and i+1 rows) text concatenated
    count = 0 # Counter to help with row location and iteration

    while count < (Rows-1): # Can only iterate to the second last row
        matches_array.at[count,'Selection'] = (matches_array.iloc[count,0] + matches_array.iloc[count+1,0]).lower() # Convert to lower case
        count += 1

    # Set up 'Item 7/8 Search Pattern' regex patterns
    matches_item7 = re.compile(r'(item\s7\.(management\x92s|discussion)\s[a-z]*)')
    matches_item8 = re.compile(r'(item\s8(a|)\.(|consolidated\sfinancial|financial)\s[a-z]*)')

    # Lists to store the locations of Item 7/8 Search Pattern matches
    Start_Loc = []
    End_Loc = []

    # Find and store the locations of Item 7/8 Search Pattern matches
    count = 0 # Set up counter

    while count < (Rows-1): # Can only iterate to the second last row

        # Match Item 7 Search Pattern
        if re.match(matches_item7, matches_array.at[count,'Selection']):
            # Column 1 = 'Start' columnn in 'matches_array'
            Start_Loc.append(matches_array.iloc[count,1]) # Store in list => Item 7 will be the starting location (column '1' = 'Start' column)

        # Match Item 8 Search Pattern
        if re.match(matches_item8, matches_array.at[count,'Selection']):
            End_Loc.append(matches_array.iloc[count,1])

        count += 1

    # Extract section of text and store in 'TenKItem7'
    if len(Start_Loc) > 1 and len(End_Loc) >1:
        TenKItem7 = TenKtext[Start_Loc[1]:End_Loc[1]]
    
    else:
        TenKItem7 = TenKtext[Start_Loc[0]:End_Loc[0]]
        
#     elif len(Start_Loc) == 1 and len(End_Loc)  == 1:
#         TenKItem7 = TenKtext[Start_Loc[0]:End_Loc[0]]
        
#     elif len(Start_Loc) > 1 and len(End_Loc)  == 1:
#         TenKItem7 = TenKtext[Start_Loc[1]:End_Loc[0]]
    
#     else:
#         TenKItem7 = TenKtext[Start_Loc[0]:End_Loc[1]]
        
    # Clean newly extracted text
    TenKItem7 = TenKItem7.strip() # Remove starting/ending white spaces
    TenKItem7 = TenKItem7.replace('\n', ' ') # Replace \n (new line) with space
    TenKItem7 = TenKItem7.replace('\r', '') # Replace \r (carriage returns-if you're on windows) with space
    TenKItem7 = TenKItem7.replace(' ', ' ') # Replace " " (a special character for space in HTML) with space
    TenKItem7 = TenKItem7.replace(' ', ' ') # Replace " " (a special character for space in HTML) with space
    while '  ' in TenKItem7:
        TenKItem7 = TenKItem7.replace('  ', ' ') # Remove extra spaces
        
    item7[filing] = TenKItem7
    

0000006201-20-000023
0000950134-06-003715
0001047469-03-013301
0000950134-05-003726
0000950134-04-002668
0000006201-10-000006
0000950123-11-014726
0000006201-18-000009
0001193125-15-061145
0001193125-12-063516
0000006201-21-000014
0001193125-16-474605
0000006201-14-000004
0000950134-07-003888
0000006201-19-000009
0000006201-09-000009
0000006201-13-000023
0001193125-17-051216


In [43]:
filing

'0001193125-17-051216'

In [44]:
[[x,len(item7[x])] for x in item7.keys()]

[['0000006201-20-000023', 261254],
 ['0000950134-06-003715', 61693],
 ['0001047469-03-013301', 76107],
 ['0000950134-05-003726', 80732],
 ['0000950134-04-002668', 80953],
 ['0000006201-10-000006', 80724],
 ['0000950123-11-014726', 92949],
 ['0000006201-18-000009', 283161],
 ['0001193125-15-061145', 216825],
 ['0001193125-12-063516', 110142],
 ['0000006201-21-000014', 344297],
 ['0001193125-16-474605', 198919],
 ['0000006201-14-000004', 211805],
 ['0000950134-07-003888', 55075],
 ['0000006201-19-000009', 290668],
 ['0000006201-09-000009', 183583],
 ['0000006201-13-000023', 85400],
 ['0001193125-17-051216', 267777]]

In [59]:
matches = re.compile(r'(item\s(7[\.\s]|8(.|)[\.\s])|'
                             'discussion\sand\sanalysis\sof\s(consolidated\sfinancial|financial)\scondition|'
                             '(consolidated\sfinancial|financial)\sstatements\sand\ssupplementary\sdata)', re.IGNORECASE)

In [78]:
End_Loc

[196890]

In [79]:
Start_Loc

[13306, 112044, 125016]

In [13]:
End_Loc

[]

In [96]:
matches_array

Unnamed: 0,SearchTerm,Start,Selection
0,Item 7.,5894,item 7.managements discussion and analysis of...
1,Managements Discussion and Analysis of Financ...,5902,managements discussion and analysis of financ...
2,Item 8A.,6062,item 8a.consolidated financial statements
3,Consolidated Financial Statements,6071,consolidated financial statementsconsolidated ...
4,Consolidated Financial Statements,6172,consolidated financial statementsitem 7.
...,...,...,...
254,consolidated financial statements,574338,consolidated financial statementsconsolidated ...
255,Consolidated Financial Statements,575011,consolidated financial statementsconsolidated ...
256,consolidated financial statements,575062,consolidated financial statementsconsolidated ...
257,Consolidated Financial Statements,575736,consolidated financial statementsconsolidated ...


In [30]:
list(item7.keys())[12]

'0000006201-14-000004'

In [29]:
item7[list(item7.keys())[12]]



In [49]:
item7['0000004515-08-000014'] == item7['0000006201-21-000014']

True

In [70]:
'\n' in TenKtext

False

Keep trying

In [41]:
filing[11:13]

'06'

In [23]:
int('08')

8

In [37]:
cleaned_soup[:1000]

'\n10-K\n1\nd33303e10vk.htm\nFORM 10-K\n\ne10vk\n\nTable of Contents\n\n\xa0\n\xa0\nUnited States Securities and Exchange Commission\n\nWashington, D.C. 20549\n\xa0\n\nForm\xa010-K\n\n\n\n\n\xa0\n\xa0\n\xa0\n\n\nþ\n\xa0\nAnnual Report Pursuant to Section\xa013 or 15(d) of the Securities Exchange Act of 1934\n\n\n\nFor the fiscal year ended December\xa031, 2005\n\n\n\n\n\n\xa0\n\xa0\n\xa0\n\n\no\n\xa0\nTransition Report Pursuant to Section\xa013 or 15(d) of the Securities Exchange Act of 1934\n\n\n\nCommission File Number: 1-8400\nAMR Corporation\n\n(Exact name of registrant as specified in its charter)\n\n\n\n\n\xa0\n\xa0\n\xa0\n\n\n\n\nDelaware\n\n\xa0\n75-1825172\n\n\n(State or other jurisdiction of\nincorporation or organization)\n\n\xa0\n(IRS Employer\nIdentification Number)\n\n\n\n\n4333 Amon Carter Blvd.\nFort Worth, Texas 76155\n(Address of principal executive offices, including zip code)\n(817)\xa0963-1234\n(Registrant\x92s telephone number, including area code)\n\xa0\nSecuriti

In [38]:
TenKtext



In [47]:
matches = re.compile(r'(item\s(7[\.\s]|(8A|8)[\.\s])|'
                     '(|management\x92s\s)discussion\sand\sanalysis\sof\s(consolidated\sfinancial|financial)\scondition|'
                     '(consolidated\sfinancial|financial)\sstatements(|\sand\ssupplementary\sdata))', re.IGNORECASE)


matches_array = pd.DataFrame([(match.group(), match.start()) for match in matches.finditer(TenKtext)])
matches_array

Unnamed: 0,0,1
0,ITEM 7.,3249
1,MANAGEMENTS DISCUSSION AND ANALYSIS OF FINANC...,3257
2,ITEM 8.,3413
3,CONSOLIDATED FINANCIAL STATEMENTS,3421
4,consolidated financial statements,13259
...,...,...
66,financial statements,255308
67,Consolidated Financial Statements,255835
68,consolidated financial statements,256243
69,consolidated financial statements,286307


In [46]:
matches = re.compile(r'CONSOLIDATED\sFINANCIAL\sSTATEMENTS', re.IGNORECASE)
matches_array = pd.DataFrame([(match.group(), match.start()) for match in matches.finditer(TenKtext)])
matches_array

Unnamed: 0,0,1
0,CONSOLIDATED FINANCIAL STATEMENTS,3421
1,consolidated financial statements,13259
2,consolidated financial statements,33239
3,consolidated financial statements,39846
4,consolidated financial statements,41873
5,consolidated financial statements,72383
6,consolidated financial statements,74732
7,consolidated financial statements,93138
8,consolidated financial statements,93306
9,consolidated financial statements,93682


### Divider

In [12]:
cleaned_soup[:1000]

'10-K 1 d829913d10k.htm FORM 10-K Form 10-K Table of Contents UNITED STATES SECURITIES AND EXCHANGE COMMISSION Washington, D.C. 20549 FORM 10-K þ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the Fiscal Year Ended December 31, 2014 ¨ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the Transition Period From to Commission file number 1-8400 American Airlines Group Inc. (Exact name of registrant as specified in its charter) Delaware 75-1825172 (State or other jurisdiction of incorporation or organization) (I.R.S. Employer Identification No.) 4333 Amon Carter Blvd., Fort Worth, Texas 76155 (817) 963-1234 (Address of principal executive offices, including zip code) Registrant\x92s telephone number, including area code (Former name, former address and former fiscal year, if changed since last report) Securities registered pursuant to Section 12(b) of the Act: Name of Exchange on Which Registered Common S

In [13]:
TenKtext = cleaned_soup

In [18]:
matches = re.compile(r'Item\s7.{1,20}(Management|Discussion)|Item\s8.{1,20}(Management|Discussion)|Item\s8.{1,20}(consolidated|financial)',
                     re.IGNORECASE)

matches_array = pd.DataFrame([(match.group(), match.start()) for match in matches.finditer(TenKtext)])
matches_array

Unnamed: 0,0,1
0,Item 7. Managements Discussion,5534
1,Item 8A. Consolidated Financial,5703
2,Item 8B. Consolidated Financial,5805
3,Item 7. Managements Discussion,15306
4,Item 7. Managements Discussion,22111
5,Item 7. Managements Discussion,23860
6,Item 7. Managements Discussion,25916
7,Item 7. Managements Discussion,82989
8,Item 7. Managements Discussion,130271
9,Item 7. Managements Discussion,219223


# XBRL Documents

The SEC mandated XBRL style documents after 2009, so perhaps those are easier to parse in general.

In [48]:
os.listdir("sec-edgar-filings/AAL/10-K")

['0000004515-08-000014',
 '0000006201-20-000023',
 '0000950134-06-003715',
 '.DS_Store',
 '0001047469-03-013301',
 '0000950134-05-003726',
 '0000950134-04-002668',
 '0000006201-10-000006',
 '0000950123-11-014726',
 '0000006201-18-000009',
 '0001193125-15-061145',
 '0001193125-12-063516',
 '0000006201-21-000014',
 '0001193125-16-474605',
 '0000006201-14-000004',
 '0000950134-07-003888',
 '0000006201-19-000009',
 '0000006201-09-000009',
 '0000006201-13-000023',
 '0001193125-17-051216']

In [49]:
f = open("sec-edgar-filings/AAL/10-K/0000006201-20-000023/filing-details.html", "r")
raw_10k = f.read()

In [51]:
soup = BeautifulSoup(raw_10k, 'lxml')

In [58]:
tag_list = soup.find_all()
tag_list[:50]

KeyboardInterrupt: 

In [59]:
'Item 7' in tag_list

False

In [60]:
lstr1 = TenKtext

In [68]:
item7={}
item7[1]="item 7\. managements discussion and analysis"
item7[2]="item 7\.managements discussion and analysis"
item7[3]="item7\. managements discussion and analysis"
item7[4]="item7\.managements discussion and analysis"
item7[5]="item 7\. management discussion and analysis"
item7[6]="item 7\.management discussion and analysis"
item7[7]="item7\. management discussion and analysis"
item7[8]="item7\.management discussion and analysis"
item7[9]="item 7 managements discussion and analysis"
item7[10]="item 7managements discussion and analysis"
item7[11]="item7 managements discussion and analysis"
item7[12]="item7managements discussion and analysis"
item7[13]="item 7 management discussion and analysis"
item7[14]="item 7management discussion and analysis"
item7[15]="item7 management discussion and analysis"
item7[16]="item7management discussion and analysis"
item7[17]="item 7: managements discussion and analysis"
item7[18]="item 7:managements discussion and analysis"
item7[19]="item7: managements discussion and analysis"
item7[20]="item7:managements discussion and analysis"
item7[21]="item 7: management discussion and analysis"
item7[22]="item 7:management discussion and analysis"
item7[23]="item7: management discussion and analysis"
item7[24]="item7:management discussion and analysis"


item8={}
item8[1]="item 8\. financial statements"
item8[2]="item 8\.financial statements"
item8[3]="item8\. financial statements"
item8[4]="item8\.financial statements"
item8[5]="item 8 financial statements"
item8[6]="item 8financial statements"
item8[7]="item8 financial statements"
item8[8]="item8financial statements"
item8[9]="item 8a\. financial statements"
item8[10]="item 8a\.financial statements"
item8[11]="item8a\. financial statements"
item8[12]="item8a\.financial statements"
item8[13]="item 8a financial statements"
item8[14]="item 8afinancial statements"
item8[15]="item8a financial statements"
item8[16]="item8afinancial statements"
item8[17]="item 8\. consolidated financial statements"
item8[18]="item 8\.consolidated financial statements"
item8[19]="item8\. consolidated financial statements"
item8[20]="item8\.consolidated financial statements"
item8[21]="item 8 consolidated  financial statements"
item8[22]="item 8consolidated financial statements"
item8[23]="item8 consolidated  financial statements"
item8[24]="item8consolidated financial statements"
item8[25]="item 8a\. consolidated financial statements"
item8[26]="item 8a\.consolidated financial statements"
item8[27]="item8a\. consolidated financial statements"
item8[28]="item8a\.consolidated financial statements"
item8[29]="item 8a consolidated financial statements"
item8[30]="item 8aconsolidated financial statements"
item8[31]="item8a consolidated financial statements"
item8[32]="item8aconsolidated financial statements"
item8[33]="item 8\. audited financial statements"
item8[34]="item 8\.audited financial statements"
item8[35]="item8\. audited financial statements"
item8[36]="item8\.audited financial statements"
item8[37]="item 8 audited financial statements"
item8[38]="item 8audited financial statements"
item8[39]="item8 audited financial statements"
item8[40]="item8audited financial statements"
item8[41]="item 8: financial statements"
item8[42]="item 8:financial statements"
item8[43]="item8: financial statements"
item8[44]="item8:financial statements"
item8[45]="item 8: consolidated financial statements"
item8[46]="item 8:consolidated financial statements"
item8[47]="item8: consolidated financial statements"
item8[48]="item8:consolidated financial statements"

look={" see ", " refer to ", " included in "," contained in "}

a={}
c={}

lstr1=TenKtext.lower()
for j in range(1,25):
    a[j]=[]
    for m in re.finditer(item7[j], lstr1):
        if not m:
            break
        else:
            substr1=lstr1[m.start()-20:m.start()]
            if not any(s in substr1 for s in look):   
                #print substr1
                b=m.start()
                a[j].append(b)
#print i

list1=[]
for value in a.values():
    for thing1 in value:
        list1.append(thing1)
list1.sort()
list1.append(len(lstr1))
#print list1

for j in range(1,49):
    c[j]=[]
    for m in re.finditer(item8[j], lstr1):
        if not m:
            break
        else:
            substr1=lstr1[m.start()-20:m.start()]
            if not any(s in substr1 for s in look):   
                #print substr1
                b=m.start()
                c[j].append(b)
list2=[]
for value in c.values():
    for thing2 in value:
        list2.append(thing2)
list2.sort()

locations={}
# #if list2==[]:
#     continue
#     #print "NO MD&A"
# #else:
#     #if list1==[]:
#     #    print "NO MD&A"
# #else:
for k0 in range(len(list1)):
    locations[k0]=[]
    locations[k0].append(list1[k0])
for k0 in range(len(locations)):
    for item in range(len(list2)):
        if locations[k0][0]<=list2[item]:
            locations[k0].append(list2[item])
            break
    if len(locations[k0])==1:
        del locations[k0]

In [69]:
locations

{}