# Retrieving Data From Edgar

In this notebook, I will explore pulling the necessary data from EDGAR--a public database containing all the quarterly and annual financial reports required by law. First, I will take a look at one company, American Airlines (AAL), and then extrapolate the code to include all necessary companies. 

We will be using the sec_edgar_downloader package for this, as it is an extremely powerful and simple tool to scrape the necessary data. More information about the package can be found here:

https://pypi.org/project/sec-edgar-downloader/

In [1]:
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import os
import re
import html2text
import pickle
from sec_edgar_downloader import Downloader

## American Airlines Sample

First, we will start with the 10K filing. I provided my opinion on the valuable parts in the 10K analysis file found in the github. 

In [2]:
company = 'American Airlines Inc'
ticker = 'AAL'

# Initialize a downloader instance. If no argument is passed
# to the constructor, the package will download filings to
# the current working directory.
dl = Downloader()

In [3]:
# Get all 10-K filings for American Airlines (ticker: AAL) from 2000 onwards
#dl.get("10-K", ticker, after="2000-01-01")

The above request downloads the specified reports into the working directory: 

In [4]:
pulls = os.listdir("sec-edgar-filings/AAL/10-K")
pulls[:10]

['0000004515-08-000014',
 '0000006201-20-000023',
 '0000950134-06-003715',
 '.DS_Store',
 '0001047469-03-013301',
 '0000950134-05-003726',
 '0000950134-04-002668',
 '0000006201-10-000006',
 '0000950123-11-014726',
 '0000006201-18-000009']

In [5]:
# We want the one from 2021
os.listdir("sec-edgar-filings/AAL/10-K/0001193125-15-061145")

['full-submission.txt', 'filing-details.html']

We can see here that there are two files. Let us explore these files:

In [6]:
# Get the most recent filing
f = open("sec-edgar-filings/AAL/10-K/0001193125-15-061145/filing-details.html", "r")
raw_10k = f.read()

In [7]:
print(raw_10k[:500])

<html><body><document>
<type>10-K
<sequence>1
<filename>d829913d10k.htm
<description>FORM 10-K
<text>
<title>Form 10-K</title>
<h5 align="left"><a href="#toc">Table of Contents</a></h5>
<p style="line-height:4px;margin-top:0px;margin-bottom:0px;border-bottom:2pt solid #000000">&#160;</p>
<p style="line-height:3px;margin-top:0px;margin-bottom:2px;border-bottom:0.5pt solid #000000">&#160;</p> <p align="center" style="margin-top:1px;margin-bottom:0px"><font size="2" style="font-family:Times New Rom


In [8]:
soup = BeautifulSoup(raw_10k, 'lxml')

In [9]:
cleaned_soup = soup.text
cleaned_soup[:500]

'\n10-K\n1\nd829913d10k.htm\nFORM 10-K\n\nForm 10-K\nTable of Contents\n\xa0\n\xa0 UNITED STATES SECURITIES AND\nEXCHANGE COMMISSION  Washington, D.C. 20549 \n\xa0 \xa0\nFORM 10-K  \xa0\n\xa0 \xa0\n\n\nþ\n ANNUAL REPORT PURSUANT TO SECTION\xa013 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 \nFor the Fiscal Year Ended December\xa031, 2014 \n\xa0\n\n\n¨\n TRANSITION REPORT PURSUANT TO SECTION\xa013 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 \nFor the Transition Period\nFrom\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0 to\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\n Commission file number 1-8400 '

In [10]:
cleaned_soup = '\n'.join(' '.join(line.split()) for line in cleaned_soup.split('\n'))
cleaned_soup[:500]

'\n10-K\n1\nd829913d10k.htm\nFORM 10-K\n\nForm 10-K\nTable of Contents\n\nUNITED STATES SECURITIES AND\nEXCHANGE COMMISSION Washington, D.C. 20549\n\nFORM 10-K\n\n\n\nþ\nANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\nFor the Fiscal Year Ended December 31, 2014\n\n\n\n¨\nTRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\nFor the Transition Period\nFrom to\nCommission file number 1-8400\nAmerican\nAirlines Group Inc. (Exact name of registrant as spe'

In [11]:
res = list(soup.find_all('b'))
res = [str(x) for x in res]
res_str = '\n'.join(res)

In [12]:
matches = re.compile(r'(item\s(7[\.\s]|8(A|)[\.\s])|'
                             'discussion\sand\sanalysis\sof\s(consolidated\sfinancial|financial)\scondition|'
                             '(consolidated\sfinancial|financial)\sstatements\sand\ssupplementary\sdata)', re.IGNORECASE)

In [13]:
[x for x in matches.finditer(res_str)]

[<re.Match object; span=(13992, 13999), match='ITEM\xa07.'>,
 <re.Match object; span=(14013, 14059), match='DISCUSSION AND ANALYSIS OF FINANCIAL CONDITION'>,
 <re.Match object; span=(21020, 21028), match='ITEM\xa08A.'>,
 <re.Match object; span=(21029, 21085), match='CONSOLIDATED FINANCIAL STATEMENTS AND SUPPLEMENTA>,
 <re.Match object; span=(42175, 42231), match='CONSOLIDATED FINANCIAL STATEMENTS AND SUPPLEMENTA>]

In [14]:
cleaned_soup = ' '.join(cleaned_soup.split())
cleaned_soup[:500]

'10-K 1 d829913d10k.htm FORM 10-K Form 10-K Table of Contents UNITED STATES SECURITIES AND EXCHANGE COMMISSION Washington, D.C. 20549 FORM 10-K þ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the Fiscal Year Ended December 31, 2014 ¨ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the Transition Period From to Commission file number 1-8400 American Airlines Group Inc. (Exact name of registrant as specified in '

In [15]:
searchstr = '('
for i in range(1,16):
    searchstr+=str(i)+'.|'

searchstr+='16.)'
searchstr

'(1.|2.|3.|4.|5.|6.|7.|8.|9.|10.|11.|12.|13.|14.|15.|16.)'

In [16]:
item = [None]*20
item[0]  = 'ITEM\s1.{0,20}business'
item[1]  = 'ITEM\s1a.{0,20}risk'
item[2]  = 'ITEM\s1b.{0,20}unresolved'
item[3]  = 'ITEM\s2.{0,20}properties'
item[4]  = 'ITEM\s3.{0,20}legal'
item[5]  = 'ITEM\s4.{0,20}mine'
item[6]  = 'ITEM\s5.{0,20}market'
item[7]  = 'ITEM\s6.{0,20}selected'
item[8]  = 'ITEM\s7.{0,20}management'
item[9]  = 'ITEM\s7a.{0,20}quantitative'
item[10] = 'ITEM\s8.{0,20}financial'
item[11] = 'ITEM\s9.{0,20}changes'
item[12] = 'ITEM\s9a.{0,20}controls'
item[13] = 'ITEM\s9b.{0,20}other'
item[14] = 'ITEM\s10.{0,30}directors'
item[15] = 'ITEM\s11.{0,30}'
item[16] = 'ITEM\s12.{0,20}security'
item[17] = 'ITEM\s13.{0,20}certain'
item[18] = 'ITEM\s14.{0,20}principal'
item[19] = 'ITEM\s15.{0,20}exhibits'

In [17]:
searchstr = '('
for i in range(19):
    searchstr += item[i] + '[a-z]{0,20}\n|'
searchstr += item[19]+')'

In [18]:
searchstr

'(ITEM\\s1.{0,20}business[a-z]{0,20}\n|ITEM\\s1a.{0,20}risk[a-z]{0,20}\n|ITEM\\s1b.{0,20}unresolved[a-z]{0,20}\n|ITEM\\s2.{0,20}properties[a-z]{0,20}\n|ITEM\\s3.{0,20}legal[a-z]{0,20}\n|ITEM\\s4.{0,20}mine[a-z]{0,20}\n|ITEM\\s5.{0,20}market[a-z]{0,20}\n|ITEM\\s6.{0,20}selected[a-z]{0,20}\n|ITEM\\s7.{0,20}management[a-z]{0,20}\n|ITEM\\s7a.{0,20}quantitative[a-z]{0,20}\n|ITEM\\s8.{0,20}financial[a-z]{0,20}\n|ITEM\\s9.{0,20}changes[a-z]{0,20}\n|ITEM\\s9a.{0,20}controls[a-z]{0,20}\n|ITEM\\s9b.{0,20}other[a-z]{0,20}\n|ITEM\\s10.{0,30}directors[a-z]{0,20}\n|ITEM\\s11.{0,30}[a-z]{0,20}\n|ITEM\\s12.{0,20}security[a-z]{0,20}\n|ITEM\\s13.{0,20}certain[a-z]{0,20}\n|ITEM\\s14.{0,20}principal[a-z]{0,20}\n|ITEM\\s15.{0,20}exhibits)'

In [19]:
matches = re.finditer('ITEM\s8.{0,20}financial', cleaned_soup, re.IGNORECASE)
locations = [x for x in matches]
locations

[<re.Match object; span=(5703, 5734), match='Item 8A. Consolidated Financial'>,
 <re.Match object; span=(5805, 5836), match='Item 8B. Consolidated Financial'>,
 <re.Match object; span=(428244, 428275), match='ITEM 8A. CONSOLIDATED FINANCIAL'>,
 <re.Match object; span=(650747, 650778), match='ITEM 8B. CONSOLIDATED FINANCIAL'>]

In [22]:
matches = re.compile(r'(?s).>IT.{0,20}EM.{1,20}7[^A].{1,400}MANAGEMENT|'
                      '(?s).>IT.{0,20}EM.{1,20}8([^B]|A).{1,400}(CONSOLIDATED|FINANCIAL)', re.IGNORECASE)

In [24]:
[x for x in matches.finditer(raw_10k)]

[<re.Match object; span=(37051, 37285), match='">Item&#160;7.&#160;&#160;&#160;&#160;&#160;&#160>,
 <re.Match object; span=(38687, 38910), match='">Item&#160;8A.&#160;&#160;&#160;&#160;</font></p>,
 <re.Match object; span=(723647, 723672), match='a>ITEM&#160;7. MANAGEMENT'>,
 <re.Match object; span=(1614422, 1614465), match='a>ITEM&#160;8A.&#160;CONSOLIDATED FINANCIAL'>]

In [25]:
cleaned_soup = ' '.join(cleaned_soup.split())

In [27]:
TenKtext = raw_10k

In [28]:
# Set up the regex pattern
matches = re.compile(r'(?s).>IT.{0,20}EM.{1,20}7[^A].{1,400}MANAGEMENT|'
                      '(?s).>IT.{0,20}EM.{1,20}8([^B]|A).{1,400}(CONSOLIDATED|FINANCIAL)', re.IGNORECASE)

matches_array = pd.DataFrame([(match.group(), match.start()) for match in matches.finditer(TenKtext)])
matches_array.head()

Unnamed: 0,0,1
0,""">Item&#160;7.&#160;&#160;&#160;&#160;&#160;&#...",37051
1,""">Item&#160;8A.&#160;&#160;&#160;&#160;</font>...",38687
2,a>ITEM&#160;7. MANAGEMENT,723647
3,a>ITEM&#160;8A.&#160;CONSOLIDATED FINANCIAL,1614422


In [29]:
# Set columns in the dataframe
matches_array.columns = ['SearchTerm', 'Start']

# Get the number of rows in the dataframe
Rows = matches_array['SearchTerm'].count()

In [30]:
# Create a new column in 'matches_array' called 'Selection' and add adjacent 
# 'SearchTerm' (i and i+1 rows) text concatenated
count = 0 # Counter to help with row location and iteration

while count < (Rows-1): # Can only iterate to the second last row
    matches_array.at[count,'Selection'] = (matches_array.iloc[count,0] + matches_array.iloc[count+1,0]).lower() # Convert to lower case
    count += 1

In [31]:
matches_array.head()

Unnamed: 0,SearchTerm,Start,Selection
0,""">Item&#160;7.&#160;&#160;&#160;&#160;&#160;&#...",37051,""">item&#160;7.&#160;&#160;&#160;&#160;&#160;&#..."
1,""">Item&#160;8A.&#160;&#160;&#160;&#160;</font>...",38687,""">item&#160;8a.&#160;&#160;&#160;&#160;</font>..."
2,a>ITEM&#160;7. MANAGEMENT,723647,a>item&#160;7. managementa>item&#160;8a.&#160;...
3,a>ITEM&#160;8A.&#160;CONSOLIDATED FINANCIAL,1614422,


In [45]:
# Set up 'Item 7/8 Search Pattern' regex patterns
matches_item7 = re.compile(r'(?s).>IT.{0,20}EM.{1,20}7[^A].{1,400}MANAGEMENT', re.IGNORECASE)
matches_item8 = re.compile(r'(?s).>IT.{0,20}EM.{1,20}8([^B]|A).{1,400}(CONSOLIDATED|FINANCIAL)',re.IGNORECASE)

# Lists to store the locations of Item 7/8 Search Pattern matches
Start_Loc = []
End_Loc = []

# Find and store the locations of Item 7/8 Search Pattern matches
count = 0 # Set up counter

while count < (Rows-1): # Can only iterate to the second last row

    # Match Item 7 Search Pattern
    if re.match(matches_item7, matches_array.at[count,'Selection']):
        # Column 1 = 'Start' columnn in 'matches_array'
        Start_Loc.append(matches_array.iloc[count,1]) # Store in list => Item 7 will be the starting location (column '1' = 'Start' column)

    # Match Item 8 Search Pattern
    if re.match(matches_item8, matches_array.at[count,'Selection']):
        End_Loc.append(matches_array.iloc[count+1,1])

    count += 1

In [46]:
print(Start_Loc[1], End_Loc[1])

723647 1614422


In [48]:
# Extract section of text and store in 'TenKItem7'
TenKItem7 = TenKtext[Start_Loc[1]:End_Loc[1]]

# Clean newly extracted text
TenKItem7 = TenKItem7.strip() # Remove starting/ending white spaces
TenKItem7 = TenKItem7.replace('\n', ' ') # Replace \n (new line) with space
TenKItem7 = TenKItem7.replace('\r', '') # Replace \r (carriage returns-if you're on windows) with space
TenKItem7 = TenKItem7.replace(' ', ' ') # Replace " " (a special character for space in HTML) with space
TenKItem7 = TenKItem7.replace(' ', ' ') # Replace " " (a special character for space in HTML) with space
while '  ' in TenKItem7:
    TenKItem7 = TenKItem7.replace('  ', ' ') # Remove extra spaces

# Print first 500 characters of newly extracted text
print(TenKItem7[:500])

a>ITEM&#160;7. MANAGEMENT&#146;S DISCUSSION AND ANALYSIS OF FINANCIAL CONDITION AND RESULTS OF OPERATIONS </b></font></p> <p style="margin-top:6px;margin-bottom:0px"><font size="2" style="font-family:Times New Roman"><b><u>American Airlines Group </u></b></font></p> <p align="justify" style="margin-top:6px;margin-bottom:0px; text-indent:2%"><font size="2" style="font-family:Times New Roman">As previously discussed, the Merger was consummated on December&#160;9, 2013. Accordingly, our consolidate


In [49]:
TenKtext[:500]

'<html><body><document>\n<type>10-K\n<sequence>1\n<filename>d829913d10k.htm\n<description>FORM 10-K\n<text>\n<title>Form 10-K</title>\n<h5 align="left"><a href="#toc">Table of Contents</a></h5>\n<p style="line-height:4px;margin-top:0px;margin-bottom:0px;border-bottom:2pt solid #000000">&#160;</p>\n<p style="line-height:3px;margin-top:0px;margin-bottom:2px;border-bottom:0.5pt solid #000000">&#160;</p> <p align="center" style="margin-top:1px;margin-bottom:0px"><font size="2" style="font-family:Times New Rom'

In [51]:
TenKItem7[723647:1614422]

AttributeError: 'str' object has no attribute 'text'

# Build Out to All of AAL

This is great, and closer to what we ultimately want. Let us now see if we can easily generalize this to every 10K for AAL:

In [42]:
pulls[0][11:13]

'08'

In [43]:
os.listdir("sec-edgar-filings/AAL/10-K/")

['0000004515-08-000014',
 '0000006201-20-000023',
 '0000950134-06-003715',
 '.DS_Store',
 '0001047469-03-013301',
 '0000950134-05-003726',
 '0000950134-04-002668',
 '0000006201-10-000006',
 '0000950123-11-014726',
 '0000006201-18-000009',
 '0001193125-15-061145',
 '0001193125-12-063516',
 '0000006201-21-000014',
 '0001193125-16-474605',
 '0000006201-14-000004',
 '0000950134-07-003888',
 '0000006201-19-000009',
 '0000006201-09-000009',
 '0000006201-13-000023',
 '0001193125-17-051216']

In [141]:
# Initialize dict to save results
item7 = {}

# Loop through each file
for filing in os.listdir("sec-edgar-filings/AAL/10-K/")[1:2]:
    
    if filing == '.DS_Store':
        continue
        
    print(filing)
        
#     if int(filing[11:13]) >= 9:
#         continue

    # Get the most recent filing
    f = open("sec-edgar-filings/AAL/10-K/"+filing+"/filing-details.html", "r")
    raw_10k = f.read()


    soup = BeautifulSoup(raw_10k, 'lxml')

    cleaned_soup = soup.text

    TenKtext = ' '.join(cleaned_soup.split())

    #Set up the regex pattern
#     matches = re.compile(r'(item\s(7[\.\s]|(8A|8)[\.\s])|'
#                     '(|management\x92s\s)discussion\sand\sanalysis\sof\s(consolidated\sfinancial|financial)\scondition|'
#                     '(consolidated\sfinancial|financial)\sstatements(|\sand\ssupplementary\sdata))', re.IGNORECASE)
    
    matches = re.compile(r'(item\s(7[\.\s]|8(A|)[\.\s])|'
                             'discussion\sand\sanalysis\sof\s(consolidated\sfinancial|financial)\scondition|'
                             '(consolidated\sfinancial|financial)\sstatements\sand\ssupplementary\sdata)', re.IGNORECASE)

    matches_array = pd.DataFrame([(match.group(), match.start()) for match in matches.finditer(TenKtext)])
    matches_array.head()

    # Set columns in the dataframe
    matches_array.columns = ['SearchTerm', 'Start']

    # Get the number of rows in the dataframe
    Rows = matches_array['SearchTerm'].count()

    # Create a new column in 'matches_array' called 'Selection' and add adjacent 
    # 'SearchTerm' (i and i+1 rows) text concatenated
    count = 0 # Counter to help with row location and iteration

    while count < (Rows-1): # Can only iterate to the second last row
        matches_array.at[count,'Selection'] = (matches_array.iloc[count,0] + matches_array.iloc[count+1,0]).lower() # Convert to lower case
        count += 1

    # Set up 'Item 7/8 Search Pattern' regex patterns
    matches_item7 = re.compile(r'(item\s7\.(management\x92s|discussion)\s[a-z]*)')
    matches_item8 = re.compile(r'(item\s8(a|)\.(|consolidated\sfinancial|financial)\s[a-z]*)')
    
    # Set up 'Item 7/8 Search Pattern' regex patterns
    #matches_item7 = re.compile(r'(item\s7\.)')
    #matches_item8 = re.compile(r'(item\s8(a|)\.)')

    # Lists to store the locations of Item 7/8 Search Pattern matches
    Start_Loc = []
    End_Loc = []

    # Find and store the locations of Item 7/8 Search Pattern matches
    count = 0 # Set up counter

    for i in range(Rows-1):#while count < (Rows-1): # Can only iterate to the second last row

        # Match Item 7 Search Pattern
        #if re.match(matches_item7, matches_array.at[count,'Selection']):
        if re.search(matches_item7, matches_array.at[i,'Selection']):
            # Column 1 = 'Start' columnn in 'matches_array'
            Start_Loc.append(matches_array.iloc[count,1]) # Store in list => Item 7 will be the starting location (column '1' = 'Start' column)

        # Match Item 8 Search Pattern
        #if re.match(matches_item8, matches_array.at[count,'Selection']):
        if re.search(matches_item8, matches_array.at[i,'Selection']):
            End_Loc.append(matches_array.iloc[count,1])

        count += 1

    # Extract section of text and store in 'TenKItem7'
    if len(Start_Loc) > 1 and len(End_Loc) >1:
        TenKItem7 = TenKtext[Start_Loc[1]:End_Loc[1]]
    
    else:
        TenKItem7 = TenKtext[Start_Loc[0]:End_Loc[0]]
        
#     elif len(Start_Loc) == 1 and len(End_Loc)  == 1:
#         TenKItem7 = TenKtext[Start_Loc[0]:End_Loc[0]]
        
#     elif len(Start_Loc) > 1 and len(End_Loc)  == 1:
#         TenKItem7 = TenKtext[Start_Loc[1]:End_Loc[0]]
    
#     else:
#         TenKItem7 = TenKtext[Start_Loc[0]:End_Loc[1]]
        
    # Clean newly extracted text
    TenKItem7 = TenKItem7.strip() # Remove starting/ending white spaces
    TenKItem7 = TenKItem7.replace('\n', ' ') # Replace \n (new line) with space
    TenKItem7 = TenKItem7.replace('\r', '') # Replace \r (carriage returns-if you're on windows) with space
    TenKItem7 = TenKItem7.replace(' ', ' ') # Replace " " (a special character for space in HTML) with space
    TenKItem7 = TenKItem7.replace(' ', ' ') # Replace " " (a special character for space in HTML) with space
    while '  ' in TenKItem7:
        TenKItem7 = TenKItem7.replace('  ', ' ') # Remove extra spaces
        
    item7[filing] = TenKItem7
    

0000006201-20-000023


In [147]:
item7[filing]



In [143]:
[[x,len(item7[x])] for x in item7.keys()]

[['0000006201-20-000023', 261254]]

In [144]:
matches = re.compile(r'(item\s(7[\.\s]|8(.|)[\.\s])|'
                             'discussion\sand\sanalysis\sof\s(consolidated\sfinancial|financial)\scondition\s|'
                             '(consolidated\sfinancial|financial)\sstatements\sand\ssupplementary\sdata)\s', re.IGNORECASE)

In [145]:
[x for x in matches.finditer(TenKtext)]

[<re.Match object; span=(142000, 142008), match='Item 7. '>,
 <re.Match object; span=(142164, 142173), match='Item 8A. '>,
 <re.Match object; span=(142173, 142230), match='Consolidated Financial Statements and Supplementa>,
 <re.Match object; span=(142263, 142272), match='Item 8B. '>,
 <re.Match object; span=(142272, 142329), match='Consolidated Financial Statements and Supplementa>,
 <re.Match object; span=(145175, 145183), match='Item 7. '>,
 <re.Match object; span=(148842, 148850), match='Item 7. '>,
 <re.Match object; span=(160531, 160539), match='Item 7. '>,
 <re.Match object; span=(332192, 332200), match='ITEM 7. '>,
 <re.Match object; span=(342192, 342200), match='Item 7. '>,
 <re.Match object; span=(354907, 354915), match='Item 7. '>,
 <re.Match object; span=(367586, 367594), match='Item 7. '>,
 <re.Match object; span=(406429, 406438), match='ITEM 8A. '>,
 <re.Match object; span=(406438, 406495), match='CONSOLIDATED FINANCIAL STATEMENTS AND SUPPLEMENTA>,
 <re.Match object; span

In [146]:
matches_array

Unnamed: 0,SearchTerm,Start,Selection
0,Item 7.,142000,item 7.discussion and analysis of financial co...
1,Discussion and Analysis of Financial Condition,142021,discussion and analysis of financial condition...
2,Item 8A.,142164,item 8a.consolidated financial statements and ...
3,Consolidated Financial Statements and Suppleme...,142173,consolidated financial statements and suppleme...
4,Consolidated Financial Statements and Suppleme...,142272,consolidated financial statements and suppleme...
5,Item 7.,145175,item 7.discussion and analysis of financial co...
6,Discussion and Analysis of Financial Condition,145196,discussion and analysis of financial condition...
7,Item 7.,148842,item 7.discussion and analysis of financial co...
8,Discussion and Analysis of Financial Condition,148863,discussion and analysis of financial condition...
9,Item 8A,149198,item 8a item 7.


In [119]:
re.search(matches_item7, matches_array.at[0,'Selection'])

<re.Match object; span=(0, 7), match='item\n7.'>

In [120]:
matches_item8 = re.compile(r'(item\s8(a|)\.)')
re.match(matches_item8, matches_array.at[0,'Selection'])

<re.Match object; span=(8, 14), match='item\n8'>

In [139]:
End_Loc

[149908, 150081, 375989, 414897]

In [140]:
Start_Loc

[]

In [13]:
End_Loc

[]

In [78]:
matches_array

Unnamed: 0,SearchTerm,Start,Selection
0,consolidated financial statements,15737,consolidated financial statementsconsolidated\...
1,consolidated\nfinancial statements,24591,consolidated\nfinancial statementsconsolidated...
2,consolidated financial statements,36523,consolidated financial statementsitem 7
3,Item 7,44244,item 7 discussion and analysis of financial co...
4,Discussion and Analysis of Financial Condition,44266,discussion and analysis of financial condition...
...,...,...,...
65,financial statements,299918,financial statementsfinancial\nstatements
66,financial\nstatements,300340,financial\nstatementsconsolidated financial st...
67,consolidated financial statements,303647,consolidated financial statementsfinancial sta...
68,financial statements,304540,financial statementsconsolidated financial sta...


In [30]:
list(item7.keys())[12]

'0000006201-14-000004'

In [29]:
item7[list(item7.keys())[12]]



In [49]:
item7['0000004515-08-000014'] == item7['0000006201-21-000014']

True

In [70]:
'\n' in TenKtext

False

Keep trying

In [41]:
filing[11:13]

'06'

In [23]:
int('08')

8

In [37]:
cleaned_soup[:1000]

'\n10-K\n1\nd33303e10vk.htm\nFORM 10-K\n\ne10vk\n\nTable of Contents\n\n\xa0\n\xa0\nUnited States Securities and Exchange Commission\n\nWashington, D.C. 20549\n\xa0\n\nForm\xa010-K\n\n\n\n\n\xa0\n\xa0\n\xa0\n\n\nþ\n\xa0\nAnnual Report Pursuant to Section\xa013 or 15(d) of the Securities Exchange Act of 1934\n\n\n\nFor the fiscal year ended December\xa031, 2005\n\n\n\n\n\n\xa0\n\xa0\n\xa0\n\n\no\n\xa0\nTransition Report Pursuant to Section\xa013 or 15(d) of the Securities Exchange Act of 1934\n\n\n\nCommission File Number: 1-8400\nAMR Corporation\n\n(Exact name of registrant as specified in its charter)\n\n\n\n\n\xa0\n\xa0\n\xa0\n\n\n\n\nDelaware\n\n\xa0\n75-1825172\n\n\n(State or other jurisdiction of\nincorporation or organization)\n\n\xa0\n(IRS Employer\nIdentification Number)\n\n\n\n\n4333 Amon Carter Blvd.\nFort Worth, Texas 76155\n(Address of principal executive offices, including zip code)\n(817)\xa0963-1234\n(Registrant\x92s telephone number, including area code)\n\xa0\nSecuriti

In [38]:
TenKtext



In [70]:
matches = re.compile(r'(item\s(7[\.\s]|(8A|8)[\.\s])|'
                     '(|management\x92s\s)discussion\sand\sanalysis\sof\s(consolidated\sfinancial|financial)\scondition|'
                     '(consolidated\sfinancial|financial)\sstatements(|\sand\ssupplementary\sdata))', re.IGNORECASE)


matches_array = pd.DataFrame([(match.group(), match.start()) for match in matches.finditer(TenKtext)])
matches_array

Unnamed: 0,0,1
0,Item 7.,5894
1,Managements Discussion and Analysis of Financ...,5902
2,Item 8A.,6062
3,Consolidated Financial Statements,6071
4,Consolidated Financial Statements,6172
...,...,...
254,consolidated financial statements,574338
255,Consolidated Financial Statements,575011
256,consolidated financial statements,575062
257,Consolidated Financial Statements,575736


In [46]:
matches = re.compile(r'CONSOLIDATED\sFINANCIAL\sSTATEMENTS', re.IGNORECASE)
matches_array = pd.DataFrame([(match.group(), match.start()) for match in matches.finditer(TenKtext)])
matches_array

Unnamed: 0,0,1
0,CONSOLIDATED FINANCIAL STATEMENTS,3421
1,consolidated financial statements,13259
2,consolidated financial statements,33239
3,consolidated financial statements,39846
4,consolidated financial statements,41873
5,consolidated financial statements,72383
6,consolidated financial statements,74732
7,consolidated financial statements,93138
8,consolidated financial statements,93306
9,consolidated financial statements,93682


### Divider

In [12]:
cleaned_soup[:1000]

'10-K 1 d829913d10k.htm FORM 10-K Form 10-K Table of Contents UNITED STATES SECURITIES AND EXCHANGE COMMISSION Washington, D.C. 20549 FORM 10-K þ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the Fiscal Year Ended December 31, 2014 ¨ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the Transition Period From to Commission file number 1-8400 American Airlines Group Inc. (Exact name of registrant as specified in its charter) Delaware 75-1825172 (State or other jurisdiction of incorporation or organization) (I.R.S. Employer Identification No.) 4333 Amon Carter Blvd., Fort Worth, Texas 76155 (817) 963-1234 (Address of principal executive offices, including zip code) Registrant\x92s telephone number, including area code (Former name, former address and former fiscal year, if changed since last report) Securities registered pursuant to Section 12(b) of the Act: Name of Exchange on Which Registered Common S

In [13]:
TenKtext = cleaned_soup

In [18]:
matches = re.compile(r'Item\s7.{1,20}(Management|Discussion)|Item\s8.{1,20}(Management|Discussion)|Item\s8.{1,20}(consolidated|financial)',
                     re.IGNORECASE)

matches_array = pd.DataFrame([(match.group(), match.start()) for match in matches.finditer(TenKtext)])
matches_array

Unnamed: 0,0,1
0,Item 7. Managements Discussion,5534
1,Item 8A. Consolidated Financial,5703
2,Item 8B. Consolidated Financial,5805
3,Item 7. Managements Discussion,15306
4,Item 7. Managements Discussion,22111
5,Item 7. Managements Discussion,23860
6,Item 7. Managements Discussion,25916
7,Item 7. Managements Discussion,82989
8,Item 7. Managements Discussion,130271
9,Item 7. Managements Discussion,219223


# XBRL Documents

The SEC mandated XBRL style documents after 2009, so perhaps those are easier to parse in general.

In [48]:
os.listdir("sec-edgar-filings/AAL/10-K")

['0000004515-08-000014',
 '0000006201-20-000023',
 '0000950134-06-003715',
 '.DS_Store',
 '0001047469-03-013301',
 '0000950134-05-003726',
 '0000950134-04-002668',
 '0000006201-10-000006',
 '0000950123-11-014726',
 '0000006201-18-000009',
 '0001193125-15-061145',
 '0001193125-12-063516',
 '0000006201-21-000014',
 '0001193125-16-474605',
 '0000006201-14-000004',
 '0000950134-07-003888',
 '0000006201-19-000009',
 '0000006201-09-000009',
 '0000006201-13-000023',
 '0001193125-17-051216']

In [49]:
f = open("sec-edgar-filings/AAL/10-K/0000006201-20-000023/filing-details.html", "r")
raw_10k = f.read()

In [51]:
soup = BeautifulSoup(raw_10k, 'lxml')

In [58]:
tag_list = soup.find_all()
tag_list[:50]

KeyboardInterrupt: 

In [59]:
'Item 7' in tag_list

False

In [60]:
lstr1 = TenKtext