# Retrieving Data From Edgar

In this notebook, I will explore pulling the necessary data from EDGAR--a public database containing all the quarterly and annual financial reports required by law. First, I will take a look at one company, American Airlines (AAL), and then extrapolate the code to include all necessary companies. 

We will be using the sec_edgar_downloader package for this, as it is an extremely powerful and simple tool to scrape the necessary data. More information about the package can be found here:

https://pypi.org/project/sec-edgar-downloader/

In [2]:
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import os
import re
import html2text
import pickle
from sec_edgar_downloader import Downloader

## American Airlines Sample

First, we will start with the 10K filing. I provided my opinion on the valuable parts in the 10K analysis file found in the github. 

In [3]:
company = 'American Airlines Inc'
ticker = 'AAL'

# Initialize a downloader instance. If no argument is passed
# to the constructor, the package will download filings to
# the current working directory.
dl = Downloader()

In [5]:
# Get all 10-K filings for American Airlines (ticker: AAL) from 2000 onwards
#dl.get("10-K", ticker, after="2000-01-01")

The above request downloads the specified reports into the working directory: 

In [6]:
pulls = os.listdir("sec-edgar-filings/AAL/10-K")
pulls[:10]

['0000004515-08-000014',
 '0000006201-20-000023',
 '0000950134-06-003715',
 '.DS_Store',
 '0001047469-03-013301',
 '0000950134-05-003726',
 '0000950134-04-002668',
 '0000006201-10-000006',
 '0000950123-11-014726',
 '0000006201-18-000009']

In [7]:
# We want the one from 2021
os.listdir("sec-edgar-filings/AAL/10-K/0001193125-15-061145")

['full-submission.txt', 'filing-details.html']

We can see here that there are two files. Let us explore these files:

In [None]:
# Get the most recent filing
f = open("sec-edgar-filings/AAL/10-K/0001193125-15-061145/filing-details.html", "r")
raw_10k = f.read()

In [None]:
print(raw_10k[:500])

In [None]:
soup = BeautifulSoup(raw_10k, 'lxml')

In [None]:
cleaned_soup = soup.text
cleaned_soup[:500]

In [None]:
cleaned_soup[:500].lstrip()

In [None]:
cleaned_soup = '\n'.join(' '.join(line.split()) for line in cleaned_soup.split('\n'))

In [None]:
cleaned_soup[:500]

In [None]:
exp = re.compile("[^\S\r\n]")
res = exp.sub('', cleaned_soup)

In [None]:
res[:5000]

In [21]:
cleaned_soup = ' '.join(cleaned_soup.split())
cleaned_soup[:500]

'10-K 1 d829913d10k.htm FORM 10-K Form 10-K Table of Contents UNITED STATES SECURITIES AND EXCHANGE COMMISSION Washington, D.C. 20549 FORM 10-K þ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the Fiscal Year Ended December 31, 2014 ¨ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the Transition Period From to Commission file number 1-8400 American Airlines Group Inc. (Exact name of registrant as specified in '

In [22]:
searchstr = '('
for i in range(1,16):
    searchstr+=str(i)+'.|'

searchstr+='16.)'
searchstr

'(1.|2.|3.|4.|5.|6.|7.|8.|9.|10.|11.|12.|13.|14.|15.|16.)'

In [48]:
matches = re.finditer("Item\s"+searchstr, cleaned_soup, re.IGNORECASE)
locations = [x for x in matches]
locations

[]

In [89]:
item = [None]*20
item[0]  = 'ITEM\s1.{0,20}business'
item[1]  = 'ITEM\s1a.{0,20}risk'
item[2]  = 'ITEM\s1b.{0,20}unresolved'
item[3]  = 'ITEM\s2.{0,20}properties'
item[4]  = 'ITEM\s3.{0,20}legal'
item[5]  = 'ITEM\s4.{0,20}mine'
item[6]  = 'ITEM\s5.{0,20}market'
item[7]  = 'ITEM\s6.{0,20}selected'
item[8]  = 'ITEM\s7.{0,20}management'
item[9]  = 'ITEM\s7a.{0,20}quantitative'
item[10] = 'ITEM\s8.{0,20}financial'
item[11] = 'ITEM\s9.{0,20}changes'
item[12] = 'ITEM\s9a.{0,20}controls'
item[13] = 'ITEM\s9b.{0,20}other'
item[14] = 'ITEM\s10.{0,30}directors'
item[15] = 'ITEM\s11.{0,30}'
item[16] = 'ITEM\s12.{0,20}security'
item[17] = 'ITEM\s13.{0,20}certain'
item[18] = 'ITEM\s14.{0,20}principal'
item[19] = 'ITEM\s15.{0,20}exhibits'

In [90]:
searchstr = '('
for i in range(19):
    searchstr += item[i] + '[a-z]{0,20}\n|'
searchstr += item[19]+')'

In [95]:
searchstr

'(ITEM\\s1.{0,20}business[a-z]{0,20}\n|ITEM\\s1a.{0,20}risk[a-z]{0,20}\n|ITEM\\s1b.{0,20}unresolved[a-z]{0,20}\n|ITEM\\s2.{0,20}properties[a-z]{0,20}\n|ITEM\\s3.{0,20}legal[a-z]{0,20}\n|ITEM\\s4.{0,20}mine[a-z]{0,20}\n|ITEM\\s5.{0,20}market[a-z]{0,20}\n|ITEM\\s6.{0,20}selected[a-z]{0,20}\n|ITEM\\s7.{0,20}management[a-z]{0,20}\n|ITEM\\s7a.{0,20}quantitative[a-z]{0,20}\n|ITEM\\s8.{0,20}financial[a-z]{0,20}\n|ITEM\\s9.{0,20}changes[a-z]{0,20}\n|ITEM\\s9a.{0,20}controls[a-z]{0,20}\n|ITEM\\s9b.{0,20}other[a-z]{0,20}\n|ITEM\\s10.{0,30}directors[a-z]{0,20}\n|ITEM\\s11.{0,30}[a-z]{0,20}\n|ITEM\\s12.{0,20}security[a-z]{0,20}\n|ITEM\\s13.{0,20}certain[a-z]{0,20}\n|ITEM\\s14.{0,20}principal[a-z]{0,20}\n|ITEM\\s15.{0,20}exhibits)'

In [111]:
matches = re.finditer('ITEM\s8.{0,20}financial', cleaned_soup, re.IGNORECASE)
locations = [x for x in matches]
locations

[<re.Match object; span=(5703, 5734), match='Item 8A. Consolidated Financial'>,
 <re.Match object; span=(5805, 5836), match='Item 8B. Consolidated Financial'>,
 <re.Match object; span=(428244, 428275), match='ITEM 8A. CONSOLIDATED FINANCIAL'>,
 <re.Match object; span=(650747, 650778), match='ITEM 8B. CONSOLIDATED FINANCIAL'>]

In [96]:
matches = re.compile(r'(item\s(7[\.\s]|8[\.\s])|'
                             'discussion\sand\sanalysis\sof\s(consolidated\sfinancial|financial)\scondition|'
                             '(consolidated\sfinancial|financial)\sstatements\sand\ssupplementary\sdata)', re.IGNORECASE)

In [99]:
[x for x in matches.finditer(cleaned_soup)]

[<re.Match object; span=(6148, 6155), match='Item\xa07.'>,
 <re.Match object; span=(6179, 6225), match='Discussion and Analysis of Financial Condition'>,
 <re.Match object; span=(6368, 6424), match='Consolidated Financial Statements and Supplementa>,
 <re.Match object; span=(6487, 6543), match='Consolidated Financial Statements and Supplementa>,
 <re.Match object; span=(16174, 16181), match='Item\xa07.'>,
 <re.Match object; span=(16195, 16241), match='Discussion and Analysis of Financial Condition'>,
 <re.Match object; span=(23008, 23015), match='Item\xa07.'>,
 <re.Match object; span=(23029, 23075), match='Discussion and Analysis of Financial Condition'>,
 <re.Match object; span=(24761, 24768), match='Item\xa07.'>,
 <re.Match object; span=(24782, 24828), match='Discussion and Analysis of Financial Condition'>,
 <re.Match object; span=(24913, 24959), match='Discussion and Analysis of Financial Condition'>,
 <re.Match object; span=(26827, 26834), match='Item\xa07.'>,
 <re.Match object; s

In [101]:
cleaned_soup = ' '.join(cleaned_soup.split())

In [1]:
TenKtext = cleaned_soup

NameError: name 'cleaned_soup' is not defined

In [156]:
# Set up the regex pattern
matches = re.compile(r'(item\s(7[\.\s]|(8A|8)[\.\s])|'
                     'discussion\sand\sanalysis\sof\s(consolidated\sfinancial|financial)\scondition|'
                     '(consolidated\sfinancial|financial)\sstatements\sand\ssupplementary\sdata)', re.IGNORECASE)

matches_array = pd.DataFrame([(match.group(), match.start()) for match in matches.finditer(TenKtext)])

# Set columns in the dataframe
matches_array.columns = ['SearchTerm', 'Start']

# Get the number of rows in the dataframe
Rows = matches_array['SearchTerm'].count()

# Create a new column in 'matches_array' called 'Selection' and add adjacent 'SearchTerm' (i and i+1 rows) text concatenated
count = 0 # Counter to help with row location and iteration
while count < (Rows-1): # Can only iterate to the second last row
    matches_array.at[count,'Selection'] = (matches_array.iloc[count,0] + matches_array.iloc[count+1,0]).lower() # Convert to lower case
    count += 1

# Set up 'Item 7/8 Search Pattern' regex patterns
matches_item7 = re.compile(r'(item\s7\.discussion\s[a-z]*)')
matches_item8 = re.compile(r'(item\s8(|.)\.(|consolidated\sfinancial|financial)\s[a-z]*)')

# Lists to store the locations of Item 7/8 Search Pattern matches
Start_Loc = []
End_Loc = []

# Find and store the locations of Item 7/8 Search Pattern matches
count = 0 # Set up counter

while count < (Rows-1): # Can only iterate to the second last row

    # Match Item 7 Search Pattern
    if re.match(matches_item7, matches_array.at[count,'Selection']):
        # Column 1 = 'Start' columnn in 'matches_array'
        Start_Loc.append(matches_array.iloc[count,1]) # Store in list => Item 7 will be the starting location (column '1' = 'Start' column)

    # Match Item 8 Search Pattern
    if re.match(matches_item8, matches_array.at[count,'Selection']):
        End_Loc.append(matches_array.iloc[count,1])

    count += 1

# Extract section of text and store in 'TenKItem7'
TenKItem7 = TenKtext[Start_Loc[1]:End_Loc[1]]

# Clean newly extracted text
TenKItem7 = TenKItem7.strip() # Remove starting/ending white spaces
TenKItem7 = TenKItem7.replace('\n', ' ') # Replace \n (new line) with space
TenKItem7 = TenKItem7.replace('\r', '') # Replace \r (carriage returns-if you're on windows) with space
TenKItem7 = TenKItem7.replace(' ', ' ') # Replace " " (a special character for space in HTML) with space
TenKItem7 = TenKItem7.replace(' ', ' ') # Replace " " (a special character for space in HTML) with space
while '  ' in TenKItem7:
    TenKItem7 = TenKItem7.replace('  ', ' ') # Remove extra spaces

# Print first 500 characters of newly extracted text
print(TenKItem7[:500])

Item 7. Managements Discussion and Analysis of Financial Condition and Results of Operations) and in our other filings with the Securities and Exchange Commission (the SEC), and other risks and uncertainties listed from time to time in our filings with the SEC. All of the forward-looking statements are qualified in their entirety by reference to the factors discussed in Part I, Item 1A. Risk Factors and elsewhere in this report. There may be other factors of which we are not currently aware tha


In [159]:
TenKItem7[:500]

'Item 7. Management\x92s Discussion and Analysis of Financial Condition and Results of Operations) and in our other filings with the Securities and Exchange Commission (the SEC), and other risks and uncertainties listed from time to time in our filings with the SEC. All of the forward-looking statements are qualified in their entirety by reference to the factors discussed in Part I, Item 1A. Risk Factors and elsewhere in this report. There may be other factors of which we are not currently aware tha'

This is great, and closer to what we ultimately want. Let us now see if we can easily generalize this to every 10K for AAL:

In [163]:
pulls[0][11:13]

'08'