# Retrieving Data From Edgar

In this notebook, I will explore pulling the necessary data from EDGAR--a public database containing all the quarterly and annual financial reports required by law. First, I will take a look at one company, American Airlines (AAL), and then extrapolate the code to include all necessary companies. 

We will be using the sec_edgar_downloader package for this, as it is an extremely powerful and simple tool to scrape the necessary data. More information about the package can be found here:

https://pypi.org/project/sec-edgar-downloader/

In [480]:
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import os
import re
from sec_edgar_downloader import Downloader

## American Airlines Sample

First, we will start with the 10K filing. I provided my opinion on the valuable parts in the 10K analysis file found in the github. 

In [469]:
company = 'American Airlines Inc'
ticker = 'AAL'

# Initialize a downloader instance. If no argument is passed
# to the constructor, the package will download filings to
# the current working directory.
dl = Downloader()

In [336]:
# Get all 10-K filings for American Airlines (ticker: AAL) from 2000 onwards
dl.get("10-K", ticker, after="2000-01-01")

19

The above request downloads the specified reports into the working directory: 

In [481]:
pulls = os.listdir("sec-edgar-filings/AAL/10-K")
pulls[:10]

['0000004515-08-000014',
 '0000006201-20-000023',
 '0000950134-06-003715',
 '0001047469-03-013301',
 '0000950134-05-003726',
 '0000950134-04-002668',
 '0000006201-10-000006',
 '0000950123-11-014726',
 '0000006201-18-000009',
 '0001193125-15-061145']

In [482]:
# We want the one from 2021
os.listdir("sec-edgar-filings/AAL/10-K/0001193125-15-061145")

['full-submission.txt', 'filing-details.html']

We can see here that there are two files. Let us explore these files:

In [493]:
# Get the most recent filing
f = open("sec-edgar-filings/AAL/10-K/0001193125-15-061145/filing-details.html", "r")
raw_10k = f.read()

In [494]:
print(raw_10k[:500])

<html><body><document>
<type>10-K
<sequence>1
<filename>d829913d10k.htm
<description>FORM 10-K
<text>
<title>Form 10-K</title>
<h5 align="left"><a href="#toc">Table of Contents</a></h5>
<p style="line-height:4px;margin-top:0px;margin-bottom:0px;border-bottom:2pt solid #000000">&#160;</p>
<p style="line-height:3px;margin-top:0px;margin-bottom:2px;border-bottom:0.5pt solid #000000">&#160;</p> <p align="center" style="margin-top:1px;margin-bottom:0px"><font size="2" style="font-family:Times New Rom


In [495]:
soup = BeautifulSoup(raw_10k, 'html.parser')

Since the SEC requires the documents to follow a certain format, containing items labelled similarly, we should be able to parse the documents easily once one is taken care of. Let us begin to explore the HTML file:

In [496]:
document = soup.html.find_all()

el = ['html',]  # we already include the html tag
for n in document:
    if n.name not in el:
        el.append(n.name)

print(el)

['html', 'body', 'document', 'type', 'sequence', 'filename', 'description', 'text', 'title', 'h5', 'a', 'p', 'font', 'b', 'center', 'table', 'tr', 'td', 'i', 'hr', 'sup', 'br', 'img', 'u']


In [590]:
goods = soup.find_all('b')
[x.text for x in goods[106:110]]

['ITEM\xa01A.\xa0\xa0RISK FACTORS ',
 'Risk Factors Relating to the Company and Industry-Related Risks ',
 'We could experience significant operating losses in the future. ',
 'Downturns in economic conditions adversely affect our business. ']

This is the information that I need for this project. I hope that the other reports will follow the same format, so that it is easy to scrape across the whole industry. Let us continue:

## Separate & Parse Document

In [755]:
searchstr = '('
for i in range(1,16):
    searchstr+=str(i)+'.|'

searchstr+='16.)'
searchstr

'(1.|2.|3.|4.|5.|6.|7.|8.|9.|10.|11.|12.|13.|14.|15.|16.)'

In [733]:
#test = r'(a>Item(\s|&#160;|&nbsp;)'+searchstr+'\.{0,1})|(>ITEM\s'+searchstr+')'

In [712]:
# Write the regex
#regex = re.compile(r'(a>Item(\s|&#160;|&nbsp;)'+searchstr+'\.{0,1})|(>ITEM\s'+searchstr+')')
#regex = re.compile(r'Overview')

# Use finditer to math the regex
#matches = regex.finditer(raw_10k)

In [764]:
matches = re.finditer("\/a>Item(\s|&#160;|&nbsp;)"+searchstr, raw_10k, re.IGNORECASE)
locations = [x for x in matches]
locations

[<re.Match object; span=(60523, 60538), match='/a>ITEM&#160;1.'>,
 <re.Match object; span=(261964, 261979), match='/a>ITEM&#160;1A'>,
 <re.Match object; span=(422769, 422784), match='/a>ITEM&#160;1B'>,
 <re.Match object; span=(423734, 423749), match='/a>ITEM&#160;2.'>,
 <re.Match object; span=(552226, 552241), match='/a>ITEM&#160;3.'>,
 <re.Match object; span=(563911, 563926), match='/a>ITEM&#160;4.'>,
 <re.Match object; span=(564782, 564797), match='/a>ITEM&#160;5.'>,
 <re.Match object; span=(598619, 598634), match='/a>ITEM&#160;6.'>,
 <re.Match object; span=(723646, 723661), match='/a>ITEM&#160;7.'>,
 <re.Match object; span=(1568708, 1568723), match='/a>ITEM&#160;7A'>,
 <re.Match object; span=(1614421, 1614436), match='/a>ITEM&#160;8A'>,
 <re.Match object; span=(3215477, 3215492), match='/a>ITEM&#160;8B'>,
 <re.Match object; span=(4331581, 4331596), match='/a>ITEM&#160;9.'>,
 <re.Match object; span=(4331976, 4331991), match='/a>ITEM&#160;9A'>,
 <re.Match object; span=(4353593, 435360

In [807]:
temp = raw_10k[261964:422769]

In [811]:
tempsoup = BeautifulSoup(temp, 'html.parser')
goods = tempsoup.find_all('p')
[x.text for x in goods]

['Below are certain risk factors that may affect our business, results of operations and financial condition, or the trading\nprice of our common stock or other securities. We caution the reader that these risk factors may not be exhaustive. We operate in a continually changing business environment, and new risks and uncertainties emerge from time to time. Management\ncannot predict such new risks and uncertainties, nor can it assess the extent to which any of the risk factors below or any such new risks and uncertainties, or any combination thereof, may impact our business. ',
 'Risk Factors Relating to the Company and Industry-Related Risks ',
 'We could experience significant operating losses in the future. ',
 'For a number of reasons, including those addressed in these risk factors, we might fail to maintain profitability and might experience significant losses. In particular, the condition of the economy, the\nlevel and volatility of fuel prices, the state of travel demand and in

Now, we will do all of them

In [789]:
starts = [0]+[locations[i].start() for i in range(len(locations))] + [len(raw_10k)]
starts

[0,
 60523,
 261964,
 422769,
 423734,
 552226,
 563911,
 564782,
 598619,
 723646,
 1568708,
 1614421,
 3215477,
 4331581,
 4331976,
 4353593,
 4355559,
 4356681,
 4357701,
 4358396,
 4359556,
 4575093]

In [812]:
sections = {}

for i in range(len(starts)-1):
    temp = raw_10k[starts[i]:starts[i+1]]
    tempsoup = BeautifulSoup(temp, 'html.parser')
    goods = tempsoup.find_all('p')
    sections[i] = [x.text for x in goods]

## Validation

In [828]:
''.join(sections[2])[:5000]

'Below are certain risk factors that may affect our business, results of operations and financial condition, or the trading\nprice of our common stock or other securities. We caution the reader that these risk factors may not be exhaustive. We operate in a continually changing business environment, and new risks and uncertainties emerge from time to time. Management\ncannot predict such new risks and uncertainties, nor can it assess the extent to which any of the risk factors below or any such new risks and uncertainties, or any combination thereof, may impact our business. Risk Factors Relating to the Company and Industry-Related Risks We could experience significant operating losses in the future. For a number of reasons, including those addressed in these risk factors, we might fail to maintain profitability and might experience significant losses. In particular, the condition of the economy, the\nlevel and volatility of fuel prices, the state of travel demand and intense competitio

YES !

# Scratchpad

Ignore the stuff below this

In [715]:
type(raw_10k)

str

In [710]:
# Write a for loop to print the matches
for match in matches:
    print(match)

In [711]:
raw_10k[30355:30555]

'p style="margin-left:4.50em; text-indent:-4.50em"><font size="2" style="font-family:Times New Roman">Item&#160;1.&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;</font></p></td>\n<td valign="bottom"><f'

In [703]:
raw_10k[60000:60895]

'idth="100%"/>\n<h5 align="left"><a href="#toc">Table of Contents</a></h5>\n<p align="center" style="margin-top:0px;margin-bottom:0px"><font size="2" style="font-family:Times New Roman"><b><a name="tx829913_1"></a>PART I </b></font></p>\n<p style="font-size:12px;margin-top:0px;margin-bottom:0px">&#160;</p>\n<table border="0" cellpadding="0" cellspacing="0" style="BORDER-COLLAPSE:COLLAPSE" width="100%">\n<tr>\n<td align="left" valign="top" width="9%"><font size="2" style="font-family:Times New Roman"><b><a name="tx829913_2"></a>ITEM&#160;1.&#160;&#160;BUSINESS</b></font></td>\n<td align="left" valign="top"><font size="2" style="font-family:Times New Roman"><b></b></font></td></tr></table> <p style="margin-top:6px;margin-bottom:0px"><font size="2" style="font-family:Times New Roman"><b>Overview </b></font></p>\n<p align="justify" style="margin-top:6px;margin-bottom:0px; text-indent:2%"><font s'

In [604]:
document = {}

# Create a loop to go through each section type and save only the 10-K section in the dictionary
for doc_type, doc_start, doc_end in zip(doc_types, doc_start_is, doc_end_is):
    if doc_type == '10-K':
        document[doc_type] = raw_10k[doc_start:doc_end]

In [605]:
# display excerpt the document
document['10-K'][0:500]

'\n<type>10-K\n<sequence>1\n<filename>d829913d10k.htm\n<description>FORM 10-K\n<text>\n<title>Form 10-K</title>\n<h5 align="left"><a href="#toc">Table of Contents</a></h5>\n<p style="line-height:4px;margin-top:0px;margin-bottom:0px;border-bottom:2pt solid #000000">&#160;</p>\n<p style="line-height:3px;margin-top:0px;margin-bottom:2px;border-bottom:0.5pt solid #000000">&#160;</p> <p align="center" style="margin-top:1px;margin-bottom:0px"><font size="2" style="font-family:Times New Roman"><b>UNITED STATES S'

In [628]:
doc_types

['10-K']

In [630]:
soup = BeautifulSoup(document['10-K'], 'lxml')

In [634]:
soup.findAll('b')[106:110]

[<b><a name="tx829913_3"></a>ITEM 1A.  RISK FACTORS </b>,
 <b>Risk Factors Relating to the Company and Industry-Related Risks </b>,
 <b><i>We could experience significant operating losses in the future. </i></b>,
 <b><i>Downturns in economic conditions adversely affect our business. </i></b>]

In [638]:
# Write the regex
regex = re.compile(r'(>Item(\s|&#160;|&nbsp;)(1A|1B|7A|7|8)\.{0,1})|(ITEM\s(1A|1B|7A|7|8))')

# Use finditer to math the regex
matches = regex.finditer(document['10-K'])

# Write a for loop to print the matches
for match in matches:
    print(match)

<re.Match object; span=(31194, 31208), match='>Item&#160;1A.'>
<re.Match object; span=(31937, 31951), match='>Item&#160;1B.'>
<re.Match object; span=(37030, 37043), match='>Item&#160;7.'>
<re.Match object; span=(37875, 37889), match='>Item&#160;7A.'>
<re.Match object; span=(38666, 38678), match='>Item&#160;8'>
<re.Match object; span=(39487, 39499), match='>Item&#160;8'>


In [640]:
els = soup.select('a')
asdf = []
for el in els:
    asdf.append(el.get_text())

In [645]:
document['10-K'][31194:32194]

'>Item&#160;1A.&#160;&#160;&#160;&#160;</font></p></td>\n<td valign="bottom"><font size="1">&#160;</font></td>\n<td valign="top"><font size="2" style="font-family:Times New Roman"><a href="#tx829913_3">Risk Factors</a></font></td>\n<td valign="bottom"><font size="1">&#160;&#160;</font></td>\n<td nowrap="" valign="bottom"><font size="2" style="font-family:Times New Roman">&#160;</font></td>\n<td align="right" nowrap="" valign="bottom"><font size="2" style="font-family:Times New Roman">30</font></td>\n<td nowrap="" valign="bottom"><font size="2" style="font-family:Times New Roman">&#160;&#160;</font></td></tr>\n<tr>\n<td nowrap="" valign="top"> <p style="margin-left:4.50em; text-indent:-4.50em"><font size="2" style="font-family:Times New Roman">Item&#160;1B.&#160;&#160;&#160;&#160;&#160;</font></p></td>\n<td valign="bottom"><font size="1">&#160;</font></td>\n<td valign="top"><font size="2" style="font-family:Times New Roman"><a href="#tx829913_4">Unresolved Staff Comments</a></font></td>

In [619]:
# Matches
matches = regex.finditer(document['10-K'])

# Create the dataframe
test_df = pd.DataFrame([(x.group(), x.start(), x.end()) for x in matches])

test_df.columns = ['item', 'start', 'end']
test_df['item'] = test_df.item.str.lower()

# Display the dataframe
test_df.head()

Unnamed: 0,item,start,end
0,>item&#160;1a.,31194,31208
1,>item&#160;1b.,31937,31951
2,>item&#160;7.,37030,37043
3,>item&#160;7a.,37875,37889
4,>item&#160;8,38666,38678


In [620]:
raw_10k[31216:31230]

'>Item&#160;1A.'

In [621]:
# Get rid of unnesesary charcters from the dataframe
test_df.replace('&#160;',' ',regex=True,inplace=True)
test_df.replace('&nbsp;',' ',regex=True,inplace=True)
test_df.replace(' ','',regex=True,inplace=True)
test_df.replace('\.','',regex=True,inplace=True)
test_df.replace('>','',regex=True,inplace=True)

# display the dataframe
test_df.head()

Unnamed: 0,item,start,end
0,item1a,31194,31208
1,item1b,31937,31951
2,item7,37030,37043
3,item7a,37875,37889
4,item8,38666,38678


In [622]:
# Drop duplicates
pos_dat = test_df.sort_values('start', ascending=True).drop_duplicates(subset=['item'], keep='last')

# Display the dataframe
pos_dat

Unnamed: 0,item,start,end
0,item1a,31194,31208
1,item1b,31937,31951
2,item7,37030,37043
3,item7a,37875,37889
5,item8,39487,39499


In [623]:
# Set item as the dataframe index
pos_dat.set_index('item', inplace=True)

# display the dataframe
pos_dat

Unnamed: 0_level_0,start,end
item,Unnamed: 1_level_1,Unnamed: 2_level_1
item1a,31194,31208
item1b,31937,31951
item7,37030,37043
item7a,37875,37889
item8,39487,39499


In [625]:
# Get Item 1a
item_1_raw = document['10-K'][pos_dat['start'].loc['item1a']:pos_dat['start'].loc['item1b']]

# Get Item 7
#item_7_raw = raw_10k[pos_dat['start'].loc['item7']:pos_dat['start'].loc['item7a']]

# Get Item 7a
#item_7a_raw = raw_10k[pos_dat['start'].loc['item7a']:pos_dat['start'].loc['item8']]

In [626]:
soup1 = BeautifulSoup(item_1a_raw, 'html.parser')

In [627]:
raw_10k[31216:31959]

'>Item&#160;1A.&#160;&#160;&#160;&#160;</font></p></td>\n<td valign="bottom"><font size="1">&#160;</font></td>\n<td valign="top"><font size="2" style="font-family:Times New Roman"><a href="#tx829913_3">Risk Factors</a></font></td>\n<td valign="bottom"><font size="1">&#160;&#160;</font></td>\n<td nowrap="" valign="bottom"><font size="2" style="font-family:Times New Roman">&#160;</font></td>\n<td align="right" nowrap="" valign="bottom"><font size="2" style="font-family:Times New Roman">30</font></td>\n<td nowrap="" valign="bottom"><font size="2" style="font-family:Times New Roman">&#160;&#160;</font></td></tr>\n<tr>\n<td nowrap="" valign="top"> <p style="margin-left:4.50em; text-indent:-4.50em"><font size="2" style="font-family:Times New Roman"'

In [585]:
goods = soup1.find_all('b')
[x.text for x in goods]

[]

In [550]:
soup.find('div')

In [491]:
# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

In [492]:
text[:5000]

'0001193125-15-061145.txt : 20150225\n0001193125-15-061145.hdr.sgml : 20150225\n20150225080234\nACCESSION NUMBER:\t\t0001193125-15-061145\nCONFORMED SUBMISSION TYPE:\t10-K\nPUBLIC DOCUMENT COUNT:\t\t20\nCONFORMED PERIOD OF REPORT:\t20141231\nFILED AS OF DATE:\t\t20150225\nDATE AS OF CHANGE:\t\t20150225\n\nFILER:\n\n\tCOMPANY DATA:\t\n\t\tCOMPANY CONFORMED NAME:\t\t\tAmerican Airlines Group Inc.\n\t\tCENTRAL INDEX KEY:\t\t\t0000006201\n\t\tSTANDARD INDUSTRIAL CLASSIFICATION:\tAIR TRANSPORTATION, SCHEDULED [4512]\n\t\tIRS NUMBER:\t\t\t\t751825172\n\t\tSTATE OF INCORPORATION:\t\t\tDE\n\t\tFISCAL YEAR END:\t\t\t1231\n\n\tFILING VALUES:\n\t\tFORM TYPE:\t\t10-K\n\t\tSEC ACT:\t\t1934 Act\n\t\tSEC FILE NUMBER:\t001-08400\n\t\tFILM NUMBER:\t\t15645918\n\n\tBUSINESS ADDRESS:\t\n\t\tSTREET 1:\t\t4333 AMON CARTER BLVD\n\t\tCITY:\t\t\tFORT WORTH\n\t\tSTATE:\t\t\tTX\n\t\tZIP:\t\t\t76155\n\t\tBUSINESS PHONE:\t\t8179631234\n\n\tMAIL ADDRESS:\t\n\t\tSTREET 1:\t\t4333 AMON CARTER BLVD\n\t\tCITY:\t\t\tFO

We can clearly see that the document is divided based on the standard required elements. Business, Risk Factors, Unresolved Staff Comments, etc. are all standardized titles. This will help in our extrapolation step.

In [475]:
# Select Business as an example:
business = soup.find_all('a')[5]
print(business)

<a href="#i218984ca4fa54d3589b92be28c18f351_22" style="color:#0000ff;font-family:'Arial',sans-serif;font-size:10pt;font-weight:400;line-height:100%;text-decoration:underline">Risk Factors</a>


In [453]:
for filing_document in soup.find_all('DOCUMENT'):
  # The 'type' tag contains the document type
  document_type = filing_document.type.find(text=True, recursive=False).strip()

In [454]:
if document_type == "10-K":  # Once the 10K text body is found

  # Grab and store the 10K text body    
  TenKtext = filing_document.find('text').extract().text

In [455]:
document_type

'ZIP'

In [456]:
TenKtext

In [457]:
TenKtext = filing_document.find('text')
print(TenKtext)

<text>
begin 644 0000006201-21-000014-xbrl.zip
M4$L#!!0    ( #&gt;*45([W/G%]#@$  J#,P 6    83$P-#9A,S(P86=R965M
M96YT+FAT;&gt;R]&gt;7?;2+(G^O]\"DS5FQKI'$H6M7KIV^?0%&amp;6SKK9+4N6N]\\[
M()$4408!-A;)[$__(B(7)$ D1;EL*T&amp;C9ZY+$D@@$1D9&gt;_SB'__[_*8[^O.V
MY\S2&gt;&gt;#<wkv_>=7_9&gt;O?ITU'WUZGQT[GP&lt;75TZQ_L';6&lt;4NV'BIWX4NL&amp;K
M5[WK7YQ?9FFZ&gt;/OJU&gt;/CX_[CT7X4W[\:#5[AK8Y?!5&amp;4L'TO]7[YYS_P+_ O
M2SL[<g msn="">77T_]-_E/?YQRO^^S]&gt;T4/^,8Z\Y3__X?D/CN_]UR_^
M&amp;W8\/7"G\'^OSX[?O!Z_\5X?'+4/SZ:3DP-V=G;R_QV=P2I?P&gt;?YEY)T&amp;;#_
M^F7NAWLSA@MX&gt;W:X2-\]^EXZ&gt;]L^./@_O]#G_OF/:12F\+08OLQ_Y/=8N5/*
MOJ1[;N#?AV_IA7[A7Y67)U$0Q6]_/:#_OU-W[@?+M_]WY,]9XERS1V&lt;0
MS=WP_[82V(2]A,7^E'\P\?_#8$VP//KU4:P7[A/X(9/K;Q_BHGM?9O[83YWV
MP?[Q:7')^FN[\3V\&gt;1HMWK9?PWVUQ4^ UBR&amp;U:?N.R"^,H]EB\!V\1N(N$
MO94_O//\9!&amp;XR[=^2&amp;NA+[T3MQ]':1K-WY[  QY8G/H3-Q /H&gt;?QRSG)]P\X
MV5.@=&gt;K))XO+^W3I5&gt;JM7GOS&gt;O_-@?GRP7Y;77M%]^;WAU=(%F[X7[\<_5 m2="">R<n mtxro2f="">\&amp;CA].HWCNHLS "_@!..SP
MJ73F)XXG3F_+

In [247]:
print(business.get_text())

Business


In [262]:
first_par = soup.findAll('p')[100]
text_between = first_par.next_sibling

In [263]:
print(text_between)





In [250]:
container = soup.find('div', class_='td-post-content')
for para in container.find_all('p', recursive=False):
    print(para.text)

AttributeError: 'NoneType' object has no attribute 'find_all'

In [213]:
#for i in soup.prettify().split('<!--Persontype-->')[1].split('<a>'):
#    print('<strong>' + ''.join(i))

In [509]:
#for i in soup.prettify().split('</a>')[1:2]:
#    print(''.join(i))

In [215]:
#print(asdf[1])

## Attempting Other Code

https://gist.github.com/anshoomehra/ead8925ea291e233a5aa2dcaa2dc61b2

In [216]:
raw_10k[:1000]

'<html><body><document>\n<type>10-K\n<sequence>1\n<filename>d286458d10k.htm\n<description>FORM 10-K\n<text>\n<title>Form 10-K</title>\n<h5 align="left"><a href="#toc">Table of Contents</a></h5>\n<p style="line-height:2px;margin-top:0px;margin-bottom:0px;border-bottom:2pt solid #435363">&#160;</p>\n<p style="line-height:3px;margin-top:0px;margin-bottom:2px;border-bottom:0.5pt solid #435363">&#160;</p> <p align="center" style="margin-top:1px;margin-bottom:0px"><font color="#435363" size="2" style="font-family:ARIAL"><b>UNITED STATES SECURITIES\nAND EXCHANGE COMMISSION </b></font></p> <p align="center" style="margin-top:0px;margin-bottom:0px"><font color="#435363" size="2" style="font-family:ARIAL"><b>Washington, D.C. 20549 </b></font></p>\n<p style="font-size:1px;margin-top:0px;margin-bottom:0px">&#160;</p><center> <p style="line-height:6px;margin-top:0px;margin-bottom:2px;border-bottom:1pt solid #435363;width:21%">&#160;</p></center>\n<p align="center" style="margin-top:1px;margin-botto

In [498]:
# Regex to find <DOCUMENT> tags
doc_start_pattern = re.compile(r'<document>')
doc_end_pattern = re.compile(r'</document>')
# Regex to find <TYPE> tag prceeding any characters, terminating at new line
type_pattern = re.compile(r'<type>[^\n]+')

In [499]:
# Create 3 lists with the span idices for each regex

### There are many <Document> Tags in this text file, each as specific exhibit like 10-K, EX-10.17 etc
### First filter will give us document tag start <end> and document tag end's <start> 
### We will use this to later grab content in between these tags
doc_start_is = [x.end() for x in doc_start_pattern.finditer(raw_10k)]
doc_end_is = [x.start() for x in doc_end_pattern.finditer(raw_10k)]

### Type filter is interesting, it looks for <TYPE> with Not flag as new line, ie terminare there, with + sign
### to look for any char afterwards until new line \n. This will give us <TYPE> followed Section Name like '10-K'
### Once we have have this, it returns String Array, below line will with find content after <TYPE> ie, '10-K' 
### as section names
doc_types = [x[len('<type>'):] for x in type_pattern.findall(raw_10k)]

In [516]:
doc_types

['10-K']

In [517]:
document = {}

# Create a loop to go through each section type and save only the 10-K section in the dictionary
for doc_type, doc_start, doc_end in zip(doc_types, doc_start_is, doc_end_is):
    if doc_type == '10-K':
        document[doc_type] = raw_10k[doc_start:doc_end]

In [518]:
# display excerpt the document
document['10-K'][0:500]

'\n<type>10-K\n<sequence>1\n<filename>d829913d10k.htm\n<description>FORM 10-K\n<text>\n<title>Form 10-K</title>\n<h5 align="left"><a href="#toc">Table of Contents</a></h5>\n<p style="line-height:4px;margin-top:0px;margin-bottom:0px;border-bottom:2pt solid #000000">&#160;</p>\n<p style="line-height:3px;margin-top:0px;margin-bottom:2px;border-bottom:0.5pt solid #000000">&#160;</p> <p align="center" style="margin-top:1px;margin-bottom:0px"><font size="2" style="font-family:Times New Roman"><b>UNITED STATES S'

In [519]:
# Write the regex
regex = re.compile(r'(>Item(\s|&#160;|&nbsp;)(1A|1B|7A|7|8)\.{0,1})|(ITEM\s(1A|1B|7A|7|8))')

# Use finditer to math the regex
matches = regex.finditer(document['10-K'])

# Write a for loop to print the matches
for match in matches:
    print(match)

<re.Match object; span=(31194, 31208), match='>Item&#160;1A.'>
<re.Match object; span=(31937, 31951), match='>Item&#160;1B.'>
<re.Match object; span=(37030, 37043), match='>Item&#160;7.'>
<re.Match object; span=(37875, 37889), match='>Item&#160;7A.'>
<re.Match object; span=(38666, 38678), match='>Item&#160;8'>
<re.Match object; span=(39487, 39499), match='>Item&#160;8'>


In [520]:
# Matches
matches = regex.finditer(document['10-K'])

# Create the dataframe
test_df = pd.DataFrame([(x.group(), x.start(), x.end()) for x in matches])

test_df.columns = ['item', 'start', 'end']
test_df['item'] = test_df.item.str.lower()

# Display the dataframe
test_df.head()

Unnamed: 0,item,start,end
0,>item&#160;1a.,31194,31208
1,>item&#160;1b.,31937,31951
2,>item&#160;7.,37030,37043
3,>item&#160;7a.,37875,37889
4,>item&#160;8,38666,38678


In [521]:
# Get rid of unnesesary charcters from the dataframe
test_df.replace('&#160;',' ',regex=True,inplace=True)
test_df.replace('&nbsp;',' ',regex=True,inplace=True)
test_df.replace(' ','',regex=True,inplace=True)
test_df.replace('\.','',regex=True,inplace=True)
test_df.replace('>','',regex=True,inplace=True)

# display the dataframe
test_df.head()

Unnamed: 0,item,start,end
0,item1a,31194,31208
1,item1b,31937,31951
2,item7,37030,37043
3,item7a,37875,37889
4,item8,38666,38678


In [522]:
# Drop duplicates
pos_dat = test_df.sort_values('start', ascending=True).drop_duplicates(subset=['item'], keep='last')

# Display the dataframe
pos_dat

Unnamed: 0,item,start,end
0,item1a,31194,31208
1,item1b,31937,31951
2,item7,37030,37043
3,item7a,37875,37889
5,item8,39487,39499


In [523]:
# Set item as the dataframe index
pos_dat.set_index('item', inplace=True)

# display the dataframe
pos_dat

Unnamed: 0_level_0,start,end
item,Unnamed: 1_level_1,Unnamed: 2_level_1
item1a,31194,31208
item1b,31937,31951
item7,37030,37043
item7a,37875,37889
item8,39487,39499


In [525]:
# Get Item 1a
item_1a_raw = document['10-K'][pos_dat['start'].loc['item1a']:pos_dat['start'].loc['item1b']]

# Get Item 7
item_7_raw = document['10-K'][pos_dat['start'].loc['item7']:pos_dat['start'].loc['item7a']]

# Get Item 7a
item_7a_raw = document['10-K'][pos_dat['start'].loc['item7a']:pos_dat['start'].loc['item8']]

In [528]:
document['10-K'][:500]

'\n<type>10-K\n<sequence>1\n<filename>d829913d10k.htm\n<description>FORM 10-K\n<text>\n<title>Form 10-K</title>\n<h5 align="left"><a href="#toc">Table of Contents</a></h5>\n<p style="line-height:4px;margin-top:0px;margin-bottom:0px;border-bottom:2pt solid #000000">&#160;</p>\n<p style="line-height:3px;margin-top:0px;margin-bottom:2px;border-bottom:0.5pt solid #000000">&#160;</p> <p align="center" style="margin-top:1px;margin-bottom:0px"><font size="2" style="font-family:Times New Roman"><b>UNITED STATES S'

In [529]:
### First convert the raw text we have to exrtacted to BeautifulSoup object 
item_1a_content = BeautifulSoup(item_1a_raw, 'lxml')

In [530]:
### By just applying .pretiffy() we see that raw text start to look oragnized, as BeautifulSoup
### apply indentation according to the HTML Tag tree structure
print(item_1a_content.prettify()[0:1000])

<html>
 <body>
  <p>
   &gt;Item 1A.
  </p>
  <td valign="bottom">
   <font size="1">
   </font>
  </td>
  <td valign="top">
   <font size="2" style="font-family:Times New Roman">
    <a href="#tx829913_3">
     Risk Factors
    </a>
   </font>
  </td>
  <td valign="bottom">
   <font size="1">
   </font>
  </td>
  <td nowrap="" valign="bottom">
   <font size="2" style="font-family:Times New Roman">
   </font>
  </td>
  <td align="right" nowrap="" valign="bottom">
   <font size="2" style="font-family:Times New Roman">
    30
   </font>
  </td>
  <td nowrap="" valign="bottom">
   <font size="2" style="font-family:Times New Roman">
   </font>
  </td>
  <tr>
   <td nowrap="" valign="top">
    <p style="margin-left:4.50em; text-indent:-4.50em">
     <font size="2" style="font-family:Times New Roman">
     </font>
    </p>
   </td>
  </tr>
 </body>
</html>


In [531]:
### Our goal is though to remove html tags and see the content
### Method get_text() is what we need, \n\n is optional, I just added this to read text 
### more cleanly, it's basically new line character between sections. 
print(item_1a_content.get_text())

>Item 1A.    
 
Risk Factors
  
 
30
  

 


In [203]:
item_1a_raw

'>Item&#160;1A.&#160;&#160;&#160;&#160;</font></p></td>\n<td valign="bottom"><font size="1">&#160;</font></td>\n<td valign="top"><font color="#435363" size="2" style="font-family:ARIAL"><a href="#tx286458_3">Risk Factors</a></font></td>\n<td valign="bottom"><font size="1">&#160;&#160;</font></td>\n<td nowrap="" valign="bottom"><font color="#435363" size="2" style="font-family:ARIAL">&#160;</font></td>\n<td align="right" nowrap="" valign="bottom"><font color="#435363" size="2" style="font-family:ARIAL">16</font></td>\n<td nowrap="" valign="bottom"><font color="#435363" size="2" style="font-family:ARIAL">&#160;</font></td></tr>\n<tr style="page-break-inside:avoid">\n<td nowrap="" valign="top"> <p style="margin-left:4.50em; text-indent:-4.50em"><font color="#435363" size="2" style="font-family:ARIAL"'

In [199]:
document["10-K"][:500]

'\n<type>10-K\n<sequence>1\n<filename>d286458d10k.htm\n<description>FORM 10-K\n<text>\n<title>Form 10-K</title>\n<h5 align="left"><a href="#toc">Table of Contents</a></h5>\n<p style="line-height:2px;margin-top:0px;margin-bottom:0px;border-bottom:2pt solid #435363">&#160;</p>\n<p style="line-height:3px;margin-top:0px;margin-bottom:2px;border-bottom:0.5pt solid #435363">&#160;</p> <p align="center" style="margin-top:1px;margin-bottom:0px"><font color="#435363" size="2" style="font-family:ARIAL"><b>UNITED ST'