# Applying Regexes To 10-Ks

### Introduction

In this notebook you will apply regexes to find useful financial information in 10-Ks. In particular, you will use what you learned in previous lessons to extract text from Items 1A, 7, and 7A.

In [1]:
import pandas as pd
from bs4 import BeautifulSoup

# Getting The HTML File

We will be working with the 2018, 10-K from Apple. In the code below, we will use the `requests` library to get the HTML data from this 10-K directly from the SEC website. We will learn more about the `requests` library in a later lesson. We will save the HTML data into a string variable named `raw_10k`, as shown below:

In [2]:
# Import requests
import requests

# Get the HTML data from the 2018 10-K from Apple
r = requests.get('https://www.sec.gov/Archives/edgar/data/320193/000032019318000145/0000320193-18-000145.txt')
raw_10k = r.text

In [3]:
type(r)

requests.models.Response

If we print the `raw_10k` string we will see that it has many sections. In the code below, we print part of the `raw_10k` string:

In [4]:
print(raw_10k[0:2000])

<SEC-DOCUMENT>0000320193-18-000145.txt : 20181105
<SEC-HEADER>0000320193-18-000145.hdr.sgml : 20181105
<ACCEPTANCE-DATETIME>20181105080140
ACCESSION NUMBER:		0000320193-18-000145
CONFORMED SUBMISSION TYPE:	10-K
PUBLIC DOCUMENT COUNT:		88
CONFORMED PERIOD OF REPORT:	20180929
FILED AS OF DATE:		20181105
DATE AS OF CHANGE:		20181105

FILER:

	COMPANY DATA:	
		COMPANY CONFORMED NAME:			APPLE INC
		CENTRAL INDEX KEY:			0000320193
		STANDARD INDUSTRIAL CLASSIFICATION:	ELECTRONIC COMPUTERS [3571]
		IRS NUMBER:				942404110
		STATE OF INCORPORATION:			CA
		FISCAL YEAR END:			0930

	FILING VALUES:
		FORM TYPE:		10-K
		SEC ACT:		1934 Act
		SEC FILE NUMBER:	001-36743
		FILM NUMBER:		181158788

	BUSINESS ADDRESS:	
		STREET 1:		ONE APPLE PARK WAY
		CITY:			CUPERTINO
		STATE:			CA
		ZIP:			95014
		BUSINESS PHONE:		(408) 996-1010

	MAIL ADDRESS:	
		STREET 1:		ONE APPLE PARK WAY
		CITY:			CUPERTINO
		STATE:			CA
		ZIP:			95014

	FORMER COMPANY:	
		FORMER CONFORMED NAME:	APPLE COMPUTER INC
		DATE OF NA

# Regexes for Tags

For our purposes, we are only interested in the sections that contain the 10-K information. All the sections, including the 10-K, are contained with the `<DOCUMENT>` and `</DOCUMENT>` tags. Each section within the document tags is clearly marked by a `<TYPE>` tag followed by the name of the section. In the code below, write three regular expressions:

1. A regex to find the `<DOCUMENT>` tag

2. A regex to find the `</DOCUMENT>` tag

3. A regex to find all the sections marked by the `<Type>` tag

In [5]:
# import re module
import re

# Write regexes
doc_start_pattern = re.compile(r'<DOCUMENT>')
doc_end_pattern = re.compile(r'</DOCUMENT>')
type_pattern = re.compile(r'<TYPE>[^\n]+')

# Create Lists with Span Indices

Now, that you have the regexes, use the `.finditer()` method to match the regexes in the `raw_10k`. In the code below, create 3 lists:

1. A list that holds the `.end()` index of each match of `doc_start_pattern`

2. A list that holds the `.start()` index of each match of `doc_end_pattern`

3. A list that holds the name of section from each match of `type_pattern`

In [6]:
# Create 3 lists with the span idices for each regex
doc_start_is = [x.end() for x in doc_start_pattern.finditer(raw_10k)]
doc_end_is = [x.start() for x in doc_end_pattern.finditer(raw_10k)]
doc_types = [x[len('<TYPE>'):] for x in type_pattern.findall(raw_10k)]

In [7]:
len(doc_start_is), len(doc_end_is), len(doc_types)

(88, 88, 88)

# Create a Dictionary for the 10-K

In the code below, create a dictionary which has the key `10-K` and as value the contents of the `10-K` section found above. To do this, create a loop, to go through all the sections found above, and if the section type is `10-K` then save it to the dictionary. Use the indices in  `doc_start_is` and `doc_end_is`to slice the `raw_10k` file.

In [8]:
document = {}

# Create a loop to go through each section type and save only the 10-K section in the dictionary
for doc_type, doc_start_i, doc_end_i in zip(doc_types, doc_start_is, doc_end_is):
    if doc_type == '10-K':
        document[doc_type] = raw_10k[doc_start_i:doc_end_i]

# display the document
document.keys()

dict_keys(['10-K'])

In [9]:

def fetch_10ks(raw_10k_doc: str, target_doc: str = '10-K'):
    """
    HTML data from 10-K document form SEC website
    doc_start_is: index list containing start 10-K start tag :<DOCUMENT>
    doc_end_is: index list containing start 10-K end tag:  </DOCUMENT> t
    doc_types: Each section within the document tags is clearly marked by a <TYPE> tag followed by the name of the section
    :param raw_10k_doc:  10-K HTML doc
    :param target_doc:  string, target doc label
    :return: document
    """
    doc_start_pattern = re.compile(r'<DOCUMENT>')
    doc_end_pattern = re.compile(r'</DOCUMENT>')
    type_pattern = re.compile(r'<TYPE>[^\n]+')

    doc_start_is = [x.end() for x in doc_start_pattern.finditer(raw_10k_doc)]
    doc_end_is = [x.start() for x in doc_end_pattern.finditer(raw_10k_doc)]
    doc_types = [x[len('<TYPE>'):] for x in type_pattern.findall(raw_10k_doc)]

    document = {}

    # Create a loop to go through each section type and save only the 10-K section in the dictionary
    for doc_type, doc_start_i, doc_end_i in zip(doc_types, doc_start_is, doc_end_is):
        if doc_type == target_doc:
            document[doc_type] = raw_10k_doc[doc_start_i:doc_end_i]

    return document

fetch_10ks(raw_10k)['10-K'][:1000]

'\n<TYPE>10-K\n<SEQUENCE>1\n<FILENAME>a10-k20189292018.htm\n<DESCRIPTION>10-K\n<TEXT>\n<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">\n<html>\n\t<head>\n\t\t<!-- Document created using Wdesk 1 -->\n\t\t<!-- Copyright 2018 Workiva -->\n\t\t<title>Document</title>\n\t</head>\n\t<body style="font-family:Times New Roman;font-size:10pt;">\n<div><a name="s3540C27286EF5B0DA103CC59028B96BE"></a></div><div style="line-height:120%;text-align:center;font-size:10pt;"><div style="padding-left:0px;text-indent:0px;line-height:normal;padding-top:10px;"><table cellpadding="0" cellspacing="0" style="font-family:Times New Roman;font-size:10pt;margin-left:auto;margin-right:auto;width:100%;border-collapse:collapse;text-align:left;"><tr><td colspan="1"></td></tr><tr><td style="width:100%;"></td></tr><tr><td style="vertical-align:bottom;padding-left:2px;padding-top:2px;padding-bottom:2px;padding-right:2px;border-top:1px solid #000000;border-bottom:1px s

# Find Item 1A, 7, and 7A

Our task now is to use regular expression to find the items of interest. The items in this `document` can be found in four different patterns. For example Item 1A can be found in either of the following patterns:

1. `>Item 1A`

2. `>Item&#160;1A` 

3. `>Item&nbsp;1A`

4. `ITEM 1A` 

In the code below write a single regular expression that can match all four patterns for Items 1A, 7, and 7A. Then use the `.finditer()` method to match the regex to `document['10-K']`. Finally create a for loop to print the `matches`.

In [10]:
# Write the regex
regex = re.compile(r'(>Item(\s|&#160;|&nbsp;)(1A|7A|7)\.{0,1})|(ITEM\s(1A|7A|7))')

# Use finditer to math the regex
matches = regex.finditer(document['10-K'])

# Write a for loop to print the matches
for match in matches:
    print(match)

<re.Match object; span=(38318, 38327), match='>Item 1A.'>
<re.Match object; span=(46148, 46156), match='>Item 7.'>
<re.Match object; span=(47281, 47290), match='>Item 7A.'>
<re.Match object; span=(119131, 119140), match='>Item 1A.'>
<re.Match object; span=(333318, 333326), match='>Item 7.'>
<re.Match object; span=(729984, 729993), match='>Item 7A.'>


If your regex is written correctly, the only matches above should be those for Items 1A, 7, and 7A. You should notice also, that each item is matched twice. This is because each item appears first in the index and then in the corresponding section. We will now have to remove the matches that correspond to the index. We will do this using Pandas in the next section.

# Remove Matches that Correspond to the Index

We will remove the matches that correspond to the index using a Pandas Dataframe. We will do this in a couple of steps.

## Create a Pandas DataFrame

In the code below create a pandas dataframe with the following column names: `'item','start','end'`. In the `item` column save the `match.group()` in lower case letters, in the ` start` column save the `match.start()`, and in the `end` column save the ``match.end()`. 

In [11]:
# import pandas
import pandas as pd

# Matches
matches = regex.finditer(document['10-K'])

# Create the dataframe
test_df = pd.DataFrame([(x.group(),x.start(),x.end()) for x in matches])
test_df.columns = ['item','start','end']
test_df['item'] = test_df.item.str.lower()

# Display the dataframe
test_df

Unnamed: 0,item,start,end
0,>item 1a.,38318,38327
1,>item 7.,46148,46156
2,>item 7a.,47281,47290
3,>item 1a.,119131,119140
4,>item 7.,333318,333326
5,>item 7a.,729984,729993


## Eliminate Unnecessary Characters

As we can see, our dataframe, in particular the `item` column, contains some unnecessary characters such as `>` and periods `.`. In some cases, we will also get unicode characters such as `&#160;` and `&nbsp;`. In the code below, use the Pandas dataframe method `.replace()` with the keyword `regex=True` to replace all whitespaces, the above mentioned unicode characters, the `>` character, and the periods from our dataframe. We want to do this because we want to use the `item` column as our dataframe index later on.

In [12]:
# Get rid of unnesesary charcters from the dataframe
test_df.replace('&#160;',' ',regex=True,inplace=True)
test_df.replace('&nbsp;',' ',regex=True,inplace=True)
test_df.replace(' ','',regex=True,inplace=True)
test_df.replace('\.','',regex=True,inplace=True)
test_df.replace('>','',regex=True,inplace=True)

# display the dataframe
test_df

Unnamed: 0,item,start,end
0,item1a,38318,38327
1,item7,46148,46156
2,item7a,47281,47290
3,item1a,119131,119140
4,item7,333318,333326
5,item7a,729984,729993


## Remove Duplicates

Now that we have removed all unnecessary characters form our dataframe, we can go ahead and remove the Item matches that correspond to the index. In the code below use the Pandas dataframe `.drop_duplicates()` method to only keep the last Item matches in the dataframe and drop the rest. Just as precaution make sure that the `start` column is sorted in ascending order before dropping the duplicates.

In [13]:
# Drop duplicates
pos_dat=test_df.sort_values('start', ascending=True).drop_duplicates(subset=['item'],keep='last')

# Display the dataframe
pos_dat

Unnamed: 0,item,start,end
3,item1a,119131,119140
4,item7,333318,333326
5,item7a,729984,729993


# Set Item to Index

In the code below use the Pandas dataframe `.set_index()` method to set the `item`  column as the index of our dataframe.

In [14]:
# Set item as the dataframe index
pos_dat.set_index('item', inplace =True)

# display the dataframe
pos_dat

Unnamed: 0_level_0,start,end
item,Unnamed: 1_level_1,Unnamed: 2_level_1
item1a,119131,119140
item7,333318,333326
item7a,729984,729993


In [15]:
def fetch_section_idx_df(doc):
    """
    Match all four patterns for Items 1A, 7, and 7A. Item 1A can be found in either of the following patterns:
        >Item 1A
        >Item&#160;1A
        >Item&nbsp;1A
        >ITEM 1A
    Pandas dataframe .drop_duplicates() method to only keep the last Item matches in the dataframe and drop the rest.
    Remove the Item matches that correspond to the index. In the code below use the Pandas dataframe
    .drop_duplicates() method to only keep the last Item matches in the dataframe and drop the rest.
    :param doc: 10K document
    :return: Pandas dataframe with the following column names: 'item','start','end'
    """
    target_doc = '10-K'
    re_risk = re.compile(r'(>Item(\s|&#160;|&nbsp;)(1A|7A|7)\.{0,1})|(ITEM\s(1A|7A|7))')
    text = doc[target_doc]
    matches = re_risk.finditer(text)

    test_df = pd.DataFrame([(x.group(), x.start(), x.end()) for x in matches])
    test_df.columns = ['item', 'start', 'end']
    test_df['item'] = test_df.item.str.lower()

    # Get rid of unnecessary characters from the dataframe
    test_df.replace('&#160;', ' ', regex=True, inplace=True)
    test_df.replace('&nbsp;', ' ', regex=True, inplace=True)
    test_df.replace(' ', '', regex=True, inplace=True)
    test_df.replace('\.', '', regex=True, inplace=True)
    test_df.replace('>','',regex=True,inplace=True)

    pos_dat = test_df.sort_values('start', ascending=True).drop_duplicates(subset=['item'], keep='last')
    pos_dat.set_index('item', inplace=True)
    pos_dat['next_start'] = pos_dat['end'].shift(-1)
    pos_dat.iloc[-1, -1] = len(text)

    return pos_dat

In [16]:
fetch_section_idx_df(fetch_10ks(raw_10k))

Unnamed: 0_level_0,start,end,next_start
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
item1a,119131,119140,333326.0
item7,333318,333326,729993.0
item7a,729984,729993,2405867.0


# Get The Financial Information From Each Item

The above dataframe contains the starting and end index of each match for Items 1A, 7, and 7A. In the code below, save all the text from the starting index of `item1a` till the starting index of `item7` into a variable called `item_1a_raw`. Similarly,  save all the text from the starting index of `item7` till the starting index of `item7a` into a variable called `item_7_raw`. Finally,  save all the text from the starting index of `item7a` till the end of `document['10-K']` into a variable called `item_7a_raw`. You can accomplish all of this by making the correct slices of `document['10-K']`.

In [17]:
# Get Item 1a
item_1a_raw = document['10-K'][pos_dat['start'].loc['item1a']:pos_dat['start'].loc['item7']]

# Get Item 7
item_7_raw = document['10-K'][pos_dat['start'].loc['item7']:pos_dat['start'].loc['item7a']]

# Get Item 7a
item_7a_raw = document['10-K'][pos_dat['start'].loc['item7a']:]

In [18]:
def fetch_section(document, pos_dat):
    """

    :param document: 10-K doc
    :param pos_dat: 10-K doc start and end by item
    :return:
    """
    target_doc = '10-K'
    text = document[target_doc]
    docs = {}
    end_idx = None
    for idx_section, row in pos_dat.iterrows():
        docs[idx_section] = text[int(row['end']):int(row['next_start'])]

    return docs

sections = fetch_section(document=fetch_10ks(raw_10k),
              pos_dat=fetch_section_idx_df(fetch_10ks(raw_10k)) )
sections.keys()

dict_keys(['item1a', 'item7', 'item7a'])

In [19]:
sections['item1a'][:1000]

'</font></div></td><td style="vertical-align:top;"><div style="line-height:120%;text-align:justify;font-size:9pt;"><font style="font-family:Helvetica,sans-serif;font-size:9pt;font-weight:bold;">Risk Factors</font></div></td></tr></table><div style="line-height:120%;padding-top:8px;text-align:justify;font-size:9pt;"><font style="font-family:Helvetica,sans-serif;font-size:9pt;">The following discussion of risk factors contains forward-looking statements. These risk factors may be important to understanding other statements in this Form 10-K. The following information should be read in conjunction with Part II, Item&#160;7, &#8220;Management&#8217;s Discussion and Analysis of Financial Condition and Results of Operations&#8221; and the consolidated financial statements and related notes in Part II, Item&#160;8, &#8220;Financial Statements and Supplementary Data&#8221; of this Form 10-K.</font></div><div style="line-height:120%;padding-top:16px;text-align:justify;font-size:9pt;"><font styl

## Display Item 1a

Now that we have each item saved into a separate variable we can view them separately. For illustration purposes we will display Item 1a, but the other items will look similar.

In [20]:
# Display Item 1a
item_1a_raw[:1000]

'>Item 1A.</font></div></td><td style="vertical-align:top;"><div style="line-height:120%;text-align:justify;font-size:9pt;"><font style="font-family:Helvetica,sans-serif;font-size:9pt;font-weight:bold;">Risk Factors</font></div></td></tr></table><div style="line-height:120%;padding-top:8px;text-align:justify;font-size:9pt;"><font style="font-family:Helvetica,sans-serif;font-size:9pt;">The following discussion of risk factors contains forward-looking statements. These risk factors may be important to understanding other statements in this Form 10-K. The following information should be read in conjunction with Part II, Item&#160;7, &#8220;Management&#8217;s Discussion and Analysis of Financial Condition and Results of Operations&#8221; and the consolidated financial statements and related notes in Part II, Item&#160;8, &#8220;Financial Statements and Supplementary Data&#8221; of this Form 10-K.</font></div><div style="line-height:120%;padding-top:16px;text-align:justify;font-size:9pt;"><

In [21]:
sections['item1a'][:1000]

'</font></div></td><td style="vertical-align:top;"><div style="line-height:120%;text-align:justify;font-size:9pt;"><font style="font-family:Helvetica,sans-serif;font-size:9pt;font-weight:bold;">Risk Factors</font></div></td></tr></table><div style="line-height:120%;padding-top:8px;text-align:justify;font-size:9pt;"><font style="font-family:Helvetica,sans-serif;font-size:9pt;">The following discussion of risk factors contains forward-looking statements. These risk factors may be important to understanding other statements in this Form 10-K. The following information should be read in conjunction with Part II, Item&#160;7, &#8220;Management&#8217;s Discussion and Analysis of Financial Condition and Results of Operations&#8221; and the consolidated financial statements and related notes in Part II, Item&#160;8, &#8220;Financial Statements and Supplementary Data&#8221; of this Form 10-K.</font></div><div style="line-height:120%;padding-top:16px;text-align:justify;font-size:9pt;"><font styl

We can see that the items looks pretty messy, they contain HTML tags, Unicode characters, etc... Before we can do a proper Natural Language Processing in these items we need to clean them up. This means we need to remove all HTML Tags, unicode characters, etc... In principle we could do this using regex substitutions as we learned previously, but his can be rather difficult. Luckily, packages already exist that can do all the cleaning for us, such as **Beautifulsoup**, which will be the topic of our next lessons.

# Run load pipeline

In [22]:
%load_ext autoreload
%autoreload 2

In [23]:
%run ../nb_config.py

In [24]:
import pprint
from tqdm import tqdm

from src.load_data import load_sec10k
from src.nlp_quant import parse_sec_fillings

In [25]:
DOC_TYPE = '10-K'
newest_pricing_data = '2018-11-05'

In [26]:
example_ticker = 'AAPL'
ex_cik = {example_ticker: f'{320193:010d}'}  #aapl	320193
#ex_cik = {'AMZN': '0001018724'}  # 'AMZN': '0001018724'
ex_cik

{'AAPL': '0000320193'}

In [27]:
sec_api = load_sec10k.SecAPI()

In [28]:
sec_data = {}
sec_dates = {}

for ticker, cik in ex_cik.items():
    sec_data[ticker] = load_sec10k.get_sec_data(
        sec_api=sec_api, cik=cik,
        newest_pricing_data=newest_pricing_data,
        doc_type=DOC_TYPE)
    sec_dates[ticker] = [x[2] for x in sec_data[ticker]]

In [29]:
sec_dates

{'AAPL': ['2018-11-05',
  '2017-11-03',
  '2016-10-26',
  '2015-10-28',
  '2014-10-27',
  '2013-10-30',
  '2012-10-31',
  '2011-10-26',
  '2010-10-27',
  '2010-01-25',
  '2009-10-27',
  '2008-11-05',
  '2007-11-15',
  '2006-12-29',
  '2005-12-01',
  '2004-12-03',
  '2003-12-19',
  '2002-12-19',
  '2001-12-21',
  '2000-12-14',
  '1999-12-22',
  '1998-12-23',
  '1998-01-23',
  '1997-12-05',
  '1996-12-19',
  '1995-12-19',
  '1994-12-13']}

In [30]:
pprint.pprint(sec_data[example_ticker][:5])
# r = requests.get('https://www.sec.gov/Archives/edgar/data/320193/000032019318000145/0000320193-18-000145.txt'

[('https://www.sec.gov/Archives/edgar/data/320193/000032019318000145/0000320193-18-000145-index.htm',
  '10-K',
  '2018-11-05'),
 ('https://www.sec.gov/Archives/edgar/data/320193/000032019317000070/0000320193-17-000070-index.htm',
  '10-K',
  '2017-11-03'),
 ('https://www.sec.gov/Archives/edgar/data/320193/000162828016020309/0001628280-16-020309-index.htm',
  '10-K',
  '2016-10-26'),
 ('https://www.sec.gov/Archives/edgar/data/320193/000119312515356351/0001193125-15-356351-index.htm',
  '10-K',
  '2015-10-28'),
 ('https://www.sec.gov/Archives/edgar/data/320193/000119312514383437/0001193125-14-383437-index.htm',
  '10-K',
  '2014-10-27')]


In [31]:
raw_fillings_by_ticker = {}

for ticker, data in sec_data.items():
    raw_fillings_by_ticker[ticker] = {}
    for index_url, file_type, file_date in tqdm(data, desc='Downloading {} Fillings'.format(ticker), unit='filling'):
        if (file_type == DOC_TYPE):
            file_url = index_url.replace('-index.htm', '.txt').replace('.txtl', '.txt')            
            
            raw_fillings_by_ticker[ticker][file_date] = sec_api.get(file_url)

Downloading AAPL Fillings: 100%|██████████| 27/27 [00:04<00:00,  6.31filling/s]


In [32]:
print('Example Document:\n\n{}...'.format(next(iter(raw_fillings_by_ticker[example_ticker].values()))[:1000]))

Example Document:

<SEC-DOCUMENT>0000320193-18-000145.txt : 20181105
<SEC-HEADER>0000320193-18-000145.hdr.sgml : 20181105
<ACCEPTANCE-DATETIME>20181105080140
ACCESSION NUMBER:		0000320193-18-000145
CONFORMED SUBMISSION TYPE:	10-K
PUBLIC DOCUMENT COUNT:		88
CONFORMED PERIOD OF REPORT:	20180929
FILED AS OF DATE:		20181105
DATE AS OF CHANGE:		20181105

FILER:

	COMPANY DATA:	
		COMPANY CONFORMED NAME:			APPLE INC
		CENTRAL INDEX KEY:			0000320193
		STANDARD INDUSTRIAL CLASSIFICATION:	ELECTRONIC COMPUTERS [3571]
		IRS NUMBER:				942404110
		STATE OF INCORPORATION:			CA
		FISCAL YEAR END:			0930

	FILING VALUES:
		FORM TYPE:		10-K
		SEC ACT:		1934 Act
		SEC FILE NUMBER:	001-36743
		FILM NUMBER:		181158788

	BUSINESS ADDRESS:	
		STREET 1:		ONE APPLE PARK WAY
		CITY:			CUPERTINO
		STATE:			CA
		ZIP:			95014
		BUSINESS PHONE:		(408) 996-1010

	MAIL ADDRESS:	
		STREET 1:		ONE APPLE PARK WAY
		CITY:			CUPERTINO
		STATE:			CA
		ZIP:			95014

	FORMER COMPANY:	
		FORMER CONFORMED NAME:	APPLE COMPUT

In [33]:
filling_documents_by_ticker = {}

for ticker, raw_fillings in raw_fillings_by_ticker.items():
    filling_documents_by_ticker[ticker] = {}
    for file_date, filling in tqdm(raw_fillings.items(), desc='Getting Documents from {} Fillings'.format(ticker), unit='filling'):
        filling_documents_by_ticker[ticker][file_date] = load_sec10k.get_documents(filling)

Getting Documents from AAPL Fillings: 100%|██████████| 23/23 [00:00<00:00, 108.27filling/s]


In [34]:
print('\n\n'.join([
    'Document {} Filed on {}:\n{}...'.format(doc_i, file_date, doc[:200])
    for file_date, docs in filling_documents_by_ticker[example_ticker].items()
    for doc_i, doc in enumerate(docs)][:3]))

Document 0 Filed on 2018-11-05:

<TYPE>10-K
<SEQUENCE>1
<FILENAME>a10-k20189292018.htm
<DESCRIPTION>10-K
<TEXT>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
	<head>
		...

Document 1 Filed on 2018-11-05:

<TYPE>EX-10.17
<SEQUENCE>2
<FILENAME>a10-kexhibit10172018.htm
<DESCRIPTION>EXHIBIT 10.17
<TEXT>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
...

Document 2 Filed on 2018-11-05:

<TYPE>EX-10.18
<SEQUENCE>3
<FILENAME>a10-kexhibit10182018.htm
<DESCRIPTION>EXHIBIT 10.18
<TEXT>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
...


In [35]:
ten_ks_by_ticker = {}
tenk_dates = {}

for ticker, filling_documents in filling_documents_by_ticker.items():
    ten_ks_by_ticker[ticker] = []
    tenk_dates[ticker] = []
    for file_date, documents in filling_documents.items():
        for document in documents:
            if load_sec10k.get_document_type(document) == DOC_TYPE:
                ten_ks_by_ticker[ticker].append({
                    'cik': ex_cik[ticker],
                    'file': document,
                    'file_date': file_date})
                tenk_dates[ticker].append(file_date)

In [36]:
load_sec10k.print_ten_k_data(ten_ks_by_ticker[example_ticker][:5], ['cik', 'file', 'file_date'])

[
  {
    cik: '0000320193'
    file: '\n<TYPE>10-K\n<SEQUENCE>1\n<FILENAME>a10-k2018929...
    file_date: '2018-11-05'},
  {
    cik: '0000320193'
    file: '\n<TYPE>10-K\n<SEQUENCE>1\n<FILENAME>a10-k2017930...
    file_date: '2017-11-03'},
  {
    cik: '0000320193'
    file: '\n<TYPE>10-K\n<SEQUENCE>1\n<FILENAME>a201610-k924...
    file_date: '2016-10-26'},
  {
    cik: '0000320193'
    file: '\n<TYPE>10-K\n<SEQUENCE>1\n<FILENAME>d17062d10k.h...
    file_date: '2015-10-28'},
  {
    cik: '0000320193'
    file: '\n<TYPE>10-K\n<SEQUENCE>1\n<FILENAME>d783162d10k....
    file_date: '2014-10-27'},
]


In [37]:
from src.nlp_quant import parse_sec_fillings

In [38]:
ten_ks_by_ticker[example_ticker][0]['file_date']  #['2018-11-05']

'2018-11-05'

In [39]:
ten_ks_by_ticker[example_ticker][0]['file'][:1000]

'\n<TYPE>10-K\n<SEQUENCE>1\n<FILENAME>a10-k20189292018.htm\n<DESCRIPTION>10-K\n<TEXT>\n<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">\n<html>\n\t<head>\n\t\t<!-- Document created using Wdesk 1 -->\n\t\t<!-- Copyright 2018 Workiva -->\n\t\t<title>Document</title>\n\t</head>\n\t<body style="font-family:Times New Roman;font-size:10pt;">\n<div><a name="s3540C27286EF5B0DA103CC59028B96BE"></a></div><div style="line-height:120%;text-align:center;font-size:10pt;"><div style="padding-left:0px;text-indent:0px;line-height:normal;padding-top:10px;"><table cellpadding="0" cellspacing="0" style="font-family:Times New Roman;font-size:10pt;margin-left:auto;margin-right:auto;width:100%;border-collapse:collapse;text-align:left;"><tr><td colspan="1"></td></tr><tr><td style="width:100%;"></td></tr><tr><td style="vertical-align:bottom;padding-left:2px;padding-top:2px;padding-bottom:2px;padding-right:2px;border-top:1px solid #000000;border-bottom:1px s

In [40]:
ten_ks_by_ticker[example_ticker][0]['file'][-1000:]

'tical-align:bottom;padding-left:2px;padding-top:2px;padding-bottom:2px;padding-right:2px;"><div style="overflow:hidden;font-size:10pt;"><font style="font-family:inherit;font-size:10pt;">&#160;</font></div></td><td style="vertical-align:bottom;padding-left:2px;padding-top:2px;padding-bottom:2px;padding-right:2px;"><div style="overflow:hidden;font-size:10pt;"><font style="font-family:inherit;font-size:10pt;">&#160;</font></div></td><td style="vertical-align:bottom;padding-left:2px;padding-top:2px;padding-bottom:2px;padding-right:2px;"><div style="overflow:hidden;font-size:10pt;"><font style="font-family:inherit;font-size:10pt;">&#160;</font></div></td></tr></table></div></div><div><br></div><div style="text-align:center;"><div style="line-height:120%;text-align:center;font-size:8pt;"><font style="font-family:Helvetica,sans-serif;font-size:8pt;">Apple Inc. | 2018 Form 10-K | </font><font style="font-family:Helvetica,sans-serif;font-size:8pt;">72</font></div></div>\t</body>\n</html>\n</TE

In [41]:
appl_10k_pos_dat = parse_sec_fillings.get_10k_risk_sections_df(
    text=ten_ks_by_ticker[example_ticker][0]['file'])

In [42]:
appl_10k_pos_dat

Unnamed: 0_level_0,start,end,next_start
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
item1a,119131,119140,333326.0
item7,333318,333326,729993.0
item7a,729984,729993,2405867.0


In [43]:
appl_10k_risk_sections = parse_sec_fillings.get_section_text(
    text=ten_ks_by_ticker[example_ticker][0]['file'],
    pos_dat=appl_10k_pos_dat)

In [44]:
appl_10k_risk_sections.keys()

dict_keys(['item1a', 'item7', 'item7a'])

In [45]:
appl_10k_risk_sections['item1a'][:1000]

'</font></div></td><td style="vertical-align:top;"><div style="line-height:120%;text-align:justify;font-size:9pt;"><font style="font-family:Helvetica,sans-serif;font-size:9pt;font-weight:bold;">Risk Factors</font></div></td></tr></table><div style="line-height:120%;padding-top:8px;text-align:justify;font-size:9pt;"><font style="font-family:Helvetica,sans-serif;font-size:9pt;">The following discussion of risk factors contains forward-looking statements. These risk factors may be important to understanding other statements in this Form 10-K. The following information should be read in conjunction with Part II, Item&#160;7, &#8220;Management&#8217;s Discussion and Analysis of Financial Condition and Results of Operations&#8221; and the consolidated financial statements and related notes in Part II, Item&#160;8, &#8220;Financial Statements and Supplementary Data&#8221; of this Form 10-K.</font></div><div style="line-height:120%;padding-top:16px;text-align:justify;font-size:9pt;"><font styl

In [47]:
from src.nlp_quant import bow_sent
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

word_pattern = re.compile('\w+')
wlm = WordNetLemmatizer()
lemma_english_stopwords = bow_sent.lemmatize_words(wlm, stopwords.words('english'))


# Clean html
appl_10k_item1a = bow_sent.clean_text(appl_10k_risk_sections['item1a'])
# word lemmatization
appl_10k_item1a_lemma = bow_sent.lemmatize_words(wlm, word_pattern.findall(appl_10k_item1a))
# Remove stopwords
appl_10k_item1a_lemma = [word for word in appl_10k_item1a_lemma if word not in lemma_english_stopwords]

In [58]:
appl_10k_item1a[:1000]

'risk factorsthe following discussion of risk factors contains forward-looking statements. these risk factors may be important to understanding other statements in this form 10-k. the following information should be read in conjunction with part ii, item\xa07, “management’s discussion and analysis of financial condition and results of operations” and the consolidated financial statements and related notes in part ii, item\xa08, “financial statements and supplementary data” of this form 10-k.the business, financial condition and operating results of the company can be affected by a number of factors, whether currently known or unknown, including but not limited to those described below, any one or more of which could, directly or indirectly, cause the company’s actual financial condition and operating results to vary materially from past, or from anticipated future, financial condition and operating results. any of these factors, in whole or in part, could materially and adversely affec

In [60]:
appl_10k_item1a[-1000:]

'nd factors that may affect the comparability of the information presented below (in\xa0millions, except number of shares, which are reflected in thousands, and per share amounts).\xa02018\xa02017\xa02016\xa02015\xa02014net sales$265,595\xa0$229,234\xa0$215,639\xa0$233,715\xa0$182,795net income$59,531\xa0$48,351\xa0$45,687\xa0$53,394\xa0$39,510\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0earnings per share:\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0basic$12.01\xa0$9.27\xa0$8.35\xa0$9.28\xa0$6.49diluted$11.91\xa0$9.21\xa0$8.31\xa0$9.22\xa0$6.45\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0cash dividends declared per share$2.72\xa0$2.40\xa0$2.18\xa0$1.98\xa0$1.82\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0shares used in computing earnings per share:\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0basic4,955,377\xa05,217,242\xa05,470,820\xa05,753,421\xa06,085,572diluted5,000,109\xa05,251,692\xa05,500,281\xa05,793,069\xa06,122,663\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0total cash, cash equivalents and marketable securities$