# Capstone Project Data Gethering: Annual Report Extraction

## Scraping SEC-EDGAR Website to Get Annual Reports 

### Step 1: Get the Meta Index from SEC API

In [21]:
# ! pip install sec-api
import requests

from bs4 import BeautifulSoup
import html5lib
import re
import pandas as pd

import json
from pandas.io.json import json_normalize


In [2]:
API = 'eb335cb397723fca0e27702b41b8315c26cda8340eb100eb83bde5a12746ace7'

Query API:https://api.sec-api.io
Stream API:https://api.sec-api.io:3334/all-filings
Full-Text Search API:https://api.sec-api.io/full-text-search
XBRL-to-JSON Converter API:https://api.sec-api.io/xbrl-to-json
Filing Render API:https://api.sec-api.io/filing-reader

In [55]:
from sec_api import QueryApi

queryApi = QueryApi(api_key=API)

query = {
  "query": { "query_string": { 
      "query": "ticker: NVDA AND filedAt:{1990-01-01 TO 2021-12-31} AND formType:\"10-K\"" 
    } },
  "from": "0",
  "size": "200",
  "sort": [{ "filedAt": { "order": "desc" } }]
}

filings = queryApi.get_filings(query)

print(filings)



{'total': {'value': 23, 'relation': 'eq'}, 'query': {'from': 0, 'size': 200}, 'filings': [{'id': '9c8fb5c7f22482d589640f027647f4b0', 'accessionNo': '0001045810-21-000010', 'cik': '1045810', 'ticker': 'NVDA', 'companyName': 'NVIDIA CORP', 'companyNameLong': 'NVIDIA CORP (Filer)', 'formType': '10-K', 'description': 'Form 10-K - Annual report [Section 13 and 15(d), not S-K Item 405]', 'filedAt': '2021-02-26T17:03:14-05:00', 'linkToTxt': 'https://www.sec.gov/Archives/edgar/data/1045810/000104581021000010/0001045810-21-000010.txt', 'linkToHtml': 'https://www.sec.gov/Archives/edgar/data/1045810/000104581021000010/0001045810-21-000010-index.htm', 'linkToXbrl': '', 'linkToFilingDetails': 'https://www.sec.gov/Archives/edgar/data/1045810/000104581021000010/nvda-20210131.htm', 'entities': [{'companyName': 'NVIDIA CORP (Filer)', 'cik': '1045810', 'irsNo': '943177549', 'stateOfIncorporation': 'DE', 'fiscalYearEnd': '0131', 'type': '10-K', 'act': '34', 'fileNo': '000-23985', 'filmNo': '21690665', 's

from sec_api import FullTextSearchApi

fullTextSearchApi = FullTextSearchApi(api_key=API)

query = {
  "query": '"Advanced Micro Devices"',
  "formTypes": ['8-K', '10-Q', '10-K'],
  "startDate": '2021-01-01',
  "endDate": '2021-06-14',
}

filings = fullTextSearchApi.get_filings(query)

#print(filings)


In [56]:
p1 = filings['filings']

urls = []
for item in p1:
    urls.append(item['linkToFilingDetails'])

In [57]:
df = pd.json_normalize(p1)

In [58]:
df["linkToHtml"][0]

'https://www.sec.gov/Archives/edgar/data/1045810/000104581021000010/0001045810-21-000010-index.htm'

### This URLs now contains the links to all annual reports of the specified company

In [59]:
urls = df['linkToTxt']

In [60]:
urls[0]

'https://www.sec.gov/Archives/edgar/data/1045810/000104581021000010/0001045810-21-000010.txt'

### Step 2: Obtain an annual report from SEC
For whatever reason, SEC would block the attempt for a while after each request

In [61]:
ar1 = requests.get(urls[0]).text

In [62]:
print(ar1[1:3000])

SEC-DOCUMENT>0001045810-21-000010.txt : 20210226
<SEC-HEADER>0001045810-21-000010.hdr.sgml : 20210226
<ACCEPTANCE-DATETIME>20210226170314
ACCESSION NUMBER:		0001045810-21-000010
CONFORMED SUBMISSION TYPE:	10-K
PUBLIC DOCUMENT COUNT:		107
CONFORMED PERIOD OF REPORT:	20210131
FILED AS OF DATE:		20210226
DATE AS OF CHANGE:		20210226

FILER:

	COMPANY DATA:	
		COMPANY CONFORMED NAME:			NVIDIA CORP
		CENTRAL INDEX KEY:			0001045810
		STANDARD INDUSTRIAL CLASSIFICATION:	SEMICONDUCTORS & RELATED DEVICES [3674]
		IRS NUMBER:				943177549
		STATE OF INCORPORATION:			DE
		FISCAL YEAR END:			0131

	FILING VALUES:
		FORM TYPE:		10-K
		SEC ACT:		1934 Act
		SEC FILE NUMBER:	000-23985
		FILM NUMBER:		21690665

	BUSINESS ADDRESS:	
		STREET 1:		2788 SAN TOMAS EXPRESSWAY
		CITY:			SANTA CLARA
		STATE:			CA
		ZIP:			95051
		BUSINESS PHONE:		408-486-2000

	MAIL ADDRESS:	
		STREET 1:		2788 SAN TOMAS EXPRESSWAY
		CITY:			SANTA CLARA
		STATE:			CA
		ZIP:			95051

	FORMER COMPANY:	
		FORMER CONFORMED NAME:	NV

In [63]:
soup = BeautifulSoup(ar1, 'html.parser')

In [64]:
print(soup.head.title.string)

nvda-20210131


### Step 3: Use Regex to obtain the documents part

### The following part is adopted from
https://gist.github.com/anshoomehra/ead8925ea291e233a5aa2dcaa2dc61b2

In [65]:
# Regex to find <DOCUMENT> tags
doc_start_pattern = re.compile(r'<DOCUMENT>')
doc_end_pattern = re.compile(r'</DOCUMENT>')
# Regex to find <TYPE> tag prceeding any characters, terminating at new line
type_pattern = re.compile(r'<TYPE>[^\n]+')

In [66]:
raw_10k = ar1

In [67]:
# Create 3 lists with the span idices for each regex

### There are many <Document> Tags in this text file, each as specific exhibit like 10-K, EX-10.17 etc
### First filter will give us document tag start <end> and document tag end's <start> 
### We will use this to later grab content in between these tags
doc_start_is = [x.end() for x in doc_start_pattern.finditer(raw_10k)]
doc_end_is = [x.start() for x in doc_end_pattern.finditer(raw_10k)]

### Type filter is interesting, it looks for <TYPE> with Not flag as new line, ie terminare there, with + sign
### to look for any char afterwards until new line \n. This will give us <TYPE> followed Section Name like '10-K'
### Once we have have this, it returns String Array, below line will with find content after <TYPE> ie, '10-K' 
### as section names
doc_types = [x[len('<TYPE>'):] for x in type_pattern.findall(raw_10k)]

Create a Dictionary for the 10-K

In the code below, we will create a dictionary which has the key 10-K and as value the contents of the 10-K section found above. To do this, we will create a loop, to go through all the sections found above, and if the section type is 10-K then save it to the dictionary. Use the indices in doc_start_is and doc_end_isto slice the raw_10k file.

In [68]:
document = {}

# Create a loop to go through each section type and save only the 10-K section in the dictionary
for doc_type, doc_start, doc_end in zip(doc_types, doc_start_is, doc_end_is):
    if doc_type == '10-K':
        document[doc_type] = raw_10k[doc_start:doc_end]
# display excerpt the document
document['10-K'][0:500]

'\n<TYPE>10-K\n<SEQUENCE>1\n<FILENAME>nvda-20210131.htm\n<DESCRIPTION>FY2021 10-K\n<TEXT>\n<XBRL>\n<?xml version="1.0" ?><!--XBRL Document Created with Wdesk from Workiva--><!--Copyright 2021 Workiva--><!--r:29cdbe28-ec04-4c19-8ba8-ce8d03a2d73f,g:12419e0b-2dd0-40ae-8bac-6c832405a298,d:ad3cb7415e124471a81b6a191fb95bed--><html xmlns:link="http://www.xbrl.org/2003/linkbase" xmlns:xbrldi="http://xbrl.org/2006/xbrldi" xmlns:nvda="http://www.nvidia.com/20210131" xmlns:xbrli="http://www.xbrl.org/2003/instance"'

### STEP 4 : Apply REGEXes to find Items 1, 1A, 7, and 7A under 10-K Section

In [69]:
# Write the regex
regex = re.compile(r'(>Item(\s|&#160;|&nbsp;)(1\.|1A|1B|7A|7|8)\.{0,1})|(ITEM\s(1\.|1A|1B|7A|7|8))')

# Use finditer to math the regex
matches = regex.finditer(document['10-K'])

# Write a for loop to print the matches
for match in matches:
    print(match)

<re.Match object; span=(181523, 181531), match='>Item 1.'>
<re.Match object; span=(183446, 183455), match='>Item 1A.'>
<re.Match object; span=(185375, 185384), match='>Item 1B.'>
<re.Match object; span=(197961, 197969), match='>Item 7.'>
<re.Match object; span=(199985, 199994), match='>Item 7A.'>
<re.Match object; span=(201977, 201985), match='>Item 8.'>
<re.Match object; span=(236245, 236252), match='ITEM 1.'>
<re.Match object; span=(323459, 323466), match='ITEM 1A'>
<re.Match object; span=(445321, 445328), match='ITEM 1B'>
<re.Match object; span=(474747, 474753), match='ITEM 7'>
<re.Match object; span=(621724, 621731), match='ITEM 7A'>


In [70]:
document['10-K'][4020613:4027783]

''

In [71]:
# Matches
matches = regex.finditer(document['10-K'])

# Create the dataframe
test_df = pd.DataFrame([(x.group(), x.start(), x.end()) for x in matches])

test_df.columns = ['item', 'start', 'end']
test_df['item'] = test_df.item.str.lower()

# Get rid of unnesesary charcters from the dataframe
test_df.replace('&#160;',' ',regex=True,inplace=True)
test_df.replace('&nbsp;',' ',regex=True,inplace=True)
test_df.replace(' ','',regex=True,inplace=True)
test_df.replace('\.','',regex=True,inplace=True)
test_df.replace('>','',regex=True,inplace=True)

# display the dataframe
test_df


Unnamed: 0,item,start,end
0,item1,181523,181531
1,item1a,183446,183455
2,item1b,185375,185384
3,item7,197961,197969
4,item7a,199985,199994
5,item8,201977,201985
6,item1,236245,236252
7,item1a,323459,323466
8,item1b,445321,445328
9,item7,474747,474753


In [72]:
# Drop duplicates
pos_dat = test_df.sort_values('start', ascending=True).drop_duplicates(subset=['item'], keep='last')
pos_dat.set_index('item', inplace=True)

# Display the dataframe
pos_dat

Unnamed: 0_level_0,start,end
item,Unnamed: 1_level_1,Unnamed: 2_level_1
item8,201977,201985
item1,236245,236252
item1a,323459,323466
item1b,445321,445328
item7,474747,474753
item7a,621724,621731


In [73]:
# Get Item 1a
item_1_raw = document['10-K'][pos_dat['start'].loc['item1']:pos_dat['start'].loc['item1a']]

# Get Item 1a
item_1a_raw = document['10-K'][pos_dat['start'].loc['item1a']:pos_dat['start'].loc['item1b']]

# Get Item 7
item_7_raw = document['10-K'][pos_dat['start'].loc['item7']:pos_dat['start'].loc['item7a']]

# Get Item 7a
item_7a_raw = document['10-K'][pos_dat['start'].loc['item7a']:pos_dat['start'].loc['item8']]


In [74]:
print(item_1_raw[1:])

TEM 1. BUSINESS</span></div><div style="margin-bottom:3pt;margin-top:9pt;text-align:justify"><span style="color:#76b900;font-family:'DIN Next LT Pro Medium',sans-serif;font-size:10pt;font-weight:700;line-height:120%">Our Company</span></div><div style="margin-bottom:9pt;text-align:justify"><span style="color:#000000;font-family:'DIN Next LT Pro Light',sans-serif;font-size:10pt;font-weight:400;line-height:120%">NVIDIA pioneered accelerated computing to help solve the most challenging computational problems. Since our original focus on PC graphics, we have expanded to several other large and important computationally intensive fields. Fueled by the sustained demand for exceptional 3D graphics and the scale of the gaming market, NVIDIA has leveraged its GPU architecture to create platforms for scientific computing, artificial intelligence, or AI, data science, autonomous vehicles, or AV, robotics, and augmented and virtual reality, or AR and VR.</span></div><div style="margin-bottom:9pt;t

### STEP 4 : Apply BeautifulSoup to refine the content

In [76]:
### First convert the raw text we have to exrtacted to BeautifulSoup object 
item_1_content = BeautifulSoup(item_1_raw, 'lxml')
print(item_1_content.prettify()[0:1500])

<html>
 <body>
  <p>
   ITEM 1. BUSINESS
  </p>
  <div style="margin-bottom:3pt;margin-top:9pt;text-align:justify">
   <span style="color:#76b900;font-family:'DIN Next LT Pro Medium',sans-serif;font-size:10pt;font-weight:700;line-height:120%">
    Our Company
   </span>
  </div>
  <div style="margin-bottom:9pt;text-align:justify">
   <span style="color:#000000;font-family:'DIN Next LT Pro Light',sans-serif;font-size:10pt;font-weight:400;line-height:120%">
    NVIDIA pioneered accelerated computing to help solve the most challenging computational problems. Since our original focus on PC graphics, we have expanded to several other large and important computationally intensive fields. Fueled by the sustained demand for exceptional 3D graphics and the scale of the gaming market, NVIDIA has leveraged its GPU architecture to create platforms for scientific computing, artificial intelligence, or AI, data science, autonomous vehicles, or AV, robotics, and augmented and virtual reality, or AR a

In [79]:
### Our goal is though to remove html tags and see the content
### Method get_text() is what we need, \n\n is optional, I just added this to read text 
### more cleanly, it's basically new line character between sections. 
print(item_1_content.get_text("\n\n"))


ITEM 1. BUSINESS

Our Company

NVIDIA pioneered accelerated computing to help solve the most challenging computational problems. Since our original focus on PC graphics, we have expanded to several other large and important computationally intensive fields. Fueled by the sustained demand for exceptional 3D graphics and the scale of the gaming market, NVIDIA has leveraged its GPU architecture to create platforms for scientific computing, artificial intelligence, or AI, data science, autonomous vehicles, or AV, robotics, and augmented and virtual reality, or AR and VR.

The GPU was initially used to simulate human imagination, enabling the virtual worlds of video games and films. Today, it also simulates human intelligence, enabling a deeper understanding of the physical world. Its parallel processing capabilities, supported by up to thousands of computing cores, are essential to running deep learning algorithms. This form of AI, in which software writes itself by learning from data, can