<a href="https://colab.research.google.com/github/mehrnazmir/Challenges/blob/main/Copy_of_SEC_EDGAR_10k_Filing_extract.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this example we will scrape text data from the SEC using EDGAR (Electronic Data Gathering, Analysis, and Retrieval).

We are looking for 10-Q: Quarterly Reports. For this example we will be using Beautiful soup and Requests

In [None]:
import time
import re
import pandas as pd

from bs4 import BeautifulSoup
import requests

# Simple example

Understanding the structure of what we want to scrape will make our task easier so it helps to start with a bit of browsing.

[Searching for identifier](https://www.sec.gov/edgar/searchedgar/companysearch.html)

[Filing location](https://www.sec.gov/Archives/edgar/full-index/)

Here is a simple example:

In [None]:
headers = {"User-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"}
# Obtain HTML for search page
base_url = 'https://www.sec.gov/Archives/edgar/data/0001318605/000095017022000796/tsla-20211231.htm'
edgar_resp = requests.get(base_url, headers=headers)
print(edgar_resp)
edgar_str = edgar_resp.text
soup = BeautifulSoup(edgar_str, 'html.parser')
soup.text

<Response [200]>




A succesful request will yield a code 200. 

In some cases, you will not be able to scrape a site without a [header](https://www.sec.gov/os/accessing-edgar-data) and oftentimes, site will publish their procedures. Failed requests.get([url]) will yield an error code 403 whereas succesful requests will return 200. 

# Extracting all filing urls from the Master Index

In [None]:
headers = {"User-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"}
company = 'Apple Inc' #or CIK#
filing = '10-K'
year = '2020'
quarter = ['QTR1','QTR2','QTR3','QTR4']

In [None]:
#This requests the master index for a quarters and makes a list
def get_index(QTR):
  base_url = f'https://www.sec.gov/Archives/edgar/full-index/{year}/{QTR}/master.idx'
  edgar_resp = requests.get(base_url, headers=headers)
  print(base_url)
  print(edgar_resp)
  return edgar_resp.text.lower().split('\n')

From the master list we find all the relevant filing for the relevant company and store them in a list. This list contains long strings of component that enables to recreate the url. We isolate the specific elements for and create the url

In [None]:
#This loops to fetch all quarters and assemble in large list and isoretains only the relevant company
def fetch_filing_url():
  edgar_str =[]
  for q in quarter:
    edgar_str = edgar_str + get_index(q)
    time.sleep(0.1)
  for item in edgar_str:
    if company.lower() in item:
      url = item.strip().split('|')
      if url[2] == filing.lower():
        url1 = url[-1]
  #Isolate URL elements for .txt an .htm
  url2 = url1.split('-')
  url2 = url2[0] +  url2[1] + url2[2]
  url2 = (url2).split('.txt')[0]
  to_get_to_html_site = 'https://www.sec.gov/Archives/' + url1
  print(to_get_to_html_site)
  data = requests.get(to_get_to_html_site,headers=headers)
  print(data)
  data = data.content.decode('utf-8').split('FILENAME>')
  data= data[1].split('\n')[0]
  url_to_use = 'https://www.sec.gov/Archives/' + url2 + '/' + data
  return url_to_use



Instantiate and print links and responses


In [None]:
filing_url = fetch_filing_url()
print(filing_url)

https://www.sec.gov/Archives/edgar/full-index/2020/QTR1/master.idx
<Response [200]>
https://www.sec.gov/Archives/edgar/full-index/2020/QTR2/master.idx
<Response [200]>
https://www.sec.gov/Archives/edgar/full-index/2020/QTR3/master.idx
<Response [200]>
https://www.sec.gov/Archives/edgar/full-index/2020/QTR4/master.idx
<Response [200]>
https://www.sec.gov/Archives/edgar/data/320193/0000320193-20-000096.txt
<Response [200]>
https://www.sec.gov/Archives/edgar/data/320193/000032019320000096/aapl-20200926.htm


In [None]:
r = requests.get('https://www.sec.gov/Archives/edgar/data/320193/000032019318000145/0000320193-18-000145.txt',headers=headers)
raw_10k = r.text

Let's have a look

In [None]:
print(raw_10k[:1000])

<SEC-DOCUMENT>0000320193-18-000145.txt : 20181105
<SEC-HEADER>0000320193-18-000145.hdr.sgml : 20181105
<ACCEPTANCE-DATETIME>20181105080140
ACCESSION NUMBER:		0000320193-18-000145
CONFORMED SUBMISSION TYPE:	10-K
PUBLIC DOCUMENT COUNT:		88
CONFORMED PERIOD OF REPORT:	20180929
FILED AS OF DATE:		20181105
DATE AS OF CHANGE:		20181105

FILER:

	COMPANY DATA:	
		COMPANY CONFORMED NAME:			APPLE INC
		CENTRAL INDEX KEY:			0000320193
		STANDARD INDUSTRIAL CLASSIFICATION:	ELECTRONIC COMPUTERS [3571]
		IRS NUMBER:				942404110
		STATE OF INCORPORATION:			CA
		FISCAL YEAR END:			0930

	FILING VALUES:
		FORM TYPE:		10-K
		SEC ACT:		1934 Act
		SEC FILE NUMBER:	001-36743
		FILM NUMBER:		181158788

	BUSINESS ADDRESS:	
		STREET 1:		ONE APPLE PARK WAY
		CITY:			CUPERTINO
		STATE:			CA
		ZIP:			95014
		BUSINESS PHONE:		(408) 996-1010

	MAIL ADDRESS:	
		STREET 1:		ONE APPLE PARK WAY
		CITY:			CUPERTINO
		STATE:			CA
		ZIP:			95014

	FORMER COMPANY:	
		FORMER CONFORMED NAME:	APPLE COMPUTER INC
		DATE OF NA

Now that we have our links, we can proceed to cleaning the raaw data by extracting all the noise and discard unnecessary information.

In [None]:
def clean(raw_10k,item_sec):
  #
  item_list = ['item1a',
                'item1b',
                'item7',
                'item7a',
                'item8',
                'item1a',
                'item1b',
                'item7',
                'item7a',
                'item8'
                ]
  
  # Regex to find <DOCUMENT> tags
  doc_start_pattern = re.compile(r'<DOCUMENT>')
  doc_end_pattern = re.compile(r'</DOCUMENT>')
  # Regex to find <TYPE> tag prceeding any characters, terminating at new line
  type_pattern = re.compile(r'<TYPE>[^\n]+')
  
  # Create 3 lists with the span idices for each regex
  
  ### There are many <Document> Tags in this text file, each as specific exhibit like 10-K, EX-10.17 etc
  ### First filter will give us document tag start <end> and document tag end's <start> 
  ### We will use this to later grab content in between these tags
  doc_start_is = [x.end() for x in doc_start_pattern.finditer(raw_10k)]
  doc_end_is = [x.start() for x in doc_end_pattern.finditer(raw_10k)]
  
  ### Type filter is interesting, it looks for <TYPE> with Not flag as new line, ie terminare there, with + sign
  ### to look for any char afterwards until new line \n. This will give us <TYPE> followed Section Name like '10-K'
  ### Once we have have this, it returns String Array, below line will with find content after <TYPE> ie, '10-K' 
  ### as section names
  doc_types = [x[len('<TYPE>'):] for x in type_pattern.findall(raw_10k)]
  
  document = {}

  # Create a loop to go through each section type and save only the 10-K section in the dictionary
  for doc_type, doc_start, doc_end in zip(doc_types, doc_start_is, doc_end_is):
      if doc_type == '10-K':
          document[doc_type] = raw_10k[doc_start:doc_end]
      
  # display excerpt the document
  #document['10-K'][0:500]

  # Write the regex
  regex = re.compile(r'(>Item(\s|&#160;|&nbsp;)(1A|1B|7A|7|8)\.{0,1})|(ITEM\s(1A|1B|7A|7|8))')
  
  # Use finditer to math the regex
  matches = regex.finditer(document['10-K'])

  # Write a for loop to print the matches
  #for match in matches:
  #   print(match)
  
  # Matches
  matches = regex.finditer(document['10-K'])
  
  # Create the dataframe
  test_df = pd.DataFrame([(x.group(), x.start(), x.end()) for x in matches])
  
  test_df.columns = ['item', 'start', 'end']
  test_df['item'] = test_df.item.str.lower()
  
  # Display the dataframe
  #test_df.head()

  # Get rid of unnesesary characters from the dataframe
  test_df.replace('&#160;',' ',regex=True,inplace=True)
  test_df.replace('&nbsp;',' ',regex=True,inplace=True)
  test_df.replace(' ','',regex=True,inplace=True)
  test_df.replace('\.','',regex=True,inplace=True)
  test_df.replace('>','',regex=True,inplace=True)
  
  # display the dataframe
  #test_df.head()

  # Drop duplicates
  pos_dat = test_df.sort_values('start', ascending=True).drop_duplicates(subset=['item'], keep='last')
  
  # Display the dataframe
  pos_dat
  
  # Set item as the dataframe index
  pos_dat.set_index('item', inplace=True)
  
  # display the dataframe
  #pos_dat
  
  # Get Item 7
  #item_7_raw = document['10-K'][pos_dat['start'].loc['item7']:pos_dat['start'].loc['item7a']]
  
  # Get Item_sec raw content
  next_item_sec = item_list[item_list.index(item_sec)+1]
  item_sec_raw = document['10-K'][pos_dat['start'].loc[item_sec]:pos_dat['start'].loc[next_item_sec]]

  ### First convert the raw text we have to exrtacted to BeautifulSoup object 
  item_sec_content = BeautifulSoup(item_sec_raw, 'lxml')

  ### By just applying .pretiffy() we see that raw text start to look oragnized, as BeautifulSoup
  ### apply indentation according to the HTML Tag tree structure


  ### Our goal is though to remove html tags and see the content
  ### Method get_text() is what we need, \n\n is optional, I just added this to read text 
  ### more cleanly, it's basically new line character between sections. 



  #print(item_sec_content.get_text("\n\n")[0:5000])
  return item_sec_content.get_text("\n\n")[0:5000]

In [None]:
clean(raw_10k, 'item7')


'>Item 7.\n\nManagement’s Discussion and Analysis of Financial Condition and Results of Operations\n\nThis section and other parts of this Annual Report on Form 10-K (“Form 10-K”) contain forward-looking statements, within the meaning of the Private Securities Litigation Reform Act of 1995, that involve risks and uncertainties. Forward-looking statements provide current expectations of future events based on certain assumptions and include any statement that does not directly relate to any historical or current fact. Forward-looking statements can also be identified by words such as “future,”  “anticipates,”  “believes,”  “estimates,”  “expects,”  “intends,”  “plans,”  “predicts,”  “will,”  “would,”  “could,”  “can,”  “may,”  and similar terms. Forward-looking statements are not guarantees of future performance and the Company’s actual results may differ significantly from the results discussed in the forward-looking statements. Factors that might cause such differences include, but ar

Now that we know how to extract specifc sections, let's loop it over all the companies contained in a list.

In [None]:
company_list = {'tesla':'', 'apple inc':'', 'twitter':''}
for name in company_list:
  filing_url = fetch_filing_url()
  r = requests.get('https://www.sec.gov/Archives/edgar/data/320193/000032019318000145/0000320193-18-000145.txt',headers=headers)
  raw_10k = r.text
  company_list[name] = clean(raw_10k, 'item7')

print(company_list['apple inc'][:5000])

https://www.sec.gov/Archives/edgar/full-index/2020/QTR1/master.idx
<Response [200]>
https://www.sec.gov/Archives/edgar/full-index/2020/QTR2/master.idx
<Response [200]>
https://www.sec.gov/Archives/edgar/full-index/2020/QTR3/master.idx
<Response [200]>
https://www.sec.gov/Archives/edgar/full-index/2020/QTR4/master.idx
<Response [200]>
https://www.sec.gov/Archives/edgar/data/320193/0000320193-20-000096.txt
<Response [200]>
https://www.sec.gov/Archives/edgar/full-index/2020/QTR1/master.idx
<Response [200]>
https://www.sec.gov/Archives/edgar/full-index/2020/QTR2/master.idx
<Response [200]>
https://www.sec.gov/Archives/edgar/full-index/2020/QTR3/master.idx
<Response [200]>
https://www.sec.gov/Archives/edgar/full-index/2020/QTR4/master.idx
<Response [200]>
https://www.sec.gov/Archives/edgar/data/320193/0000320193-20-000096.txt
<Response [200]>
https://www.sec.gov/Archives/edgar/full-index/2020/QTR1/master.idx
<Response [200]>
https://www.sec.gov/Archives/edgar/full-index/2020/QTR2/master.idx

We now have 