# Parsing Company 10Ks From the SEC

# Import the libraries

This module will require five libraries- the first is the requests library for making the URL requests; bs4 to parse the files and content; pandas which will be used for taking our cleaned data and giving it structure; re libraries to take care of regular expression; and finally urllib library to fetch, read, open and download from URLs.

In [1]:
# import our libraries
import requests
import pandas as pd
from bs4 import BeautifulSoup
import re
import urllib

# Web Scraping the SEC Query Page

# Section One: Define the Parameters of the Search
To create a search we need to "build" a URL that takes us to a valid results query, this requires taking our base endpoint and attaching on different parameters to help narrow down our search. I'll do my best to explain how each of these parameters works, but unfortunately, there is no formal documentation on this.

Endpoint The endpoint for our EDGAR query is https://www.sec.gov/cgi-bin/browse-edgar if you go to this link without any additional parameters it will be an invalid request.

# Parameters

action: (required) By default should be set to getcompany.
CIK: (required) Is the CIK number of the company you are searching.
type: (optional) Allows filtering the type of form. For example, if set to 10-k only the 10-K filings are returned.
dateb: (optional) Will only return the filings before a given date. The format is as follows YYYYMMDD
owner: (required) Is set to exclude by default and specifies ownership. You may also set it to include and only.
start: (optional) Is the starting index of the results. For example, if I have 100 results but want to start at 45 of 100, I would pass 45.
state: (optional) The company's state.
filenum: (optional) The filing number.
sic: (optional) The company's SIC (Standard Industry Classification) identifier
output: (optional) Defines returned data structure as either xml (atom) or normal html.
count: (optional) The number of results you want to see with your request, the max is 100 and if not set it will default to 40.
Now that we understand all the parameters let's make a request by defining our endpoint, and then a dictionary of our parameters. Where the key of the dictionary is the parameter name, and the value is the value we want to set for that parameter. Once we've defined these two components we can make our request and parse the response using BeautifulSoup.

In [None]:
# base URL for the SEC EDGAR browser
endpoint = r"https://www.sec.gov/cgi-bin/browse-edgar"
base_url= r"https://www.sec.gov"


#Adding Input Box to enter the SIC Code as per client's needs 
listx = input("Enter SIC number(s) separated by space: ")
x = listx.split()
x.sort()
print("SIC numbers: ", x)


#INPUT BOX FOR 10-K and 10-Q as per client's need
listflg = input("Enter the type of filing: ")
type = listflg.split()
print(type," filing")


# Input for the time period of the required filings
yeari = input("Starting period (YYYYMMDD): ")
yearf = input("Ending period (YYYYMMDD): ")

#Looping for the given SIC code to extract the list of companies and their CIK Codes
try:
    for i in range(0,len(x)):
        # define our parameters dictionary
        #Paramters: action, SIC code, Count[range(0,100) where 100 displays 100 companies per page]
        param_dict = {'action':'getcompany',
                      'myowner':'exclude',
                      'SIC': (x)[i],
                      'count': '100'
                     }
        try:
            q =i,
            y=x[i]
            print("SIC Code- " + y)
            # request the url, and then parse the response.
            response = requests.get(url = endpoint, params = param_dict)
            soup = BeautifulSoup(response.content, "html.parser")

            # Let the user know it was successful.
            print('List of Companies from SIC Code '+x[i]+' below-')
            print(response.url)
            print('Request Successful')
            print("-"*110)
            company_links= []
            company_names=[]

        #Parsing the Next Page
        #In the example above our results were limited because we did such a narrow search, 
        #but it's not uncommon for more broad searches to return over 100 different entries.
        #In these situations, we can leverage the html parser output to find the link that takes us to the additional results.
        #This process is easy; we merely find the link tag that has a rel attribute set to next.
        #To demonstrate this, I've added a new URL that will return over 100 results.    

            start = int(100)
            first_page= [ base_url + tr.td.a['href'] for tr in soup.find_all('tr')[1:] ]

           # company_links = company_links + first_page
            for j in first_page: 
                    if j not in company_links: 
                        company_links.append(j)

            while soup.find_all('input',{'type':'button'})!= []:

        #       next_page_link = base_url +'/cgi-bin/browse-edgar?action=getcompany&amp;SIC=' + y + '&amp;owner=include&amp;match=&amp;start='+start + '&amp;count=100&amp;hidefilings=0'
                str1= '/cgi-bin/browse-edgar?action=getcompany&amp;SIC='
                str2= '&amp;owner=include&amp;match=&amp;start='
                str3='&amp;count=100&amp;hidefilings=0'

                next_page_link = ''.join(map(str, (base_url,str1, y, str2, start,str3)))
        #         print(next_page_link)
                start= start+100

                # request the next page
                response = requests.get(url = next_page_link)
                soup = BeautifulSoup(response.content, 'html.parser')
                more_companies_list=[ base_url + tr.td.a['href'] for tr in soup.find_all('tr')[1:] ]
                company_links= company_links + (more_companies_list)
                for k in more_companies_list: 
                    if k not in company_links: 
                        company_links.append(k)

        #Using the following block of code to remove repetition of same companies which can be expected in the first and last page.            
        #This is to avoid downloading similar files again and again.   
            final_companies = [] 
            for f in company_links: 
                if f not in final_companies: 
                    final_companies.append(f)


        # Get the name of the company in consideration and append it to the list of all companies.

            all_company_names=[]

            for company_link in final_companies:
                company_page=requests.get(company_link)


                soup= BeautifulSoup(company_page.content, 'html.parser')
                company_name=soup.find('span',{'class':'companyName'}).text
                print(company_name)
                if company_name not in all_company_names: 
                    all_company_names.append(company_name)


                print('Below is the link to all your selected Filings')
                for o in range(0, len(type)):

                    # define our parameters dictionary
                    param_dict = {'type' : type[o],
                                  'dateb': yearf,
                                  'datea': yeari,
                                  }
                    #Above parameters specifies only the format and time-period of filings as required by customer.

                    #This block of code will print the company-wise page-url but 
                    #with the parameters of format and time-period incorporated.
                    # request the url, and then parse the response.

                    filing_page = requests.get(url = company_link.replace('&hidefilings=0','').replace('&count=40',''), params = param_dict)
                    soup = BeautifulSoup(filing_page.content, 'html.parser')
                    print(filing_page.url)

                    r= soup.find_all('a', {'id' : 'interactiveDataBtn'}) 
                    document_links= [ base_url + a['href'] for a in r ]
                    print('Below are links to your specified form type')


                    a=0
                    for document_link in document_links:
                        document_page=requests.get(document_link)
                        soup = BeautifulSoup(document_page.content, 'html.parser')
                        print(document_page.url)


        #Download the excel files of the specified filing type.

                        download_file=[base_url + a['href'] for a in soup.find_all('a',{'class':"xbrlviewer"})[:2] ]
                        final_download_file= download_file[1]

                        outfile_name = company_name + " " + str(a)+ ".xls"
                        outfile_name = outfile_name.replace(',','').replace('#','').replace(':','').replace('(see all company filings)','').replace('/','')
                        print(outfile_name)
                        print(final_download_file)
                        print('-'*50)
                        urllib.request.urlretrieve(final_download_file, outfile_name)
                        a = a+ 1

                   
                print("-"*110)
            i = q,
        except:
            pass
except:
    pass
print("SEARCH AND EXTRACTION COMPLETE.")

In [None]:
#See a list of all companies from one SIC Code.
# In case of multiple SIC inputs, this is the list of companies for the last SIC code, in the sorted list.

for i in range(len(all_company_names)):
    all_company_names[i] = all_company_names[i].replace('(see all company filings)','')
    print(all_company_names[i])
print('-'*50)

## Downloading the list of companies into excel file
This following block of code is a small practice for downloading the above viewed list as an excel file. This code will download only the companies under one SIC number. In case of multiple SIC no. inputs, the companies under the last SIC will be downloaded. Each SIC can be input individually and the list can be downloaded one after another.                     
The excel file has to be processed before usage as raw data. Select the column with the list and click on "Data" tab in excel. Under that, select "Text to Columns" option. Select "delimited" and click on next. Under "Delimiters", enter ":" in the "Other" option and click next. In the "Data Preview" Space, click on the second box with CIK numbers(it will appear black on selection). Select "Text" in the "Column Data Format" and click Finish.                                       
Now to clean the Companies column, press "ctrl+F" from the keyboard or open the "Find and Replace" pop up box. Search "CIK#" and press "find all". In the "Replace" tab, do not type anything and just press "Replace all".                      
Now you have companies with corresponding CIK numbers as usable data.                                           
The files will be downloaded in your current working directory(in which the jupyter notebook is running).

In [None]:
#Uncomment the following segment of code to download the list of companies as an MS Excel file.

# import xlwt
# from tempfile import TemporaryFile
# book = xlwt.Workbook()
# sheet1 = book.add_sheet('sheet1')


# for i,e in enumerate(all_company_names):
#     sheet1.write(i,1,e)

# name = "Company_List_2890.xls"
# book.save(name)
# book.save(TemporaryFile())

## Closing Remarks
The EDGAR query system allows us to quickly filter the companies we want to grab filings for and makes the process of finding the forms we need intuitive. With our knowledge of Python and the request system that EDGAR uses we can gain access to a tremendous amount of financial data that is free for public use.