<b>This project focuses on writing a function that will open and read the SEC url. This will allow me to access the SEC server and count the number of filings given the three arguments of the function: year, quarter and pattern. OS, RE, urllib modules are used. Regular Expressions are also used to test the function. 

<b>Project 1: </b>Write a function <b>SEC_filing_count()</b> that has three arguments: $year$, $quarter$, and $pattern$. The goal is to count the number of filings in the SEC master.idx for that year and quarter for which the 'Company Name' entry satisfies the regular expression pattern.  This is perhaps best illustrated with an example.

If I enter <b>SEC_filing_count(2012, 3, '\d\d')</b>, then I want the function to tell me how many of the Company Names in the 2012Q3 index contain two consecutive digits.

- If ./Projects/master_2012Q3.txt does not already exist, you should use <b>urllib()</b> to 'try' to download 'https://www.sec.gov/Archives/edgar/full-index/2012/QTR3/master.idx' to './Projects/master_2012_Q3.txt'. (I recommend using absolute paths, but not hardcoding the absolute paths. For example, after you have used $year$ and $quarter$ to construct $destfile$, you can write something like: 'url_destfile = os.path.abspath(os.path.join('./', destfile))'.)
- If './Projects/master_2012_Q3.txt' already exists, you should NOT ask <b>urllib()</b> to download it again. We want to be nice to the SEC servers.
- When reading './Projects/master_2012Q3.txt' into Python as a list, you are welcome to create a list that only contains the 'Company Name' entry from each line. 
- Regardless of how much information you choose to retain from each line, please convert all letters in 'Company Name' to uppercase before performing the regex search.
- In the example above, you would then use either <b>re.search()</b> or <b>findall()</b> to identify all company names that contain at least two consecutive digits.
- If the number of regex matches is positive, return this number.
- If there are no matches, return the number zero.
- If the user requests an index that does not exist, such as <b>SEC_filing_count(1776, 4, '\d\d')</b>, return the number -1.

In [1]:
import os
import re
import urllib.request

In [2]:
def SEC_filing_count(year, quarter, pattern): 
    os.getcwd() 
    absdir = os.getcwd() 
    os.chdir(absdir) 
    frompath = 'https://www.sec.gov/Archives/edgar/full-index/' + str(year) + '/QTR' + str(quarter) + '/master.idx'
   
    destpath = os.path.join(absdir,'Downloads')
    try: 
        path = urllib.request.urlretrieve(frompath, destpath)  
    except: 
        return -1
    x = open(destpath, 'r')
    lines = x.readlines() 
    x.close() 
    lines = lines[9:10] + lines[11:]
    idx = []
    for ele in lines: 
        (cik, name, form, date, filename) = ele.split('|')
        idx.append((cik, name, form, date, filename)) 
        
    n_list = [ele[1].replace('/','').upper() for ele in idx] 
    names = ', '.join(n_list)
    matches = len(re.findall(pattern, names)) 
    if matches > 0: 
        return matches 
    if matches == 0: 
        return 0 
    

In [3]:
print(SEC_filing_count(1900, 4, '\d'))                  # This should return -1 because the year is wrong
print(SEC_filing_count(2015, 2, '[a-z]{5}'))            # This should return 0 because Company Name is stored as uppercase
print(SEC_filing_count(2012, 3, '(\d{1,}\W\w{1,})'))    # This may or may not return a positive number of filings
print(SEC_filing_count(2015, 2, 'GOLDMAN'))             # This should return a positive number of filings
print(SEC_filing_count(2012, 3, 'LEHMAN'))              # This may or may not return a positive number of filings

-1
0
2629
821
52
