First goal is to find an endpoint for Form 4's

https://www.sec.gov/developer

https://www.sec.gov/os/accessing-edgar-data

In [60]:
import requests
import time
import datetime
import pandas as pd
import os

run_once_override = False

In [50]:
HEADERS = {
    'User-Agent': 'Personal User khalidelassaad@gmail.com',
    'Accept-Encoding': 'gzip, deflate',
    'Host': 'www.sec.gov'
}

def SEC_API_sleep():
    # Be a good citizen! Limit requests to less than 10 per second
    time.sleep(0.11)
    

Fetch the list of CIKs from here https://www.sec.gov/Archives/edgar/cik-lookup-data.txt

In [53]:
def get_url_to_outfilename(url, outfilename, run_once_override=run_once_override):
    if not run_once_override:
        return -1
    response = requests.get(url, headers=HEADERS)
    print("Status Code:", response.status_code)
    with open(outfilename, "w", encoding="utf-8") as outfile:
        outfile.write(response.text)
    SEC_API_sleep()
    return response

In [4]:
get_url_to_outfilename(
    "https://www.sec.gov/Archives/edgar/cik-lookup-data.txt",
    "cik_list.txt"
)

-1

In [5]:
get_url_to_outfilename(
    "https://www.sec.gov/files/company_tickers.json",
    "company_tickers.json"
)

-1

In [6]:
def generate_interested_companies_dict():
    interested_company_names = []
    return_dict = dict()
    with open("interested_company_names.txt","r") as infile:
        for line in infile:
            interested_company_names.append(line.strip())
    with open("cik_list.txt","r") as infile:
        for line in infile:
            for name in interested_company_names:
                if name in line:
                    cik_number = int(line.split(":")[1])
                    cik_name = line.split(":")[0]
                    return_dict[cik_number] = {"name": cik_name}
    return return_dict

In [7]:
interested_companies_dict = generate_interested_companies_dict()

We can get information about a specific CIK's filings by querying this resource:

https://data.sec.gov/submissions/CIK##########.json

In [52]:
def save_interested_companies_indices(interested_companies_dict, run_once_override=run_once_override):
    if not run_once_override:
        return -1
    for cik in interested_companies_dict:
        url = "https://data.sec.gov/submissions/CIK{:010d}.json".format(cik)
        get_url_to_outfilename(
            "https://data.sec.gov/submissions/CIK{:010d}.json".format(cik),
            "{}.html".format(cik),
            run_once_override=run_once_override
        )
    return 0

The above URL is not fetching. Let's try something else.

https://www.sec.gov/os/accessing-edgar-data

According to the above, use https://www.sec.gov/Archives/edgar/full-index/ to get an index per quarter of all filings from all companies. Let's write a function to grab the latest index and parse out relevant links for our `interested_companies_dict`

We want the `master.idx` file which contains data in this format: `CIK|Company Name|Form Type|Date Filed|Filename`

The full path to this file looks like:

`https://www.sec.gov/Archives/edgar/full-index/{YYYY}/QTR{1-4}/master.idx`

Data extends from 1993 QTR1 - 2023 QTR1

In [24]:
def get_master_index_for_year_and_quarter(year, quarter, run_once_override=run_once_override):
    if not run_once_override:
        return -1
    get_url_to_outfilename(
        "https://www.sec.gov/Archives/edgar/full-index/{}/QTR{}/master.idx".format(year, quarter),
        "indices/{}Q{}_master.idx".format(year, quarter),
        run_once_override=run_once_override
    )
    return 0

In [35]:
get_master_index_for_year_and_quarter(2023, 1)

-1

Let's grab all indices since 1993 QTR1, then we can aggregate them together for one unified master index that we can parse for companies we're interested in.

First we'll grab all indices.

In [54]:
def get_all_master_indices(run_once_override=run_once_override):
    if not run_once_override:
        return -1
    start_year = 1993
    end_year = pd.Timestamp(datetime.datetime.now()).year
    end_quarter = pd.Timestamp(datetime.datetime.now()).quarter
    for year in range(start_year, end_year + 1):
        end_quarter_range = 4 if year != end_year else end_quarter
        for quarter in range(1, end_quarter_range + 1):
            print("Fetchin index for {}Q{}".format(year, quarter))
            get_master_index_for_year_and_quarter(year, quarter, run_once_override)
    return 0

In [56]:
get_all_master_indices()

-1

Let's also write a function to fetch only the latest index (for updates while the current quarter is ongoing).

In [57]:
def get_latest_master_index(run_once_override=run_once_override):
    if not run_once_override:
        return -1
    year = pd.Timestamp(datetime.datetime.now()).year
    quarter = pd.Timestamp(datetime.datetime.now()).quarter
    print("Fetchin index for {}Q{}".format(year, quarter))
    get_master_index_for_year_and_quarter(year, quarter, run_once_override)
    return 0

In [59]:
get_latest_master_index()

-1

Nice, now to read data. Here's what I want to do:

1. Read all index files.
2. Grab rows for CIKs that I'm interested in, based on `interested_companies_dict`
3. Output rows as a pandas dataframe

In [62]:
index_files = os.listdir("indices")
index_files

['1993Q1_master.idx',
 '1993Q2_master.idx',
 '1993Q3_master.idx',
 '1993Q4_master.idx',
 '1994Q1_master.idx',
 '1994Q2_master.idx',
 '1994Q3_master.idx',
 '1994Q4_master.idx',
 '1995Q1_master.idx',
 '1995Q2_master.idx',
 '1995Q3_master.idx',
 '1995Q4_master.idx',
 '1996Q1_master.idx',
 '1996Q2_master.idx',
 '1996Q3_master.idx',
 '1996Q4_master.idx',
 '1997Q1_master.idx',
 '1997Q2_master.idx',
 '1997Q3_master.idx',
 '1997Q4_master.idx',
 '1998Q1_master.idx',
 '1998Q2_master.idx',
 '1998Q3_master.idx',
 '1998Q4_master.idx',
 '1999Q1_master.idx',
 '1999Q2_master.idx',
 '1999Q3_master.idx',
 '1999Q4_master.idx',
 '2000Q1_master.idx',
 '2000Q2_master.idx',
 '2000Q3_master.idx',
 '2000Q4_master.idx',
 '2001Q1_master.idx',
 '2001Q2_master.idx',
 '2001Q3_master.idx',
 '2001Q4_master.idx',
 '2002Q1_master.idx',
 '2002Q2_master.idx',
 '2002Q3_master.idx',
 '2002Q4_master.idx',
 '2003Q1_master.idx',
 '2003Q2_master.idx',
 '2003Q3_master.idx',
 '2003Q4_master.idx',
 '2004Q1_master.idx',
 '2004Q2_m