# Using SEC EDGAR RESTful data APIs

This notebook shows how to retrieve information reported by regulated entities to U.S. Securities and Exchange Commision (SEC).

SEC is maintainig EDGAR system with information about all regulated enties (companies, funds, individuals). Accessing the data is free and there is number of [various ways how to access the data](https://www.sec.gov/os/accessing-edgar-data).

"data.sec.gov" was created to host RESTful data Application Programming Interfaces (APIs) delivering JSON-formatted data to external customers and to web pages on SEC.gov. These APIs do not require any authentication or API keys to access.

Currently included in the APIs are the submissions history by filer and the XBRL data from financial statements (forms 10-Q, 10-K,8-K, 20-F, 40-F, 6-K, and their variants).

The JSON structures are updated throughout the day, in real time, as submissions are disseminated.

pip install -r requirements.txt

!jupyter nbextension enable --py widgetsnbextension

In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import re
import unicodedata
from bs4 import BeautifulSoup as bs
import requests
from tqdm.notebook import tqdm
import os
import warnings
import boto3
warnings.filterwarnings("ignore")

## Finding CIK of company

EDGAR assigns to filers a unique numerical identifier, known as a Central Index Key (CIK), when they sign up to make filings to the SEC. CIK numbers remain unique to the filer; they are not recycled. 

List of all CIKs matched with entity name is available for download [(13 MB, text file)](https://www.sec.gov/Archives/edgar/cik-lookup-data.txt). Note that this list includes funds and individuals and is historically cumulative for company names. Thus a given CIK may be associated with multiple names in the case of company or fund name changes, and the list contains some entities that no longer file with the SEC.

We will be using smaller (611 kB) JSON [kaggle dataset](https://www.kaggle.com/datasets/svendaj/sec-edgar-cik-ticker-exchange), which is sourcing data directly at EDGAR and is input for this notebook. This dataset contains only companies names, CIK, ticker and associated stock exchange.

In [3]:
# Let's convert CIK JSON to pandas DataFrame
# First load the data into python dictionary


CIK_df=pd.read_json("/home/jupyter/SEC_exctractor/company_tickers.json").T

In [4]:
CIK_df.head()

Unnamed: 0,cik_str,ticker,title
0,320193,AAPL,Apple Inc.
1,789019,MSFT,MICROSOFT CORP
2,1652044,GOOGL,Alphabet Inc.
3,1018724,AMZN,AMAZON COM INC
4,1045810,NVDA,NVIDIA CORP


In [5]:
CIK_df.rename(columns={'cik_str': 'cik', 'title':'name'}, inplace=True)


### Finding a particular company based upon the Name they are registered with

In [6]:
# finding companies containing substring in company name
substring = "Tech"
CIK_df[CIK_df["name"].str.contains(substring, case=False)]

Unnamed: 0,cik,ticker,name
69,101829,RTX,RAYTHEON TECHNOLOGIES CORP
129,723125,MU,MICRON TECHNOLOGY INC
136,1543151,UBER,"Uber Technologies, Inc"
201,1835632,MRVL,"Marvell Technology, Inc."
233,882835,ROP,ROPER TECHNOLOGIES INC
...,...,...,...
9052,1855631,AWINW,AERWINS Technologies Inc.
9071,1847416,ORIAW,Orion Biotech Opportunities Corp.
9165,1872964,MTEKW,Maris Tech Ltd.
9197,1070050,APCXW,AppTech Payments Corp.


# Entity’s current filing history

Each entity’s current filing history is available at the following URL:

* https://data.sec.gov/submissions/CIK##########.json

Where the ########## is the entity’s 10-digit Central Index Key (CIK), including leading zeros.

This JSON data structure contains metadata such as current name, former name, and stock exchanges and ticker symbols of publicly-traded companies. The object’s property path contains at least one year’s of filing or to 1,000 (whichever is more) of the most recent filings in a compact columnar data array. If the entity has additional filings, files will contain an array of additional JSON files and the date range for the filings each one contains.

In [7]:
# read response from REST API with `requests` library and format it as python dict

import requests
header_full = {
    "User-Agent": "harshit harshit.gola.off@gmail.com",
    "Accept-Encoding": "gzip, deflate",
    "Host": "data.sec.gov"
}




In [8]:

header = {
    "User-Agent": "harshit harshit.gola.off@gmail.com",
}

## Select the ticker of company used in this example

Subsequent information retrieval will be using selected `ticker` and associated CIK

In [9]:
# finding company row with given ticker

def get_current_filing_history(url, header):
    company_filings = requests.get(url, headers=header).json()
    company_filings_df = pd.DataFrame(company_filings["filings"]["recent"])
    return company_filings_df
    

## Reading from RESTful API

EDGAR requires that HTTP requests will be identified with proper [UserAgent in header and comply with fair use policy (currently max. 10 requests per second)](https://www.sec.gov/os/accessing-edgar-data). At minimum you need to supply your own e-mail adress in User-Agent field (otherwise you will get 403/Forbiden error). If you will provide Host field, please be sure use data.sec.gov server and not www.sec.gov as mentioned in example (this would result in 404/Not Found error).

## Creating DataFrame with submitted filings

`company_filings["filings"]["recent"]` contains up to 1000 last submitted filings sorted from latest to oldest.

In [10]:
def pull_all_history(df, header):
    df_=pd.DataFrame()
    for index, row in tqdm(df.iterrows(), total=df.shape[0]):
        CIK = row['cik']
        url = f"https://data.sec.gov/submissions/CIK{str(CIK).zfill(10)}.json"
        company_filings_df = get_current_filing_history(url, header)
        company_filings_df['ticker']=row['ticker']
        company_filings_df['cik']=row['cik']
        df_ = pd.concat([company_filings_df, df_])
    return df_

In [12]:
df_history = pull_all_history(CIK_df[:100], header_full)

In [None]:
df_history.to_csv('/home/jupyter/SEC_exctractor/data/history.csv', index=False)

In [13]:
df_history.head()

Unnamed: 0,accessionNumber,filingDate,reportDate,acceptanceDateTime,act,form,fileNumber,filmNumber,items,core_type,size,isXBRL,isInlineXBRL,primaryDocument,primaryDocDescription,ticker,cik
0,0000018230-25-000009,2025-02-14,,2025-02-14T09:38:46.000Z,34.0,IRANNOTICE,001-00768,25623988.0,,IRANNOTICE,76041,0,0,catirannotice-202410xk.htm,IRANNOTICE,CAT,18230
1,0000018230-25-000008,2025-02-14,2024-12-31,2025-02-14T09:36:30.000Z,34.0,10-K,001-00768,25623971.0,,XBRL,33945394,1,1,cat-20241231.htm,10-K,CAT,18230
2,0001104659-25-012855,2025-02-13,2025-02-11,2025-02-13T14:41:51.000Z,,4,,,,4,8873,0,0,xslF345X05/tm255985-10_4seq1.xml,OWNERSHIP DOCUMENT,CAT,18230
3,0001104659-25-012854,2025-02-13,2025-02-11,2025-02-13T14:40:49.000Z,,4,,,,4,6563,0,0,xslF345X05/tm255985-9_4seq1.xml,OWNERSHIP DOCUMENT,CAT,18230
4,0001104659-25-012853,2025-02-13,2025-02-11,2025-02-13T14:39:50.000Z,,4,,,,4,6513,0,0,xslF345X05/tm255985-8_4seq1.xml,OWNERSHIP DOCUMENT,CAT,18230


In [14]:
df_history.shape

(142253, 17)

In [15]:
df_history.to_csv('/home/jupyter/SEC_exctractor/data/history.csv', index=False)

## Accessing specific filing document

Let's download latest Annual Report (10-K). Files are stored in browsable directory structure for CIK and accession-number: 
* https://www.sec.gov/Archives/edgar/data/{CIK}/{accession-number}/

Creating a function to create a url and run loop for all the items to download each of the filing htm file

In [16]:
def download_all_forms(df, form, header):
    df_ = df[df.form == form]
    for index, row in tqdm(df_.iterrows(), total=df_.shape[0]):
        url = f"https://www.sec.gov/Archives/edgar/data/{row['cik']}/{row['accessionNumber'].replace('-', '')}/{row['primaryDocument']}"
        req_content = requests.get(url, headers=header).content.decode("utf-8")
        directory = f"data/{row['ticker']}"
        if not os.path.exists(directory):
            os.makedirs(directory)

        with open(f"{directory}/{row['primaryDocument']}", "w") as f:
            f.write(req_content)

This step is to download all the 10K htm files for the 100 most recent filings into the data folder.

In [17]:
download_all_forms(df_history, '10-K', header)
    

  0%|          | 0/540 [00:00<?, ?it/s]