# Using SEC EDGAR RESTful data APIs

This notebook shows how to retrieve information reported by regulated entities to U.S. Securities and Exchange Commision (SEC).

SEC is maintainig EDGAR system with information about all regulated enties (companies, funds, individuals). Accessing the data is free and there is number of [various ways how to access the data](https://www.sec.gov/os/accessing-edgar-data).

"data.sec.gov" was created to host RESTful data Application Programming Interfaces (APIs) delivering JSON-formatted data to external customers and to web pages on SEC.gov. These APIs do not require any authentication or API keys to access.

Currently included in the APIs are the submissions history by filer and the XBRL data from financial statements (forms 10-Q, 10-K,8-K, 20-F, 40-F, 6-K, and their variants).

The JSON structures are updated throughout the day, in real time, as submissions are disseminated.

pip install -r requirements.txt

!jupyter nbextension enable --py widgetsnbextension

In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import re
import unicodedata
from bs4 import BeautifulSoup as bs
import requests
from tqdm.notebook import tqdm
import os
import warnings
import boto3
warnings.filterwarnings("ignore")

## Finding CIK of company

EDGAR assigns to filers a unique numerical identifier, known as a Central Index Key (CIK), when they sign up to make filings to the SEC. CIK numbers remain unique to the filer; they are not recycled. 

List of all CIKs matched with entity name is available for download [(13 MB, text file)](https://www.sec.gov/Archives/edgar/cik-lookup-data.txt). Note that this list includes funds and individuals and is historically cumulative for company names. Thus a given CIK may be associated with multiple names in the case of company or fund name changes, and the list contains some entities that no longer file with the SEC.

We will be using smaller (611 kB) JSON [kaggle dataset](https://www.kaggle.com/datasets/svendaj/sec-edgar-cik-ticker-exchange), which is sourcing data directly at EDGAR and is input for this notebook. This dataset contains only companies names, CIK, ticker and associated stock exchange.

In [4]:
# Let's convert CIK JSON to pandas DataFrame
# First load the data into python dictionary


CIK_df=pd.read_json("/home/jupyter/SEC_exctractor/company_tickers.json").T

In [5]:
CIK_df.head()

Unnamed: 0,cik_str,ticker,title
0,320193,AAPL,Apple Inc.
1,789019,MSFT,MICROSOFT CORP
2,1652044,GOOGL,Alphabet Inc.
3,1018724,AMZN,AMAZON COM INC
4,1045810,NVDA,NVIDIA CORP


In [6]:
CIK_df.rename(columns={'cik_str': 'cik', 'title':'name'}, inplace=True)


### Finding a particular company based upon the Name they are registered with

In [7]:
# finding companies containing substring in company name
substring = "Tech"
CIK_df[CIK_df["name"].str.contains(substring, case=False)]

Unnamed: 0,cik,ticker,name
69,101829,RTX,RAYTHEON TECHNOLOGIES CORP
129,723125,MU,MICRON TECHNOLOGY INC
136,1543151,UBER,"Uber Technologies, Inc"
201,1835632,MRVL,"Marvell Technology, Inc."
233,882835,ROP,ROPER TECHNOLOGIES INC
...,...,...,...
9052,1855631,AWINW,AERWINS Technologies Inc.
9071,1847416,ORIAW,Orion Biotech Opportunities Corp.
9165,1872964,MTEKW,Maris Tech Ltd.
9197,1070050,APCXW,AppTech Payments Corp.


# Entity’s current filing history

Each entity’s current filing history is available at the following URL:

* https://data.sec.gov/submissions/CIK##########.json

Where the ########## is the entity’s 10-digit Central Index Key (CIK), including leading zeros.

This JSON data structure contains metadata such as current name, former name, and stock exchanges and ticker symbols of publicly-traded companies. The object’s property path contains at least one year’s of filing or to 1,000 (whichever is more) of the most recent filings in a compact columnar data array. If the entity has additional filings, files will contain an array of additional JSON files and the date range for the filings each one contains.

In [9]:
# read response from REST API with `requests` library and format it as python dict

import requests
header_full = {
    "User-Agent": "harshit harshit.gola.off@gmail.com",
    "Accept-Encoding": "gzip, deflate",
    "Host": "data.sec.gov"
}




In [10]:

header = {
    "User-Agent": "harshit harshit.gola.off@gmail.com",
}

## Select the ticker of company used in this example

Subsequent information retrieval will be using selected `ticker` and associated CIK

In [11]:
# finding company row with given ticker

def get_current_filing_history(url, header):
    company_filings = requests.get(url, headers=header).json()
    company_filings_df = pd.DataFrame(company_filings["filings"]["recent"])
    return company_filings_df
    

## Reading from RESTful API

EDGAR requires that HTTP requests will be identified with proper [UserAgent in header and comply with fair use policy (currently max. 10 requests per second)](https://www.sec.gov/os/accessing-edgar-data). At minimum you need to supply your own e-mail adress in User-Agent field (otherwise you will get 403/Forbiden error). If you will provide Host field, please be sure use data.sec.gov server and not www.sec.gov as mentioned in example (this would result in 404/Not Found error).

## Creating DataFrame with submitted filings

`company_filings["filings"]["recent"]` contains up to 1000 last submitted filings sorted from latest to oldest.

In [12]:
def pull_all_history(df, header):
    df_=pd.DataFrame()
    for index, row in tqdm(df.iterrows(), total=df.shape[0]):
        CIK = row['cik']
        url = f"https://data.sec.gov/submissions/CIK{str(CIK).zfill(10)}.json"
        company_filings_df = get_current_filing_history(url, header)
        company_filings_df['ticker']=row['ticker']
        company_filings_df['cik']=row['cik']
        df_ = pd.concat([company_filings_df, df_])
    return df_

In [13]:
df_history = pull_all_history(CIK_df[:100], header_full)

  0%|          | 0/100 [00:00<?, ?it/s]

In [14]:
df_history.head()

Unnamed: 0,accessionNumber,filingDate,reportDate,acceptanceDateTime,act,form,fileNumber,filmNumber,items,core_type,size,isXBRL,isInlineXBRL,primaryDocument,primaryDocDescription,ticker,cik
0,0000018230-25-000009,2025-02-14,,2025-02-14T09:38:46.000Z,34.0,IRANNOTICE,001-00768,25623988.0,,IRANNOTICE,76041,0,0,catirannotice-202410xk.htm,IRANNOTICE,CAT,18230
1,0000018230-25-000008,2025-02-14,2024-12-31,2025-02-14T09:36:30.000Z,34.0,10-K,001-00768,25623971.0,,XBRL,33945394,1,1,cat-20241231.htm,10-K,CAT,18230
2,0001104659-25-012855,2025-02-13,2025-02-11,2025-02-13T14:41:51.000Z,,4,,,,4,8873,0,0,xslF345X05/tm255985-10_4seq1.xml,OWNERSHIP DOCUMENT,CAT,18230
3,0001104659-25-012854,2025-02-13,2025-02-11,2025-02-13T14:40:49.000Z,,4,,,,4,6563,0,0,xslF345X05/tm255985-9_4seq1.xml,OWNERSHIP DOCUMENT,CAT,18230
4,0001104659-25-012853,2025-02-13,2025-02-11,2025-02-13T14:39:50.000Z,,4,,,,4,6513,0,0,xslF345X05/tm255985-8_4seq1.xml,OWNERSHIP DOCUMENT,CAT,18230


In [15]:
df_history.to_csv('/home/jupyter/SEC_exctractor/data/history.csv', index=False)

## Accessing specific filing document

Let's download latest Annual Report (10-K). Files are stored in browsable directory structure for CIK and accession-number: 
* https://www.sec.gov/Archives/edgar/data/{CIK}/{accession-number}/

Creating a function to create a url and run loop for all the items to download each of the filing htm file

In [16]:
def download_all_forms(df, form, header):
    df_ = df[df.form == form]
    for index, row in tqdm(df_.iterrows(), total=df_.shape[0]):
        url = f"https://www.sec.gov/Archives/edgar/data/{row['cik']}/{row['accessionNumber'].replace('-', '')}/{row['primaryDocument']}"
        req_content = requests.get(url, headers=header).content.decode("utf-8")
        directory = f"data/{row['ticker']}"
        if not os.path.exists(directory):
            os.makedirs(directory)

        with open(f"{directory}/{row['primaryDocument']}", "w") as f:
            f.write(req_content)

This step is to download all the 10K htm files for the 100 most recent filings into the data folder.

In [17]:
download_all_forms(df_history, '10-K', header)
    

  0%|          | 0/540 [00:00<?, ?it/s]

In [18]:
#!/usr/bin/env python
# coding: utf-8

# # 10-K form
# ## Business, Risk, and MD&A
# The function *parse_10k_filing()* parses 10-K forms to extract the following sections: business description, business risk, and management discussioin and analysis. The function takes two arguments, a link and a number indicating the section, and returns a list with the requested sections. Current options are **0(All), 1(Business), 2(Risk), 4(MDA).**
# 
# Caveats:
# The function *parse_10k_filing()* is a parser. You need to feed a SEC text link into it. There are many python and r packages to get a direct link to the fillings.
# 
import re
import unicodedata
import pandas as pd
from bs4 import BeautifulSoup as bs

def parse_10k_filing(file_path, section):
    
    if section not in [0, 1, 2, 3]:
        print("Not a valid section")
        sys.exit()
    
    def get_text(file_path):
        with open(file_path, 'r') as file:
            content = file.read()
        html = bs(content, 'html.parser')
        text = html.get_text()
        text = unicodedata.normalize("NFKD", text).encode('ascii', 'ignore').decode('utf8')
        text = text.split("\n")
        text = " ".join(text)
        return text
    
    def extract_text(text, item_start, item_end):
        item_start = item_start
        item_end = item_end
        starts = [i.start() for i in item_start.finditer(text)]
        ends = [i.start() for i in item_end.finditer(text)]
        positions = list()
        for s in starts:
            control = 0
            for e in ends:
                if control == 0:
                    if s < e:
                        control = 1
                        positions.append([s,e])
        item_length = 0
        item_position = list()
        for p in positions:
            if (p[1]-p[0]) > item_length:
                item_length = p[1]-p[0]
                item_position = p

        item_text = text[item_position[0]:item_position[1]]

        return item_text

    text = get_text(file_path)
        
    if section == 1 or section == 0:
        try:
            item1_start = re.compile("item\s*[1][\.\;\:\-\_]*\s*\\b", re.IGNORECASE)
            item1_end = re.compile("item\s*1a[\.\;\:\-\_]\s*Risk|item\s*2[\.\,\;\:\-\_]\s*Prop", re.IGNORECASE)
            businessText = extract_text(text, item1_start, item1_end)
        except:
            businessText = "Something went wrong!"
        
    if section == 2 or section == 0:
        try:
            item1a_start = re.compile("(?<!,\s)item\s*1a[\.\;\:\-\_]\s*Risk", re.IGNORECASE)
            item1a_end = re.compile("item\s*2[\.\;\:\-\_]\s*Prop|item\s*[1][\.\;\:\-\_]*\s*\\b", re.IGNORECASE)
            riskText = extract_text(text, item1a_start, item1a_end)
        except:
            riskText = "Something went wrong!"
            
    if section == 3 or section == 0:
        try:
            item7_start = re.compile("item\s*[7][\.\;\:\-\_]*\s*\\bM", re.IGNORECASE)
            item7_end = re.compile("item\s*7a[\.\;\:\-\_]\sQuanti|item\s*8[\.\,\;\:\-\_]\s*", re.IGNORECASE)
            mdaText = extract_text(text, item7_start, item7_end)
        except:
            mdaText = "Something went wrong!"
    
    if section == 0:
        data = [businessText, riskText, mdaText]
    elif section == 1:
        data = [businessText]
    elif section == 2:
        data = [riskText]
    elif section == 3:
        data = [mdaText]
    
    return data



In [19]:
def parse_all_forms(df, form, header):
    df_ = df[df.form == form]
    df__ = pd.DataFrame()
    for index, row in tqdm(df_.iterrows(), total=df_.shape[0]):
        directory = f"data/{row['ticker']}"
        file_path = 'data/AAPL/a10-k20179302017.htm'
        section = 0
        
        # Parse the 10-K filing and store the results in a DataFrame
        text_data = parse_10k_filing(file_path, section)
        df_text = pd.DataFrame({'Text': text_data})
        df_text['ticker'] = row['ticker']
        df_text['filepath'] = file_path
        df__ = pd.concat([df_text, df__])

    return df__

In [20]:
df_text = parse_all_forms(df_history , '10-K', header)

  0%|          | 0/540 [00:00<?, ?it/s]

KeyboardInterrupt: 

In [57]:
df_text

Unnamed: 0,Id,Text,ticker,filepath
0,1,Item 1. BusinessCompany BackgroundThe Company ...,AAPL,data/AAPL/a10-k20179302017.htm
1,2,Item 1A. Risk FactorsThe following discussion ...,AAPL,data/AAPL/a10-k20179302017.htm
2,3,Item 7. Managements Discussion and Analysis of...,AAPL,data/AAPL/a10-k20179302017.htm
3,4,Item 1. BusinessCompany BackgroundThe Company ...,AAPL,data/AAPL/a10-k20179302017.htm
4,5,Item 1A. Risk FactorsThe following discussion ...,AAPL,data/AAPL/a10-k20179302017.htm
...,...,...,...,...
1609,1610,Item 1A. Risk FactorsThe following discussion ...,CAT,data/AAPL/a10-k20179302017.htm
1610,1611,Item 7. Managements Discussion and Analysis of...,CAT,data/AAPL/a10-k20179302017.htm
1611,1612,Item 1. BusinessCompany BackgroundThe Company ...,CAT,data/AAPL/a10-k20179302017.htm
1612,1613,Item 1A. Risk FactorsThe following discussion ...,CAT,data/AAPL/a10-k20179302017.htm


In [None]:
## Exporting the resulted dataframe in a csv format to AWS S3

import boto3


In [56]:
## Exporting the resulted dataframe in a csv format.

file_path='./All Reports Data.csv'



#Arranging the index and adding Id column
df_text = df_text.reset_index(drop=True)
df_text['Id'] = df_text.index+1

# Pop the column from its current position
new_column = df_text.pop('Id')

# Insert the column at the front of the DataFrame
df_text.insert(0, 'Id', new_column)

print(df_text)
df_text.to_csv(file_path, index=False)


        Id                                               Text ticker   
0        1  Item 1. BusinessCompany BackgroundThe Company ...   AAPL  \
1        2  Item 1A. Risk FactorsThe following discussion ...   AAPL   
2        3  Item 7. Managements Discussion and Analysis of...   AAPL   
3        4  Item 1. BusinessCompany BackgroundThe Company ...   AAPL   
4        5  Item 1A. Risk FactorsThe following discussion ...   AAPL   
...    ...                                                ...    ...   
1609  1610  Item 1A. Risk FactorsThe following discussion ...    CAT   
1610  1611  Item 7. Managements Discussion and Analysis of...    CAT   
1611  1612  Item 1. BusinessCompany BackgroundThe Company ...    CAT   
1612  1613  Item 1A. Risk FactorsThe following discussion ...    CAT   
1613  1614  Item 7. Managements Discussion and Analysis of...    CAT   

                            filepath  
0     data/AAPL/a10-k20179302017.htm  
1     data/AAPL/a10-k20179302017.htm  
2     data/AAPL/a1

#testing the parse text for a single file

In [58]:
file_path = 'data/AAPL/a10-k20179302017.htm'
section = 0

# Parse the 10-K filing and store the results in a DataFrame
text_data = parse_10k_filing(file_path, section)
df = pd.DataFrame({'Text': text_data})

# Print the DataFrame
print(df)





                                                Text
0  Item 1. BusinessCompany BackgroundThe Company ...
1  Item 1A. Risk FactorsThe following discussion ...
2  Item 7. Managements Discussion and Analysis of...


In [59]:
type(content)

str

In [14]:
import pandas as pd

# Read the CSV file into a DataFrame
df_read = pd.read_csv('All Reports Data.csv')

print(df_read)
print(df_read.shape)

        Id                                               Text ticker   
0        1  Item 1. BusinessCompany BackgroundThe Company ...   AAPL  \
1        2  Item 1A. Risk FactorsThe following discussion ...   AAPL   
2        3  Item 7. Managements Discussion and Analysis of...   AAPL   
3        4  Item 1. BusinessCompany BackgroundThe Company ...   AAPL   
4        5  Item 1A. Risk FactorsThe following discussion ...   AAPL   
...    ...                                                ...    ...   
1609  1610  Item 1A. Risk FactorsThe following discussion ...    CAT   
1610  1611  Item 7. Managements Discussion and Analysis of...    CAT   
1611  1612  Item 1. BusinessCompany BackgroundThe Company ...    CAT   
1612  1613  Item 1A. Risk FactorsThe following discussion ...    CAT   
1613  1614  Item 7. Managements Discussion and Analysis of...    CAT   

                            filepath  
0     data/AAPL/a10-k20179302017.htm  
1     data/AAPL/a10-k20179302017.htm  
2     data/AAPL/a1

## Code to put all the texts from the latest 10k in a single row to be used for doing further appropriate cleaning.
### Mostly all the financial statements contains declarations and disclosures that are specific to their nature of work and Industry they are serving. So, this file can act as a source to perform some further data processing and cleaning that can be later used to map some of the internal processes that only the business is aware about. 

## Adding a Column for categorizing the Sections of the forms

In [13]:
column_names = df_read.columns
print(column_names)

Index(['Id', 'Text', 'ticker', 'filepath', 'Category'], dtype='object')


In [15]:
import pandas as pd

def assign_category(df):
    # Create a new column 'Category'
    df['Category'] = ''

    # Define the keywords and their corresponding categories
    keywords = {
        'Item 1.': 'Business Overview',
        'Item 1A.': 'Risk Factors',
        'Item 7.': 'MD&A'
    }

    # Iterate over each row
    for index, row in df.iterrows():
        text = row['Text']

        # Check if the 'Text' column contains any of the keywords
        for keyword, category in keywords.items():
            if keyword in text:
                df.at[index, 'Category'] = category
                break

    # Return the updated DataFrame
    return df

# Example usage
# Assuming your input DataFrame is called 'df_read'

# Assign categories based on keywords
df_with_category = assign_category(df_read)

# Print the updated DataFrame
print(df_with_category)


        Id                                               Text ticker   
0        1  Item 1. BusinessCompany BackgroundThe Company ...   AAPL  \
1        2  Item 1A. Risk FactorsThe following discussion ...   AAPL   
2        3  Item 7. Managements Discussion and Analysis of...   AAPL   
3        4  Item 1. BusinessCompany BackgroundThe Company ...   AAPL   
4        5  Item 1A. Risk FactorsThe following discussion ...   AAPL   
...    ...                                                ...    ...   
1609  1610  Item 1A. Risk FactorsThe following discussion ...    CAT   
1610  1611  Item 7. Managements Discussion and Analysis of...    CAT   
1611  1612  Item 1. BusinessCompany BackgroundThe Company ...    CAT   
1612  1613  Item 1A. Risk FactorsThe following discussion ...    CAT   
1613  1614  Item 7. Managements Discussion and Analysis of...    CAT   

                            filepath           Category  
0     data/AAPL/a10-k20179302017.htm  Business Overview  
1     data/AAPL/a10

In [16]:
## Exporting the resulted dataframe in a csv format.

file_path='./All Reports Data.csv'

df_with_category.to_csv(file_path, index=False)

In [None]:
bucket_name = 'glue-sec-etl'  # Replace with your bucket name
csv_file_key = 'All Reports Data.csv'  # Replace with your desired key for the CSV filenpm install -g npm

In [None]:
s3 = boto3.resource('s3')
s3.meta.client.upload_file(csv_file_key, bucket_name, csv_file_key)
