# GPT-4 API: Extract Structured Data from SEC Filings (Material Event Disclosures)

Data sources used:
- SEC-API.io
- AlgoSeek
- WRDS Fundamental Data

Logic:

- Query API to download metadata of 8-Ks with Item 4.02 => create file `8K-filing-metadata.csv`
- Extractor API to extract and download Item 4.02 sections => save to `./data/edgar/8K-item-4.02/CIK/ACCESSION-NO-item-4-2.txt`
- Pre-select 100 filings that deal with revenue recognition errors, and where 50% impact is material and 50% not-material.
- For each item section, extract structured data using GPT4 and save to `./data/edgar/8K-item-4.02/CIK/ACCESSION-NO-item-4-2-structured-data.json`
- For each CIK, and `filedAt`, read `Years since IPO` and `market cap` from `fundamentals`
- For each filing, calculate 1/2/3/5/10/20 day return after filing was disclosed
  - Plot return distributions for each day
  - Display descriptive stats

More:
- https://sec-api.io/resources/analyze-8-k-filings-and-material-event-disclosure-activity

## Step 1: Locate URLs of 8-K Filings with Item 4.02

In [1]:
# load OPENAI_API_KEY value from .env file
from dotenv import load_dotenv

load_dotenv()

True

In [2]:
!pip -q install sec-api


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [87]:
import os
from sec_api import QueryApi
import pandas as pd

SEC_API_KEY = os.getenv("SEC_API_KEY")

queryApi = QueryApi(api_key=SEC_API_KEY)

payload = {
    "query": 'formType:"8-K" AND items:"4.02" AND filedAt:[2022-01-01 TO 2022-12-31]',
    "from": "0",
    "size": "50",
    "sort": [{"filedAt": {"order": "desc"}}],
}

response = queryApi.get_filings(payload)

# Convert JSON array to DataFrame
metadata_sample = pd.DataFrame(response["filings"])

columns_of_interest = [
    "formType",
    "filedAt",
    "accessionNo",
    "ticker",
    "cik",
    "companyName",
    "items",
    "linkToFilingDetails",
]

metadata_sample[columns_of_interest].head(5)

Unnamed: 0,formType,filedAt,accessionNo,ticker,cik,companyName,items,linkToFilingDetails
0,8-K,2022-12-27T16:20:13-05:00,0001213900-22-082892,POLCQ,1810140,Polished.com Inc.,[Item 4.01: Changes in Registrant's Certifying...,https://www.sec.gov/Archives/edgar/data/181014...
1,8-K,2022-12-22T16:30:53-05:00,0001493152-22-036320,GLLI,1888734,GLOBALINK INVESTMENT INC.,[Item 4.02: Non-Reliance on Previously Issued ...,https://www.sec.gov/Archives/edgar/data/188873...
2,8-K,2022-12-21T17:25:38-05:00,0001213900-22-081850,CLRC,1903392,ClimateRock,[Item 4.02: Non-Reliance on Previously Issued ...,https://www.sec.gov/Archives/edgar/data/190339...
3,8-K,2022-12-16T16:17:36-05:00,0001185185-22-001426,FEIM,39020,FREQUENCY ELECTRONICS INC,[Item 2.02: Results of Operations and Financia...,https://www.sec.gov/Archives/edgar/data/39020/...
4,8-K,2022-12-16T16:05:27-05:00,0001493152-22-035733,VTRO,793171,"Vitro Biopharma, Inc.",[Item 4.02: Non-Reliance on Previously Issued ...,https://www.sec.gov/Archives/edgar/data/793171...


In [88]:
# Get first 50 filings of each month in 2022 (for demonstration purposes only)
# In practice, you would want to get all filings for the year (or a longer period of time)
def get_filings_metadata(year=2022):
    metadata_list = []

    for month in range(1, 13):
        # Iterate over "from" in 50 increments to get all filings for the year
        # for i in range(0, 10000, 50): # Uncomment to get all filings for the year
        for i in range(
            0, 50, 50
        ):  # For demonstration purposes, only get first 50 filings of each month
            payload = {
                "query": f'formType:"8-K" AND items:"4.02" AND filedAt:[{year}-{month:02d}-01 TO {year}-{month:02d}-31]',
                "from": i,
                "size": "50",
                "sort": [{"filedAt": {"order": "desc"}}],
            }
            response = queryApi.get_filings(payload)
            if len(response["filings"]) == 0:
                break
            metadata_list.append(response["filings"])

        print(f"✅ Month {month:02d} completed")

    # Flatten the list of lists: [[1, 2], [3, 4]] -> [1, 2, 3, 4]
    metadata_list = [item for sublist in metadata_list for item in sublist]
    print(f"✅ Done. Total filings: {len(metadata_list)}")
    # Convert JSON array to DataFrame
    metadata_df = pd.DataFrame(metadata_list)

    return metadata_df[
        [
            "formType",
            "filedAt",
            "accessionNo",
            "ticker",
            "cik",
            "companyName",
            "items",
            "linkToFilingDetails",
        ]
    ]

In [89]:
metadata_2022 = get_filings_metadata(2022)

✅ Month 01 completed
✅ Month 02 completed
✅ Month 03 completed
✅ Month 04 completed
✅ Month 05 completed
✅ Month 06 completed
✅ Month 07 completed
✅ Month 08 completed
✅ Month 09 completed
✅ Month 10 completed
✅ Month 11 completed
✅ Month 12 completed
✅ Done. Total filings: 305


In [90]:
metadata_2022.head(5)

Unnamed: 0,formType,filedAt,accessionNo,ticker,cik,companyName,items,linkToFilingDetails
0,8-K,2022-01-31T21:35:24-05:00,0001829126-22-002268,,1740742,"TransparentBusiness, Inc.",[Item 4.01: Changes in Registrant's Certifying...,https://www.sec.gov/Archives/edgar/data/174074...
1,8-K,2022-01-31T08:50:59-05:00,0001104659-22-009285,OEPW,1824677,One Equity Partners Open Water I Corp.,[Item 4.02: Non-Reliance on Previously Issued ...,https://www.sec.gov/Archives/edgar/data/182467...
2,8-K,2022-01-31T07:03:59-05:00,0001564590-22-003120,BIOCQ,1044378,BIOCEPT INC,[Item 4.02: Non-Reliance on Previously Issued ...,https://www.sec.gov/Archives/edgar/data/104437...
3,8-K,2022-01-28T17:40:28-05:00,0001213900-22-004297,SCAQ,1821812,Stratim Cloud Acquisition Corp.,[Item 4.02: Non-Reliance on Previously Issued ...,https://www.sec.gov/Archives/edgar/data/182181...
4,8-K,2022-01-28T17:02:06-05:00,0001193125-22-021724,DTRT,1865537,DTRT Health Acquisition Corp.,[Item 4.02: Non-Reliance on Previously Issued ...,https://www.sec.gov/Archives/edgar/data/186553...


In [91]:
# Save the DataFrame to a CSV file ./data/edgar/8K-filing-metadata.csv
metadata_2022.to_csv("./data/edgar/8K-filing-metadata.csv", index=False)

## Step 2: Download Item 4.02 Sections

- Use `pandarallel` to parallelize downloading of Item 4.02 sections and speed up the process.

In [94]:
from sec_api import ExtractorApi
import time

extractorApi = ExtractorApi(SEC_API_KEY)


def extract_section_4_02_and_save_to_file(df_row, retry_count=0):
    cik = df_row["cik"]
    accessionNo = df_row["accessionNo"]
    filingUrl = df_row["linkToFilingDetails"]
    itemId = "4-2"
    try:
        # Check if output directory exists, if not create it
        output_dir = f"./data/edgar/8K-item-4.02/{cik}"
        os.makedirs(output_dir, exist_ok=True)
        # Save extracted text to a file ./data/edgar/{cik}/{accessionNo}-item-{itemId}.txt
        file_name = f"{accessionNo}-item-{itemId}.txt"
        file_path = os.path.join(output_dir, file_name)
        # Skip if the file already exists
        if os.path.exists(file_path):
            return

        section_content = extractorApi.get_section(filingUrl, itemId, "text")
        
        with open(file_path, "w") as f:
            f.write(section_content)
    except Exception as e:
        # If e contains 429, retry the request after waiting for 
        # 300 milliseconds to the power of retry_count
        if "429" in str(e) and retry_count < 3:
            print(f"Retrying for CIK {cik} at {filingUrl}")
            time.sleep(0.3 ** (retry_count + 1))
            return extract_section_4_02_and_save_to_file(df_row, retry_count + 1)
        else:
            print(f"Failed to extract item {itemId} for CIK {cik} at {filingUrl}")
            print(e)

In [95]:
# Test
single = metadata_2022.iloc[0]
extract_section_4_02_and_save_to_file(single)

In [58]:
!pip -q install pandarallel ipywidgets

In [96]:
from pandarallel import pandarallel

number_of_parallel_downloads = 10

pandarallel.initialize(
    progress_bar=True, nb_workers=number_of_parallel_downloads, verbose=1
)

metadata_2022.parallel_apply(extract_section_4_02_and_save_to_file, axis=1)

print("✅ All sections extracted and saved to files")

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=31), Label(value='0 / 31'))), HBox…

✅ All sections extracted and saved to files


## Step 3: Use GPT-4 API to Extract Structured Data

In [98]:
from IPython.display import display, HTML

# Load an extracted section from a sample filing
sample_filing = metadata_2022.iloc[0]
cik = sample_filing["cik"]
accessionNo = sample_filing["accessionNo"]
filingUrl = sample_filing["linkToFilingDetails"]
file_name = f"{accessionNo}-item-4-2.txt"
file_path = f"./data/edgar/8K-item-4.02/{cik}/{file_name}"

with open(file_path, "r") as f:
    sample_section = f.read()

print(f"📄 Sample section from {filingUrl}")
display(HTML(sample_section))

📄 Sample section from https://www.sec.gov/Archives/edgar/data/1740742/000182912622002268/transparentbusiness_8k.htm


In [102]:
print(sample_section)

 Item 4.02 Non-Reliance on Previously Issued Financial Statements or a Related Audit Report or Completed Interim Review. &#160;

On April 30, 2021, the Company filed with the SEC its General Form for Registration of Securities on Form 10-12G, and an amendment to Form 10-12G filed on August 9, 2021, together with all exhibits thereto, which included its Financial Statements as of December 31, 2020 and 2019 . In August 2021 the Former Auditor brought to the attention of the Company potential errors in the accounting related to the Company&#8217;s deferred tax liabilities and certain income tax disclosures. On January 26, 2022, after internal analysis and consultation with its technical accountants and counsel, the management of the Company determined that the Financial Statements should no longer be relied upon because the Company concluded that there was an error related to accounting for deferred tax liabilities. The error is deemed material to the financial statements for the year end

In [105]:
from openai import OpenAI

openai_client = OpenAI()

prompt = f"""Task: Given the following 8-K filing section 4.02, extract the key components of the disclosure, the identified issue (or issues), affected reporting periods, whether a restatement is neccesary or not, the reasons for a restatement, the impact of the error, whether the impact is material or not, the company's auditor, and the event classiciation. Return the extracted structured data as a JSON object. Only respond with the JSON object, and do not respond with anything else.

Structure of JSON object:
'''8K-background.ipynb
{{
  "keyComponents": "...", // string: key components of the disclosure
  "identifiedIssue": ["..."], // array of strings: identified issue (or issues)
  "affectedReportingPeriods": ["..."], // array of strings: affected reporting periods in format "Q1 2023", "Q2 2023", etc. or "FY 2023"
  "restatementIsNecessary": true|false, // boolean: whether a restatement is necessary or not
  "reasonsForRestatement": ["..."], // array of strings: reasons for a restatement
  "impactOfError": "...", // string: impact of the error
  "impactIsMaterial": true|false, // boolean: whether the impact is material or not
  "auditors": ["..."], // array of string: company's auditor or auditors. If no auditor is mentioned, return an empty array. If multiple auditors are mentioned, return an array with all auditors.
  "eventClassification": "..." // string: event classification, such as "Financial Restatement Due to Revenue Recognition Errors"
}}
'''

Input text:
'''
{sample_section}
'''

Response:
"""

response = openai_client.chat.completions.create(
    model="gpt-4-turbo",
    messages=[{"role": "user", "content": prompt}],
)

In [110]:
response_string = response.choices[0].message.content
print(response_string)

{
  "keyComponents": "Non-Reliance on Previously Issued Financial Statements due to errors in deferred tax liabilities and income tax disclosures, involving discussions with Former Auditor and a new auditor team, leading to an impending restatement of the financials for FY 2020.",
  "identifiedIssue": [
    "Errors in accounting for deferred tax liabilities",
    "Incorrect netting of deferred tax liability against deferred tax asset due to non-consolidated tax filing status with ITSQuest"
  ],
  "affectedReportingPeriods": [
    "FY 2020"
  ],
  "restatementIsNecessary": true,
  "reasonsForRestatement": [
    "Material errors identified in the accounting for income taxes and deferred tax liabilities",
    "Understatement of deferred tax liabilities and Goodwill by $969,940"
  ],
  "impactOfError": "Material understatement of deferred tax liabilities and Goodwill by $969,940",
  "impactIsMaterial": true,
  "auditors": [
    "Former Auditor",
    "Paris Kreit & Chiu"
  ],
  "eventClassi

In [111]:
import json

# Remove starting "```json" and ending "```" values from GPTs response
response_string = response_string.replace("```json\n", "").replace("```", "").replace("\n", "")
# Convert to JSON object
response_json = json.loads(response_string)
response_json

{'keyComponents': 'Non-Reliance on Previously Issued Financial Statements due to errors in deferred tax liabilities and income tax disclosures, involving discussions with Former Auditor and a new auditor team, leading to an impending restatement of the financials for FY 2020.',
 'identifiedIssue': ['Errors in accounting for deferred tax liabilities',
  'Incorrect netting of deferred tax liability against deferred tax asset due to non-consolidated tax filing status with ITSQuest'],
 'affectedReportingPeriods': ['FY 2020'],
 'restatementIsNecessary': True,
 'reasonsForRestatement': ['Material errors identified in the accounting for income taxes and deferred tax liabilities',
  'Understatement of deferred tax liabilities and Goodwill by $969,940'],
 'impactOfError': 'Material understatement of deferred tax liabilities and Goodwill by $969,940',
 'impactIsMaterial': True,
 'auditors': ['Former Auditor', 'Paris Kreit & Chiu'],
 'eventClassification': 'Financial Restatement Due to Revenue Re

In [112]:
# Load the extracted structured data into a DataFrame
structured_data = pd.json_normalize(response_json)

structured_data

Unnamed: 0,keyComponents,identifiedIssue,affectedReportingPeriods,restatementIsNecessary,reasonsForRestatement,impactOfError,impactIsMaterial,auditors,eventClassification
0,Non-Reliance on Previously Issued Financial St...,[Errors in accounting for deferred tax liabili...,[FY 2020],True,[Material errors identified in the accounting ...,Material understatement of deferred tax liabil...,True,"[Former Auditor, Paris Kreit & Chiu]",Financial Restatement Due to Revenue Recogniti...


## Step 4: Calculate X-Day Returns for each Filing

- Load daily stock prices for each company from `./data/historical-prices/CIK.csv`

## Step 5: Enrich Data with Fundamental Data

- Load market cap, years since IPO, sector, industry, and other fundamental data for each CIK from WRDS in `./data/fundamentals/fundamentals.csv`

## Step 6: Analyze and Visualize Data

## End Result

![Return Histogram](assets/returns-histogram-4.02.png)

![Descp Stats](assets/desc-stats-returns-4.02.png)