# GPT-4 API: Extract Structured Data from SEC Filings (Material Event Disclosures)

Data sources used:
- SEC-API.io
- AlgoSeek
- WRDS Fundamental Data

Logic:

- Query API to download metadata of 8-Ks with Item 4.02 => create file `8K-filing-metadata.csv`
- Extractor API to extract and download Item 4.02 sections => save to `./data/edgar/8K-item-4.02/CIK/ACCESSION-NO-item-4-2.txt`
- Pre-select 100 filings that deal with revenue recognition errors, and where 50% impact is material and 50% not-material.
- For each item section, extract structured data using GPT4 and save to `./data/edgar/8K-item-4.02/CIK/ACCESSION-NO-item-4-2-structured-data.json`
- For each CIK, and `filedAt`, read `Years since IPO` and `market cap` from `fundamentals`
- For each filing, calculate 1/2/3/5/10/20 day return after filing was disclosed
  - Plot return distributions for each day
  - Display descriptive stats

More:
- https://sec-api.io/resources/analyze-8-k-filings-and-material-event-disclosure-activity

## Step 1: Locate URLs of 8-K Filings with Item 4.02

In [3]:
# load OPENAI_API_KEY value from .env file
from dotenv import load_dotenv

load_dotenv()

True

In [4]:
!pip -q install sec-api

In [5]:
import os
from sec_api import QueryApi
import pandas as pd

SEC_API_KEY = os.getenv("SEC_API_KEY")

queryApi = QueryApi(api_key=SEC_API_KEY)

payload = {
    "query": 'formType:"8-K" AND items:"4.02" AND filedAt:[2022-01-01 TO 2022-12-31]',
    "from": "0",
    "size": "50",
    "sort": [{"filedAt": {"order": "desc"}}],
}

response = queryApi.get_filings(payload)

# Convert JSON array to DataFrame
metadata_sample = pd.DataFrame(response["filings"])

columns_of_interest = [
    "formType",
    "filedAt",
    "accessionNo",
    "ticker",
    "cik",
    "companyName",
    "items",
    "linkToFilingDetails",
]

metadata_sample[columns_of_interest].head(5)

Unnamed: 0,formType,filedAt,accessionNo,ticker,cik,companyName,items,linkToFilingDetails
0,8-K,2022-12-27T16:20:13-05:00,0001213900-22-082892,POLCQ,1810140,Polished.com Inc.,[Item 4.01: Changes in Registrant's Certifying...,https://www.sec.gov/Archives/edgar/data/181014...
1,8-K,2022-12-22T16:30:53-05:00,0001493152-22-036320,GLLI,1888734,GLOBALINK INVESTMENT INC.,[Item 4.02: Non-Reliance on Previously Issued ...,https://www.sec.gov/Archives/edgar/data/188873...
2,8-K,2022-12-21T17:25:38-05:00,0001213900-22-081850,CLRC,1903392,ClimateRock,[Item 4.02: Non-Reliance on Previously Issued ...,https://www.sec.gov/Archives/edgar/data/190339...
3,8-K,2022-12-16T16:17:36-05:00,0001185185-22-001426,FEIM,39020,FREQUENCY ELECTRONICS INC,[Item 2.02: Results of Operations and Financia...,https://www.sec.gov/Archives/edgar/data/39020/...
4,8-K,2022-12-16T16:05:27-05:00,0001493152-22-035733,VTRO,793171,"Vitro Biopharma, Inc.",[Item 4.02: Non-Reliance on Previously Issued ...,https://www.sec.gov/Archives/edgar/data/793171...


In [10]:
# Get first 50 filings of each month in 2022 (for demonstration purposes only)
# In practice, you would want to get all filings for the year (or a longer period of time)
def get_filings_metadata(year=2022):
    metadata_list = []

    for month in range(1, 13):
        # Iterate over "from" in 50 increments to get all filings for the year
        # for i in range(0, 10000, 50): # Uncomment to get all filings for the year
        for i in range(
            0, 50, 50
        ):  # For demonstration purposes, only get first 50 filings of each month
            payload = {
                "query": f'formType:"8-K" AND items:"4.02" AND filedAt:[{year}-{month:02d}-01 TO {year}-{month:02d}-31]',
                "from": i,
                "size": "50",
                "sort": [{"filedAt": {"order": "desc"}}],
            }
            response = queryApi.get_filings(payload)
            if len(response["filings"]) == 0:
                break
            metadata_list.append(response["filings"])

        print(f"‚úÖ Month {month:02d} completed")

    # Flatten the list of lists: [[1, 2], [3, 4]] -> [1, 2, 3, 4]
    metadata_list = [item for sublist in metadata_list for item in sublist]
    # Convert JSON array to DataFrame
    metadata_df = pd.DataFrame(metadata_list)
    # Drop duplicates 
    metadata_df = metadata_df.drop_duplicates(subset=["accessionNo"])

    print(f"‚úÖ Done. Total filings: {len(metadata_list)}")

    return metadata_df[
        [
            "formType",
            "filedAt",
            "accessionNo",
            "ticker",
            "cik",
            "companyName",
            "items",
            "linkToFilingDetails",
        ]
    ]

In [11]:
metadata_2022 = get_filings_metadata(2022)

‚úÖ Month 01 completed
‚úÖ Month 02 completed
‚úÖ Month 03 completed
‚úÖ Month 04 completed
‚úÖ Month 05 completed
‚úÖ Month 06 completed
‚úÖ Month 07 completed
‚úÖ Month 08 completed
‚úÖ Month 09 completed
‚úÖ Month 10 completed
‚úÖ Month 11 completed
‚úÖ Month 12 completed
‚úÖ Done. Total filings: 305


In [16]:
metadata_2022.head(5)

Unnamed: 0,formType,filedAt,accessionNo,ticker,cik,companyName,items,linkToFilingDetails
0,8-K,2022-01-31T21:35:24-05:00,0001829126-22-002268,,1740742,"TransparentBusiness, Inc.",[Item 4.01: Changes in Registrant's Certifying...,https://www.sec.gov/Archives/edgar/data/174074...
1,8-K,2022-01-31T08:50:59-05:00,0001104659-22-009285,OEPW,1824677,One Equity Partners Open Water I Corp.,[Item 4.02: Non-Reliance on Previously Issued ...,https://www.sec.gov/Archives/edgar/data/182467...
2,8-K,2022-01-31T07:03:59-05:00,0001564590-22-003120,BIOCQ,1044378,BIOCEPT INC,[Item 4.02: Non-Reliance on Previously Issued ...,https://www.sec.gov/Archives/edgar/data/104437...
3,8-K,2022-01-28T17:40:28-05:00,0001213900-22-004297,SCAQ,1821812,Stratim Cloud Acquisition Corp.,[Item 4.02: Non-Reliance on Previously Issued ...,https://www.sec.gov/Archives/edgar/data/182181...
4,8-K,2022-01-28T17:02:06-05:00,0001193125-22-021724,DTRT,1865537,DTRT Health Acquisition Corp.,[Item 4.02: Non-Reliance on Previously Issued ...,https://www.sec.gov/Archives/edgar/data/186553...


In [17]:
# Save the DataFrame to a CSV file ./data/edgar/8K-filing-metadata.csv
metadata_2022.to_csv("./data/edgar/8K-filing-metadata.csv", index=False)

## Step 2: Download Item 4.02 Sections

- Use `pandarallel` to parallelize downloading of Item 4.02 sections and speed up the process.

In [18]:
from sec_api import ExtractorApi
import time

extractorApi = ExtractorApi(SEC_API_KEY)


def extract_section_4_02_and_save_to_file(df_row, retry_count=0):
    cik = df_row["cik"]
    accessionNo = df_row["accessionNo"]
    filingUrl = df_row["linkToFilingDetails"]
    itemId = "4-2"
    try:
        # Check if output directory exists, if not create it
        output_dir = f"./data/edgar/8K-item-4.02/{cik}"
        os.makedirs(output_dir, exist_ok=True)
        # Save extracted text to a file ./data/edgar/{cik}/{accessionNo}-item-{itemId}.txt
        file_name = f"{accessionNo}-item-{itemId}.txt"
        file_path = os.path.join(output_dir, file_name)
        # Skip if the file already exists
        if os.path.exists(file_path):
            return

        section_content = extractorApi.get_section(filingUrl, itemId, "text")
        
        with open(file_path, "w") as f:
            f.write(section_content)
    except Exception as e:
        # If e contains 429, retry the request after waiting for 
        # 300 milliseconds to the power of retry_count
        if "429" in str(e) and retry_count < 3:
            print(f"Retrying for CIK {cik} at {filingUrl}")
            time.sleep(0.3 ** (retry_count + 1))
            return extract_section_4_02_and_save_to_file(df_row, retry_count + 1)
        else:
            print(f"Failed to extract item {itemId} for CIK {cik} at {filingUrl}")
            print(e)

In [19]:
# Test
single = metadata_2022.iloc[0]
extract_section_4_02_and_save_to_file(single)

In [58]:
!pip -q install pandarallel ipywidgets

In [20]:
from pandarallel import pandarallel

number_of_parallel_downloads = 10

pandarallel.initialize(
    progress_bar=True, nb_workers=number_of_parallel_downloads, verbose=1
)

metadata_2022.parallel_apply(extract_section_4_02_and_save_to_file, axis=1)

print("‚úÖ All sections extracted and saved to files")

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=31), Label(value='0 / 31'))), HBox‚Ä¶

‚úÖ All sections extracted and saved to files


## Step 3: Use GPT-4 API to Extract Structured Data

In [21]:
from IPython.display import display, HTML

# Load an extracted section from a sample filing
sample_filing = metadata_2022.iloc[0]
cik = sample_filing["cik"]
accessionNo = sample_filing["accessionNo"]
filingUrl = sample_filing["linkToFilingDetails"]
file_name = f"{accessionNo}-item-4-2.txt"
file_path = f"./data/edgar/8K-item-4.02/{cik}/{file_name}"

with open(file_path, "r") as f:
    sample_section = f.read()

print(f"üìÑ Sample section from {filingUrl}")
display(HTML(sample_section))

üìÑ Sample section from https://www.sec.gov/Archives/edgar/data/1740742/000182912622002268/transparentbusiness_8k.htm


In [22]:
from openai import OpenAI

openai_client = OpenAI()

prompt = f"""Task: Given the following 8-K filing section 4.02, extract the key components of the disclosure, the identified issue (or issues), affected reporting periods, whether a restatement is neccesary or not, the reasons for a restatement, the impact of the error, whether the impact is material or not, the company's auditor, and the event classiciation. Return the extracted structured data as a JSON object. Only respond with the JSON object, and do not respond with anything else.

Structure of JSON object:
'''
{{
  "keyComponents": "...", // string: key components of the disclosure
  "identifiedIssue": ["..."], // array of strings: identified issue (or issues)
  "affectedReportingPeriods": ["..."], // array of strings: affected reporting periods in format "Q1 2023", "Q2 2023", etc. or "FY 2023"
  "restatementIsNecessary": true|false, // boolean: whether a restatement is necessary or not
  "reasonsForRestatement": ["..."], // array of strings: reasons for a restatement
  "impactOfError": "...", // string: impact of the error
  "impactIsMaterial": true|false, // boolean: whether the impact is material or not
  "auditors": ["..."], // array of string: company's auditor or auditors. If no auditor is mentioned, return an empty array. If multiple auditors are mentioned, return an array with all auditors.
  "eventClassification": "..." // string: event classification, such as "Financial Restatement Due to Revenue Recognition Errors"
}}
'''

Input text:
'''
{sample_section}
'''

Response:
"""

response = openai_client.chat.completions.create(
    model="gpt-4-turbo",
    messages=[{"role": "user", "content": prompt}],
)

In [23]:
response_string = response.choices[0].message.content
print(response_string)

{
  "keyComponents": "Non-reliance on previously issued financial statements due to material accounting errors, need for restatement identified",
  "identifiedIssue": [
    "Incorrect accounting for deferred tax liabilities",
    "Errors in certain income tax disclosures"
  ],
  "affectedReportingPeriods": [
    "FY 2020"
  ],
  "restatementIsNecessary": true,
  "reasonsForRestatement": [
    "Material error in accounting for deferred tax liabilities",
    "Understated deferred tax liabilities and Goodwill"
  ],
  "impactOfError": "Financial statements for the fiscal year 2020 are materially misstated, affecting tax liabilities and Goodwill accounting.",
  "impactIsMaterial": true,
  "auditors": [
    "Former Auditor",
    "Paris Kreit & Chiu"
  ],
  "eventClassification": "Financial Restatement Due to Errors in Tax Accounting"
}


In [24]:
import json

# Remove starting "```json" and ending "```" values from GPTs response
response_string = response_string.replace("```json\n", "").replace("```", "").replace("\n", "")
# Convert to JSON object
response_json = json.loads(response_string)
response_json

{'keyComponents': 'Non-reliance on previously issued financial statements due to material accounting errors, need for restatement identified',
 'identifiedIssue': ['Incorrect accounting for deferred tax liabilities',
  'Errors in certain income tax disclosures'],
 'affectedReportingPeriods': ['FY 2020'],
 'restatementIsNecessary': True,
 'reasonsForRestatement': ['Material error in accounting for deferred tax liabilities',
  'Understated deferred tax liabilities and Goodwill'],
 'impactOfError': 'Financial statements for the fiscal year 2020 are materially misstated, affecting tax liabilities and Goodwill accounting.',
 'impactIsMaterial': True,
 'auditors': ['Former Auditor', 'Paris Kreit & Chiu'],
 'eventClassification': 'Financial Restatement Due to Errors in Tax Accounting'}

In [26]:
# Load the extracted structured data into a DataFrame
structured_data_sample = pd.json_normalize(response_json)
structured_data_sample

Unnamed: 0,keyComponents,identifiedIssue,affectedReportingPeriods,restatementIsNecessary,reasonsForRestatement,impactOfError,impactIsMaterial,auditors,eventClassification
0,Non-reliance on previously issued financial st...,[Incorrect accounting for deferred tax liabili...,[FY 2020],True,[Material error in accounting for deferred tax...,Financial statements for the fiscal year 2020 ...,True,"[Former Auditor, Paris Kreit & Chiu]",Financial Restatement Due to Errors in Tax Acc...


In [40]:
# To speed up things, I prepared a CSV file with the extracted structured data
# for all 8-K filings in 2022 with Item 4.02. Let's load it into a DataFrame.
structured_data_2022 = pd.read_csv(
    "./data/edgar/8K-4.02-structured-data-and-metadata-2022.csv",
    parse_dates=["filedAt"],
    # Convert the following columns to lists
    converters={
        "identifiedIssue": eval,
        "affectedReportingPeriods": eval,
        "reasonsForRestatement": eval,
        "auditors": eval,
    },
)
structured_data_2022.head()

Unnamed: 0,keyComponents,identifiedIssue,affectedReportingPeriods,restatementIsNecessary,reasonsForRestatement,impactOfError,impactIsMaterial,auditors,eventClassification,cik,accessionNo,ticker,filedAt
0,Identification of errors in financial statemen...,[Misclassification of certain expenses and rec...,"[Q1 2022, FY 2021]",True,[To correct classification of certain expenses...,Material weakness in design and operation of e...,True,[Ernst & Young LLP],Financial Restatement Due to Misclassification...,1005286,0001005286-22-000049,LFCR,2022-09-13 21:45:06-04:00
1,"The Board of Intellicheck, Inc. has determined...",[Misclassification of certain option awards as...,"[Q3 2020, Q1 2021, Q2 2021, Q3 2021, FY 2020, ...",True,"[Change in classification of option awards, Ad...",The errors are expected to increase accrued li...,True,[Independent Registered Public Accounting Firm],Financial Restatement Due to Misclassification...,1040896,0001493152-22-014655,IDN,2022-05-20 17:26:31-04:00
2,Non-Reliance on Previously Issued Financial St...,[Failure to accrue for certain expenses incurred],[Q3 2021],True,[Failure to accrue for expenses estimated betw...,Underreported expenses by approximately $1.0 t...,True,[Mayer Hoffman McCann P.C.],Financial Restatement Due to Expense Recogniti...,1044378,0001564590-22-003120,BIOCQ,2022-01-31 07:03:59-05:00
3,Identification of an accounting error in previ...,[Error in the application of Accounting Standa...,"[Q1 2022, Q2 2022]",True,[To correct misstatements in financial stateme...,Overstatement of both fleet new vehicle revenu...,True,[KPMG LLP],Financial Restatement Due to Revenue Recogniti...,1043509,0001043509-22-000016,SAH,2022-10-28 17:00:49-04:00
4,Non-Reliance on Previously Issued Financial St...,[Financial records of Human Brands were defici...,[Q2 2021],True,[Deficiencies in the financial records of Huma...,Non-compliance with Regulation S-X requirement...,True,[B.F Borgers C.P.A.],Financial Restatement Due to Non-Compliance wi...,1058330,0001903596-22-000275,ROAG,2022-05-10 16:53:49-04:00


### Auditors Involved in Material Financial Restatements

In [48]:
# Quick look at auditors involved in material restatements
auditors = (
    structured_data_2022[
        structured_data_2022["restatementIsNecessary"]
        & structured_data_2022["impactIsMaterial"]
    ]["auditors"]
    .explode()
    .fillna("NaN")
)

# Remove "Independent registered public accounting firm", "Not specified" and others from list
auditors = auditors[
    ~auditors.isin(
        [
            "Independent registered public accounting firm",
            "Not specified",
            "Not explicitly mentioned",
            "Independent accountant",
            "NaN",
        ]
    )
    # Or if auditor name includes phrase "independent registered public"
    & ~auditors.str.contains("independent registered public", case=False)
    & ~auditors.str.contains("Not explicitly mentioned", case=False)
].reset_index(drop=True)

auditors_count = auditors.value_counts().to_frame().reset_index()
auditors_count["pct"] = auditors_count["count"] / auditors_count["count"].sum() * 100
auditors_count["pct"] = auditors_count["pct"].round(2)

print("üîç Top 10 auditors involved in material restatements in 2022:")
auditors_count.head(10)

üîç Top 10 auditors involved in material restatements in 2022:


Unnamed: 0,auditors,count,pct
0,"WithumSmith+Brown, PC",39,16.88
1,Marcum LLP,35,15.15
2,"BDO USA, LLP",18,7.79
3,Ernst & Young LLP,13,5.63
4,PricewaterhouseCoopers LLP,8,3.46
5,"WithumSmith+Brown, P.C.",7,3.03
6,KPMG LLP,6,2.6
7,Friedman LLP,6,2.6
8,UHY LLP,5,2.16
9,RSM US LLP,5,2.16


## Step 4: Calculate X-Day Returns for each Filing

- Load daily stock prices for each company from `./data/historical-prices/CIK.csv`

In [None]:
# Load CSV ./gpt-4-api-extract-data-from-sec-filings/data/historical-prices/daily-price-vol-all-4.02.csv

## Step 5: Enrich Data with Fundamental Data

- Load market cap, years since IPO, sector, industry, and other fundamental data for each CIK from WRDS in `./data/fundamentals/fundamentals.csv`

## Step 6: Analyze and Visualize Data

## End Result

![Return Histogram](assets/returns-histogram-4.02.png)

![Descp Stats](assets/desc-stats-returns-4.02.png)