# 🧠 Introduction to Automated Data Extraction from SEC Filings with Python and OpenAI
Welcome to this hands-on coding tutorial designed for research fellows who are new to Python programming and AI-driven data extraction. In this notebook, we will walk through how to write a Python script that leverages large language models to extract structured information from **SEC Form 3 filings**, a common document in finance and regulatory research.

## 🎯 Objective
Our goal is to automatically extract key details such as the insider’s name, their role(s), company information, and the filing date from a raw Form 3 text file. We’ll use OpenAI’s API to interpret unstructured text, and validate the results using a strict schema defined with Python’s pydantic library.

## 🛠️ What You'll Learn
This notebook will help you:

- Load and read a raw SEC filing text file.

- Securely access API keys using environment variables.

- Use the openai Python client to make structured API calls.

- Define and validate response schemas using pydantic.

- Build system prompts to guide a language model for information extraction.

- Parse and handle structured responses from a language model.

## 📦 Key Python Concepts Covered
- **File handling:** Reading text data from a file.

- **Environment management:** Loading sensitive credentials with dotenv.

- **API usage:** Sending structured requests to OpenAI's language models.

- **Data validation:** Using pydantic to enforce structured output.

- **Typing and classes:** Building robust and type-safe data models.

## 📄 About SEC Form 3
Form 3 is a statement of beneficial ownership filed with the U.S. Securities and Exchange Commission (SEC). It discloses who owns what stake in a publicly traded company, typically corporate insiders such as directors, officers, and large shareholders. These forms are important for legal compliance, research, and transparency.

## 💡 Why Use Language Models?
SEC filings are often written in inconsistent formats, mixing structured XML and freeform text. Traditional parsers struggle with this ambiguity. By using a language model, we can extract structured fields from messy text without writing dozens of custom rules.

# 📓 Jupyter Notebook Cheatsheet

### ▶️ Running Code
- **Shift + Enter**: Run the current cell and move to the next
- **Ctrl + Enter**: Run the current cell and stay there
- **Alt + Enter**: Run current cell and insert a new one below

---

### ➕ Adding / Deleting Cells
- **A**: Add cell above (in command mode)
- **B**: Add cell below
- **D, D**: Delete cell (press 'D' twice)
- **M**: Convert to Markdown
- **Y**: Convert to Code

(Ensure you're in **Command Mode** — press `Esc` if unsure)

---

### 🧠 Markdown Basics (for text cells)
- `# Header 1`, `## Header 2`, etc.
- `**bold**`, `_italic_`, `code` (inline)
- Triple backticks (```) for code blocks

---

### 💡 Tips
- Use `Tab` for autocompletion
- Use `Shift + Tab` for docstrings and help
- Interrupt a long-running cell: **Kernel → Interrupt**

---

### 🔁 Restarting / Resetting
- **Restart Kernel**: Clears memory & variables (under Kernel menu)
- Useful if things break or slow down

# Imports


In Python, it's good practice to keep all your imports in one place — typically at the **top of the notebook or script**. This makes it easier to see what libraries are required and helps avoid unexpected bugs or missing dependencies.

That said, there are different styles:
- Some developers (including NVIDIA's ML lead!) prefer importing libraries *right before they use them*, to make each section more self-contained.
- Others (like me) prefer importing everything at the top for **clarity and reproducibility**.

👉 **There’s no strict rule — choose the style that helps you and your team stay organized.** For this tutorial, we’ll put them all at the top.

In [1]:
import os
from openai import OpenAI
from pydantic import BaseModel
from typing import List
from dotenv import load_dotenv

ModuleNotFoundError: No module named 'openai'

# Config

This section stores configuration variables, values that do not change during the program’s execution, such as file paths, model names, and environment settings. 

In [4]:
# Path to your filing
filing_path = "/zfs/data/NODR/EDGAR_HTTPS/edgar/data/1656998/0000950103-24-000077.txt"

# Loading your environment variables this should return True
load_dotenv("../.env")

True

#### **Oh no!** Why is this False?

# Exercise 1: Handling Private keys and variables

1. Create an .env file
2. modify the .env file to include OPENAI_API_KEY='The given key'
3. retry load_dotenv

<details>
  <summary>1. Hint</summary>
  Try using the bash command touch .env to create a new 
</details>

<details>
  <summary>2. Hint</summary>
  Either modify with your favorite text editor or attempt to edit in jupyterhub, raise a red card if you need help
</details>
<details>
  <summary>3. Hint</summary>
  re-run the config cell or copy it and try again!
</details>



### creating the client

Need to speak a little more about clients and servers

In [6]:
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# Exercise 2: Shaping up LLMs for Research

### LLM Prep

#### 🧠 OpenAI API Prompting Cheat Sheet

**System Prompt**  
Sets behavior or persona. Use once at the start.  
➡️ `"You are a helpful scientific assistant."`

**User Prompt**  
The user's question or task. Can appear multiple times.  
➡️ `"Explain how Python virtual environments work."`

**Templates**  
Reusable prompt formats with variables. One for inputs: 
Example:  
```python
f'''
You are an assistant.


Context: {{context}} 
Question: {{question}} 
Answer:
'''

```
one for ouputs

```python
'''
Please return the following json:
{{company_name: comp_name
mission_statement: "We like to do business "

}}'''

```
**Temperature**  
Controls randomness in output.  
- `0.0` → Deterministic (use for logic, code)  
- `0.5` → Balanced  
- `1.0+` → Creative, varied  

**Example API Call (Python)**  
```python
openai.ChatCompletion.create(
  model="gpt-4",
  temperature=0.2,
  messages=[
    {"role": "system", "content": "You are a concise Python tutor."},
    {"role": "user", "content": "How do I create a virtual environment?"}
  ]
)

### Objective

Extract key information from a Form 3 filing, including the insider’s name, their role(s), the company name, CIK, and filing date — and return it in a structured, standardized format (e.g., JSON or dictionary).

This will make the data easy to validate, analyze, and store for downstream use (like building a dataset or running queries).

#### Develop a system prompt to help pull this information from the Form 3

In [8]:
system_prompt='Write your prompt here'

#### Get the text from the filing to be the user prompt

In [11]:
# Read the text from the file path
# with open(...,'r') as f
# Write the code here

filing_text=''

In [12]:
user_prompt=filing_text

#### Finally use OpenAI built in help pydantic to structure the text

In [13]:
class Form3Filing(BaseModel):
    insider_name:str
    #.....

#### Call the Api without structure

In [14]:
response = client.responses.parse(
    model="o4-mini",
    input=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt}
    ],
)


### Understand the Response

In [15]:
response

ParsedResponse[NoneType](id='resp_685475c8ef3481a2b3abdb576d121d940da816bcc5cf2279', created_at=1750365640.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='o4-mini-2025-04-16', object='response', output=[ResponseReasoningItem(id='rs_685475c97c5481a2a126e442be7713160da816bcc5cf2279', summary=[], type='reasoning', encrypted_content=None, status=None), ParsedResponseOutputMessage[NoneType](id='msg_685475cb078481a2a2e4060a9fd420450da816bcc5cf2279', content=[ParsedResponseOutputText[NoneType](annotations=[], text='Hello! It seems there’s no question or request. How can I assist you today?', type='output_text', logprobs=None, parsed=None)], role='assistant', status='completed', type='message')], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, max_output_tokens=None, previous_response_id=None, prompt=None, reasoning=Reasoning(effort='medium', generate_summary=None, summary=None), service_tier='default', status

### Call the API with Structure

In [16]:
response = client.responses.parse(
    model="o4-mini",
    input=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt}
    ],
    text_format=Form3Filing,
)


In [18]:
response.output_parsed.model_dump()

{'insider_name': 'Hello! How can I assist you today?'}

# Exercise 3: Development

Put everything into a py file and run it from the terminal