# RAG System for SEC 10-K Financial Question Answering

This notebook implements a complete Retrieval-Augmented Generation (RAG) system that answers financial questions from Apple's FY2024 and Tesla's FY2023 10-K filings.

**Colab-Ready**: Click "Run all" or execute cells sequentially.

---

## Setup & Installation

### Step 1: Clone Repository

In [1]:
import os
import subprocess
import sys

# Check if running in Colab
try:
    from google.colab import drive
    IN_COLAB = True
    print("✓ Running in Google Colab")
except ImportError:
    IN_COLAB = False
    print("✓ Running locally (not Colab)")

# Set working directory
if IN_COLAB:
    REPO_DIR = "/content/SecRAG-10K"
else:
    REPO_DIR = os.getcwd()

print(f"Working directory: {REPO_DIR}")

✓ Running in Google Colab
Working directory: /content/SecRAG-10K


In [2]:
# Clone GitHub repository (Colab only)
if IN_COLAB:
    os.chdir("/content")
    if not os.path.exists(REPO_DIR):
        print("Cloning repository from GitHub...")
        subprocess.run(
            ["git", "clone", "https://github.com/kalpeshdahake/SecRAG-10K.git"],
            check=True
        )
    os.chdir(REPO_DIR)
    print(f"✓ Repository ready at {REPO_DIR}")
else:
    print("✓ Local mode - using existing directory")

# Verify structure
required_dirs = ["data", "ingestion", "embeddings", "retrieval", "llm", "pipeline"]
for dir_name in required_dirs:
    if os.path.exists(dir_name):
        print(f"  ✓ {dir_name}/")
    else:
        print(f"  ✗ {dir_name}/ MISSING")

Cloning repository from GitHub...
✓ Repository ready at /content/SecRAG-10K
  ✓ data/
  ✓ ingestion/
  ✓ embeddings/
  ✓ retrieval/
  ✓ llm/
  ✓ pipeline/


### Step 2: Install Dependencies

In [3]:
# Install requirements
print("Installing dependencies...")
subprocess.run(
    [sys.executable, "-m", "pip", "install", "-r", "requirements.txt", "--quiet"],
    check=False
)
print("✓ Dependencies installed")

Installing dependencies...
✓ Dependencies installed


### Step 3: Verify Imports

In [4]:
# Test imports
print("Testing imports...")

try:
    import pypdf
    print("  ✓ pypdf")
except ImportError as e:
    print(f"  ✗ pypdf: {e}")

try:
    from sentence_transformers import SentenceTransformer
    print("  ✓ sentence-transformers")
except ImportError as e:
    print(f"  ✗ sentence-transformers: {e}")

try:
    import chromadb
    print("  ✓ chromadb")
except ImportError as e:
    print(f"  ✗ chromadb: {e}")

try:
    import torch
    print(f"  ✓ torch (GPU available: {torch.cuda.is_available()})")
except ImportError as e:
    print(f"  ✗ torch: {e}")

try:
    from transformers import AutoTokenizer, AutoModelForCausalLM
    print("  ✓ transformers")
except ImportError as e:
    print(f"  ✗ transformers: {e}")

print("\n✓ All imports successful")

Testing imports...
  ✓ pypdf
  ✓ sentence-transformers
  ✓ chromadb
  ✓ torch (GPU available: True)
  ✓ transformers

✓ All imports successful


---

## PDF Indexing Pipeline

### Step 4: Load & Parse PDFs

In [5]:
# Import ingestion modules
import sys
sys.path.insert(0, REPO_DIR)

from ingestion.pdf_loader import load_pdf
from ingestion.section_parser import assign_items
from ingestion.chunker import chunk_text

print("Loading PDFs...")

# Load Apple 10-K
apple_pages = load_pdf(
    "data/10-Q4-2024-As-Filed.pdf",
    company="Apple",
    document="Apple 10-K"
)
apple_pages = assign_items(apple_pages)
apple_chunks = chunk_text(apple_pages)

print(f"Apple 10-K: {len(apple_pages)} pages → {len(apple_chunks)} chunks")
print(f"  Sample metadata: {apple_chunks[0]['metadata']}")

# Load Tesla 10-K
tesla_pages = load_pdf(
    "data/tsla-20231231-gen.pdf",
    company="Tesla",
    document="Tesla 10-K"
)
tesla_pages = assign_items(tesla_pages)
tesla_chunks = chunk_text(tesla_pages)

print(f"\nTesla 10-K: {len(tesla_pages)} pages → {len(tesla_chunks)} chunks")
print(f"  Sample metadata: {tesla_chunks[0]['metadata']}")

print(f"\n✓ Total chunks: {len(apple_chunks) + len(tesla_chunks)}")

Loading PDFs...
Apple 10-K: 121 pages → 605 chunks
  Sample metadata: {'company': 'Apple', 'document': 'Apple 10-K', 'page': 1, 'item': 'Unknown'}

Tesla 10-K: 130 pages → 647 chunks
  Sample metadata: {'company': 'Tesla', 'document': 'Tesla 10-K', 'page': 1, 'item': 'Unknown'}

✓ Total chunks: 1252


### Step 5: Generate Embeddings & Index

In [6]:
from embeddings.embedder import Embedder
from embeddings.vector_store import VectorStore

print("Initializing embedding model & vector store...")

# Initialize embedder and vector store
embedder = Embedder()
vector_store = VectorStore(persist_dir="vector_db")

# Create collections
apple_collection = vector_store.get_or_create_collection("apple_10k")
tesla_collection = vector_store.get_or_create_collection("tesla_10k")

print("✓ Collections created")
print("\nGenerating Apple embeddings...")

# Embed Apple chunks
apple_embeddings = embedder.embed_texts(
    [chunk["text"] for chunk in apple_chunks]
)
vector_store.add_chunks(apple_collection, apple_chunks, apple_embeddings)

print(f"✓ Apple indexed: {len(apple_chunks)} chunks")

print("\nGenerating Tesla embeddings...")

# Embed Tesla chunks
tesla_embeddings = embedder.embed_texts(
    [chunk["text"] for chunk in tesla_chunks]
)
vector_store.add_chunks(tesla_collection, tesla_chunks, tesla_embeddings)

print(f"✓ Tesla indexed: {len(tesla_chunks)} chunks")

print("\n✓ Indexing complete! Ready for inference.")

Initializing embedding model & vector store...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

✓ Collections created

Generating Apple embeddings...


Batches:   0%|          | 0/19 [00:00<?, ?it/s]

✓ Apple indexed: 605 chunks

Generating Tesla embeddings...


Batches:   0%|          | 0/21 [00:00<?, ?it/s]

✓ Tesla indexed: 647 chunks

✓ Indexing complete! Ready for inference.


---

## Inference & Evaluation

Run the RAG pipeline on all 13 test questions.

### Step 6: Test Question Set

In [7]:
import json

# 13 test questions from assignment
test_questions = [
    {"question_id": 1, "question": "What was Apple's total revenue for the fiscal year ended September 28, 2024?"},
    {"question_id": 2, "question": "How many shares of common stock were issued and outstanding as of October 18, 2024?"},
    {"question_id": 3, "question": "What is the total amount of term debt (current + non-current) reported by Apple as of September 28, 2024?"},
    {"question_id": 4, "question": "On what date was Apple's 10-K report for 2024 signed and filed with the SEC?"},
    {"question_id": 5, "question": "Does Apple have any unresolved staff comments from the SEC as of this filing? How do you know?"},
    {"question_id": 6, "question": "What was Tesla's total revenue for the year ended December 31, 2023?"},
    {"question_id": 7, "question": "What percentage of Tesla's total revenue in 2023 came from Automotive Sales (excluding Leasing)?"},
    {"question_id": 8, "question": "What is the primary reason Tesla states for being highly dependent on Elon Musk?"},
    {"question_id": 9, "question": "What types of vehicles does Tesla currently produce and deliver?"},
    {"question_id": 10, "question": "What is the purpose of Tesla's 'lease pass-through fund arrangements'?"},
    {"question_id": 11, "question": "What is Tesla's stock price forecast for 2025?"},
    {"question_id": 12, "question": "Who is the CFO of Apple as of 2025?"},
    {"question_id": 13, "question": "What color is Tesla's headquarters painted?"}
]

print(f"Loaded {len(test_questions)} test questions")
for q in test_questions[:3]:
    print(f"  Q{q['question_id']}: {q['question'][:60]}...")
print(f"  ... and {len(test_questions) - 3} more")

Loaded 13 test questions
  Q1: What was Apple's total revenue for the fiscal year ended Sep...
  Q2: How many shares of common stock were issued and outstanding ...
  Q3: What is the total amount of term debt (current + non-current...
  ... and 10 more


### Step 7: Run RAG Pipeline

In [8]:
from pipeline.rag_pipeline import answer_question

print("Running RAG pipeline on test questions...\n")

# Combined collection for routing
combined_collection = {
    "apple": apple_collection,
    "tesla": tesla_collection
}

# Helper: Route to correct collection
def answer_with_routing(query):
    q_lower = query.lower()
    if "apple" in q_lower:
        return answer_question(query, apple_collection)
    elif "tesla" in q_lower:
        return answer_question(query, tesla_collection)
    else:
        return {
            "answer": "This question cannot be answered based on the provided documents.",
            "sources": []
        }

# Run on all questions
results = []

for q in test_questions:
    qid = q["question_id"]
    query = q["question"]

    result = answer_with_routing(query)

    output = {
        "question_id": qid,
        "answer": result["answer"],
        "sources": result["sources"]
    }

    results.append(output)

    print(f"Q{qid}: {query[:70]}...")
    print(f"  Answer: {result['answer'][:80]}...")
    print(f"  Sources: {result['sources']}\n")

print("✓ All 13 questions processed")

Running RAG pipeline on test questions...



/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:00<00:00, 103MiB/s]


config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Q1: What was Apple's total revenue for the fiscal year ended September 28,...
  Answer: $391,035 million...
  Sources: ['Apple 10-K, Item 8, p. 38']

Q2: How many shares of common stock were issued and outstanding as of Octo...
  Answer: This question cannot be answered based on the provided documents....
  Sources: []



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Q3: What is the total amount of term debt (current + non-current) reported...
  Answer: $121,983 million...
  Sources: ['Apple 10-K, Item 8, p. 45']



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Q4: On what date was Apple's 10-K report for 2024 signed and filed with th...
  Answer: November 1, 2024...
  Sources: ['Apple 10-K, Item 601, p. 118']



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Q5: Does Apple have any unresolved staff comments from the SEC as of this ...
  Answer: Not specified in the document....
  Sources: []



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Q6: What was Tesla's total revenue for the year ended December 31, 2023?...
  Answer: 96,773 million...
  Sources: ['Tesla 10-K, Item 9A, p. 51']



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Q7: What percentage of Tesla's total revenue in 2023 came from Automotive ...
  Answer: 19.4%...
  Sources: ['Tesla 10-K, Item 9A, p. 51']



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Q8: What is the primary reason Tesla states for being highly dependent on ...
  Answer: Not specified in the document....
  Sources: []



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Q9: What types of vehicles does Tesla currently produce and deliver?...
  Answer: Model S and Model X, Model 3 and Model Y, Cybertruck, Tesla Semi...
  Sources: ['Tesla 10-K, Item 1, p. 5']



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Q10: What is the purpose of Tesla's 'lease pass-through fund arrangements'?...
  Answer: Under these arrangements, our wholly owned subsidiaries finance the cost of sola...
  Sources: ['Tesla 10-K, Item 9A, p. 82']

Q11: What is Tesla's stock price forecast for 2025?...
  Answer: This question cannot be answered based on the provided documents....
  Sources: []

Q12: Who is the CFO of Apple as of 2025?...
  Answer: This question cannot be answered based on the provided documents....
  Sources: []

Q13: What color is Tesla's headquarters painted?...
  Answer: This question cannot be answered based on the provided documents....
  Sources: []

✓ All 13 questions processed


### Step 8: Export Results

In [9]:
# Save results to JSON
output_file = "results.json"

with open(output_file, "w") as f:
    json.dump(results, f, indent=2)

print(f"✓ Results saved to {output_file}")

# Display summary
print("\n" + "="*60)
print("EVALUATION SUMMARY")
print("="*60)

answered = sum(1 for r in results if "cannot be answered" not in r["answer"].lower())
refused = len(results) - answered

print(f"\nQuestions Answered: {answered}/13")
print(f"Questions Refused: {refused}/13")

print("\nAnswered:")
for r in results:
    if "cannot be answered" not in r["answer"].lower():
        print(f"  Q{r['question_id']}: {r['answer'][:60]}... [{r['sources']}]")

print("\nRefused (Out-of-Scope):")
for r in results:
    if "cannot be answered" in r["answer"].lower():
        print(f"  Q{r['question_id']}: {r['answer'][:60]}...")

print("\n" + "="*60)

✓ Results saved to results.json

EVALUATION SUMMARY

Questions Answered: 9/13
Questions Refused: 4/13

Answered:
  Q1: $391,035 million... [['Apple 10-K, Item 8, p. 38']]
  Q3: $121,983 million... [['Apple 10-K, Item 8, p. 45']]
  Q4: November 1, 2024... [['Apple 10-K, Item 601, p. 118']]
  Q5: Not specified in the document.... [[]]
  Q6: 96,773 million... [['Tesla 10-K, Item 9A, p. 51']]
  Q7: 19.4%... [['Tesla 10-K, Item 9A, p. 51']]
  Q8: Not specified in the document.... [[]]
  Q9: Model S and Model X, Model 3 and Model Y, Cybertruck, Tesla ... [['Tesla 10-K, Item 1, p. 5']]
  Q10: Under these arrangements, our wholly owned subsidiaries fina... [['Tesla 10-K, Item 9A, p. 82']]

Refused (Out-of-Scope):
  Q2: This question cannot be answered based on the provided docum...
  Q11: This question cannot be answered based on the provided docum...
  Q12: This question cannot be answered based on the provided docum...
  Q13: This question cannot be answered based on the provided docum...



---

## Interactive Query Mode

Ask custom questions about the 10-K filings.

In [10]:
# Custom query function
def query_rag(question):
    """
    Query the RAG system with a custom question.

    Args:
        question (str): Your question about Apple or Tesla 10-K

    Returns:
        dict: {"answer": str, "sources": list}
    """
    result = answer_with_routing(question)
    return result

# Example custom queries
print("Custom Query Examples:")
print("="*60)

custom_queries = [
    "What are Apple's main business segments?",
    "What risks does Tesla face?"
]

for query in custom_queries:
    result = query_rag(query)
    print(f"Q: {query}")
    print(f"A: {result['answer']}")
    print(f"Sources: {result['sources']}")
    print()

Custom Query Examples:


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Q: What are Apple's main business segments?
A: The risk that new products and services may have quality or other defects or deficiencies. New products, services and technologies may replace or supersede existing offerings and may produce lower revenues and lower profit margin s, which can materially adversely impact the Company’s business, results of operations and financial condition. There can be no assurance the Company will successfully manage future introductions and transitions of products and services.
Sources: ['Apple 10-K, Item 8, p. 49']



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Q: What risks does Tesla face?
A: The risks described below are not the only risks facing our company. Risks and uncertainties not currently known to us or that we currently deem to be immaterial also may materially adversely affect our business, financial condition and future results. In particular, Tesla’s products, business, results of operations, and statements and actions of Tesla and its management are subject to significant amounts of commentary by a range of third parties. Such attention can include criticism, which may be exaggerated or unfounded, such as speculation regarding the sufficiency or stability of our management team. Any such negative perceptions, whether caused by us or not, may harm our business and make it more difficult to raise additional funds if needed. We may be unable to effectively grow, or manage the compliance, residual value, financing and credit risks related to, our various financing programs.
Sources: ['Tesla 10-K, Item 1A, p. 15']



---

## Download Results

Save results locally (Colab only)

In [11]:
if IN_COLAB:
    from google.colab import files
    print("Downloading results.json...")
    files.download("results.json")
    print("✓ Download started")
else:
    print("Local mode: results saved to ./results.json")

Downloading results.json...


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

✓ Download started
