# Problem definition

SEC filings are long, filled with boilerplate language, and often difficult to parse through even as an experienced analyst. Let's assume that I have the following question:

> "What was the annual revenue for TSLA in 2024, and what was media sentiment after the SEC 10-K filings were released?"

For the sake of simplicity, let's assume the answer can be obtained with the following two sources:
- [TSLA 2024 10-K SEC filing](https://www.sec.gov/Archives/edgar/data/1318605/000162828025003063/tsla-20241231.htm#ie9fbbc0a99a6483f9fc1594c1ef72807_157), 347,000 tokens
- [Yahoo Finance article](https://finance.yahoo.com/news/tesla-inc-tsla-q4-2024-072241602.html), 5,000 tokens

To implement an architecture that could answer these questions, we have the following options. There are lots of LLM's we could use, but let's assume that we're using GPT 4o at commercial API usage rates. 

- GPT-4o with unfiltered context: Manually upload full context to GPT API. 
- GPT-4o with web search: Ask the question directly, and don't provide context.
- GPT-4o with keyword search: BM25 search to fetch top-N chunks that match query terms to article metadata, and send those chunks to the API.
- RAG with semantic search (GPT-4o): Search vector database for context, and  upload to GPT-4o

Using our article baseline, and assuming that each method can accurately answer the question, here are the daily cost extrapolations for a platform with the following levels of activity:

- We have a platform with 1000 concurrent users at any given moment
- These users use the platform for 10 hours per day
- Each user asks 1 question every 3 minutes, each similar to the above question (39 tokens).

| Scenario                           | Input Tokens | Output Tokens | Cost per Query (USD) | Total Daily Cost (USD) |
|------------------------------------|--------------|----------------|-----------------------|-------------------------|
| GPT-4o with unfiltered context     | 347,000      | 200            | $1.7390               | $347,800.00             |
| GPT-4o with web search             | 7,500        | 200            | $0.0415               | $8,300.00               |
| Keyword search + GPT-4o            | 3,000        | 200            | $0.0190               | $3,800.00               |
| RAG with semantic search (GPT-4o)  | 2,000        | 200            | $0.0140               | $2,800.00               |
| RAG with semantic search (GPT 3.5) | 2,000        | 200            | $0.0013               | $260              |

Our API costs would be 66% cheaper using a RAG compared to web search. If we optimize the semantic retrieval methodology for accuracy and use GPT 3.5 instead of 4o, it would be 97% cheaper. 

| Scenario                           | High Accuracy | Cost Efficient | Low Latency | Easy to Implement | Scalable | Curated Sources |
|------------------------------------|----------------|----------------|-------------|-------------------|----------|---------------------|
| GPT-4o with unfiltered context     | ✅             | ❌             | ❌          | ✅                | ❌       | ✅                  |
| GPT-4o with web search             | ✅             | ✅             | ✅          | ✅                | ✅       | ❌                  |
| RAG with semantic search (GPT-4o)  | ✅             | ✅             | ✅          | ❌                | ✅       | ✅                  |
| RAG with semantic search (GPT-3.5) | ✅             | ✅             | ✅          | ❌                | ✅       | ✅                  |

# Evaluation Criteria

We'll implement all four scenarios above, and evaluate them on the following criteria:

- Tier 1: Single-source QA accuracy. *How well does our engine answer questions about the 10-K filings?*
    - Evaluation data set: [financial-qa-10k](https://huggingface.co/datasets/virattt/financial-qa-10K)
    - Evaluation metrics:
        - ROUGE
        - Recall@k for chunk retrieval
- Tier 2 (if time permits): Multi-source QA accuracy. *How well does our engine answer questions about the 10-K filings AND related media?*
    
    - Evaluation metrics:
        - 

# Datasets to use:

- **SEC 10-k filings** for the years 2020-2023 for AAPL, NVDA, TSLA, GOOGL. We'll scrape this data with the [EDGAR-CRAWLER](https://github.com/lefterisloukas/edgar-crawler) project.
- **Evaluation set**: QA answer responses based on the (financial-qa-10k)(https://huggingface.co/datasets/virattt/financial-qa-10K) labeled dataset from HuggingFace. 

In [4]:
from bs4 import BeautifulSoup
from transformers import GPT2TokenizerFast

# Load the document
with open("/Users/jon/Downloads/tsla-20241231.mhtml", "r", encoding="utf-8") as file:
    mhtml_content = file.read()

# Parse the MHTML content using BeautifulSoup
soup = BeautifulSoup(mhtml_content, "html.parser")

# Extract text content
text = soup.get_text()

# Tokenize using GPT-2 tokenizer (approximate for ChatGPT models)
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
tokens = tokenizer.encode(text)

# Count the number of tokens
num_tokens = len(tokens)
num_tokens



  from .autonotebook import tqdm as notebook_tqdm
W0612 11:12:01.673000 70042 torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
Token indices sequence length is longer than the specified maximum sequence length for this model (346343 > 1024). Running this sequence through the model will result in indexing errors


346343

# Load data

In [2]:
import pandas as pd
import os
from pathlib import Path
from pprint import pprint
import sys
sys.path.append("..")

from src.download_data import download_data

#using HF library has some bugs with the parquet files, so we'll download the data manually
download_data()

#get the path to the data
data_path = Path("..") / "data"

#get the list of parquet files
parquet_files = list(data_path.glob("*.parquet"))
print(parquet_files)
#load example from val parquet
val = pd.read_parquet(parquet_files[2])
val.head()

[PosixPath('../data/train.parquet'), PosixPath('../data/test.parquet'), PosixPath('../data/val.parquet')]


Unnamed: 0,Context,Question,Program,Answer
0,Table\tof\tContents\n\n# UNITED\tSTATES\tSECUR...,what is the average payment volume per transac...,,127.40
1,# UNITED\tSTATES\n\n# SECURITIES\tAND\tEXCHANG...,what was the percentage cumulative total retur...,,93.5%
2,Table\tof\tContents\n\n# UNITED\tSTATES\tSECUR...,what percentage of the total oil and gas mmboe...,,24.69%
3,# UNITED\tSTATES\n\n# SECURITIES\tAND\tEXCHANG...,in 2010 what was the net change in net revenue...,,18.6
4,# UNITED\tSTATES\n\n# SECURITIES\tAND\tEXCHANG...,what are the deferred fuel cost revisions as a...,,60.3%


In [3]:
pprint(val['Context'][0])

('Table\tof\tContents\n'
 '\n'
 '# UNITED\tSTATES\tSECURITIES\tAND\tEXCHANGE\tCOMMISSION\n'
 '\n'
 '# ANNUAL\tREPORT\tPURSUANT\tTO\tSECTION\t13\tOR\t15(d)\tOF\tTHE\tSECURITIES\t'
 'EXCHANGE\tACT\tOF 1934\n'
 '\n'
 '# For\tthe\tfiscal\tyear\tended\tSeptember\t30,\t2008\n'
 '\n'
 '¨ TRANSITION\tREPORT\tPURSUANT\tTO\tSECTION\t13\tOR\t15(d)\tOF\tTHE\t'
 'SECURITIES\tEXCHANGE\tACT\tOF\n'
 '\n'
 '# VISA\tINC.\n'
 '\n'
 '(Exact\tname\tof\tRegistrant\tas\tspecified\tin\tits\tcharter)\n'
 '\n'
 'Delaware\n'
 '\n'
 '(State\tor\tother\tjurisdiction of\tincorporation\tor\torganization)\n'
 '\n'
 'P.O.\tBox\t8999 San\tFrancisco,\tCalifornia (Address\tof\tprincipal\t'
 'executive\toffices)\n'
 '\n'
 '(IRS\tEmployer Identification\tNo.)\n'
 '\n'
 'Securities\tregistered\tpursuant\tto\tSection\t12(b)\tof\tthe\tAct:\n'
 '\n'
 'Securities\tregistered\tpursuant\tto\tSection\t12(g)\tof\tthe\tAct:\t\t\t\t'
 'NONE\n'
 '\n'
 'Indicate\tby\tcheck\tmark\tif\tthe\tregistrant\tis\ta\twell-known\tseasoned\t'
 'is