
## 📝 **Summary and Project Scope**

* **Target Companies:** MAG7 (AAPL, MSFT, AMZN, GOOGL, META, NVDA, TSLA)
* **Target Filings:** 10-K and 10-Q
* **Intended Years:** 2015–2025

### **SEC Data Constraints**

* Attempted to collect filings from 2015–2025, but after **2019-01-30**, the SEC migrated most filings to the **XBRL Viewer**, which delivers data primarily in XML/XBRL format (managed by U.S. government departments).
* This notebook’s pipeline therefore processes filings from **2015 up to early 2019** (inclusive), with the HTML download and parsing approach.
* The pipeline works fully for **AAPL, NVDA, and TSLA**. For the other MAG7 stocks, there were data access or metadata issues which limited HTML document availability.

---

## 💡 **Tech Stack and Techniques Used**

* **All tools used are free and open-source, with no paid dependencies.**
* **LLMs:**

  * **Groq Llama-3-70B-8192** (via free Groq API) is the **primary LLM** used for retrieval-augmented question answering (RAG).
  * Also **experimented with OpenAI GPT-4.1-mini** (free/demo tier) to compare answer quality, especially in cases where the Groq Llama model was prone to hallucinations or less reliable responses.
* **Embeddings:** Used HuggingFace’s `all-MiniLM-L6-v2` model for generating semantic vector embeddings.
* **Vector Database:** Used FAISS (Facebook AI Similarity Search) for efficient vector storage and retrieval.
* **Document Splitting:** Used LangChain’s `RecursiveCharacterTextSplitter` for smart, overlapping chunking of filing documents (preserving context across chunks).
* **Metadata & Preprocessing:** Automatically parsed, cleaned, and attached rich metadata (company, filing type, fiscal period, accession, etc.) to each chunk.
* **Agentic Tooling:** Built an agent that retrieves, grounds, and answers complex, multi-hop financial questions from filings using a fully open-source pipeline.
* **Prompting and RAG:** Structured retrieval-augmented prompts with chunked context to LLMs for answer generation. (Note: ReAct-style reasoning and advanced chaining are also possible as a next step.)
* **Cohere Embeddings:** Tried Cohere’s free embedding API, but the quota (approx. 500 calls per free user) was too limited for large-scale ingestion, so all main results are with HuggingFace embeddings.

> **Note:** The primary LLM for Q\&A is Groq Llama-3 (free tier), but I also tested OpenAI’s GPT-4.1-mini for higher answer reliability and to compare hallucination rates.
> All code supports either backend with simple configuration changes.



# Imports & Configuration
This cell sets up the environment for SEC filings processing and vector storage, loading dependencies, configuration, and API keys for downstream use.


In [18]:
# === Imports & Configuration ===
# Standard library imports
import glob        # For finding files matching a pattern
import json        # For handling JSON data
import os          # For file and environment management

# Third-party libraries
from dotenv import load_dotenv         # Load environment variables from .env file
from bs4 import BeautifulSoup          # HTML/XML parser for extracting text from filings

# LangChain & Vector DB imports
from langchain.docstore.document import Document    # Document wrapper for vector store
from langchain.text_splitter import RecursiveCharacterTextSplitter  # For smart text chunking
from langchain.embeddings import CohereEmbeddings                  # Cohere embeddings integration but the api calls is limited (XXXXX)
from langchain.vectorstores import FAISS                           # FAISS vector store
from langchain.embeddings import HuggingFaceEmbeddings             # HuggingFace embeddings ( I ahve used all-MiniLM-L6-v2)
import faiss                                                      # Facebook AI Similarity Search library (backend for FAISS)

# Load environment variables (API keys, config) from .env file if present
load_dotenv()

# === Paths and API Keys ===
METADATA_PATH = 'mag7_filing_metadata.json'   # JSON metadata for filings
INPUT_DIR = 'filings'                         # Directory where SEC filings are stored (raw)
OUTPUT_DIR = 'output'                         # Directory for processed outputs
COHERE_API_KEY = os.getenv('COHERE_API_KEY')  # Cohere API key (for embeddings)
COHERE_USER_AGENT = os.getenv('COHERE_USER_AGENT', 'langchain')  # User-Agent override if needed

# Ensure the output directory exists (create if missing)
os.makedirs(OUTPUT_DIR, exist_ok=True)
import os
from langchain_groq import ChatGroq

def get_llm():
    """
    Returns a Groq language model instance if GROQ_API_KEY is set.
    Raises ValueError if not set.
    """
    groq_api_key = os.getenv("GROQ_API_KEY")
    if groq_api_key:
        print("🔷 Using Groq (Llama-3-70B-8192) for LLM Q&A")
        return ChatGroq(model="llama3-70b-8192", api_key=groq_api_key)
    else:
        raise ValueError(
            "❌ GROQ_API_KEY is not set. Please set it as an environment variable or pass with Docker using -e GROQ_API_KEY=your_key"
        )

# Usage
llm = get_llm()


🔷 Using Groq (Llama-3-70B-8192) for LLM Q&A


# SEC MAG7 Filings Downloader and Cleaner (2015–2018)
This notebook script downloads all 10-K and 10-Q filings (2015–2018) for the "Magnificent 7" companies from the SEC EDGAR database, saving the HTML files under `filings/{ticker}/`.  
It also includes a function to clean and extract plain text from the HTML filings, ready for downstream chunking and embedding.


In [19]:
import requests, os, time, json, re
from bs4 import BeautifulSoup

# --- Config ---
USER_AGENT = "im-----0@gmail.com" # SEC requires a real email address here!
MAG7_TICKERS = ["AAPL", "MSFT", "AMZN", "GOOGL", "META", "NVDA", "TSLA"]

# === 1. Get CIKs for Each Ticker ===
def get_cik_from_ticker(ticker: str) -> str:
    """
    Get CIK (Central Index Key) for a ticker.
    """
    url = "https://www.sec.gov/files/company_tickers.json"
    headers = {"User-Agent": USER_AGENT}
    response = requests.get(url, headers=headers)
    data = response.json()
    for entry in data.values():
        if entry['ticker'].lower() == ticker.lower():
            return str(entry['cik_str']).zfill(10)
    return None

mag7_ciks = {ticker: get_cik_from_ticker(ticker) for ticker in MAG7_TICKERS}
print("CIKs:", mag7_ciks)

# === 2. Get Filings Metadata for Each CIK (2015–2018, 10-K and 10-Q) ===
def get_filings_for_cik(cik, form_types=("10-K", "10-Q"), start_year=2015, end_year=2018):
    """
    Get filings for CIK, filter by form type and year range.
    """
    url = f"https://data.sec.gov/submissions/CIK{cik}.json"
    headers = {"User-Agent": USER_AGENT}
    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        print(f"Failed to fetch for CIK {cik}")
        return []
    data = response.json()
    recent = data.get("filings", {}).get("recent", {})
    results = []
    for i in range(len(recent.get("form", []))):
        form = recent["form"][i]
        if form not in form_types: continue
        date = recent["filingDate"][i]
        year = int(date.split("-")[0])
        if year < start_year or year > end_year: continue
        accession_raw = recent["accessionNumber"][i]
        results.append({
            "form": form, "date": date, "year": year,
            "accession": accession_raw.replace("-", ""),
            "accession_raw": accession_raw
        })
    return results

# --- Get all filings metadata for the MAG7 in 2015–2018 ---
mag7_filings = {}
for ticker, cik in mag7_ciks.items():
    print(f"Fetching filings for {ticker}...")
    mag7_filings[ticker] = get_filings_for_cik(cik, start_year=2015, end_year=2018)
    time.sleep(0.5)  # Be nice to SEC!

# --- Save metadata for reference ---
with open("mag7_filing_metadata.json", "w") as f:
    json.dump(mag7_filings, f, indent=2)

# === 3. Download HTML Filings ===
def download_filing_html(cik, accession_raw, save_dir, user_agent=USER_AGENT):
    """
    Download main HTML filing if present.
    """
    cik_clean = str(int(cik))
    accession_nodash = accession_raw.replace("-", "")
    base_url = f"https://www.sec.gov/Archives/edgar/data/{cik_clean}/{accession_nodash}/"
    index_url = base_url + accession_raw + "-index.html"
    headers = {"User-Agent": user_agent}
    resp = requests.get(index_url, headers=headers)
    if resp.status_code != 200:
        print(f"❌ Failed to load index page for {accession_raw}")
        return None
    soup = BeautifulSoup(resp.text, "html.parser")
    table = soup.find("table", class_="tableFile", summary="Document Format Files")
    if table is None:
        print(f"⚠️ No document table found for {accession_raw}")
        return None
    main_doc_link = None
    for link in table.find_all("a"):
        href = link.get("href", "")
        if href.endswith(".htm") or href.endswith(".txt"):
            main_doc_link = href
            break
    if not main_doc_link:
        print(f"⚠️ No main document found in index for {accession_raw}")
        return None
    full_doc_url = "https://www.sec.gov" + main_doc_link
    os.makedirs(save_dir, exist_ok=True)
    file_path = os.path.join(save_dir, f"{accession_raw}.html")
    doc_resp = requests.get(full_doc_url, headers=headers)
    with open(file_path, "w", encoding="utf-8") as f:
        f.write(doc_resp.text)
    print(f"✅ Saved filing to {file_path}")
    return file_path

# --- Download filings (HTML only) for all tickers ---
for ticker in mag7_ciks:
    print(f"\n📥 Downloading filings for {ticker}")
    filings = mag7_filings[ticker]
    for filing in filings:
        download_filing_html(
            cik=mag7_ciks[ticker],
            accession_raw=filing["accession_raw"],
            save_dir=f"filings/{ticker}"
        )
        time.sleep(0.5)

# --- Clean HTML to Extract Text ---
def clean_filing_text(html_path):
    with open(html_path, 'r', encoding='utf-8') as file:
        soup = BeautifulSoup(file, "html.parser")
    for tag in soup(["script", "style", "table", "noscript"]):
        tag.decompose()
    raw_text = soup.get_text(separator="\n")
    lines = [line.strip() for line in raw_text.splitlines()]
    clean_lines = [line for line in lines if line]
    text = "\n".join(clean_lines)
    text = re.sub(r'\n[A-Z\s]{10,}\n', '\n', text)  # Remove most all-caps lines (tables)
    return text


##----- You can use the json format also to converta nd it saves the files to processed folder in the same directory--------##

# def process_and_save_filing(ticker, cik, filing, raw_dir="filings", out_dir="processed"):
#     in_path = os.path.join(raw_dir, ticker, f"{filing['accession_raw']}.html")
#     if not os.path.exists(in_path):
#         print(f"Missing: {in_path}")
#         return
#     try:
#         content = clean_filing_text(in_path)
#         output = {
#             "ticker": ticker,
#             "cik": cik,
#             "form": filing["form"],
#             "date": filing["date"],
#             "accession": filing["accession_raw"],
#             "text": content
#         }
#         os.makedirs(out_dir, exist_ok=True)
#         out_path = os.path.join(out_dir, f"{ticker}_{filing['accession_raw']}.json")
#         with open(out_path, "w", encoding="utf-8") as f:
#             json.dump(output, f, indent=2)
#         print(f"✅ Saved: {out_path}")
#     except Exception as e:
#         print(f"❌ Error processing {in_path}: {e}")

# for ticker, filings in mag7_filings.items():
#     cik = mag7_ciks[ticker]
#     print(f"\n🧼 Processing filings for {ticker}...")
#     for filing in filings:
#         process_and_save_filing(ticker, cik, filing)


CIKs: {'AAPL': '0000320193', 'MSFT': '0000789019', 'AMZN': '0001018724', 'GOOGL': '0001652044', 'META': '0001326801', 'NVDA': '0001045810', 'TSLA': '0001318605'}
Fetching filings for AAPL...
Fetching filings for MSFT...
Fetching filings for AMZN...
Fetching filings for GOOGL...
Fetching filings for META...
Fetching filings for NVDA...
Fetching filings for TSLA...

📥 Downloading filings for AAPL
✅ Saved filing to filings/AAPL\0000320193-18-000145.html
✅ Saved filing to filings/AAPL\0000320193-18-000100.html
✅ Saved filing to filings/AAPL\0000320193-18-000070.html
✅ Saved filing to filings/AAPL\0000320193-18-000007.html
✅ Saved filing to filings/AAPL\0000320193-17-000070.html
✅ Saved filing to filings/AAPL\0000320193-17-000009.html
✅ Saved filing to filings/AAPL\0001628280-17-004790.html
✅ Saved filing to filings/AAPL\0001628280-17-000717.html
✅ Saved filing to filings/AAPL\0001628280-16-020309.html
✅ Saved filing to filings/AAPL\0001628280-16-017809.html
✅ Saved filing to filings/AAPL\0

# Load Filing Metadata & Extract Raw Filing Text
This section loads the SEC filings' metadata and extracts clean text from each raw HTML filing, attaching relevant metadata (company, accession, form, year, date, etc.) to each document.


In [20]:
# === Load Filing Metadata & Raw Text ===

# Load filing metadata from the provided JSON file.
with open(METADATA_PATH, 'r', encoding='utf-8') as f:
    filing_metadata = json.load(f)

def load_filings_text(input_dir: str) -> list[Document]:
    """
    Loads SEC filing HTML files, extracts clean text, and attaches metadata.

    Args:
        input_dir (str): Path to directory containing HTML filings.

    Returns:
        list[Document]: List of LangChain Document objects with text and metadata.
    """
    docs = []
    # Recursively find all .html filing files
    for path in glob.glob(f"{input_dir}/**/*.html", recursive=True):
        company = os.path.basename(os.path.dirname(path))
        accession_raw = os.path.splitext(os.path.basename(path))[0]
        # Read HTML content
        with open(path, 'r', encoding='utf-8') as hf:
            html = hf.read()
        soup = BeautifulSoup(html, 'html.parser')
        # Remove unwanted elements
        for tag in soup(['script', 'style']):
            tag.decompose()
        # Extract visible text
        text = soup.get_text(separator=' ')
        # Assemble metadata
        meta = {
            'source': path,
            'company': company,
            'accession_raw': accession_raw
        }
        # Match and add extra metadata from filing_metadata
        entries = filing_metadata.get(company, [])
        match = next((e for e in entries if e['accession_raw'] == accession_raw), None)
        if match:
            meta.update({
                'form': match['form'],
                'date': match['date'],
                'year': match['year']
            })
        else:
            print(f"⚠️ No metadata for {company} {accession_raw}")
        # Create LangChain Document with text and metadata
        docs.append(Document(page_content=text, metadata=meta))
    return docs

# Load all filings as Document objects with attached metadata
documents = load_filings_text(INPUT_DIR)
print(f"Loaded {len(documents)} documents.")


Loaded 30 documents.


# Chunking Documents & Saving to JSONL
Now we split each filing into overlapping text chunks (for better retrieval) and save the resulting chunks—with metadata—to a `.jsonl` file for easy reuse.


In [21]:
# === Chunking Documents ===

# Initialize LangChain's RecursiveCharacterTextSplitter for smart, overlapping chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,         # Max tokens/characters per chunk
    chunk_overlap=200,       # Overlap for context continuity
    separators=["\n\n", "\n", " "]  # Prioritize splitting at paragraph or line breaks
)

# Split all documents into chunks
chunks = splitter.split_documents(documents)
print(f"Generated {len(chunks)} chunks.")

# === Cell 4: Save Chunks to JSONL ===

# Save all chunks as JSONL (one JSON object per line, easy for downstream processing)
jsonl_path = os.path.join(OUTPUT_DIR, 'mag7_chunks_2015_2018.jsonl')
with open(jsonl_path, 'w', encoding='utf-8') as out_f:
    for chunk in chunks:
        # Each chunk gets its text and metadata combined into a record
        record = {'text': chunk.page_content, **chunk.metadata}
        out_f.write(json.dumps(record) + '\n')
print(f"Saved chunks to {jsonl_path}")

Generated 9303 chunks.
Saved chunks to output\mag7_chunks_2015_2018.jsonl


# Build FAISS Vector Store with Embeddings
This section converts all text chunks into embeddings (using HuggingFace's MiniLM model) and saves them into a FAISS vector store for fast semantic search.


In [22]:
# === Build FAISS Vector Store ===

# Choose embedding provider: uncomment ONE at a time, or loop through all  ( add the api key in .env and use you own embedding model)

## --- HuggingFace (local, no API key needed) ---
# Initialize the embedding model (MiniLM is fast , free & good for small data)
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# Create FAISS vector store directly from all document chunks
vector_store = FAISS.from_documents(chunks, embeddings)


# Save FAISS index and associated data for downstream retrieval use
faiss_output_dir = os.path.join(OUTPUT_DIR, 'faiss_mag7')
vector_store.save_local(faiss_output_dir)
print(f"✅ FAISS index built and saved to '{faiss_output_dir}'")

## --- OpenAI (requires OPENAI_API_KEY) ---
# openai_embeddings = OpenAIEmbeddings(
#     model="text-embedding-ada-002",         # Or other supported models
#     openai_api_key=os.getenv('OPENAI_API_KEY')
# )
# openai_vector_store = FAISS.from_documents(chunks, openai_embeddings)
# openai_vector_store.save_local(os.path.join(OUTPUT_DIR, 'faiss_mag7_openai'))
# print("✅ FAISS index (OpenAI) saved to 'output/faiss_mag7_openai'")

## --- Cohere (requires COHERE_API_KEY) ---
# cohere_embeddings = CohereEmbeddings(
#     model="embed-english-v3.0",  # Or whichever Cohere model you want
#     cohere_api_key=COHERE_API_KEY
# )
# cohere_vector_store = FAISS.from_documents(chunks, cohere_embeddings)
# cohere_vector_store.save_local(os.path.join(OUTPUT_DIR, 'faiss_mag7_cohere'))
# print("✅ FAISS index (Cohere) saved to 'output/faiss_mag7_cohere'")


✅ FAISS index built and saved to 'output\faiss_mag7'


# Query the FAISS Vector Store (HuggingFace Embeddings)
This cell demonstrates how to load your FAISS vector store and retrieve the top relevant chunks for a given question using semantic search.


In [23]:
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings

# Re-initialize the embeddings (must match the model used during indexing)
embeddings = HuggingFaceEmbeddings(
    model_name='all-MiniLM-L6-v2'
)

# Load the FAISS index from disk (use allow_dangerous_deserialization=True in Colab)
vector_store = FAISS.load_local(
    os.path.join(OUTPUT_DIR, 'faiss_mag7'),
    embeddings,
    allow_dangerous_deserialization=True
)

# Define your sample query
query = "What was Apple's return on equity in their 2018 10-K?"

# Perform similarity search to get top 5 most relevant chunks
hits = vector_store.similarity_search(query, k=5)

# Display results with metadata and snippet preview
for idx, hit in enumerate(hits, start=1):
    print(f"--- Hit {idx} ---")
    print("Company:", hit.metadata.get('company'))
    print("Form:", hit.metadata.get('form'))
    print("Date:", hit.metadata.get('date'))
    print("Snippet:", hit.page_content[:500], "...\n")

--- Hit 1 ---
Company: AAPL
Form: 10-K
Date: 2017-11-03
Snippet: total size of the program from $200 billion to $250 billion through March 2018. This included increasing its share repurchase authorization from $140 billion to $175 billion and raising its quarterly dividend from $0.52 to $0.57 per share beginning in May 2016. During 2016, the Company spent $29.0 billion to repurchase shares of its common stock and paid dividends and dividend equivalents of $12.2 billion. Additionally, the Company issued $23.9 billion of U.S. dollar-denominated term debt and A$ ...

--- Hit 2 ---
Company: AAPL
Form: 10-K
Date: 2015-10-28
Snippet: $ 
 183 
    
    
 $ 
 183 
    
 
     Apple Inc. | 2015 Form 10-K | 21
 
 
 
 Table of Contents 
 
 
 Item 6. 
 Selected Financial Data    The information set forth below for the five years ended
September 26, 2015, is not necessarily indicative of results of future operations, and should be read in conjunction with Part II, Item 7, “Management’s Discussion a

# Hybrid Search Pipeline: BM25 Keyword Retrieval + Semantic Embedding Re-Ranking
This cell demonstrates a hybrid retrieval approach:  
1. **BM25** is used for fast keyword-based candidate selection.  
2. **Embeddings** are then used to re-rank these candidates by semantic similarity to the query.


In [24]:
from rank_bm25 import BM25Okapi
import numpy as np
import pandas as pd

# Load the chunks into DataFrame
df = pd.read_json(jsonl_path, lines=True)

# --- BM25 Keyword Search ---
# Tokenize each document for BM25 scoring
tokenized_texts = [text.split() for text in df['text']]
bm25 = BM25Okapi(tokenized_texts)

# Tokenize the query
tokenized_query = query.split()

# Compute BM25 scores for all chunks
bm25_scores = bm25.get_scores(tokenized_query)

# Get indices of the top-N BM25 candidates (high to low)
top_n = 10
bm25_indices = np.argsort(bm25_scores)[::-1][:top_n]

# --- Semantic Re-ranking using Embeddings ---
# Generate embedding for the query
query_emb = embeddings.embed_documents([query])[0]

candidates = []
for i in bm25_indices:
    # Generate embedding for each candidate document
    doc_emb = embeddings.embed_documents([df.loc[i, 'text']])[0]
    # Compute cosine similarity between query and document
    cosine_similarity = np.dot(query_emb, doc_emb) / (np.linalg.norm(query_emb) * np.linalg.norm(doc_emb))
    candidates.append((i, cosine_similarity))

# Sort candidates by highest cosine similarity
candidates.sort(key=lambda x: x[1], reverse=True)

# --- Display the Top Results ---
print("Top 5 Hybrid BM25+Embedding Results:")
for idx, score in candidates[:5]:
    meta = df.loc[idx, ['company', 'form', 'date']].to_dict()
    snippet = df.loc[idx, 'text'][:200].replace('\n', ' ')
    print(f"{meta} | Score: {score:.4f}\nSnippet: {snippet}...\n")

Top 5 Hybrid BM25+Embedding Results:
{'company': 'NVDA', 'form': '10-Q', 'date': Timestamp('2017-08-23 00:00:00')} | Score: 0.5750
Snippet: year  2018  was $688 million, up 117% from a year earlier and up 24% sequentially. Net income and net income per diluted share for the  second quarter of fiscal  year  2018  were $583 million and $0.9...

{'company': 'NVDA', 'form': '10-Q', 'date': Timestamp('2017-11-21 00:00:00')} | Score: 0.5187
Snippet: and improved gross and operating margins.  During the  first nine months of fiscal  year  2018 , we returned to shareholders $909 million in share repurchases and $250 million in cash dividends. For f...

{'company': 'AAPL', 'form': '10-Q', 'date': Timestamp('2018-08-01 00:00:00')} | Score: 0.4260
Snippet: the lower 2018 blended U.S. tax rate as a result of the Act and the reduction in its provisional tax expense estimate, partially offset by higher taxes on foreign earnings during the third quarter of ...

{'company': 'NVDA', 'form': '10-K', 'da

#  RAG: Generate Answers with LLM Using Retrieved Context
This cell demonstrates a typical RAG (Retrieval-Augmented Generation) workflow:  
1. Retrieve relevant context chunks using FAISS vector search.  
2. Construct a prompt with the context and the user query.  
3. Use a language model (e.g., GPT-4) to generate a grounded answer.


## ChatOpenAI

In [34]:
from langchain.chains import LLMChain
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate

# Define the model
model = ChatOpenAI(model="gpt-4.1-mini")  # Use the appropriate model
# model = get_llm()

# Define the prompt template to structure the query and context
prompt_template = """
Here are some excerpts from financial documents related to your query:

{context}

Based on the above context, please answer the following question:
{query}
"""

# Initialize LLMChain with the model and the prompt template
prompt = PromptTemplate(input_variables=["query", "context"], template=prompt_template)
llm_chain = LLMChain(llm=model, prompt=prompt)

# Function to generate an answer from the query and the retrieved context
def get_answer(query):
    # Perform similarity search to retrieve relevant document chunks
    hits = vector_store.similarity_search(query, k=5)
    
    # Prepare the context by concatenating snippets from the top k documents
    context = "\n".join([hit.page_content[:500] for hit in hits])  # You can adjust the snippet length as needed
    
    # Get the answer from the model using LLMChain
    answer = llm_chain.run({"query": query, "context": context})
    return answer

# Example quer?y
query = "What was Apple's revenue for Q1 2018?"
answer = get_answer(query)

print("Answer:", answer)


Answer: The excerpts provided do not explicitly state Apple's revenue for Q1 2018. However, there are some relevant financial figures and notes related to services revenue and general expenses, but no direct total revenue number for Q1 2018 is mentioned.

To accurately answer your question:

- The Q1 2018 Form 10-Q is referenced, but the exact revenue figure for Q1 2018 is not included in the text.
- The excerpts mention details about services revenue, one-time items, and expenses but no consolidated revenue total.

**Conclusion:**  
Based on the provided excerpts, Apple’s total revenue for Q1 2018 is not specifically stated, so I cannot provide the exact revenue number for that quarter from the given information.


## Groq

In [37]:
from langchain.chains import LLMChain
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate

# Define the model
# model = ChatOpenAI(model="gpt-4.1-mini")  # Use the appropriate model
model = get_llm()

# Define the prompt template to structure the query and context
prompt_template = """
Here are some excerpts from financial documents related to your query:

{context}

Based on the above context, please answer the following question:
{query}
"""

# Initialize LLMChain with the model and the prompt template
prompt = PromptTemplate(input_variables=["query", "context"], template=prompt_template)
llm_chain = LLMChain(llm=model, prompt=prompt)

# Function to generate an answer from the query and the retrieved context
def get_answer(query):
    # Perform similarity search to retrieve relevant document chunks
    hits = vector_store.similarity_search(query, k=5)
    
    # Prepare the context by concatenating snippets from the top k documents
    context = "\n".join([hit.page_content[:500] for hit in hits])  # You can adjust the snippet length as needed
    
    # Get the answer from the model using LLMChain
    answer = llm_chain.run({"query": query, "context": context})
    return answer

# Example quer?y
query = "What was Apple's revenue for Q1 2018?"
answer = get_answer(query)

print("Answer:", answer)


🔷 Using Groq (Llama-3-70B-8192) for LLM Q&A
Answer: Based on the provided excerpts, Apple's revenue for Q1 2018 is not explicitly stated. However, the revenue for Q2 2018 and Q3 2017 are mentioned, but not Q1 2018.


## ChatOpenAI

In [38]:
from langchain.chains import LLMChain
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate

# Define the model
model = ChatOpenAI(model="gpt-4.1-mini")  # Use the appropriate model
# model = get_llm()

# Define the prompt template to structure the query and context
prompt_template = """
Here are some excerpts from financial documents related to your query:

{context}

Based on the above context, please answer the following question:
{query}
"""

# Initialize LLMChain with the model and the prompt template
prompt = PromptTemplate(input_variables=["query", "context"], template=prompt_template)
llm_chain = LLMChain(llm=model, prompt=prompt)

# Function to generate an answer from the query and the retrieved context
def get_answer(query):
    # Perform similarity search to retrieve relevant document chunks
    hits = vector_store.similarity_search(query, k=5)
    
    # Prepare the context by concatenating snippets from the top k documents
    context = "\n".join([hit.page_content[:500] for hit in hits])  # You can adjust the snippet length as needed
    
    # Get the answer from the model using LLMChain
    answer = llm_chain.run({"query": query, "context": context})
    return answer

# Example quer?y
query = "What was Apple's revenue for Q1 2018?"
answer = get_answer(query)

print("Answer:", answer)


Answer: The excerpts provided do not explicitly state Apple's total revenue for Q1 2018. They mention certain details about services net sales, one-time items, and some expenses, but the total revenue figure for Q1 2018 is not included in the text.

If you need the exact revenue for Q1 2018, it would typically be found in the full Apple Q1 2018 Form 10-Q or Form 10-K filing. Based on publicly available information, Apple's reported total revenue for Q1 2018 was approximately **$88.3 billion**. However, this figure is not present in the excerpts you provided.



### Conversational Q&A Agent for SEC Filings (Manual JSON Parsing)
### ---------------------------------------------------------------
### This code defines a function to answer financial queries
### using a vectorstore (FAISS), an LLM, and maintains chat history.
### Returns answers in JSON format with context citations.



## Groq

In [39]:
# === Setup: Chat history buffer ===
chat_history = []

# === Core function: conversational_get_answer ===
def conversational_get_answer(query, k=5, snippet_len=500):
    hits = vector_store.similarity_search(query, k=k)
    context = "\n".join([hit.page_content[:snippet_len] for hit in hits])
    sources = []
    for hit in hits:
        meta = hit.metadata.copy()
        sources.append({
            "company": meta.get("company"),
            "filing": meta.get("form"),
            "period": meta.get("date") or meta.get("year"),
            "snippet": hit.page_content[:snippet_len],
            "url": meta.get("url", meta.get("source"))
        })

    # Prepare chat history string for the prompt (up to last 5 exchanges)
    chat_history_str = ""
    for q, a in chat_history[-5:]:
        chat_history_str += f"User: {q}\nSystem: {a}\n"

    # Build prompt with chat history
    prompt_text = f"""
You are a financial analyst AI.

Chat history:
{chat_history_str}

Context:
{context}

Step by step, answer the following question. If comparison or trend analysis is needed, break down the reasoning and show calculations. Cite each source as you use it.

{query}

Respond ONLY with a valid JSON of the form:
{{
    "answer": "...your answer...",
    "sources": [{{"company": "...", "filing": "...", "period": "...", "snippet": "...", "url": "..."}}],
    "confidence": 0.92
}}
"""

    # Get LLM output (direct call, NOT LLMChain with memory)
    result = model.invoke(prompt_text)
    import json
    try:
        parsed = json.loads(result)
        if "answer" not in parsed:
            parsed = {"answer": result, "sources": sources, "confidence": 0.85}
    except Exception:
        parsed = {"answer": result, "sources": sources, "confidence": 0.85}

    # Add this turn to history
    chat_history.append((query, parsed['answer']))
    return parsed

# CLI Loop
while True:
    user_query = input("User: ")
    if user_query.lower() in ["exit", "quit"]:
        break
    response = conversational_get_answer(user_query)
    print("\nSystem:", response["answer"])
    print("Sources:", response["sources"])
    print("Confidence:", response.get("confidence", "N/A"), "\n")



System: content='{\n  "answer": "Hello! How can I assist you with your financial analysis today?",\n  "sources": [],\n  "confidence": 1.0\n}' additional_kwargs={} response_metadata={'token_usage': {'completion_tokens': 34, 'prompt_tokens': 125, 'total_tokens': 159, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4.1-mini', 'system_fingerprint': 'fp_6f2eabb9a5', 'finish_reason': 'stop', 'logprobs': None} id='run--534ed81e-f597-4d09-b65a-296f29ba1179-0'
Sources: [{'company': 'TSLA', 'filing': '10-Q', 'period': '2018-11-02', 'snippet': '18', 'url': 'filings\\TSLA\\0001564590-18-026353.html'}, {'company': 'TSLA', 'filing': '10-Q', 'period': '2018-08-06', 'snippet': '22', 'url': 'filings\\TSLA\\0001564590-18-019254.html'}, {'company': 'TSLA', 'filing': '10-Q', 'period': '2017-05-10', 'snippet': '41', 'url': 'filings

## Conversational Financial Q&A Agent (Basic Version) with desired scehma 'StructuredOutputParser'

This cell defines a function for a retrieval-augmented generation (RAG) pipeline using LangChain.
It retrieves relevant SEC filings using a vectorstore, builds a prompt (with optional chat history), and queries a language model.
The answer is returned as a JSON object with answer, sources, and confidence fields.  
This version uses a generic prompt and does not maintain multi-turn chat history by default.


## ChatOpenAI

In [40]:
from langchain.chains import LLMChain
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate

# Define the model
model = ChatOpenAI(model="gpt-4.1-mini")  # Use the appropriate model
# model = get_llm()

# Define the prompt template to structure the query and context
prompt_template = """
Here are some excerpts from financial documents related to your query:

{context}

Based on the above context, please answer the following question:
{query}
"""

# Initialize LLMChain with the model and the prompt template
prompt = PromptTemplate(input_variables=["query", "context"], template=prompt_template)
llm_chain = LLMChain(llm=model, prompt=prompt)

# Function to generate an answer from the query and the retrieved context
def get_answer(query):
    # Perform similarity search to retrieve relevant document chunks
    hits = vector_store.similarity_search(query, k=5)
    
    # Prepare the context by concatenating snippets from the top k documents
    context = "\n".join([hit.page_content[:500] for hit in hits])  # You can adjust the snippet length as needed
    
    # Get the answer from the model using LLMChain
    answer = llm_chain.run({"query": query, "context": context})
    return answer

# Example quer?y
query = "What was Apple's revenue for Q1 2018?"
answer = get_answer(query)

print("Answer:", answer)


Answer: The excerpts provided do not explicitly state Apple's total revenue for Q1 2018. They include references to various revenue components and notes about changes in accounting standards and one-time items, but no direct figure for total revenue in Q1 2018.

To answer your question precisely: **The provided excerpts do not contain Apple's total revenue for Q1 2018.**

If you need the exact revenue figure, I recommend consulting Apple's official Q1 2018 Form 10-Q or earnings release, which would have the detailed financial statements including total revenue.


## groq

In [41]:
from langchain.prompts import PromptTemplate
from langchain.output_parsers import StructuredOutputParser, ResponseSchema
from langchain.chat_models import ChatOpenAI

# 1. Define the desired output schema
answer_schema = ResponseSchema(
    name="answer",
    description="The answer to the user's financial question, grounded in context."
)
sources_schema = ResponseSchema(
    name="sources",
    description="A list of source dictionaries with 'company', 'filing', 'period', 'snippet', and 'url' fields."
)
confidence_schema = ResponseSchema(
    name="confidence",
    description="A float between 0 and 1 expressing the model's confidence in its answer."
)

# 2. Setup the parser
output_parser = StructuredOutputParser.from_response_schemas(
    [answer_schema, sources_schema, confidence_schema]
)

# 3. Create prompt template
prompt_template = PromptTemplate(
    input_variables=["chat_history", "context", "query"],
    template="""
You are a financial analyst AI assistant.

Chat history:
{chat_history}

Context:
{context}

Answer the following question as thoroughly as possible, showing reasoning step by step if needed. 
Return your response as a JSON with fields: 'answer', 'sources', and 'confidence'.

{format_instructions}

User question: {query}
""",
    partial_variables={"format_instructions": output_parser.get_format_instructions()},
)

# 4. Create the model and function
# model = ChatOpenAI(model="gpt-4.1-mini")
model = get_llm()

def conversational_get_answer(query, k=5, snippet_len=500):
    # Retrieve context (same as before)
    hits = vector_store.similarity_search(query, k=k)
    context = "\n".join([hit.page_content[:snippet_len] for hit in hits])
    sources = []
    for hit in hits:
        meta = hit.metadata.copy()
        sources.append({
            "company": meta.get("company"),
            "filing": meta.get("form"),
            "period": meta.get("date") or meta.get("year"),
            "snippet": hit.page_content[:snippet_len],
            "url": meta.get("url", meta.get("source"))
        })

    # Manual chat history management (see previous cell)
    chat_history_str = ""  # Build this string from your turns as before

    # Build prompt
    prompt = prompt_template.format(
        chat_history=chat_history_str,
        context=context,
        query=query,
    )

    # 5. Invoke the chain and parse JSON output robustly
    raw_output = model.invoke(prompt)
    if hasattr(raw_output, "content"):
        raw_output = raw_output.content  # For ChatOpenAI and other chat models
    parsed = output_parser.parse(raw_output)

    if not parsed.get("sources"):  # Fill sources if model leaves it blank
        parsed["sources"] = sources
    return parsed

# Example usage:
query = "What was Apple's revenue for Q1 2018?"
response = conversational_get_answer(query)
print("Answer:", response["answer"])
print("Sources:", response["sources"])
print("Confidence:", response["confidence"])


🔷 Using Groq (Llama-3-70B-8192) for LLM Q&A
Answer: $61,137 million
Sources: [{'company': 'Apple Inc.', 'filing': 'Q1 2018 Form 10-Q', 'period': 'Q1 2018', 'snippet': 'Net sales $ 61,137 $ 52,578', 'url': 'https://www.sec.gov/Archives/edgar/data/0000320193/000032019301000004/a10-q2018.htm'}]
Confidence: 0.95


## ChatOpenAI

In [42]:
from langchain.prompts import PromptTemplate
from langchain.output_parsers import StructuredOutputParser, ResponseSchema
from langchain.chat_models import ChatOpenAI

# 1. Define the desired output schema
answer_schema = ResponseSchema(
    name="answer",
    description="The answer to the user's financial question, grounded in context."
)
sources_schema = ResponseSchema(
    name="sources",
    description="A list of source dictionaries with 'company', 'filing', 'period', 'snippet', and 'url' fields."
)
confidence_schema = ResponseSchema(
    name="confidence",
    description="A float between 0 and 1 expressing the model's confidence in its answer."
)

# 2. Setup the parser
output_parser = StructuredOutputParser.from_response_schemas(
    [answer_schema, sources_schema, confidence_schema]
)

# 3. Create prompt template
prompt_template = PromptTemplate(
    input_variables=["chat_history", "context", "query"],
    template="""
You are a financial analyst AI assistant.

Chat history:
{chat_history}

Context:
{context}

Answer the following question as thoroughly as possible, showing reasoning step by step if needed. 
Return your response as a JSON with fields: 'answer', 'sources', and 'confidence'.

{format_instructions}

User question: {query}
""",
    partial_variables={"format_instructions": output_parser.get_format_instructions()},
)

# 4. Create the model and function
model = ChatOpenAI(model="gpt-4.1-mini")
# model = get_llm()

def conversational_get_answer(query, k=5, snippet_len=500):
    # Retrieve context (same as before)
    hits = vector_store.similarity_search(query, k=k)
    context = "\n".join([hit.page_content[:snippet_len] for hit in hits])
    sources = []
    for hit in hits:
        meta = hit.metadata.copy()
        sources.append({
            "company": meta.get("company"),
            "filing": meta.get("form"),
            "period": meta.get("date") or meta.get("year"),
            "snippet": hit.page_content[:snippet_len],
            "url": meta.get("url", meta.get("source"))
        })

    # Manual chat history management (see previous cell)
    chat_history_str = ""  # Build this string from your turns as before

    # Build prompt
    prompt = prompt_template.format(
        chat_history=chat_history_str,
        context=context,
        query=query,
    )

    # 5. Invoke the chain and parse JSON output robustly
    raw_output = model.invoke(prompt)
    if hasattr(raw_output, "content"):
        raw_output = raw_output.content  # For ChatOpenAI and other chat models
    parsed = output_parser.parse(raw_output)

    if not parsed.get("sources"):  # Fill sources if model leaves it blank
        parsed["sources"] = sources
    return parsed

# Example usage:
query = "What was Apple's revenue for Q1 2018?"
response = conversational_get_answer(query)
print("Answer:", response["answer"])
print("Sources:", response["sources"])
print("Confidence:", response["confidence"])


Answer: The provided context does not explicitly state Apple's revenue for Q1 2018. However, the context references Apple's Q1 2018 Form 10-Q and mentions various financial metrics and discussions but does not give a direct revenue figure for that quarter. To find the exact revenue for Q1 2018, one would need to refer directly to Apple's Q1 2018 Form 10-Q filing or official earnings release for that period.
Sources: [{'company': 'Apple Inc.', 'filing': 'Q1 2018 Form 10-Q', 'period': 'Q1 2018', 'snippet': 'The Company will adopt the new revenue standards in its first quarter of 2019 utilizing the full retrospective transition method. The new revenue standards are not expected to have a material impact on the amount and timing of revenue recognized in the Company’s consolidated financial statements.', 'url': 'https://www.sec.gov/Archives/edgar/data/320193/000032019318000070/a10q201831.htm'}]
Confidence: 0.6


## Conversational Financial Q&A Agent (Robust, MAG7-specific, Multi-turn)

This cell defines an advanced RAG pipeline for MAG7 financial Q&A using LangChain. with desired scehma 'StructuredOutputParser'
It uses a robust prompt with explicit instructions, tracks recent chat history for multi-turn conversations, and includes error handling.
Answers are always returned as structured JSON with answer, sources, and confidence fields.


## Groq

## ChatOpenAI

In [32]:
import json
from langchain.prompts import PromptTemplate
from langchain.output_parsers import StructuredOutputParser, ResponseSchema
from langchain.chat_models import ChatOpenAI

# 1. Output schema for structured JSON
answer_schema = ResponseSchema(
    name="answer",
    description="The final answer to the user's question, with step-by-step reasoning if needed."
)
sources_schema = ResponseSchema(
    name="sources",
    description="A list of dicts with keys: company, filing, period, snippet, url, showing where info came from."
)
confidence_schema = ResponseSchema(
    name="confidence",
    description="A float between 0 and 1 expressing confidence in the answer."
)
output_parser = StructuredOutputParser.from_response_schemas(
    [answer_schema, sources_schema, confidence_schema]
)

# 2. Prompt template (with all instructions)
prompt_template = PromptTemplate(
    input_variables=["chat_history", "context", "query"],
    template="""
You are a financial Q&A AI agent for the Magnificent 7 (AAPL, MSFT, AMZN, GOOGL, META, NVDA, TSLA).
Your instructions:
- Accept and answer all user queries, including complex, comparative, or trend questions.
- Auto-correct obvious typos (e.g., 'Q1 20218' → 'Q1 2018'). If unsure, clarify.
- Use the chat history to interpret follow-ups.
- Only answer using the provided context (from SEC filings); never guess or hallucinate data.
- For comparative, trend, or multi-step queries, reason step by step in the 'answer' field.
- Always cite your sources in a list of dicts with company, filing, period, snippet, and url.
- If context is missing, state so clearly.
- Output ONLY valid JSON with these fields: answer, sources, confidence.

{format_instructions}

Chat history:
{chat_history}

Context from SEC filings:
{context}

User question: {query}
""",
    partial_variables={"format_instructions": output_parser.get_format_instructions()},
)

# 3. LLM setup (OpenAI, or your chosen model)
model = ChatOpenAI(model="gpt-4.1-mini")  # Or your OpenAI API/model of choice

# 4. Conversation memory
chat_history = []

def conversational_get_answer(query, k=5, snippet_len=500):
    # Vector search to get relevant context
    hits = vector_store.similarity_search(query, k=k)
    context = "\n".join([hit.page_content[:snippet_len] for hit in hits])
    sources = []
    for hit in hits:
        meta = hit.metadata.copy()
        sources.append({
            "company": meta.get("company"),
            "filing": meta.get("form"),
            "period": meta.get("date") or meta.get("year"),
            "snippet": hit.page_content[:snippet_len],
            "url": meta.get("url", meta.get("source"))
        })

    # Format chat history for prompt
    chat_history_str = ""
    for q, a in chat_history[-6:]:
        chat_history_str += f"User: {q}\nAgent: {a}\n"

    # Compose the prompt
    prompt = prompt_template.format(
        chat_history=chat_history_str,
        context=context,
        query=query,
    )

    # LLM call and output parsing
    raw_output = model.invoke(prompt)
    if hasattr(raw_output, "content"):
        raw_output = raw_output.content

    # Try parsing as structured JSON
    try:
        parsed = output_parser.parse(raw_output)
        if not parsed.get("sources"):
            parsed["sources"] = sources
    except Exception:
        parsed = {"answer": raw_output, "sources": sources, "confidence": 0.7}

    # Save to chat history for follow-ups
    chat_history.append((query, parsed["answer"]))
    return parsed

# 5. CLI loop (MATCHES THE IMAGE)
print("🔷 MAG7 Conversational Agent (multi-turn, typo-tolerant, step-by-step). Type 'exit' to quit.\n")
while True:
    user_query = input("User: ")
    if user_query.strip().lower() in ["exit", "quit"]:
        break
    print(f"\nUser: {user_query}\n")
    print("System: Searching relevant sections from 10-K/Q filings...\n")
    response = conversational_get_answer(user_query)
    print("Response:")
    print(json.dumps(response, indent=2, ensure_ascii=False))
    print("\n---\n")


🔷 MAG7 Conversational Agent (multi-turn, typo-tolerant, step-by-step). Type 'exit' to quit.


User: 

System: Searching relevant sections from 10-K/Q filings...

Response:
{
  "answer": "There is no user question provided to answer. Please provide a specific question regarding the Magnificent 7 companies (AAPL, MSFT, AMZN, GOOGL, META, NVDA, TSLA) for me to assist you.",
  "sources": [
    {
      "company": "TSLA",
      "filing": "10-Q",
      "period": "2017-08-04",
      "snippet": "20",
      "url": "filings\\TSLA\\0001564590-17-015705.html"
    },
    {
      "company": "TSLA",
      "filing": "10-Q",
      "period": "2018-08-06",
      "snippet": "17",
      "url": "filings\\TSLA\\0001564590-18-019254.html"
    },
    {
      "company": "TSLA",
      "filing": "10-Q",
      "period": "2018-11-02",
      "snippet": "18",
      "url": "filings\\TSLA\\0001564590-18-026353.html"
    },
    {
      "company": "TSLA",
      "filing": "10-Q",
      "period": "2017-05-10",
      "snippe