# Part 1: The Data Factory

## Objective
Transform `2024_Annual_Report.pdf` into a fine-tuning dataset of Question/Answer pairs.

## Workflow
1. **Ingestion**: LlamaParse to markdown (and JSON).
2. **Chunking**: 1500 chars (Save to JSON).
3. **Generation**: 
    - LLM A: Generate 10 Questions (Hard Facts, Strategic, Creative)
    - LLM B: Generate Answers based on chunks
4. **Storage**: Split 80/20 train/test JSONL.

In [11]:
import os
import sys
import json
import uuid
import random
import asyncio
import nest_asyncio
from pathlib import Path
from dotenv import load_dotenv

# Apply nest_asyncio for LlamaParse async loops in Jupyter
nest_asyncio.apply()

# Add project root to path
notebook_dir = Path.cwd()
project_root = notebook_dir.parent if notebook_dir.name == "notebooks" else notebook_dir
sys.path.insert(0, str(project_root))

from src.services.llm_services import (
    load_config,
    get_llm,
    get_pdf_parser,
    print_config_summary,
    load_pdf_and_save,
)
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import JsonOutputParser

# Load Environment & Config
load_dotenv()
config = load_config(str(project_root / "src/config/config.yaml"))
print_config_summary(config)

 Config loaded:
  LLM: groq / llama-3.3-70b-versatile
  Embeddings: sbert / sentence-transformers/all-MiniLM-L6-v2
  Temperature: 0.2
  Artifacts: ./artifacts


## 1. Ingestion (LlamaParse)

In [12]:
pdf_path = project_root / "data/pdfs/2024_Annual_Report.pdf"

print(f"Loading PDF from: {pdf_path}")

parser = get_pdf_parser(config)
# load_pdf_and_save RETURNS THE FULL TEXT STRING
# It also saves .md and .json files now
full_text = load_pdf_and_save(
    pdf_path=str(pdf_path),
    parser=parser,
    output_dir=str(project_root / "data/interim")
)
# No need to join! full_text is already a string.
print(f"Total Characters: {len(full_text)}")
print("Sample content:")
print(full_text[:500])

Loading PDF from: c:\Development\financial-intelligence-engine\data\pdfs\2024_Annual_Report.pdf
Loading PDF from: c:\Development\financial-intelligence-engine\data\pdfs\2024_Annual_Report.pdf
Started parsing the file under job_id 286346c6-e9dd-4447-b045-d0b1da092425
Loaded 142 pages/documents.
Saved parsed Markdown to: c:\Development\financial-intelligence-engine\data\interim\2024_Annual_Report_parsed.md
Saved parsed JSON to: c:\Development\financial-intelligence-engine\data\interim\2024_Annual_Report_parsed.json
Total Characters: 703039
Sample content:
# On Our Way

# 2024 ANNUAL REPORT



# Uber’s Mission

We reimagine the way the world moves for the better

We are Uber. The go-getters. The kind of people who are relentless about our mission to help people go anywhere and get anything and earn their way.

Movement is what we power. It’s our lifeblood. It runs through our veins. It’s what gets us out of bed each morning. It pushes us to constantly reimagine how we can move better. For

## 2. Chunking

In [13]:
# Split into 1500 character chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500,
    chunk_overlap=150,
    length_function=len,
    is_separator_regex=False,
)

chunks = text_splitter.create_documents([full_text])

# Assign IDs and Save Chunks
chunk_data = []
for chunk in chunks:
    chunk.metadata["chunk_id"] = str(uuid.uuid4())
    chunk_data.append({
        "chunk_id": chunk.metadata["chunk_id"],
        "content": chunk.page_content,
        "metadata": chunk.metadata
    })

chunks_save_path = project_root / "data/interim/chunks.json"
with open(chunks_save_path, "w", encoding="utf-8") as f:
    json.dump(chunk_data, f, indent=2)

print(f"Created {len(chunks)} chunks and saved {len(chunk_data)} items to {chunks_save_path}")

Created 654 chunks and saved 654 items to c:\Development\financial-intelligence-engine\data\interim\chunks.json


## 3. The Generation Loop (Q/A Generation)

In [14]:
# Initialize LLMs with Fallbacks and Task Specificity
# Use specific config if available, otherwise fall back to global config
question_config = config.get("question_generation", config)
answer_config = config.get("answer_generation", config)

print(f"Question Geneation Model: {question_config.get('llm_model')}")
print(f"Answer Generation Model: {answer_config.get('llm_model')}")

question_llm = get_llm(question_config)
answer_llm = get_llm(answer_config)

# --- PROMPT A: Question Generation ---
question_gen_system = """
You are an expert financial analyst creating a fine-tuning dataset.
Your task is to generate 10 diverse questions based STRICTLY on the provided text chunk.

The questions must cover these three categories:
1. **Hard Facts**: Specific numbers, dates, names, or metrics found in the text.
2. **Strategic Summaries**: High-level strategic goals, risks, or performance overviews.
3. **Stylistic/Creative**: Questions about the tone, style, or specific phrasing used.

Output format must be a raw JSON list of strings, e.g.:
["Question 1", "Question 2", ...]
"""

question_prompt = ChatPromptTemplate.from_messages([
    ("system", question_gen_system),
    ("human", "Context Chunk:\n{chunk}\n\nGenerate 10 questions:")
])

# --- PROMPT B: Answer Generation ---
answer_gen_system = """
You are an expert financial analyst.
Answer the following question based STRICTLY and ONLY on the provided context chunk.
If the answer is not in the chunk, say "Information not found in context."
Be concise and professional.
"""

answer_prompt = ChatPromptTemplate.from_messages([
    ("system", answer_gen_system),
    ("human", "Context Chunk:\n{context}\n\nQuestion: {question}\n\nAnswer:")
])

# Chains
question_chain = question_prompt | question_llm | JsonOutputParser()
answer_chain = answer_prompt | answer_llm

Question Geneation Model: gemini-2.5-flash
Answer Generation Model: openai/gpt-oss-20b


In [None]:
import time
from tqdm.notebook import tqdm

qa_dataset = []

print(f"Processing {len(chunks)} chunks...")

for i, chunk in enumerate(tqdm(chunks)):
    chunk_text = chunk.page_content
    chunk_id = chunk.metadata.get("chunk_id", f"chunk_{i}")
    
    # 1. Generate Questions
    try:
        questions = question_chain.invoke({"chunk": chunk_text})
        if not isinstance(questions, list):
            # Try to parse if string
            questions = json.loads(questions)
    except Exception as e:
        print(f"Error generating questions for chunk {i}: {e}")
        continue
        
    # 2. Generate Answers for each question
    for q in questions:
        try:
            answer_response = answer_chain.invoke({
                "context": chunk_text,
                "question": q
            })
            answer_text = answer_response.content
            print(f"question : {q}")
            qa_dataset.append({
                "chunk_id": chunk_id,
                "question": q,
                "answer": answer_text,
                "context": chunk_text
            })
            print(f"Answer : {answer_text}")
        except Exception as e:
            print(f"Error generating answer for Q: {q[:20]}... : {e}")

print(f"Total items generated: {len(qa_dataset)}")

Processing 654 chunks...


  0%|          | 0/654 [00:00<?, ?it/s]

question : What is Uber's mission as stated in the 2024 Annual Report?
Answer : Uber’s mission, as stated in the 2024 Annual Report, is: **“We reimagine the way the world moves for the better.”**
question : What is the fiscal year end date mentioned in the Annual Report?
Answer : December 31, 2024.
question : What is the Commission File Number for Uber Technologies, Inc.?
Answer : Commission File Number: 001-38902
question : What is the I.R.S. Employer Identification Number for Uber Technologies, Inc.?
Answer : The I.R.S. Employer Identification Number for Uber Technologies, Inc. is **45‑2647441**.
question : Where are the principal executive offices of Uber located?
Answer : San Francisco, California 94158.
question : What specific section of the Securities Exchange Act of 1934 does Uber's Annual Report comply with?
Answer : The report is filed **“pursuant to Section 13 or 15(d) of the Securities Exchange Act of 1934.”**
question : How does Uber describe its commitment to movement in 

### Save and Split Dataset as json files

In [None]:
# Shuffle the dataset to ensure random distribution
random.shuffle(qa_dataset)

# Split 80/20 train/test
split_idx = int(len(qa_dataset) * 0.8)
train_data = qa_dataset[:split_idx]
test_data = qa_dataset[split_idx:]

processed_dir = project_root / "data/processed"
processed_dir.mkdir(parents=True, exist_ok=True)

# Helper to save JSONL
def save_jsonl(data, filename):
    path = processed_dir / filename
    with open(path, "w", encoding="utf-8") as f:
        for entry in data:
            json.dump(entry, f)
            f.write("\n")
    print(f"Saved {len(data)} items to {path}")

save_jsonl(train_data, "train.jsonl")
save_jsonl(test_data, "golden_test_set.jsonl")

# Also save the full raw dataset for backup
qa_save_path = processed_dir / "qa_dataset_full.json"
with open(qa_save_path, "w", encoding="utf-8") as f:
    json.dump(qa_dataset, f, indent=2)

print(f"Total items generated: {len(qa_dataset)}")
print(f"Full dataset saved to {qa_save_path}")