# Part 1: The Data Factory

## Objective
Transform `2024_Annual_Report.pdf` into a fine-tuning dataset of Question/Answer pairs.

## Workflow
1. **Ingestion**: LlamaParse to markdown.
2. **Chunking**: 1500 chars.
3. **Generation**: 
    - LLM A: Generate 10 Questions (Hard Facts, Strategic, Creative)
    - LLM B: Generate Answers based on chunks
4. **Storage**: Split 80/20 train/test JSONL.

In [1]:
import os
import sys
import json
import asyncio
import nest_asyncio
from pathlib import Path
from dotenv import load_dotenv

# Apply nest_asyncio for LlamaParse async loops in Jupyter
nest_asyncio.apply()

# Add project root to path
notebook_dir = Path.cwd()
project_root = notebook_dir.parent if notebook_dir.name == "notebooks" else notebook_dir
sys.path.insert(0, str(project_root))

from src.services.llm_services import (
    load_config,
    get_llm,
    get_pdf_parser,
    print_config_summary
)
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import JsonOutputParser

# Load Environment & Config
load_dotenv()
config = load_config(str(project_root / "src/config/config.yaml"))
print_config_summary(config)


 Config loaded:
  LLM: openrouter (openai/gpt-4o-mini)
  Embeddings: sbert / sentence-transformers/all-MiniLM-L6-v2
  Temperature: 0.2
  Artifacts: ./artifacts


## 1. Ingestion (LlamaParse)

In [2]:
pdf_path = project_root / "data/pdfs/2024_Annual_Report.pdf"

print(f"Loading PDF from: {pdf_path}")

parser = get_pdf_parser(config)

# Parse the document - LlamaParse returns a list of Document objects
documents = parser.load_data(str(pdf_path))

print(f"Loaded {len(documents)} pages/documents.")

# Combine into full text for chunking
full_text = "\n\n".join([doc.text for doc in documents])
print(f"Total Characters: {len(full_text)}")
print("Sample content:")
print(full_text[:500])

Loading PDF from: c:\Development\financial-intelligence-engine\data\pdfs\2024_Annual_Report.pdf
Started parsing the file under job_id 7c539b01-6381-4e7b-8894-3f99ac695873
Loaded 142 pages/documents.
Total Characters: 703039
Sample content:
# On Our Way

# 2024 ANNUAL REPORT



# Uber’s Mission

We reimagine the way the world moves for the better

We are Uber. The go-getters. The kind of people who are relentless about our mission to help people go anywhere and get anything and earn their way.

Movement is what we power. It’s our lifeblood. It runs through our veins. It’s what gets us out of bed each morning. It pushes us to constantly reimagine how we can move better. For you. For all the places you want to go. For all the things 


## 2. Chunking

In [3]:
# Split into 1500 character chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500,
    chunk_overlap=150,
    length_function=len,
    is_separator_regex=False,
)

chunks = text_splitter.create_documents([full_text])
print(f"Created {len(chunks)} chunks.")

Created 654 chunks.


## 3. The Generation Loop (Q/A Generation)

In [4]:
# Initialize LLMs
llm = get_llm(config)

# --- PROMPT A: Question Generation ---
question_gen_system = """
You are an expert financial analyst creating a fine-tuning dataset.
Your task is to generate 10 diverse questions based STRICTLY on the provided text chunk.

The questions must cover these three categories:
1. **Hard Facts**: Specific numbers, dates, names, or metrics found in the text.
2. **Strategic Summaries**: High-level strategic goals, risks, or performance overviews.
3. **Stylistic/Creative**: Questions about the tone, style, or specific phrasing used.

Output format must be a raw JSON list of strings, e.g.:
["Question 1", "Question 2", ...]
"""

question_prompt = ChatPromptTemplate.from_messages([
    ("system", question_gen_system),
    ("human", "Context Chunk:\n{chunk}\n\nGenerate 10 questions:")
])

# --- PROMPT B: Answer Generation ---
answer_gen_system = """
You are an expert financial analyst.
Answer the following question based STRICTLY and ONLY on the provided context chunk.
If the answer is not in the chunk, say "Information not found in context."
Be concise and professional.
"""

answer_prompt = ChatPromptTemplate.from_messages([
    ("system", answer_gen_system),
    ("human", "Context Chunk:\n{context}\n\nQuestion: {question}\n\nAnswer:")
])

# Chains
question_chain = question_prompt | llm | JsonOutputParser()
answer_chain = answer_prompt | llm

In [5]:
import time
from tqdm.notebook import tqdm

generated_dataset = []

# Processing Loop
# Limit for testing? set to None for full run, or e.g., 5 for test
LIMIT_CHUNKS = None 

chunks_to_process = chunks[:LIMIT_CHUNKS] if LIMIT_CHUNKS else chunks

print(f"Processing {len(chunks_to_process)} chunks...")

for i, chunk in tqdm(enumerate(chunks_to_process), total=len(chunks_to_process)):
    chunk_text = chunk.page_content
    
    # Step A: Generate Questions
    try:
        questions = question_chain.invoke({"chunk": chunk_text})
        
        # Ensure we have a list
        if not isinstance(questions, list):
            print(f"Warning: Chunk {i} generated non-list questions. Skipping.")
            continue
            
        # Step B: Generate Answers for each question
        for q in questions:
            try:
                answer_response = answer_chain.invoke({"context": chunk_text, "question": q})
                answer = answer_response.content
                
                # Add to dataset
                generated_dataset.append({
                    "input": q,
                    "output": answer,
                    "context": chunk_text
                })
            except Exception as e:
                print(f"Error generating answer for Q: {q[:30]}... : {e}")
                
    except Exception as e:
        print(f"Error generating questions for chunk {i}: {e}")
        # Optional: Sleep to avoid rate limits if using free tier heavily
        time.sleep(1)

print(f"Generated {len(generated_dataset)} Q/A pairs.")

Processing 654 chunks...


  0%|          | 0/654 [00:00<?, ?it/s]

Error generating answer for Q: What is the exclusive forum fo... : Error code: 402 - {'error': {'message': 'Insufficient credits. This account never purchased credits. Make sure your key is on the correct account or org, and if so, purchase more at https://openrouter.ai/settings/credits', 'code': 402}}
Error generating answer for Q: What contingency is mentioned ... : Error code: 402 - {'error': {'message': 'Insufficient credits. This account never purchased credits. Make sure your key is on the correct account or org, and if so, purchase more at https://openrouter.ai/settings/credits', 'code': 402}}
Error generating answer for Q: What has the Delaware Supreme ... : Error code: 402 - {'error': {'message': 'Insufficient credits. This account never purchased credits. Make sure your key is on the correct account or org, and if so, purchase more at https://openrouter.ai/settings/credits', 'code': 402}}
Error generating answer for Q: How does the text categorize a... : Error code: 402 - {'e

## 4. Storage & Splitting

In [6]:
import random

# Shuffle dataset
random.seed(42)
random.shuffle(generated_dataset)

# 80/20 Split
split_idx = int(len(generated_dataset) * 0.8)
train_set = generated_dataset[:split_idx]
test_set = generated_dataset[split_idx:]

data_dir = project_root / "data"

def save_jsonl(data, path):
    with open(path, 'w', encoding='utf-8') as f:
        for entry in data:
            f.write(json.dumps(entry) + '\n')
    print(f"Saved {len(data)} records to {path}")

save_jsonl(train_set, data_dir / "train.jsonl")
save_jsonl(test_set, data_dir / "golden_test_set.jsonl")

Saved 2098 records to c:\Development\financial-intelligence-engine\data\train.jsonl
Saved 525 records to c:\Development\financial-intelligence-engine\data\golden_test_set.jsonl


In [7]:
# Preview
print("Train Set Example:")
print(json.dumps(train_set[0], indent=2))
print("\nTest Set Example:")
print(json.dumps(test_set[0], indent=2))

Train Set Example:
{
  "input": "What specific risks are associated with the transition to renewable forms of energy mentioned in the text?",
  "output": "The specific risks associated with the transition to renewable forms of energy mentioned in the text include market shifts toward electric vehicles (EVs) and lower carbon business models, potential increased energy costs, and the risk of losing customers or facing criticism if the business fails to keep up with changes in consumer preferences and evolving stakeholder expectations.",
  "context": "# Climate Change Risks\n\nWe are subject to climate change risks, including physical and transitional risks, and if we are unable to manage such risks, our business may be adversely impacted.\n\nWe face climate change-related physical and transition risks, which include risks associated with market shifts toward more sustainable or renewable forms of energy and energy conservation. In the context of our business, this includes market shifts 