# LAB 2.3: MULTI-DOCUMENT CONTEXT MANAGEMENT

**Course:** Advanced Prompt Engineering Training  
**Session:** Session 2 - Advanced Context Engineering  
**Duration:** 50 minutes  
**Type:** Hands-on Multi-Source Information Synthesis

## LAB OVERVIEW

This lab focuses on **managing context from multiple documents** for comprehensive analysis. You'll learn to:

- Chunk and index multiple documents efficiently
- Track cross-document references
- Rank document relevance for specific queries
- Synthesize information from disparate sources
- Build production multi-document analysis systems

**Scenario:** You're building a commercial loan underwriting system. Each application consists of multiple documents:
- **Application Form** (5 pages) - Basic applicant info
- **Tax Returns** (3 years, 60 pages total) - Income verification
- **Bank Statements** (6 months, 30 pages) - Cash flow analysis
- **Business Plan** (20 pages) - Growth projections
- **Property Appraisal** (15 pages) - Collateral assessment
- **Credit Report** (10 pages) - Credit history

**Total:** 140 pages across 6 distinct documents

**Challenge:** Answer questions that require information from multiple documents while staying within token limits.

## LEARNING OBJECTIVES

By the end of this lab, you will be able to:

✓ Chunk documents while preserving source metadata  
✓ Build cross-document reference systems  
✓ Rank document relevance for queries  
✓ Synthesize information from multiple sources  
✓ Handle conflicting information across documents  
✓ Build production multi-document systems

## SETUP INSTRUCTIONS

### Step 1: Import Libraries

In [None]:
# Lab 2.3: Multi-Document Context Management
# Advanced Prompt Engineering Training - Session 2

import os
import json
from openai import OpenAI
import tiktoken
import pandas as pd
import numpy as np
from typing import Dict, List, Any, Optional, Tuple
from datetime import datetime
from collections import defaultdict
import hashlib

print("✓ Libraries imported")

### Step 2: Configure OpenAI Client

In [None]:
# Check if API key exists
if not os.environ.get("OPENAI_API_KEY"):
    raise ValueError("OPENAI_API_KEY not found. Please set it in .env file")

# Initialize OpenAI client
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

# Configuration
MODEL = os.getenv("MODEL_NAME")
TEMPERATURE = 0  # Deterministic for BFSI applications

if not MODEL:
    raise ValueError("MODEL_NAME not found. Please set it in .env file")

encoding = tiktoken.encoding_for_model(MODEL)

def count_tokens(text: str) -> int:
    """Count tokens in text"""
    return len(encoding.encode(text))

print(f"✓ Model: {MODEL}")
print(f"✓ Tokenizer: {encoding.name}")

### Step 3: Create Helper Functions

In [None]:
def call_gpt4(
    prompt: str,
    system_prompt: str = "You are a helpful AI assistant.",
    temperature: float = 0
) -> Dict:
    """
    Call GPT-4 API
    
    Args:
        prompt (str): User prompt
        system_prompt (str): System prompt
        temperature (float): Sampling temperature
    
    Returns:
        Dict: Response with metadata
    """
    try:
        response = client.chat.completions.create(
            model=MODEL,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": prompt}
            ],
            temperature=temperature
        )
        
        return {
            "content": response.choices[0].message.content,
            "prompt_tokens": response.usage.prompt_tokens,
            "completion_tokens": response.usage.completion_tokens,
            "total_tokens": response.usage.total_tokens,
            "success": True
        }
    except Exception as e:
        return {
            "content": "",
            "error": str(e),
            "success": False
        }

def generate_chunk_id(content: str) -> str:
    """Generate unique ID for chunk"""
    return hashlib.md5(content.encode()).hexdigest()[:12]

print("✓ Helper functions created")

### Step 4: Load Sample Document Collection

In [None]:
# Realistic multi-document loan application dataset

loan_documents = {
    "application_form": {
        "title": "Commercial Loan Application",
        "type": "application",
        "date": "2024-02-01",
        "pages": 5,
        "content": """
COMMERCIAL LOAN APPLICATION

Applicant Information:
Name: TechStart Solutions, LLC
Business Type: Limited Liability Company
Industry: Software Development (NAICS 541511)
Years in Business: 3.5 years
Federal Tax ID: 98-7654321

Primary Contact:
Name: Jennifer Martinez
Title: CEO & Founder
Phone: (555) 123-4567
Email: jennifer@techstartsolutions.com

Loan Request:
Requested Amount: $750,000
Purpose: Commercial real estate acquisition and renovation
Property Address: 1234 Innovation Drive, Austin, TX 78701
Property Type: Office building
Purchase Price: $1,200,000
Down Payment: $450,000 (37.5%)
Loan-to-Value Ratio: 62.5%

Financial Summary:
Annual Revenue (2023): $2,400,000
Annual Revenue (2022): $1,800,000
Annual Revenue (2021): $1,200,000
Current Employees: 18 full-time
Monthly Operating Expenses: $180,000
Current Business Debt: $350,000 (equipment loans)
"""
    },
    
    "tax_returns_2023": {
        "title": "Business Tax Return - 2023",
        "type": "tax_return",
        "date": "2024-04-15",
        "pages": 22,
        "content": """
FORM 1120 - U.S. CORPORATION INCOME TAX RETURN
Tax Year: 2023
Business Name: TechStart Solutions, LLC

INCOME:
Gross Receipts: $2,580,000
Returns and Allowances: ($180,000)
Net Receipts: $2,400,000
Cost of Goods Sold: $960,000
Gross Profit: $1,440,000

DEDUCTIONS:
Salaries and Wages: $840,000
Rent: $120,000
Taxes and Licenses: $48,000
Interest: $28,000
Depreciation: $65,000
Other Deductions: $189,000
Total Deductions: $1,290,000

TAXABLE INCOME: $150,000
Tax Liability: $31,500
Payments: $35,000
Refund Due: $3,500

BALANCE SHEET (End of Year):
Assets:
  Cash: $280,000
  Accounts Receivable: $420,000
  Equipment: $385,000
  Other Assets: $115,000
  Total Assets: $1,200,000

Liabilities:
  Accounts Payable: $180,000
  Notes Payable: $350,000
  Other Liabilities: $70,000
  Total Liabilities: $600,000

Equity: $600,000
"""
    },
    
    "tax_returns_2022": {
        "title": "Business Tax Return - 2022",
        "type": "tax_return",
        "date": "2023-04-15",
        "pages": 20,
        "content": """
FORM 1120 - U.S. CORPORATION INCOME TAX RETURN
Tax Year: 2022

INCOME:
Gross Receipts: $1,950,000
Returns and Allowances: ($150,000)
Net Receipts: $1,800,000
Cost of Goods Sold: $720,000
Gross Profit: $1,080,000

DEDUCTIONS:
Salaries and Wages: $630,000
Rent: $108,000
Taxes and Licenses: $36,000
Interest: $22,000
Depreciation: $55,000
Other Deductions: $134,000
Total Deductions: $985,000

TAXABLE INCOME: $95,000
Tax Liability: $19,950

BALANCE SHEET (End of Year):
Total Assets: $980,000
Total Liabilities: $480,000
Equity: $500,000
"""
    },
    
    "bank_statements": {
        "title": "Business Bank Statements (Last 6 Months)",
        "type": "bank_statement",
        "date": "2024-02-01",
        "pages": 30,
        "content": """
BUSINESS CHECKING ACCOUNT SUMMARY
Account: TechStart Solutions Operating Account
Period: August 2023 - January 2024

MONTHLY SUMMARY:

January 2024:
Beginning Balance: $285,000
Deposits: $245,000 (client payments, contracts)
Withdrawals: $195,000 (payroll $140k, rent $10k, vendors $45k)
Ending Balance: $335,000

December 2023:
Beginning Balance: $220,000
Deposits: $280,000
Withdrawals: $215,000 (payroll $145k, rent $10k, vendors $50k, bonuses $10k)
Ending Balance: $285,000

November 2023:
Beginning Balance: $195,000
Deposits: $235,000
Withdrawals: $210,000
Ending Balance: $220,000

October 2023:
Beginning Balance: $180,000
Deposits: $255,000
Withdrawals: $240,000
Ending Balance: $195,000

September 2023:
Beginning Balance: $165,000
Deposits: $220,000
Withdrawals: $205,000
Ending Balance: $180,000

August 2023:
Beginning Balance: $155,000
Deposits: $210,000
Withdrawals: $200,000
Ending Balance: $165,000

AVERAGE MONTHLY:
Average Deposits: $240,833
Average Withdrawals: $210,833
Average Ending Balance: $230,000
Net Monthly Cash Flow: +$30,000
"""
    },
    
    "business_plan": {
        "title": "TechStart Solutions - Business Plan & Projections",
        "type": "business_plan",
        "date": "2024-01-15",
        "pages": 20,
        "content": """
TECHSTART SOLUTIONS LLC
STRATEGIC BUSINESS PLAN 2024-2026

EXECUTIVE SUMMARY:
TechStart Solutions is a rapidly growing software development firm specializing in
enterprise cloud solutions. Founded in 2020, we have grown from 4 employees to 18,
serving Fortune 500 clients.

MARKET OPPORTUNITY:
The enterprise cloud solutions market is projected to grow 25% annually through 2028.
Our niche focus on healthcare and financial services positions us in high-growth,
high-margin sectors.

FINANCIAL PROJECTIONS:

2024 Projected Revenue: $3,200,000 (33% growth)
  - Existing Contracts: $2,400,000
  - New Business Pipeline: $800,000
  
2025 Projected Revenue: $4,200,000 (31% growth)
2026 Projected Revenue: $5,400,000 (29% growth)

PROFITABILITY TARGETS:
2024 Net Margin: 12% ($384,000 profit)
2025 Net Margin: 14% ($588,000 profit)
2026 Net Margin: 16% ($864,000 profit)

STAFFING PLAN:
Current: 18 employees
2024: 24 employees (+6)
2025: 32 employees (+8)
2026: 40 employees (+8)

REAL ESTATE ACQUISITION:
Purpose: Purchase and renovate 1234 Innovation Drive
  - Current office rent: $10,000/month ($120,000/year)
  - Mortgage payment (projected): $5,800/month ($69,600/year)
  - Annual savings: $50,400
  - Additional benefit: Equity building, tax deductions, asset appreciation

DEBT SERVICE COVERAGE:
Projected Monthly Loan Payment: $5,800
Projected Monthly EBITDA (2024): $40,000
Debt Service Coverage Ratio: 6.9x (Excellent)
"""
    },
    
    "property_appraisal": {
        "title": "Commercial Property Appraisal Report",
        "type": "appraisal",
        "date": "2024-01-20",
        "pages": 15,
        "content": """
COMMERCIAL REAL ESTATE APPRAISAL

Property Address: 1234 Innovation Drive, Austin, TX 78701
Appraisal Date: January 20, 2024
Appraiser: Austin Commercial Appraisers, LLC (License #TX-2024-8877)

PROPERTY DESCRIPTION:
Building Type: Two-story office building
Year Built: 2015
Total Square Footage: 8,500 sq ft
Lot Size: 0.45 acres
Zoning: Commercial Office (CO-1)
Parking: 24 spaces

CONDITION ASSESSMENT:
Overall Condition: Good to Excellent
Exterior: Well-maintained brick and glass facade
Interior: Modern finishes, open floor plan
Recent Updates: HVAC system upgraded 2022, roof replaced 2021
Deferred Maintenance: Minimal (<$15,000 estimated)

MARKET ANALYSIS:
Comparable Sale 1: 1100 Tech Blvd - $138/sq ft
Comparable Sale 2: 1500 Innovation Lane - $142/sq ft
Comparable Sale 3: 1890 Commerce Ave - $135/sq ft
Average Comparable: $138/sq ft

VALUATION:
Approach 1 - Sales Comparison: $1,173,000 (8,500 sq ft × $138/sq ft)
Approach 2 - Income Capitalization: $1,210,000
Approach 3 - Cost Approach: $1,195,000

FINAL APPRAISED VALUE: $1,200,000

MARKETABILITY: Good - Austin tech corridor, high demand
RECOMMENDED LOAN-TO-VALUE: Maximum 75% ($900,000)
"""
    },
    
    "credit_report": {
        "title": "Business Credit Report",
        "type": "credit_report",
        "date": "2024-02-01",
        "pages": 10,
        "content": """
BUSINESS CREDIT REPORT
Business Name: TechStart Solutions, LLC
Report Date: February 1, 2024
Reporting Bureau: Equifax Business

CREDIT SCORE: 782 (Excellent)
Risk Rating: Low Risk
Delinquency Score: 95/100

CREDIT SUMMARY:
Total Trade Lines: 8
Active Accounts: 6
Closed Accounts: 2
Total Credit Limit: $485,000
Total Balance: $350,000
Credit Utilization: 72%

PAYMENT HISTORY (24 Months):
On-Time Payments: 100%
30-Day Late: 0
60-Day Late: 0
90+ Day Late: 0
Collections: 0
Public Records: 0

ACTIVE CREDIT ACCOUNTS:

1. Equipment Loan - Regional Bank
   Balance: $180,000
   Monthly Payment: $5,200
   Status: Current, never late
   
2. Equipment Loan - Tech Finance Co
   Balance: $170,000
   Monthly Payment: $4,800
   Status: Current, never late

3. Business Credit Card - Chase
   Limit: $50,000
   Balance: $18,000
   Status: Current
   
4. Business Credit Card - AmEx
   Limit: $35,000
   Balance: $12,000
   Status: Current

5. Line of Credit - Community Bank
   Limit: $100,000
   Balance: $0 (unused)
   Status: Current

OWNER PERSONAL GUARANTEE:
Jennifer Martinez - Personal FICO: 745 (Good)
Personal DTI: 28% (Low)

RECOMMENDED LENDING DECISION:
Business demonstrates excellent credit management
Payment history perfect over 24 months
Low risk for commercial lending up to $1,000,000
"""
    }
}

print(f"✓ Sample documents loaded: {len(loan_documents)} documents")
for doc_id, doc in loan_documents.items():
    tokens = count_tokens(doc['content'])
    print(f"  - {doc['title']}: {tokens} tokens ({doc['pages']} pages)")

total_tokens = sum(count_tokens(doc['content']) for doc in loan_documents.values())
print(f"\n✓ Total document collection: {total_tokens} tokens")

## CHALLENGE 1: DOCUMENT CHUNKING & INDEXING

**Time:** 10 minutes  
**Objective:** Chunk documents while preserving metadata

### Background

Each document needs to be chunked for efficient retrieval, with metadata tracking the source.

### Implementation

In [None]:
# SOLUTION: Document Chunking System

class DocumentChunk:
    """
    Single chunk from a document with full metadata
    """
    
    def __init__(
        self,
        chunk_id: str,
        doc_id: str,
        doc_title: str,
        doc_type: str,
        content: str,
        chunk_index: int,
        total_chunks: int,
        metadata: Dict = None
    ):
        self.chunk_id = chunk_id
        self.doc_id = doc_id
        self.doc_title = doc_title
        self.doc_type = doc_type
        self.content = content
        self.chunk_index = chunk_index  # Position in document (0-indexed)
        self.total_chunks = total_chunks
        self.metadata = metadata or {}
        self.token_count = count_tokens(content)
    
    def __repr__(self):
        return f"Chunk({self.chunk_id}, {self.doc_title}, {self.chunk_index+1}/{self.total_chunks}, {self.token_count} tokens)"
    
    def to_dict(self):
        """Export chunk to dictionary"""
        return {
            "chunk_id": self.chunk_id,
            "doc_id": self.doc_id,
            "doc_title": self.doc_title,
            "doc_type": self.doc_type,
            "content": self.content,
            "chunk_index": self.chunk_index,
            "total_chunks": self.total_chunks,
            "token_count": self.token_count,
            "metadata": self.metadata
        }

class DocumentChunker:
    """
    Chunk documents while preserving source metadata
    """
    
    def __init__(self, chunk_size: int = 800, overlap: int = 100):
        """
        Initialize chunker
        
        Args:
            chunk_size (int): Target tokens per chunk
            overlap (int): Overlapping tokens between chunks
        """
        self.chunk_size = chunk_size
        self.overlap = overlap
        self.all_chunks = []
    
    def chunk_document(self, doc_id: str, document: Dict) -> List[DocumentChunk]:
        """
        Chunk a document into pieces
        
        Args:
            doc_id (str): Document identifier
            document (Dict): Document data with 'content', 'title', 'type', etc.
        
        Returns:
            List[DocumentChunk]: Document chunks
        """
        content = document['content']
        
        # Split by paragraphs first (preserves structure)
        paragraphs = [p.strip() for p in content.split('\n\n') if p.strip()]
        
        chunks = []
        current_chunk = []
        current_tokens = 0
        
        for paragraph in paragraphs:
            para_tokens = count_tokens(paragraph)
            
            # If single paragraph exceeds chunk size, split it
            if para_tokens > self.chunk_size:
                # Save current chunk if exists
                if current_chunk:
                    chunks.append('\n\n'.join(current_chunk))
                    current_chunk = []
                    current_tokens = 0
                
                # Split large paragraph by sentences
                sentences = paragraph.split('. ')
                for sentence in sentences:
                    sent_tokens = count_tokens(sentence)
                    if current_tokens + sent_tokens > self.chunk_size:
                        if current_chunk:
                            chunks.append('. '.join(current_chunk) + '.')
                        current_chunk = [sentence]
                        current_tokens = sent_tokens
                    else:
                        current_chunk.append(sentence)
                        current_tokens += sent_tokens
                
                if current_chunk:
                    chunks.append('. '.join(current_chunk))
                    current_chunk = []
                    current_tokens = 0
            
            # Normal paragraph processing
            elif current_tokens + para_tokens > self.chunk_size:
                # Save current chunk and start new one
                if current_chunk:
                    chunks.append('\n\n'.join(current_chunk))
                current_chunk = [paragraph]
                current_tokens = para_tokens
            else:
                current_chunk.append(paragraph)
                current_tokens += para_tokens
        
        # Don't forget last chunk
        if current_chunk:
            chunks.append('\n\n'.join(current_chunk))
        
        # Create DocumentChunk objects
        total_chunks = len(chunks)
        chunk_objects = []
        
        for i, chunk_content in enumerate(chunks):
            chunk_id = f"{doc_id}_chunk_{i}"
            
            chunk_obj = DocumentChunk(
                chunk_id=chunk_id,
                doc_id=doc_id,
                doc_title=document['title'],
                doc_type=document['type'],
                content=chunk_content,
                chunk_index=i,
                total_chunks=total_chunks,
                metadata={
                    'date': document.get('date'),
                    'pages': document.get('pages')
                }
            )
            
            chunk_objects.append(chunk_obj)
            self.all_chunks.append(chunk_obj)
        
        return chunk_objects
    
    def chunk_all_documents(self, documents: Dict) -> Dict[str, List[DocumentChunk]]:
        """
        Chunk all documents in collection
        
        Args:
            documents (Dict): Document collection
        
        Returns:
            Dict[str, List[DocumentChunk]]: Chunked documents by doc_id
        """
        chunked_docs = {}
        
        for doc_id, document in documents.items():
            chunks = self.chunk_document(doc_id, document)
            chunked_docs[doc_id] = chunks
        
        return chunked_docs
    
    def get_all_chunks(self) -> List[DocumentChunk]:
        """Get all chunks across all documents"""
        return self.all_chunks
    
    def get_chunk_stats(self) -> Dict:
        """Get chunking statistics"""
        if not self.all_chunks:
            return {"total_chunks": 0}
        
        token_counts = [chunk.token_count for chunk in self.all_chunks]
        doc_types = defaultdict(int)
        for chunk in self.all_chunks:
            doc_types[chunk.doc_type] += 1
        
        return {
            "total_chunks": len(self.all_chunks),
            "total_tokens": sum(token_counts),
            "avg_tokens_per_chunk": np.mean(token_counts),
            "min_tokens": min(token_counts),
            "max_tokens": max(token_counts),
            "chunks_by_type": dict(doc_types)
        }

print("✓ DocumentChunk and DocumentChunker classes defined")

### Test Document Chunker

In [None]:
# Test document chunker
print("DOCUMENT CHUNKING TEST:")
print("=" * 80)

chunker = DocumentChunker(chunk_size=600, overlap=100)

# Chunk all documents
chunked_collection = chunker.chunk_all_documents(loan_documents)

print(f"\nCHUNKING RESULTS:")
print("-" * 80)
for doc_id, chunks in chunked_collection.items():
    print(f"\n{doc_id}:")
    print(f"  Document: {chunks[0].doc_title}")
    print(f"  Chunks created: {len(chunks)}")
    for chunk in chunks:
        print(f"    - {chunk.chunk_id}: {chunk.token_count} tokens")

# Get overall statistics
stats = chunker.get_chunk_stats()
print("\n" + "=" * 80)
print("OVERALL STATISTICS:")
print("-" * 80)
for key, value in stats.items():
    if isinstance(value, float):
        print(f"  {key}: {value:.1f}")
    else:
        print(f"  {key}: {value}")

print("\n" + "=" * 80)

## CHALLENGE 2: CROSS-DOCUMENT REFERENCE TRACKING

**Time:** 10 minutes  
**Objective:** Track which documents reference the same entities

### Background

Documents often reference the same facts (revenue, dates, amounts). Tracking cross-references helps identify discrepancies or confirm consistency.

In [None]:
# SOLUTION: Cross-Document Reference Tracker

class CrossDocumentTracker:
    """
    Track entities and facts across multiple documents
    """
    
    def __init__(self):
        self.entity_index = defaultdict(list)  # entity -> [(doc_id, chunk_id, mention)]
        self.doc_entities = defaultdict(set)  # doc_id -> set of entities
    
    def extract_entities_from_chunk(self, chunk: DocumentChunk) -> List[str]:
        """
        Extract key entities from chunk (simplified version)
        
        Args:
            chunk (DocumentChunk): Document chunk
        
        Returns:
            List[str]: Extracted entities
        """
        import re
        
        entities = []
        text = chunk.content
        
        # Extract monetary amounts
        money_pattern = r'\$[\d,]+(?:\.\d{2})?'
        money_entities = re.findall(money_pattern, text)
        entities.extend([f"AMOUNT:{m}" for m in money_entities])
        
        # Extract percentages
        pct_pattern = r'\d+(?:\.\d+)?%'
        pct_entities = re.findall(pct_pattern, text)
        entities.extend([f"PERCENTAGE:{p}" for p in pct_entities])
        
        # Extract years
        year_pattern = r'\b(20\d{2})\b'
        year_entities = re.findall(year_pattern, text)
        entities.extend([f"YEAR:{y}" for y in year_entities])
        
        # Extract key financial terms
        financial_terms = [
            'revenue', 'profit', 'income', 'assets', 'liabilities',
            'loan', 'credit', 'debt', 'equity', 'payment'
        ]
        for term in financial_terms:
            if term.lower() in text.lower():
                entities.append(f"TERM:{term.upper()}")
        
        return list(set(entities))  # Remove duplicates
    
    def index_chunk(self, chunk: DocumentChunk) -> None:
        """
        Index entities from a chunk
        
        Args:
            chunk (DocumentChunk): Chunk to index
        """
        entities = self.extract_entities_from_chunk(chunk)
        
        for entity in entities:
            self.entity_index[entity].append({
                'doc_id': chunk.doc_id,
                'doc_title': chunk.doc_title,
                'chunk_id': chunk.chunk_id,
                'doc_type': chunk.doc_type
            })
            self.doc_entities[chunk.doc_id].add(entity)
    
    def index_all_chunks(self, chunks: List[DocumentChunk]) -> None:
        """Index all chunks"""
        for chunk in chunks:
            self.index_chunk(chunk)
    
    def find_cross_document_entities(self, min_docs: int = 2) -> Dict[str, List]:
        """
        Find entities mentioned in multiple documents
        
        Args:
            min_docs (int): Minimum number of documents
        
        Returns:
            Dict[str, List]: Entities and their document sources
        """
        cross_doc_entities = {}
        
        for entity, mentions in self.entity_index.items():
            # Get unique documents
            unique_docs = set(m['doc_id'] for m in mentions)
            
            if len(unique_docs) >= min_docs:
                cross_doc_entities[entity] = mentions
        
        return cross_doc_entities
    
    def find_related_documents(self, doc_id: str, min_shared: int = 3) -> List[Tuple[str, int]]:
        """
        Find documents related to given document
        
        Args:
            doc_id (str): Source document ID
            min_shared (int): Minimum shared entities
        
        Returns:
            List[Tuple[str, int]]: Related docs with shared entity count
        """
        if doc_id not in self.doc_entities:
            return []
        
        source_entities = self.doc_entities[doc_id]
        related = defaultdict(int)
        
        # Count shared entities with other documents
        for entity in source_entities:
            for mention in self.entity_index[entity]:
                other_doc = mention['doc_id']
                if other_doc != doc_id:
                    related[other_doc] += 1
        
        # Filter and sort
        related_docs = [(doc, count) for doc, count in related.items() if count >= min_shared]
        related_docs.sort(key=lambda x: x[1], reverse=True)
        
        return related_docs
    
    def get_stats(self) -> Dict:
        """Get cross-document statistics"""
        cross_doc = self.find_cross_document_entities(min_docs=2)
        
        return {
            "total_entities": len(self.entity_index),
            "cross_document_entities": len(cross_doc),
            "documents_indexed": len(self.doc_entities),
            "avg_entities_per_doc": np.mean([len(entities) for entities in self.doc_entities.values()])
        }

print("✓ CrossDocumentTracker class defined")

### Test Cross-Document Tracker

In [None]:
# Test cross-document tracker
print("CROSS-DOCUMENT REFERENCE TRACKING:")
print("=" * 80)

tracker = CrossDocumentTracker()

# Index all chunks
all_chunks = chunker.get_all_chunks()
tracker.index_all_chunks(all_chunks)

# Get statistics
stats = tracker.get_stats()
print("\nTRACKING STATISTICS:")
print("-" * 80)
for key, value in stats.items():
    if isinstance(value, float):
        print(f"  {key}: {value:.1f}")
    else:
        print(f"  {key}: {value}")

# Find cross-document entities
print("\n" + "=" * 80)
print("CROSS-DOCUMENT ENTITIES (mentioned in 2+ documents):")
print("-" * 80)

cross_doc_entities = tracker.find_cross_document_entities(min_docs=2)

# Show some examples
for entity, mentions in list(cross_doc_entities.items())[:10]:
    doc_titles = set(m['doc_title'] for m in mentions)
    print(f"\n{entity}:")
    print(f"  Appears in {len(doc_titles)} documents:")
    for title in list(doc_titles)[:3]:
        print(f"    - {title}")

# Find documents related to application form
print("\n" + "=" * 80)
print("DOCUMENTS RELATED TO APPLICATION FORM:")
print("-" * 80)

related = tracker.find_related_documents('application_form', min_shared=3)
for doc_id, shared_count in related:
    doc_title = loan_documents[doc_id]['title']
    print(f"  {doc_title}: {shared_count} shared entities")

print("=" * 80)

## CHALLENGE 3: DOCUMENT RELEVANCE RANKING

**Time:** 10 minutes  
**Objective:** Rank documents by relevance to a query

### Background

Given a query, determine which documents are most relevant and should be loaded into context.

In [None]:
# SOLUTION: Document Relevance Ranker

class DocumentRelevanceRanker:
    """
    Rank documents and chunks by relevance to query
    """
    
    def __init__(self, chunks: List[DocumentChunk]):
        """
        Initialize ranker with chunks
        
        Args:
            chunks (List[DocumentChunk]): All available chunks
        """
        self.chunks = chunks
        self.doc_types_keywords = {
            'application': ['applicant', 'request', 'loan amount', 'down payment', 'ltv'],
            'tax_return': ['income', 'revenue', 'deduction', 'profit', 'tax', 'receipts'],
            'bank_statement': ['balance', 'deposit', 'withdrawal', 'cash flow', 'monthly'],
            'business_plan': ['projection', 'growth', 'strategy', 'forecast', 'plan'],
            'appraisal': ['property', 'value', 'appraisal', 'comparable', 'market'],
            'credit_report': ['credit', 'score', 'payment', 'delinquency', 'trade']
        }
    
    def calculate_keyword_score(self, query: str, chunk: DocumentChunk) -> float:
        """
        Calculate keyword overlap score
        
        Args:
            query (str): User query
            chunk (DocumentChunk): Document chunk
        
        Returns:
            float: Keyword score (0-1)
        """
        query_words = set(query.lower().split())
        content_words = set(chunk.content.lower().split())
        
        if not query_words:
            return 0.0
        
        overlap = len(query_words & content_words)
        return overlap / len(query_words)
    
    def calculate_doc_type_score(self, query: str, chunk: DocumentChunk) -> float:
        """
        Calculate document type relevance
        
        Args:
            query (str): User query
            chunk (DocumentChunk): Document chunk
        
        Returns:
            float: Type relevance score (0-1)
        """
        query_lower = query.lower()
        doc_type = chunk.doc_type
        
        if doc_type not in self.doc_types_keywords:
            return 0.5  # Neutral score
        
        keywords = self.doc_types_keywords[doc_type]
        matches = sum(1 for kw in keywords if kw in query_lower)
        
        if matches > 0:
            return min(1.0, matches / 3)  # Max score at 3 keyword matches
        
        return 0.3  # Small baseline score
    
    def calculate_relevance_score(
        self,
        query: str,
        chunk: DocumentChunk,
        keyword_weight: float = 0.6,
        type_weight: float = 0.4
    ) -> float:
        """
        Calculate overall relevance score
        
        Args:
            query (str): User query
            chunk (DocumentChunk): Document chunk
            keyword_weight (float): Weight for keyword matching
            type_weight (float): Weight for document type
        
        Returns:
            float: Overall relevance score (0-1)
        """
        keyword_score = self.calculate_keyword_score(query, chunk)
        type_score = self.calculate_doc_type_score(query, chunk)
        
        return (keyword_weight * keyword_score) + (type_weight * type_score)
    
    def rank_chunks(
        self,
        query: str,
        top_k: int = 5,
        min_score: float = 0.1
    ) -> List[Tuple[DocumentChunk, float]]:
        """
        Rank chunks by relevance to query
        
        Args:
            query (str): User query
            top_k (int): Number of top chunks to return
            min_score (float): Minimum relevance score threshold
        
        Returns:
            List[Tuple[DocumentChunk, float]]: Ranked chunks with scores
        """
        scored_chunks = []
        
        for chunk in self.chunks:
            score = self.calculate_relevance_score(query, chunk)
            if score >= min_score:
                scored_chunks.append((chunk, score))
        
        # Sort by score (descending)
        scored_chunks.sort(key=lambda x: x[1], reverse=True)
        
        return scored_chunks[:top_k]
    
    def get_relevant_documents(
        self,
        query: str,
        max_tokens: int = 3000
    ) -> List[DocumentChunk]:
        """
        Get relevant chunks within token budget
        
        Args:
            query (str): User query
            max_tokens (int): Maximum tokens to return
        
        Returns:
            List[DocumentChunk]: Relevant chunks within budget
        """
        # Get ranked chunks
        ranked = self.rank_chunks(query, top_k=20)  # Get more candidates
        
        # Select chunks within budget
        selected = []
        current_tokens = 0
        
        for chunk, score in ranked:
            if current_tokens + chunk.token_count <= max_tokens:
                selected.append(chunk)
                current_tokens += chunk.token_count
            else:
                break  # Budget exceeded
        
        return selected

print("✓ DocumentRelevanceRanker class defined")

### Test Document Ranker

In [None]:
# Test document ranker
print("DOCUMENT RELEVANCE RANKING:")
print("=" * 80)

ranker = DocumentRelevanceRanker(all_chunks)

# Test queries
test_queries = [
    "What is the applicant's annual revenue?",
    "Does the applicant have good credit?",
    "What is the property value?",
    "Can they afford the loan payment?"
]

for query in test_queries:
    print(f"\n{'='*80}")
    print(f"QUERY: {query}")
    print("=" * 80)
    
    # Get top 3 relevant chunks
    ranked = ranker.rank_chunks(query, top_k=3, min_score=0.15)
    
    print(f"\nTop {len(ranked)} relevant chunks:")
    print("-" * 80)
    for i, (chunk, score) in enumerate(ranked, 1):
        print(f"\n{i}. {chunk.doc_title} (Score: {score:.3f})")
        print(f"   Chunk: {chunk.chunk_id}")
        print(f"   Type: {chunk.doc_type}")
        print(f"   Tokens: {chunk.token_count}")
        print(f"   Preview: {chunk.content[:150]}...")

# Test budget-constrained retrieval
print("\n" + "=" * 80)
print("BUDGET-CONSTRAINED RETRIEVAL:")
print("=" * 80)

query = "Analyze the company's financial health and ability to repay the loan."
max_tokens = 2000

relevant_chunks = ranker.get_relevant_documents(query, max_tokens=max_tokens)

print(f"\nQuery: {query}")
print(f"Token Budget: {max_tokens}")
print(f"Chunks Selected: {len(relevant_chunks)}")
print(f"Total Tokens: {sum(c.token_count for c in relevant_chunks)}")

print("\nSelected Documents:")
for chunk in relevant_chunks:
    print(f"  - {chunk.doc_title} ({chunk.token_count} tokens)")

print("=" * 80)

## CHALLENGE 4: SYNTHESIS FROM MULTIPLE SOURCES

**Time:** 10 minutes  
**Objective:** Answer questions requiring information from multiple documents

### Background

Many queries require synthesizing information from 2+ documents with proper attribution.

In [None]:
# SOLUTION: Multi-Document Synthesizer

class MultiDocumentSynthesizer:
    """
    Synthesize answers from multiple document sources
    """
    
    def __init__(self, ranker: DocumentRelevanceRanker):
        """
        Initialize synthesizer
        
        Args:
            ranker (DocumentRelevanceRanker): Document ranker
        """
        self.ranker = ranker
    
    def answer_query(
        self,
        query: str,
        max_context_tokens: int = 3000,
        include_sources: bool = True
    ) -> Dict:
        """
        Answer query using multiple documents
        
        Args:
            query (str): User query
            max_context_tokens (int): Maximum tokens for context
            include_sources (bool): Whether to cite sources
        
        Returns:
            Dict: Answer with metadata
        """
        # Get relevant chunks
        relevant_chunks = self.ranker.get_relevant_documents(query, max_context_tokens)
        
        if not relevant_chunks:
            return {
                "success": False,
                "error": "No relevant documents found",
                "answer": ""
            }
        
        # Build context from chunks
        context_parts = []
        sources_used = []
        
        for chunk in relevant_chunks:
            source_label = f"[{chunk.doc_type.upper()}: {chunk.doc_title}]"
            context_parts.append(f"{source_label}\n{chunk.content}")
            
            sources_used.append({
                'doc_id': chunk.doc_id,
                'doc_title': chunk.doc_title,
                'doc_type': chunk.doc_type,
                'chunk_id': chunk.chunk_id
            })
        
        context = "\n\n---\n\n".join(context_parts)
        
        # Build prompt
        if include_sources:
            prompt = f"""
Based on the following document excerpts, answer the question.
Cite your sources by referencing the document type and title.

DOCUMENTS:
{context}

QUESTION: {query}

Provide a clear answer with source citations.
"""
        else:
            prompt = f"""
Based on the following information, answer the question concisely.

INFORMATION:
{context}

QUESTION: {query}
"""
        
        # Get answer from LLM
        response = call_gpt4(
            prompt,
            "You are a loan underwriting analyst. Provide accurate answers based on document evidence."
        )
        
        if response['success']:
            return {
                "success": True,
                "answer": response['content'],
                "sources": sources_used,
                "chunks_used": len(relevant_chunks),
                "context_tokens": count_tokens(context),
                "total_tokens": response['total_tokens']
            }
        else:
            return {
                "success": False,
                "error": response.get('error'),
                "answer": ""
            }
    
    def verify_consistency(
        self,
        entity: str,
        expected_value: str = None
    ) -> Dict:
        """
        Check if an entity has consistent values across documents
        
        Args:
            entity (str): Entity to check (e.g., "annual revenue")
            expected_value (str): Optional expected value
        
        Returns:
            Dict: Consistency check results
        """
        query = f"What is the {entity}?"
        
        # Get answer
        result = self.answer_query(query, max_context_tokens=2000)
        
        if result['success']:
            # Extract sources that mentioned this
            sources = [s['doc_title'] for s in result['sources']]
            
            return {
                "entity": entity,
                "value_found": result['answer'],
                "sources": sources,
                "source_count": len(sources),
                "consistent": True if len(set(sources)) > 1 else False
            }
        
        return {
            "entity": entity,
            "error": "Could not verify"
        }

print("✓ MultiDocumentSynthesizer class defined")

### Test Multi-Document Synthesizer

In [None]:
# Test multi-document synthesizer
print("MULTI-DOCUMENT SYNTHESIS:")
print("=" * 80)

synthesizer = MultiDocumentSynthesizer(ranker)

# Test queries requiring multiple documents
synthesis_queries = [
    "Does the applicant's stated annual revenue in the application match their tax returns?",
    "What is the debt service coverage ratio based on the business plan and loan request?",
    "Is the property value from the appraisal consistent with the loan-to-value ratio in the application?"
]

for query in synthesis_queries:
    print(f"\n{'='*80}")
    print(f"QUERY: {query}")
    print("=" * 80)
    
    result = synthesizer.answer_query(query, max_context_tokens=2500)
    
    if result['success']:
        print(f"\nANSWER:")
        print(result['answer'])
        
        print(f"\nSOURCES USED ({result['chunks_used']} chunks, {result['context_tokens']} tokens):")
        sources_seen = set()
        for source in result['sources']:
            source_key = f"{source['doc_type']}: {source['doc_title']}"
            if source_key not in sources_seen:
                print(f"  - {source_key}")
                sources_seen.add(source_key)
        
        print(f"\nTOKENS: {result['total_tokens']} total")
    else:
        print(f"\nERROR: {result['error']}")

### Test Consistency Verification

In [None]:
# Test consistency verification
print("CONSISTENCY VERIFICATION:")
print("=" * 80)

entities_to_verify = [
    "annual revenue for 2023",
    "total business debt",
    "credit score"
]

for entity in entities_to_verify:
    result = synthesizer.verify_consistency(entity)
    
    print(f"\n{entity.upper()}:")
    if 'error' not in result:
        print(f"  Value: {result['value_found'][:100]}...")
        print(f"  Found in {result['source_count']} source(s)")
        print(f"  Consistent: {'Yes' if result['consistent'] else 'Needs verification'}")
    else:
        print(f"  Error: {result['error']}")

print("\n" + "=" * 80)

## CHALLENGE 5: PRODUCTION MULTI-DOCUMENT MANAGER

**Time:** 10 minutes  
**Objective:** Build complete production system

### Implementation

In [None]:
# SOLUTION: Production Multi-Document Manager

class ProductionDocumentManager:
    """
    Production-ready multi-document management system
    """
    
    def __init__(
        self,
        documents: Dict,
        chunk_size: int = 600,
        max_context_tokens: int = 3000
    ):
        """
        Initialize document manager
        
        Args:
            documents (Dict): Document collection
            chunk_size (int): Target chunk size in tokens
            max_context_tokens (int): Maximum context tokens
        """
        self.documents = documents
        self.max_context_tokens = max_context_tokens
        
        # Initialize subsystems
        self.chunker = DocumentChunker(chunk_size=chunk_size)
        self.chunked_docs = self.chunker.chunk_all_documents(documents)
        self.all_chunks = self.chunker.get_all_chunks()
        
        self.tracker = CrossDocumentTracker()
        self.tracker.index_all_chunks(self.all_chunks)
        
        self.ranker = DocumentRelevanceRanker(self.all_chunks)
        self.synthesizer = MultiDocumentSynthesizer(self.ranker)
        
        self.query_count = 0
        self.cache = {}  # Simple query cache
    
    def query(
        self,
        question: str,
        use_cache: bool = True,
        include_sources: bool = True
    ) -> Dict:
        """
        Answer question using document collection
        
        Args:
            question (str): User question
            use_cache (bool): Whether to use cached results
            include_sources (bool): Whether to include source citations
        
        Returns:
            Dict: Answer with metadata
        """
        self.query_count += 1
        
        # Check cache
        if use_cache and question in self.cache:
            cached = self.cache[question].copy()
            cached['from_cache'] = True
            return cached
        
        # Get answer
        result = self.synthesizer.answer_query(
            question,
            max_context_tokens=self.max_context_tokens,
            include_sources=include_sources
        )
        
        # Add query metadata
        result['query_number'] = self.query_count
        result['from_cache'] = False
        
        # Cache result
        if use_cache and result['success']:
            self.cache[question] = result
        
        return result
    
    def get_document_summary(self, doc_id: str) -> str:
        """
        Get summary of a specific document
        
        Args:
            doc_id (str): Document identifier
        
        Returns:
            str: Document summary
        """
        if doc_id not in self.documents:
            return f"Document {doc_id} not found"
        
        doc = self.documents[doc_id]
        chunks = self.chunked_docs.get(doc_id, [])
        
        return f"""
Document: {doc['title']}
Type: {doc['type']}
Pages: {doc['pages']}
Date: {doc['date']}
Chunks: {len(chunks)}
Total Tokens: {sum(c.token_count for c in chunks)}
"""
    
    def list_documents(self) -> pd.DataFrame:
        """
        List all documents in collection
        
        Returns:
            pd.DataFrame: Document listing
        """
        doc_list = []
        
        for doc_id, doc in self.documents.items():
            chunks = self.chunked_docs.get(doc_id, [])
            doc_list.append({
                'doc_id': doc_id,
                'title': doc['title'],
                'type': doc['type'],
                'pages': doc['pages'],
                'date': doc['date'],
                'chunks': len(chunks),
                'tokens': sum(c.token_count for c in chunks)
            })
        
        return pd.DataFrame(doc_list)
    
    def find_related_docs(self, doc_id: str) -> List[str]:
        """
        Find documents related to given document
        
        Args:
            doc_id (str): Source document ID
        
        Returns:
            List[str]: Related document IDs
        """
        related = self.tracker.find_related_documents(doc_id, min_shared=3)
        return [doc_id for doc_id, _ in related]
    
    def get_stats(self) -> Dict:
        """Get comprehensive system statistics"""
        return {
            "total_documents": len(self.documents),
            "total_chunks": len(self.all_chunks),
            "total_tokens": sum(c.token_count for c in self.all_chunks),
            "queries_processed": self.query_count,
            "cache_size": len(self.cache),
            "avg_tokens_per_chunk": np.mean([c.token_count for c in self.all_chunks]),
            **self.tracker.get_stats()
        }

print("✓ ProductionDocumentManager class defined")

### Test Production Document Manager

In [None]:
# Test production document manager
print("PRODUCTION MULTI-DOCUMENT MANAGER:")
print("=" * 80)

# Initialize manager
manager = ProductionDocumentManager(
    loan_documents,
    chunk_size=600,
    max_context_tokens=2500
)

# List documents
print("\nDOCUMENT COLLECTION:")
print("-" * 80)
print(manager.list_documents().to_string(index=False))

### Test Sample Queries

In [None]:
# Test queries
print("SAMPLE QUERIES:")
print("=" * 80)

test_questions = [
    "What is the requested loan amount and purpose?",
    "Is the company profitable? Show evidence from tax returns.",
    "What are the monthly cash deposits based on bank statements?"
]

for question in test_questions:
    print(f"\n{'='*80}")
    print(f"Q: {question}")
    print("=" * 80)
    
    result = manager.query(question)
    
    if result['success']:
        print(f"\nA: {result['answer'][:300]}...")
        print(f"\nMetadata:")
        print(f"  Sources: {result['chunks_used']} chunks")
        print(f"  Tokens: {result['total_tokens']}")
        print(f"  Query #: {result['query_number']}")
    else:
        print(f"Error: {result['error']}")

### System Statistics

In [None]:
# System statistics
print("\nSYSTEM STATISTICS:")
print("=" * 80)
stats = manager.get_stats()
for key, value in stats.items():
    if isinstance(value, float):
        print(f"  {key}: {value:.2f}")
    else:
        print(f"  {key}: {value}")

print("=" * 80)

## LAB SUMMARY

### Token Efficiency Analysis

```
140-page document collection = ~70,000 tokens

Single Query ("What is the revenue?"):
├─ Naive (load all docs): 70,000 tokens
├─ Basic (load one doc): ~10,000 tokens (85% savings)
├─ Chunked (load relevant chunks): ~1,500 tokens (98% savings)
└─ Optimized (rank + select): ~800 tokens (99% savings)

Multi-Source Query ("Does revenue match tax returns?"):
├─ Naive: 70,000 tokens
├─ Load 2 full docs: ~20,000 tokens (71% savings)
└─ Optimized (relevant chunks): ~1,800 tokens (97% savings)
```

### Production Checklist

- [x] Implement document chunking with metadata
- [x] Build cross-document entity tracker
- [x] Create relevance ranking system
- [x] Develop multi-source synthesis
- [x] Add query caching
- [x] Implement source citation
- [x] Test consistency verification
- [x] Monitor token usage per query
- [x] Handle missing/conflicting information
- [x] Document system architecture

