# Reading Large Financial Documents with LLMs

In this notebook, we'll explore techniques for efficiently processing large financial documents such as annual reports, SEC filings, and earnings call transcripts using Large Language Models. 

These documents often contain critical information for valuation, but their length and complexity make them challenging to analyze manually. We'll demonstrate how LLMs can help extract, summarize, and analyze this information effectively.

## 1. Introduction: The Challenge of Large Financial Documents

Financial documents present several unique challenges:

- **Length**: Annual reports and SEC filings often exceed 100+ pages
- **Structure**: Mix of narrative text, tables, charts, and footnotes
- **Technical language**: Industry-specific terminology and financial jargon
- **Information dispersal**: Important details scattered throughout the document
- **Contextual understanding**: Requires understanding relationships between different sections

Traditional approaches like keyword searching or rule-based extraction often miss important context or nuances. LLMs, with their ability to understand natural language and maintain context, offer a promising solution.

## 2. Setting Up the Environment

First, let's install and import the required libraries:

In [None]:
# Install required libraries
!pip install -q openai pandas numpy tiktoken PyPDF2 requests beautifulsoup4 nltk transformers torch langchain

In [None]:
# Import required libraries
import os
import pandas as pd
import numpy as np
import re
import json
import requests
import PyPDF2
from bs4 import BeautifulSoup
from nltk.tokenize import sent_tokenize
import tiktoken
import torch
from transformers import AutoTokenizer, AutoModel
from openai import OpenAI
import matplotlib.pyplot as plt
import warnings

# Suppress warnings
warnings.filterwarnings('ignore')

# Initialize OpenAI client
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

# If you're using your own API key, uncomment and replace with your key:
# client = OpenAI(api_key="your-api-key-here")

print("Libraries imported successfully!")

## 3. Document Acquisition and Loading

Let's start by downloading Tesla's 2022 Annual Report (10-K) as our example document.

In [None]:
# Define a function to download and load the 10-K filing
def download_sec_filing(ticker, filing_type="10-K", year=2022):
    """
    Download an SEC filing for a given ticker, filing type and year
    """
    print(f"Downloading {filing_type} for {ticker} ({year})...")
    
    # This is a simplified example - in a real application, you'd use the SEC EDGAR API
    # For demonstration, we'll use a direct link to Tesla's 2022 10-K
    
    # In a production environment, you would use:
    # - SEC EDGAR API
    # - Financial data provider APIs (Bloomberg, Refinitiv, FactSet)
    # - Web scraping with proper rate limiting and user agents
    
    if ticker == "TSLA" and filing_type == "10-K" and year == 2022:
        url = "https://www.sec.gov/Archives/edgar/data/1318605/000095017023001409/tsla-20221231.htm"
        response = requests.get(url)
        
        if response.status_code == 200:
            # Save the file locally
            with open("tesla_10k_2022.html", "w", encoding="utf-8") as f:
                f.write(response.text)
            
            # Parse HTML content
            soup = BeautifulSoup(response.text, 'html.parser')
            
            # Extract text (this is simplified - proper extraction would require more processing)
            text = soup.get_text()
            
            # Clean up the text
            text = re.sub(r'\s+', ' ', text)
            
            print(f"Successfully downloaded filing ({len(text)} characters)")
            return text
        else:
            print(f"Failed to download filing. Status code: {response.status_code}")
            return None
    else:
        print("For this demo, only Tesla's 2022 10-K is supported.")
        return None

# Download Tesla's 2022 10-K
tesla_10k = download_sec_filing("TSLA", "10-K", 2022)

# Display the first 1000 characters to verify the content
if tesla_10k:
    print("\nPreview of the document:")
    print(tesla_10k[:1000])

## 4. Understanding Document Length Constraints

LLMs have token limits that restrict how much text they can process at once. Let's examine how to handle large documents that exceed these limits.

In [None]:
# Function to count tokens in a text
def count_tokens(text, model="gpt-3.5-turbo"):
    """Count the number of tokens in a text string"""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

# Count tokens in the 10-K
if tesla_10k:
    token_count = count_tokens(tesla_10k)
    print(f"The Tesla 10-K contains approximately {token_count:,} tokens")
    
    # Common model context windows
    context_windows = {
        "gpt-3.5-turbo": 16_385,
        "gpt-4": 8_192,
        "gpt-4-turbo": 128_000,
        "claude-2": 100_000,
        "llama-2-70b": 4_096
    }
    
    # Create a dataframe to visualize token limits
    models_df = pd.DataFrame({
        'Model': list(context_windows.keys()),
        'Context Window (tokens)': list(context_windows.values()),
        'Can Process Full Document?': [
            'Yes' if window >= token_count else 'No' 
            for window in context_windows.values()
        ]
    })
    
    print("\nModel Context Windows vs. Document Size:")
    print(models_df)

## 5. Chunking Strategies for Large Documents

When documents exceed token limits, we need effective chunking strategies that preserve context and coherence.

In [None]:
# Define different chunking strategies

def chunk_by_characters(text, chunk_size=4000, overlap=200):
    """Split text into chunks of approximately equal character length with overlap"""
    chunks = []
    start = 0
    
    while start < len(text):
        # Find the end of the chunk
        end = min(start + chunk_size, len(text))
        
        # If we're not at the end of the text, try to break at a sentence boundary
        if end < len(text):
            # Look for the last period followed by a space within the overlap window
            last_period = text.rfind('. ', end - overlap, end)
            if last_period != -1:
                end = last_period + 2  # Include the period and space
                
        # Extract the chunk and add it to our list
        chunk = text[start:end]
        chunks.append(chunk)
        
        # Move the start pointer, ensuring overlap
        start = end - overlap if end < len(text) else end
    
    return chunks

def chunk_by_sections(text, section_patterns=[r'Item \d+\.', r'PART [IVX]+\.']):
    """Split text by regulatory document sections (Items and Parts)"""
    chunks = []
    
    # Combine all patterns into a single regex
    combined_pattern = '|'.join(f'({pattern})' for pattern in section_patterns)
    
    # Find all section headers
    matches = list(re.finditer(combined_pattern, text))
    
    # Process each section
    for i, match in enumerate(matches):
        # Determine section start and end
        start = match.start()
        end = matches[i+1].start() if i < len(matches) - 1 else len(text)
        
        # Extract the section
        section = text[start:end]
        
        # Further chunk if section is too large
        if count_tokens(section) > 8000:  # Assuming a target of 8000 tokens per chunk
            subsections = chunk_by_characters(section, 4000, 200)
            chunks.extend(subsections)
        else:
            chunks.append(section)
    
    return chunks

def chunk_by_semantic_content(text, max_tokens=8000):
    """
    Split text by trying to preserve semantic units (paragraphs, sections)
    This is a simplified version - a production implementation would be more sophisticated
    """
    # Split by paragraphs (double newlines)
    paragraphs = re.split(r'\n\n+', text)
    
    chunks = []
    current_chunk = ""
    current_tokens = 0
    
    for para in paragraphs:
        para_tokens = count_tokens(para)
        
        # If adding this paragraph would exceed the limit, start a new chunk
        if current_tokens + para_tokens > max_tokens and current_chunk:
            chunks.append(current_chunk)
            current_chunk = para
            current_tokens = para_tokens
        else:
            # Add to the current chunk
            if current_chunk:
                current_chunk += "\n\n" + para
            else:
                current_chunk = para
            current_tokens += para_tokens
    
    # Add the last chunk if it's not empty
    if current_chunk:
        chunks.append(current_chunk)
    
    return chunks

# Apply different chunking strategies to our document
if tesla_10k:
    # Character-based chunking
    char_chunks = chunk_by_characters(tesla_10k)
    
    # Section-based chunking
    section_chunks = chunk_by_sections(tesla_10k)
    
    # Semantic chunking
    semantic_chunks = chunk_by_semantic_content(tesla_10k)
    
    # Compare the results
    chunking_results = pd.DataFrame({
        'Chunking Method': ['Character-based', 'Section-based', 'Semantic-based'],
        'Number of Chunks': [len(char_chunks), len(section_chunks), len(semantic_chunks)],
        'Avg. Tokens per Chunk': [
            int(sum(count_tokens(chunk) for chunk in char_chunks) / len(char_chunks)),
            int(sum(count_tokens(chunk) for chunk in section_chunks) / len(section_chunks)),
            int(sum(count_tokens(chunk) for chunk in semantic_chunks) / len(semantic_chunks))
        ],
        'Max Tokens in a Chunk': [
            max(count_tokens(chunk) for chunk in char_chunks),
            max(count_tokens(chunk) for chunk in section_chunks),
            max(count_tokens(chunk) for chunk in semantic_chunks)
        ]
    })
    
    print("Comparison of Chunking Strategies:")
    print(chunking_results)
    
    # Let's choose the section-based approach for further analysis
    chunks = section_chunks
    print(f"\nWe'll proceed with {len(chunks)} section-based chunks for our analysis.")

## 6. Retrieval Augmented Generation (RAG) for Document Analysis

Now that we've chunked our document, we can use Retrieval Augmented Generation (RAG) to analyze specific aspects of the filing.

In [None]:
# First, let's create embeddings for each chunk to enable semantic search
# For production use, consider using:
# - OpenAI's text-embedding-ada-002
# - HuggingFace's all-MiniLM-L6-v2
# - Other embedding models specialized for financial text

# For this example, we'll use a simple function to simulate embeddings
def create_embeddings(chunks):
    """
    Create embeddings for each chunk
    In a real application, you would use a proper embedding model
    """
    try:
        # Initialize a pre-trained model for embeddings
        tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
        model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
        
        embeddings = []
        
        # Process chunks in batches to avoid memory issues
        for chunk in chunks:
            # Tokenize and get model outputs
            inputs = tokenizer(chunk, padding=True, truncation=True, 
                              return_tensors="pt", max_length=512)
            
            with torch.no_grad():
                outputs = model(**inputs)
            
            # Use the CLS token embedding as the sentence embedding
            embedding = outputs.last_hidden_state[:, 0, :].numpy()
            embeddings.append(embedding[0])  # Add the embedding vector
            
        print(f"Created embeddings for {len(chunks)} chunks successfully!")
        return np.array(embeddings)
    
    except Exception as e:
        print(f"Error creating embeddings: {e}")
        # Fall back to a simple TF-IDF approach
        from sklearn.feature_extraction.text import TfidfVectorizer
        
        vectorizer = TfidfVectorizer()
        embeddings = vectorizer.fit_transform(chunks).toarray()
        print(f"Created TF-IDF vectors for {len(chunks)} chunks as fallback.")
        return embeddings

# Create embeddings for our chunks
if 'chunks' in locals():
    chunk_embeddings = create_embeddings(chunks)
    print(f"Embedding shape: {chunk_embeddings.shape}")
    
    # Function to find the most relevant chunks for a query
    def find_relevant_chunks(query, chunks, embeddings, top_n=3):
        """Find the most relevant chunks for a given query"""
        # Create query embedding
        query_embedding = create_embeddings([query])[0]
        
        # Calculate similarity scores
        similarities = np.dot(embeddings, query_embedding) / (
            np.linalg.norm(embeddings, axis=1) * np.linalg.norm(query_embedding)
        )
        
        # Get indices of top_n most similar chunks
        top_indices = np.argsort(similarities)[-top_n:][::-1]
        
        # Return the top chunks and their similarity scores
        return [(chunks[i], similarities[i]) for i in top_indices]
    
    # Example: Find chunks related to "risk factors"
    risk_chunks = find_relevant_chunks("risk factors", chunks, chunk_embeddings)
    
    print("\nTop chunks related to 'risk factors':")
    for i, (chunk, score) in enumerate(risk_chunks):
        print(f"\nChunk {i+1} (similarity score: {score:.2f}):")
        print(f"{chunk[:300]}...\n")

## 7. Extracting Structured Information from Financial Documents

Now let's use an LLM to extract specific structured information from our document chunks.

In [None]:
# Function to extract structured information using an LLM
def extract_information(chunk, extraction_prompt, model="gpt-3.5-turbo"):
    """Extract structured information from a document chunk using an LLM"""
    
    # Create a system prompt for the extraction task
    system_prompt = """You are a financial analyst specializing in SEC filings analysis.
    Extract the requested information from the provided text from an SEC filing.
    Provide your response in JSON format according to the schema specified.
    If information is not found, indicate with "Not found" or appropriate null values."""
    
    # Check if we're exceeding token limits
    combined_text = system_prompt + extraction_prompt + chunk
    token_count = count_tokens(combined_text)
    
    # If too large, truncate the chunk
    if token_count > 15000:  # Leave room for response
        print(f"Warning: Truncating chunk to fit within token limit (current: {token_count})")
        encoding = tiktoken.encoding_for_model(model)
        tokens = encoding.encode(chunk)
        # Truncate to approximately 10000 tokens
        truncated_tokens = tokens[:10000]
        chunk = encoding.decode(truncated_tokens)
        print(f"New token count: {count_tokens(system_prompt + extraction_prompt + chunk)}")
    
    try:
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": f"{extraction_prompt}\n\nTEXT FROM SEC FILING:\n{chunk}"}
            ],
            temperature=0.0,  # Low temperature for factual extraction
            response_format={"type": "json_object"}  # Request JSON output
        )
        
        # Extract and parse the JSON response
        result = json.loads(response.choices[0].message.content)
        return result
    
    except Exception as e:
        print(f"Error during information extraction: {e}")
        return {"error": str(e)}

# Example: Extract risk factors from relevant chunks
if 'risk_chunks' in locals():
    # Extraction prompt for risk factors
    risk_factors_prompt = """
    Extract the top risk factors mentioned in the filing text.
    For each risk factor:
    1. Identify the title or main category
    2. Provide a brief summary
    3. Note any quantitative impacts mentioned (if any)
    4. Assess the severity (High, Medium, Low) based on the language used
    
    Return the information in the following JSON format:
    {
        "risk_factors": [
            {
                "title": "Risk factor title/category",
                "summary": "Brief summary of the risk",
                "quantitative_impact": "Any numbers/percentages mentioned",
                "severity": "High/Medium/Low assessment"
            }
        ]
    }
    
    Limit your response to the top 5 most significant risk factors.
    """
    
    # Extract risk factors from the most relevant chunk
    most_relevant_chunk = risk_chunks[0][0]  # First chunk from our risk_chunks list
    risk_factors = extract_information(most_relevant_chunk, risk_factors_prompt)
    
    # Display the extracted risk factors
    print("\nExtracted Risk Factors:")
    if "risk_factors" in risk_factors:
        for i, factor in enumerate(risk_factors["risk_factors"]):
            print(f"\n{i+1}. {factor['title']}")
            print(f"   Summary: {factor['summary']}")
            print(f"   Quantitative Impact: {factor['quantitative_impact']}")
            print(f"   Severity: {factor['severity']}")
    else:
        print("No risk factors were extracted or there was an error.")
        print(risk_factors)

## 8. Document Summarization Techniques

Different sections of financial documents require different summarization approaches. Let's demonstrate how to create tailored summaries.

In [None]:
# Function to generate different types of summaries
def generate_summary(chunk, summary_type, model="gpt-3.5-turbo"):
    """
    Generate different types of summaries from document chunks
    
    summary_type options:
    - 'executive': High-level overview for executives
    - 'detailed': Comprehensive summary with key details
    - 'comparative': Summary comparing to previous periods
    - 'implications': Focus on business implications
    """
    
    # Define prompts for different summary types
    prompts = {
        'executive': """
        Create a concise executive summary of this SEC filing excerpt in 3-5 bullet points.
        Focus on the most important information that a C-suite executive would need to know.
        Highlight material changes, strategic implications, and key risks or opportunities.
        """,
        
        'detailed': """
        Create a detailed yet concise summary of this SEC filing excerpt.
        Include:
        - Key financial metrics and changes
        - Important business developments
        - Significant risks and contingencies
        - Management's strategic focus areas
        - Regulatory issues or concerns
        
        Your summary should be comprehensive but focused on material information.
        """,
        
        'comparative': """
        Summarize this SEC filing excerpt with a focus on year-over-year or quarter-over-quarter changes.
        Highlight:
        - Growth or decline in key metrics
        - Changing trends in the business
        - Evolution of risks or opportunities
        - Shifts in management focus or strategy
        
        Specifically note what has changed compared to previous periods.
        """,
        
        'implications': """
        Analyze this SEC filing excerpt and summarize the business implications.
        Focus on:
        - What these developments mean for the company's future
        - Potential impacts on competitive positioning
        - Implications for stakeholders (investors, customers, employees)
        - Forward-looking indicators
        
        Your summary should focus on "what it means" rather than just "what happened."
        """
    }
    
    # Default to executive summary if an invalid type is provided
    if summary_type not in prompts:
        print(f"Warning: Unknown summary type '{summary_type}'. Using 'executive' instead.")
        summary_type = 'executive'
    
    system_prompt = """You are a financial analyst specializing in SEC filings analysis.
    Summarize the provided text from an SEC filing according to the specific requirements.
    Focus only on factual information present in the text. Do not add speculative content.
    """
    
    try:
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": f"{prompts[summary_type]}\n\nTEXT FROM SEC FILING:\n{chunk}"}
            ],
            temperature=0.2,  # Low temperature for factual summary
        )
        
        # Extract the summary
        summary = response.choices[0].message.content
        return summary
    
    except Exception as e:
        print(f"Error during summary generation: {e}")
        return f"Error generating summary: {str(e)}"

# Example: Generate different types of summaries for the MD&A section
if 'chunks' in locals():
    # Find the MD&A section (Management Discussion and Analysis)
    mda_chunks = find_relevant_chunks("Management's Discussion and Analysis", chunks, chunk_embeddings, top_n=1)
    
    if mda_chunks:
        mda_chunk = mda_chunks[0][0]
        
        # Generate different types of summaries
        summary_types = ['executive', 'detailed', 'implications']
        summaries = {}
        
        for summary_type in summary_types:
            print(f"\nGenerating {summary_type} summary...")
            summaries[summary_type] = generate_summary(mda_chunk[:5000], summary_type)  # Using first 5000 chars for brevity
        
        # Display the summaries
        for summary_type, summary in summaries.items():
            print(f"\n--- {summary_type.upper()} SUMMARY ---")
            print(summary)
    else:
        print("Could not find MD&A section in the document.")

## 9. Extracting Financial Metrics and KPIs

Financial documents contain numerous metrics and KPIs that are crucial for valuation. Let's extract these in a structured format.

In [None]:
# Function to extract financial metrics
def extract_financial_metrics(chunks, model="gpt-3.5-turbo"):
    """Extract key financial metrics from document chunks"""
    
    # Find chunks likely to contain financial information
    financial_chunks = find_relevant_chunks("financial statements revenue profit margin", 
                                           chunks, chunk_embeddings, top_n=2)
    
    # Extract metrics from each chunk
    metrics_prompt = """
    Extract key financial metrics and KPIs from the provided SEC filing text.
    Include:
    1. Revenue figures (total and by segment if available)
    2. Profit metrics (gross profit, operating income, net income)
    3. Margin percentages (gross margin, operating margin, net margin)
    4. Growth rates (YoY or QoQ changes)
    5. Any other key performance indicators mentioned
    
    Return the information in the following JSON format:
    {
        "time_period": "The time period these metrics refer to",
        "revenue": {
            "total": "Total revenue figure with units",
            "segments": [
                {"name": "Segment name", "value": "Segment revenue with units"}
            ]
        },
        "profit": {
            "gross_profit": "Figure with units",
            "operating_income": "Figure with units",
            "net_income": "Figure with units"
        },
        "margins": {
            "gross_margin": "Percentage",
            "operating_margin": "Percentage",
            "net_margin": "Percentage"
        },
        "growth": {
            "revenue_growth": "Percentage",
            "income_growth": "Percentage"
        },
        "other_kpis": [
            {"name": "KPI name", "value": "KPI value", "description": "Brief description"}
        ]
    }
    
    If any information is not available, use null or appropriate placeholder.
    """
    
    all_metrics = []
    
    for i, (chunk, score) in enumerate(financial_chunks):
        print(f"\nExtracting metrics from financial chunk {i+1}...")
        chunk_metrics = extract_information(chunk, metrics_prompt)
        all_metrics.append(chunk_metrics)
    
    # Combine and deduplicate metrics
    # In a real application, you would implement more sophisticated deduplication and validation
    combined_metrics = all_metrics[0] if all_metrics else {}
    
    return combined_metrics

# Extract financial metrics
if 'chunks' in locals():
    financial_metrics = extract_financial_metrics(chunks)
    
    # Display the extracted metrics
    print("\nExtracted Financial Metrics:")
    print(json.dumps(financial_metrics, indent=2))
    
    # Visualize some key metrics if available
    if financial_metrics and 'revenue' in financial_metrics and 'segments' in financial_metrics['revenue']:
        # Extract segment revenues for visualization
        segments = financial_metrics['revenue']['segments']
        if segments and len(segments) > 1:
            segment_names = [seg['name'] for seg in segments if 'name' in seg and 'value' in seg]
            
            # Extract values and convert to numeric (removing non-numeric characters)
            segment_values = []
            for seg in segments:
                if 'value' in seg:
                    # Extract numeric value from string like "$10.2 billion"
                    value_str = seg['value']
                    numeric_value = re.findall(r'[\d.]+', value_str)
                    if numeric_value:
                        # Convert to float and scale based on units
                        value = float(numeric_value[0])
                        if 'billion' in value_str.lower():
                            value *= 1000
                        segment_values.append(value)
                    else:
                        segment_values.append(0)
            
            # Create a simple bar chart
            if segment_names and segment_values and len(segment_names) == len(segment_values):
                plt.figure(figsize=(10, 6))
                plt.bar(segment_names, segment_values)
                plt.title('Revenue by Segment (in millions USD)')
                plt.ylabel('Revenue (USD Millions)')
                plt.xticks(rotation=45, ha='right')
                plt.tight_layout()
                plt.show()

## 10. Identifying and Analyzing Forward-Looking Statements

Forward-looking statements in financial documents provide crucial insights for forecasting future performance.

In [None]:
# Function to identify and analyze forward-looking statements
def extract_forward_looking_statements(chunks, model="gpt-3.5-turbo"):
    """Extract and analyze forward-looking statements from document chunks"""
    
    # Find chunks likely to contain forward-looking statements
    fls_chunks = find_relevant_chunks("forward-looking statements future outlook guidance", 
                                     chunks, chunk_embeddings, top_n=3)
    
    # Extract forward-looking statements from each chunk
    fls_prompt = """
    Identify and analyze forward-looking statements in the provided SEC filing text.
    
    For each significant forward-looking statement:
    1. Extract the exact statement or a close paraphrase
    2. Categorize it (e.g., Revenue Projection, Product Development, Market Expansion)
    3. Note any timeframes mentioned
    4. Identify any quantitative targets or metrics
    5. Assess the confidence level based on the language used (High, Medium, Low)
    
    Return the information in the following JSON format:
    {
        "forward_looking_statements": [
            {
                "statement": "The extracted statement",
                "category": "Category of the statement",
                "timeframe": "Mentioned timeframe if any",
                "quantitative_targets": "Any specific numbers/percentages",
                "confidence_level": "High/Medium/Low",
                "analysis": "Brief analysis of implications"
            }
        ]
    }
    
    Limit your response to the top 5-7 most significant forward-looking statements.
    """
    
    all_statements = []
    
    for i, (chunk, score) in enumerate(fls_chunks):
        print(f"\nExtracting forward-looking statements from chunk {i+1}...")
        chunk_statements = extract_information(chunk, fls_prompt)
        
        if 'forward_looking_statements' in chunk_statements:
            all_statements.extend(chunk_statements['forward_looking_statements'])
    
    # Deduplicate statements (simplified approach)
    unique_statements = []
    statement_texts = set()
    
    for statement in all_statements:
        # Use statement text as a deduplication key
        if 'statement' in statement and statement['statement'] not in statement_texts:
            statement_texts.add(statement['statement'])
            unique_statements.append(statement)
    
    return {"forward_looking_statements": unique_statements}

# Extract forward-looking statements
if 'chunks' in locals():
    forward_looking = extract_forward_looking_statements(chunks)
    
    # Display the extracted statements
    print("\nExtracted Forward-Looking Statements:")
    if 'forward_looking_statements' in forward_looking:
        for i, statement in enumerate(forward_looking['forward_looking_statements']):
            print(f"\n{i+1}. {statement.get('statement', 'N/A')}")
            print(f"   Category: {statement.get('category', 'N/A')}")
            print(f"   Timeframe: {statement.get('timeframe', 'N/A')}")
            print(f"   Targets: {statement.get('quantitative_targets', 'N/A')}")
            print(f"   Confidence: {statement.get('confidence_level', 'N/A')}")
            print(f"   Analysis: {statement.get('analysis', 'N/A')}")

## 11. Best Practices for Financial Document Analysis with LLMs

Based on our exploration in this notebook, here are key best practices for using LLMs to analyze large financial documents:

1. **Effective Chunking**:
   - Use semantically meaningful chunks (sections, topics) when possible
   - Maintain appropriate overlap between chunks to preserve context
   - Consider document structure (items, parts, sections) for SEC filings

2. **Retrieval Strategy**:
   - Use embedding-based retrieval for finding relevant sections
   - Implement hybrid retrieval (combining semantic and keyword search)
   - Consider document metadata for improved retrieval

3. **Prompt Engineering**:
   - Use specific, detailed prompts for financial analysis tasks
   - Request structured output formats (JSON) for consistency
   - Include financial domain knowledge in system prompts

4. **Verification and Validation**:
   - Cross-check extracted information across multiple chunks
   - Validate numerical data against primary sources
   - Use multiple extraction approaches for critical information

5. **Context Management**:
   - Provide relevant historical context for time-sensitive analysis
   - Include document metadata (filing type, date, company) in prompts
   - Maintain awareness of reporting period boundaries

6. **Handling Numerical Data**:
   - Request specific units for financial figures
   - Validate consistency of numerical extractions
   - Check for order-of-magnitude errors in extracted values

## 12. Conclusion and Next Steps

In this notebook, we've explored techniques for efficiently processing large financial documents using LLMs. We've covered:

- Document acquisition and preprocessing
- Chunking strategies for handling large texts
- Retrieval-augmented generation for targeted analysis
- Extracting structured information from unstructured text
- Summarization techniques for different purposes
- Extraction of financial metrics and forward-looking statements

These capabilities provide a foundation for more advanced financial analysis tasks, such as:

- **Forecasting cash flows**: Using extracted metrics and forward-looking statements
- **Comparable company analysis**: Identifying peer companies mentioned in documents
- **Risk assessment**: Analyzing risk factors and uncertainties
- **Valuation model inputs**: Extracting key parameters for DCF and multiples-based valuation

In the next notebook, we'll explore how to use these extracted insights to generate financial forecasts for valuation purposes.