# Week 1: PDF Summarizer - Day 1
## Complete Pipeline with LLM Integration

**Location**: `week1/examples/day1.ipynb`

This notebook demonstrates an end-to-end PDF summarization pipeline using:
- **PDF Parsing**: Extract text from PDF files
- **Token-Based Chunking**: Respect LLM context window limits
- **Multi-Stage Summarization**: Chunk-level + final synthesis
- **LLM Integration**: Local models via Ollama (TinyLLama)


## Section 1: Setup & Dependencies
Import all required libraries and modules from pdf_summarizer package.

In [3]:
import os
import sys
from pathlib import Path
from datetime import datetime

# For Jupyter notebooks - get parent directory
current_dir = Path.cwd()
week1_path = str(current_dir.parent)  # Go up one level from examples/
sys.path.insert(0, week1_path)

# Verify
print(f"Current directory: {current_dir}")
print(f"Adding to path: {week1_path}")
print(f"Path contents: {os.listdir(week1_path)}")

import tiktoken
from openai import OpenAI

from pdf_summarizer.pdf_parser import pdf_parser_func
from pdf_summarizer.model_constants import MODEL_CONFIGS, DEFAULT_MODEL, DEFAULT_SAFETY_FACTOR

print("✓ All dependencies imported successfully")

Current directory: /Users/nexonsamuel/Documents/data_engg_tutor/week1/examples
Adding to path: /Users/nexonsamuel/Documents/data_engg_tutor/week1
Path contents: ['.DS_Store', 'requirements.txt', 'tests', 'output', 'ollama.sh', 'README.md', 'examples', 'pdf_summarizer', 'data']
✓ All dependencies imported successfully


## Section 2: Configuration & Constants
Centralized settings for LLM, chunking, and model parameters.

In [None]:
# ==================== LLM Configuration ====================
# Using configurations from pdf_summarizer.model_constants

LLM_CONFIG = MODEL_CONFIGS.get(DEFAULT_MODEL, MODEL_CONFIGS['tinyllama'])

# ==================== Chunking Configuration ====================
# Token-based chunking parameters

CHUNKING_CONFIG = {
    'max_tokens': 500,                    # Maximum tokens per chunk
    'encoding': LLM_CONFIG.get('encoding', 'cl100k_base')  # Use model encoding
}

# ==================== Prompt Templates ====================
# Define system prompts for different stages of processing

PROMPTS = {
    'chunk_summarizer': (
        'You are evaluating a Data Engineer candidate. '
        'Extract only: work experience, technical skills, cloud platforms, and key achievements. '
        'Be factual and concise.'
    ),
    'final_synthesizer': (
        'You are an AI Head of a data engineering team evaluating a candidate\'s resume '
        'for a Data Engineer position. Your job is to provide a crisp, professional evaluation '
        'based on the information provided. Assess their: 1) Relevant experience, 2) Technical skills, '
        '3) Cloud platform expertise, 4) Data pipeline/ETL knowledge, 5) Overall fit for the role. '
        'Be objective and constructive.'
    )
}

# ==================== Output Configuration ====================
OUTPUT_DIR = Path('../output')  # Save to week1/output/
OUTPUT_DIR.mkdir(exist_ok=True, parents=True)

print(f"Configuration loaded:")
print(f"  LLM Model: {LLM_CONFIG['name']}")
print(f"  Provider: {LLM_CONFIG['provider']}")
print(f"  Max Tokens per Chunk: {CHUNKING_CONFIG['max_tokens']}")
print(f"  Output Directory: {OUTPUT_DIR.absolute()}")

## Section 3: Utility Functions
Helper functions for text processing and LLM interaction.

In [None]:
def chunk_text_by_tokens(text, max_tokens=500, encoding='cl100k_base'):
    """
    Split text into chunks based on token count.
    
    This function respects LLM context window limits by splitting large documents
    into smaller chunks that can be processed independently.
    
    Args:
        text (str): Input text to chunk
        max_tokens (int): Maximum tokens per chunk (default: 500)
        encoding (str): Tiktoken encoding name (default: 'cl100k_base')
    
    Returns:
        list: List of text chunks
    """
    # Initialize tokenizer
    encoder = tiktoken.get_encoding(encoding)
    
    # Tokenize the entire text
    tokens = encoder.encode(text)
    total_tokens = len(tokens)
    
    print(f"Total tokens in document: {total_tokens}")
    print(f"Max tokens per chunk: {max_tokens}")
    
    # Split tokens into chunks
    chunks = []
    for i in range(0, len(tokens), max_tokens):
        chunk_tokens = tokens[i:i + max_tokens]
        chunk_text = encoder.decode(chunk_tokens)
        chunks.append(chunk_text)
    
    print(f"Split into {len(chunks)} chunks")
    return chunks


def initialize_llm_client(config=None):
    """
    Initialize OpenAI client configured for local Ollama server.
    
    Args:
        config (dict): LLM configuration dictionary
    
    Returns:
        OpenAI: Configured OpenAI client instance
    """
    if config is None:
        config = LLM_CONFIG
    
    client = OpenAI(
        base_url=config['base_url'],
        api_key=config['api_key']
    )
    print(f"LLM client initialized (Model: {config['name']})")
    return client


def call_llm(client, messages, model_name=None):
    """
    Send messages to LLM and get response.
    
    Args:
        client (OpenAI): Initialized OpenAI client
        messages (list): List of message dicts with 'role' and 'content'
        model_name (str): Model name to use
    
    Returns:
        str: Response text from LLM
    """
    if model_name is None:
        model_name = LLM_CONFIG['name']
    
    response = client.chat.completions.create(
        model=model_name,
        messages=messages
    )
    return response.choices[0].message.content


print("✓ Utility functions defined")

## Section 4: Pipeline Functions
Core functions that orchestrate the summarization pipeline.

In [None]:
def summarize_pdf(pdf_path, max_tokens=None):
    """
    Complete PDF summarization pipeline:
    1. Parse PDF and extract text
    2. Chunk text by tokens
    3. Summarize each chunk independently
    4. Combine chunk summaries
    5. Create final synthesis
    
    Args:
        pdf_path (str): Path to PDF file
        max_tokens (int): Maximum tokens per chunk
    
    Returns:
        dict: Result dictionary containing final_summary, chunk_summaries, etc.
    """
    if max_tokens is None:
        max_tokens = CHUNKING_CONFIG['max_tokens']
    
    print("\n" + "="*80)
    print("STARTING PDF SUMMARIZATION PIPELINE")
    print("="*80)
    
    # ==================== Step 1: Initialize Client ====================
    print("\n[STEP 1] Initializing LLM client...")
    client = initialize_llm_client()
    
    # ==================== Step 2: Parse PDF ====================
    print("\n[STEP 2] Parsing PDF...")
    try:
        pdf_text = pdf_parser_func(pdf_path)
        print(f"✓ PDF parsed successfully")
        print(f"  Document size: {len(pdf_text)} characters")
    except Exception as e:
        raise ValueError(f"Failed to parse PDF: {str(e)}")
    
    # ==================== Step 3: Chunk Text ====================
    print("\n[STEP 3] Chunking text by tokens...")
    chunks = chunk_text_by_tokens(
        text=pdf_text,
        max_tokens=max_tokens,
        encoding=CHUNKING_CONFIG['encoding']
    )
    
    # ==================== Step 4: Summarize Each Chunk ====================
    print("\n[STEP 4] Summarizing each chunk...")
    chunk_summaries = []
    
    for idx, chunk in enumerate(chunks, 1):
        print(f"  Processing chunk {idx}/{len(chunks)}...", end=" ")
        
        # Prepare messages for this chunk
        messages = [
            {
                'role': 'system',
                'content': PROMPTS['chunk_summarizer']
            },
            {
                'role': 'user',
                'content': f"Summarize this text:\n\n{chunk}"
            }
        ]
        
        # Call LLM and collect summary
        try:
            summary = call_llm(client, messages)
            chunk_summaries.append(summary)
            print("✓")
        except Exception as e:
            print(f"✗ Error: {str(e)}")
            chunk_summaries.append(f"[Error processing chunk {idx}]")
    
    print(f"✓ All {len(chunks)} chunks summarized")
    
    # ==================== Step 5: Combine Summaries ====================
    print("\n[STEP 5] Combining chunk summaries...")
    combined_summary = "\n\n".join([
        f"[CHUNK {idx}]\n{summary}"
        for idx, summary in enumerate(chunk_summaries, 1)
    ])
    print("✓ Summaries combined")
    
    # ==================== Step 6: Create Final Summary ====================
    print("\n[STEP 6] Creating final synthesis...")
    final_messages = [
        {
            'role': 'system',
            'content': PROMPTS['final_synthesizer']
        },
        {
            'role': 'user',
            'content': f"Based on this candidate's information, provide a concise hiring evaluation for a Data Engineer role:\n\n{combined_summary}"
        }
    ]
    
    try:
        final_summary = call_llm(client, final_messages)
        print("✓ Final summary created")
    except Exception as e:
        print(f"✗ Error creating final summary: {str(e)}")
        final_summary = combined_summary  # Fallback
    
    # ==================== Step 7: Prepare Results ====================
    print("\n[STEP 7] Preparing results...")
    results = {
        'final_summary': final_summary,
        'chunk_summaries': chunk_summaries,
        'num_chunks': len(chunks),
        'timestamp': datetime.now().isoformat(),
        'pdf_path': pdf_path,
        'model': LLM_CONFIG['name']
    }
    
    print("="*80)
    print("PIPELINE COMPLETE")
    print("="*80)
    
    return results


def save_results(results, output_filename=None):
    """
    Save summarization results to file with metadata.
    
    Args:
        results (dict): Results dictionary from summarize_pdf()
        output_filename (str): Optional custom filename (without extension)
    
    Returns:
        Path: Path to saved file
    """
    # Generate filename if not provided
    if output_filename is None:
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
        output_filename = f"summary_{timestamp}.txt"
    else:
        output_filename = f"{output_filename}.txt"
    
    output_path = OUTPUT_DIR / output_filename
    
    # Prepare output content with metadata
    output_content = f"""
{'='*80}
PDF SUMMARIZATION RESULT
{'='*80}

Generated: {results['timestamp']}
Model: {results['model']}
PDF: {results['pdf_path']}
Chunks Processed: {results['num_chunks']}

{'='*80}
FINAL SUMMARY
{'='*80}

{results['final_summary']}


{'='*80}
CHUNK SUMMARIES
{'='*80}

"""
    
    # Add chunk summaries
    for idx, chunk_summary in enumerate(results['chunk_summaries'], 1):
        output_content += f"\n[CHUNK {idx}]\n{chunk_summary}\n\n"
    
    # Write to file
    with open(output_path, 'w', encoding='utf-8') as f:
        f.write(output_content)
    
    print(f"\n✓ Results saved to: {output_path}")
    return output_path


print("✓ Pipeline functions defined")

## Section 5: Test with Sample Data
Run the complete pipeline on your resume.

In [None]:
# ==================== Execute Pipeline ====================
# Process your resume PDF

# Use relative path from week1/examples/ to week1/data/
PDF_FILE = '../data/Nexon_Samuel.pdf'

# Verify file exists
pdf_path = Path(PDF_FILE)
if not pdf_path.exists():
    print(f"⚠️  PDF file not found at {pdf_path.absolute()}")
    print(f"Place your PDF in: week1/data/")
else:
    # Run the complete pipeline
    results = summarize_pdf(PDF_FILE)
    
    # Display results
    print("\n" + "="*80)
    print("FINAL SUMMARY")
    print("="*80)
    print(results['final_summary'])

## Section 6: Save & Display Output
Persist results and show formatted output.

In [None]:
# Save the results (only if results exist)
if 'results' in locals():
    output_path = save_results(results, output_filename='nexon_samuel_summary')
    
    # Display summary statistics
    print("\n" + "="*80)
    print("EXECUTION SUMMARY")
    print("="*80)
    print(f"PDF File: {results['pdf_path']}")
    print(f"Chunks Processed: {results['num_chunks']}")
    print(f"Model Used: {results['model']}")
    print(f"Timestamp: {results['timestamp']}")
    print(f"Output File: {output_path}")
    print("="*80)
else:
    print("No results to save - run Section 5 first")

## Section 7: Format Output Display
Display summary in formatted paragraphs.

In [None]:
# Display summary formatted by paragraphs
if 'results' in locals():
    print("\n" + "="*80)
    print("FORMATTED SUMMARY OUTPUT")
    print("="*80 + "\n")
    
    # Split by periods to show paragraphs
    summary_text = results['final_summary']
    paragraphs = summary_text.split('. ')
    
    for para in paragraphs:
        # Clean up and print each paragraph
        para = para.strip()
        if para:
            print(f"{para}.")
            print()  # Add blank line between paragraphs
else:
    print("No results to display - run Section 5 first")

## Section 8: Advanced Options
Optional functions for batch processing and custom configurations.

In [None]:
def process_multiple_pdfs(pdf_directory):
    """
    Process multiple PDF files in a directory.
    
    Args:
        pdf_directory (str): Path to directory containing PDF files
    
    Returns:
        list: List of result dictionaries
    
    Example:
        results = process_multiple_pdfs('../data/')
    """
    pdf_dir = Path(pdf_directory)
    pdf_files = list(pdf_dir.glob('*.pdf'))
    
    print(f"\nFound {len(pdf_files)} PDF files in {pdf_directory}")
    
    all_results = []
    
    for idx, pdf_file in enumerate(pdf_files, 1):
        print(f"\n[{idx}/{len(pdf_files)}] Processing: {pdf_file.name}")
        
        try:
            results = summarize_pdf(str(pdf_file))
            
            # Save results
            output_name = pdf_file.stem + '_summary'
            save_results(results, output_filename=output_name)
            
            all_results.append(results)
            
        except Exception as e:
            print(f"✗ Error processing {pdf_file.name}: {str(e)}")
    
    print(f"\n✓ Processed {len(all_results)} PDFs successfully")
    return all_results


print("✓ Advanced functions defined")
print("\nExample usage:")
print("  # Process all PDFs in week1/data/ folder")
print("  all_results = process_multiple_pdfs('../data/')")

## Summary

### Pipeline Workflow
```
PDF File (week1/data/)
   ↓
Parse PDF → Extract Text
   ↓
Chunk by Tokens → Respect context window
   ↓
Summarize Each Chunk → LLM processing
   ↓
Combine Summaries → Merge results
   ↓
Final Synthesis → Create comprehensive summary
   ↓
Save Output → Generate timestamped file (week1/output/)
```

### Project Structure
```
week1/
├── examples/
│   └── day1.ipynb              # This notebook
├── pdf_summarizer/
│   ├── __init__.py
│   ├── pdf_parser.py           # ← Imported
│   ├── chunker.py
│   ├── model_constants.py      # ← Imported
│   └── summarizer.py
├── data/                       # Input PDFs
└── output/                     # Generated summaries
```

### How to Use
1. **Place PDFs** in `week1/data/` folder
2. **Section 5**: Change PDF_FILE path or use batch processing
3. **Run sections 1-7** to execute pipeline
4. **Check results** in `week1/output/` folder

### Customization
- **Change model** in Section 2: `LLM_CONFIG = MODEL_CONFIGS['mistral']`
- **Adjust chunks** in Section 2: `'max_tokens': 1000`
- **Modify prompts** in Section 2: `PROMPTS['chunk_summarizer'] = "...."`
