## Advanced RAG - Data Extraction Pipeline
### Extract PDFs to Markdown, Images, and Tables

**Learning Objectives:**
- Extract PDF content to markdown format
- Save all figures as PNG images
- Extract tables with context (2 paragraphs before)
- Organize extracted content systematically

**The Complete Workflow:**

This is the **first step** in our Multi-Agent Deep RAG pipeline:

**Step 1: Data Extraction (This Notebook - 06-01)**
- Extract raw PDFs into structured formats
- Convert PDFs to markdown with page breaks
- Save full page images when large figures (>500x500) are detected
- Extract tables with 2 paragraphs of context for better understanding
- Organize by company and document

**Step 2: Data Ingestion (06-02)**
- Load extracted markdown files
- Split by page breaks for granular chunks
- Add metadata (company, doc_type, fiscal_year, fiscal_quarter, page)
- Create embeddings (dense + sparse for hybrid search)
- Store in Qdrant vector database with deduplication

**Step 3: Retrieval (07)**
- Dynamic filter extraction from user queries using LLM
- Hybrid search (dense semantic + sparse keyword matching)
- Reranking with cross-encoder for relevance
- Return top results with metadata for context

**Why This Approach?**
- **Markdown**: Preserves structure, easy to chunk, searchable
- **Page Images**: Full context for charts/diagrams with titles/headers
- **Tables with Context**: Include descriptions and captions
- **Metadata**: Enables precise filtering (company, year, quarter, doc type)

**Output Structure:**
- Markdown: `data/rag-markdown/{company}/{filename}.md`
- Images: `data/rag-images/{company}/{filename}/page_X.png`
- Tables: `data/rag-tables/{company}/{filename}/table_X.md`

## Multi-Agent Deep RAG Pipeline - Complete Workflow

**Overview:**
This course demonstrates a production-ready RAG system with three distinct stages: Extraction, Ingestion, and Retrieval.

---

### **Notebook 06-01: Data Extraction (Current)**
**Purpose:** Convert raw PDFs into structured, searchable formats

**What We Do:**
1. **PDF → Markdown Conversion**
   - Extract full document text with preserved structure
   - Insert page breaks (`<!-- page break -->`) for chunking
   - Save as `.md` files organized by company

2. **Intelligent Image Extraction**
   - Detect large images (>500x500 pixels) in PDFs
   - Save **entire page** as PNG when large image found
   - Preserves titles, headers, and context around charts/diagrams
   - Avoids cropping issues with individual image extraction

3. **Table Extraction with Context**
   - Identify markdown tables in text
   - Extract 2 paragraphs **before** each table
   - Ensures table titles and descriptions are included
   - Save as separate `.md` files for targeted retrieval

**Output:**
```
data/rag-markdown/{company}/{document}.md
data/rag-images/{company}/{document}/page_5.png
data/rag-tables/{company}/{document}/table_1.md
```

---

### **Notebook 06-02: Data Ingestion**
**Purpose:** Load extracted data into vector database for semantic search

**What We Do:**
1. **Load Markdown Files**
   - Read extracted markdown from 06-01
   - Split by page breaks for page-level chunks

2. **Metadata Enrichment**
   - Extract from filename: company, doc_type, fiscal_year, fiscal_quarter
   - Add page numbers, file hash (for deduplication)
   - Attach metadata to each chunk

3. **Hybrid Embeddings**
   - **Dense vectors**: Gemini embeddings (semantic understanding)
   - **Sparse vectors**: BM25 (keyword matching)
   - Store both in Qdrant for hybrid retrieval

4. **Deduplication**
   - SHA-256 hash-based duplicate detection
   - Skip already processed files

**Output:**
- Qdrant collection with 1000+ page chunks
- Each chunk has dense + sparse embeddings
- Rich metadata for filtering

---

### **Notebook 07: Retrieval**
**Purpose:** Intelligent search with dynamic filtering and reranking

**What We Do:**
1. **Dynamic Filter Extraction**
   - User query: "Amazon Q1 2024 revenue"
   - LLM extracts: `{company: "amazon", fiscal_year: 2024, fiscal_quarter: "q1"}`
   - Automatic company name mapping (AMZN → amazon)
   - Document type mapping (annual report → 10-k)

2. **Hybrid Search**
   - Search filtered subset (e.g., only Amazon Q1 2024)
   - Dense + sparse vectors combined with RRF/DBSF fusion
   - Fetch top-k candidates (e.g., 10 results)

3. **Reranking**
   - Cross-encoder (BAAI/bge-reranker-base) scores relevance
   - Reorder results by true semantic similarity
   - Return top-n (e.g., 5 best matches)

4. **Context-Rich Results**
   - Return page text with metadata
   - Include page images if available
   - Reference tables with context

**Why This Works:**
- **Filters reduce search space** → faster, more precise
- **Hybrid search** → catches both semantic + keyword matches
- **Reranking** → ensures best results at top
- **Metadata** → enables multi-dimensional filtering

---

### **Key Design Decisions**

**Why Page-Level Chunking?**
- Financial docs have page-specific info (page numbers cited in discussions)
- Easier to reference and verify sources
- Natural boundary for context

**Why Save Full Page Images?**
- Charts/tables often lack context when cropped individually
- Titles and captions usually appear above/beside figures
- Full page preserves visual layout

**Why Extract Tables Separately?**
- Tables are high-value structured data
- With context (2 paragraphs), they're self-contained
- Can be used for targeted table-based RAG

**Why Hybrid Search?**
- Dense: "What's the company's profitability?" (semantic)
- Sparse: "EBITDA margin Q3" (exact terms)
- Together: Best of both worlds

---

### Setup and Imports

In [None]:
from pathlib import Path
from typing import List, Tuple

from docling_core.types.doc import PictureItem
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption

### Configuration

In [None]:
# Paths
DATA_DIR = "data/rag-data/rag-pdf"
OUTPUT_MD_DIR = "data/rag-data/rag-markdown"
OUTPUT_IMAGES_DIR = "data/rag-data/rag-images"
OUTPUT_TABLES_DIR = "data/rag-data/rag-tables"

### Extract Metadata from Filename

In [None]:
def extract_metadata_from_filename(filename: str) -> dict:
    """Extract metadata from filename."""
    name = filename.replace('.pdf', '')
    parts = name.split()
    
    metadata = {}
    
    # Handle different filename patterns
    if len(parts) >= 3:
        metadata['company_name'] = parts[0]
        metadata['doc_type'] = parts[1]
        
        if len(parts) == 4:
            metadata['fiscal_quarter'] = parts[2]
            metadata['fiscal_year'] = int(parts[3])
        elif len(parts) == 3:
            metadata['fiscal_quarter'] = None
            metadata['fiscal_year'] = int(parts[2])
        else:
            # More than 4 parts - use last as year
            metadata['fiscal_quarter'] = parts[2] if len(parts) > 3 else None
            metadata['fiscal_year'] = int(parts[-1])
    else:
        # Fallback for non-standard filenames
        metadata['company_name'] = parts[0] if parts else 'unknown'
        metadata['doc_type'] = parts[1] if len(parts) > 1 else 'unknown'
        metadata['fiscal_quarter'] = None
        metadata['fiscal_year'] = None
    
    return metadata

### Extract Tables with Context

In [None]:
def extract_tables_with_context(markdown_text: str) -> List[Tuple[str, str]]:
    """
    Extract tables with 2 paragraphs of context before each table.
    
    Returns:
        List of (context + table, table_number) tuples
    """
    # Split by table pattern (markdown tables start with |)
    lines = markdown_text.split('\n')
    
    tables = []
    i = 0
    table_num = 1
    
    while i < len(lines):
        line = lines[i]
        
        # Detect table start (line with multiple |)
        if line.strip().startswith('|') and line.count('|') >= 2:
            # Find 2 paragraphs before
            context_lines = []
            para_count = 0
            j = i - 1
            
            while j >= 0 and para_count < 2:
                if lines[j].strip():  # Non-empty line
                    context_lines.insert(0, lines[j])
                elif context_lines:  # Empty line marks paragraph break
                    para_count += 1
                j -= 1
            
            # Extract full table
            table_lines = []
            while i < len(lines) and (lines[i].strip().startswith('|') or not lines[i].strip()):
                if lines[i].strip():  # Skip empty lines within table
                    table_lines.append(lines[i])
                i += 1
                if i < len(lines) and lines[i].strip() and not lines[i].strip().startswith('|'):
                    break
            
            # Combine context + table
            full_content = '\n'.join(context_lines) + '\n\n' + '\n'.join(table_lines)
            tables.append((full_content, f"table_{table_num}"))
            table_num += 1
        else:
            i += 1
    
    return tables

### Extract PDF Content

In [None]:
def extract_pdf_content(pdf_path: Path):
    """Extract PDF to markdown, images, and tables."""
    print(f"Processing: {pdf_path.name}")
    
    # Get metadata and create directories
    metadata = extract_metadata_from_filename(pdf_path.name)
    company = metadata['company_name']
    filename_stem = pdf_path.stem
    
    md_dir = Path(OUTPUT_MD_DIR) / company
    images_dir = Path(OUTPUT_IMAGES_DIR) / company / filename_stem
    tables_dir = Path(OUTPUT_TABLES_DIR) / company / filename_stem
    
    for dir_path in [md_dir, images_dir, tables_dir]:
        dir_path.mkdir(parents=True, exist_ok=True)
    
    # Configure and convert
    pipeline_options = PdfPipelineOptions()
    pipeline_options.generate_picture_images = True
    pipeline_options.generate_page_images = True
    
    converter = DocumentConverter(
        format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}
    )
    result = converter.convert(str(pdf_path))
    
    # Save markdown
    page_break = "<!-- page break -->"
    markdown_text = result.document.export_to_markdown(page_break_placeholder=page_break)
    (md_dir / f"{filename_stem}.md").write_text(markdown_text, encoding='utf-8')
    print(f"  ✓ Markdown saved")
    
    # Find pages with large images and save them
    pages_to_save = set()
    for element, _ in result.document.iterate_items():
        if isinstance(element, PictureItem):
            image = element.get_image(result.document)
            if image.size[0] > 500 and image.size[1] > 500:
                page_no = element.prov[0].page_no if element.prov else None
                if page_no:
                    pages_to_save.add(page_no)
    
    # Save page images
    for page_no in pages_to_save:
        page = result.document.pages[page_no]
        page.image.pil_image.save(images_dir / f"page_{page_no}.png", "PNG")
    
    if pages_to_save:
        print(f"  ✓ Saved {len(pages_to_save)} page images")
    
    # Save tables with context
    tables = extract_tables_with_context(markdown_text)
    for table_content, table_name in tables:
        (tables_dir / f"{table_name}.md").write_text(table_content, encoding='utf-8')
    
    if tables:
        print(f"  ✓ Saved {len(tables)} tables")
    
    print(f"  [DONE]\n")

### Process All PDFs

In [None]:
# Find all PDF files
data_path = Path(DATA_DIR)
pdf_files = list(data_path.rglob("*.pdf"))
print(f"Found {len(pdf_files)} PDF files\n")

# Process each PDF
for pdf_path in pdf_files:
    extract_pdf_content(pdf_path)