# Lab 1: Document Ingestion Pipeline

## Learning Objectives
By the end of this lab, you will:
- Evaluate document ingestion frameworks and their trade-offs
- Implement a dispatcher pattern for multi-format document handling
- Build a text cleaning pipeline for PDF and web content
- Create a standardized Document data model for your RAG system

## Setup
Run the cell below to install required libraries.

In [None]:
!uv pip install beautifulsoup4 requests chardet -q


---
## Part 1: Framework Selection

Before writing any code, the most important decision in a RAG ingestion pipeline is **choosing the right document processing framework**. Each framework has distinct strengths depending on your document types, accuracy requirements, and deployment constraints.

Here are the four major frameworks we'll compare:

| Framework | Approach | Key Strength |
|-----------|----------|-------------|
| **Docling** | Local ML models for layout analysis | Best table extraction accuracy (97.9%) |
| **Unstructured** | Multi-format connectors + partitioning | Widest format support, enterprise connectors |
| **LlamaParse** | Cloud API with LLM-powered parsing | LLM-optimized markdown output |
| **PyMuPDF** | Direct PDF rendering library | Fastest raw text extraction |

Let's build a comparison matrix to see them side by side.

In [None]:
import pandas as pd

frameworks = {
    "Framework": ["Docling", "Unstructured", "LlamaParse", "PyMuPDF"],
    "Table Accuracy": ["97.9%", "~75%", "High", "Basic"],
    "Speed": ["~6s/page", "Variable", "API latency", "<1ms/page"],
    "Deployment": ["Local (GPU/CPU)", "Local / API", "Cloud API", "Local"],
    "License": ["MIT", "Apache 2.0", "Commercial", "AGPL"],
    "Best For": ["Research papers", "Enterprise multi-format", "LLM-optimized output", "High-throughput simple PDFs"]
}

df = pd.DataFrame(frameworks)
print(df.to_string(index=False))


### Exercise 1.1: Choose Your Framework

Given the requirements below, which framework would you choose? Fill in your answers.

In [None]:
# TODO: For each scenario, assign the best framework name
# Choices: "Docling", "Unstructured", "LlamaParse", "PyMuPDF"

scenario_1 = ""  # Processing 50,000 simple PDF invoices as fast as possible
scenario_2 = ""  # Extracting tables from scientific papers with 2-column layouts
scenario_3 = ""  # Building a connector for emails, PowerPoints, and PDFs
scenario_4 = ""  # Data must stay on-premise, complex table extraction needed

# Validation
from tests import checks
checks.check_lab_1_1(scenario_1, scenario_2, scenario_3, scenario_4)


---
## Part 2: The Document Model

Every ingestion pipeline needs a **standardized data model** — a common structure that all extractors produce, regardless of the source format. This is critical because:

1. **Downstream consistency**: Chunkers, embedders, and retrievers all expect the same shape of data.
2. **Metadata tracking**: We need to know where content came from, when it was ingested, and what type it is.
3. **Testability**: A clear contract makes it easy to validate extractor output.

Let's define our `Document` dataclass.

In [None]:
from dataclasses import dataclass, field
from typing import Optional, Dict
from datetime import datetime, timezone

@dataclass
class Document:
    """Standardized representation of an ingested document."""
    content: str
    source: str
    title: Optional[str] = None
    doc_type: str = "unknown"
    author: Optional[str] = None
    ingested_at: str = field(
        default_factory=lambda: datetime.now(timezone.utc).isoformat()
    )
    word_count: int = 0
    extra_metadata: Dict = field(default_factory=dict)

    def __post_init__(self):
        if self.content and self.word_count == 0:
            self.word_count = len(self.content.split())

    def to_dict(self) -> dict:
        return {
            "content": self.content,
            "source": self.source,
            "metadata": {
                "title": self.title,
                "type": self.doc_type,
                "author": self.author,
                "word_count": self.word_count,
            }
        }

# Test it
doc = Document(
    content="Transformers use self-attention mechanisms to process sequences in parallel.",
    source="arxiv:1706.03762",
    title="Attention Is All You Need",
    doc_type="pdf"
)
print(f"Title: {doc.title}")
print(f"Word count: {doc.word_count}")
print(f"Dict keys: {list(doc.to_dict().keys())}")


---
## Part 3: Building Extractors

Now let's build our first real extractor — a **web page extractor** using BeautifulSoup. The key challenges in web extraction are:

- **Boilerplate removal**: Navigation, footers, ads, and scripts are noise.
- **Content detection**: Finding the `<main>` or `<article>` tag where the real content lives.
- **Graceful degradation**: Not every page has clean semantic HTML.

In [None]:
import requests
from bs4 import BeautifulSoup

def extract_web_page(url: str) -> Document:
    """Extract clean text content from a web page."""
    resp = requests.get(
        url,
        headers={"User-Agent": "ResearchAssistant/1.0"},
        timeout=10
    )
    resp.raise_for_status()
    
    soup = BeautifulSoup(resp.text, "html.parser")
    
    # Remove boilerplate elements
    for tag in soup(["script", "style", "nav", "footer", "header", "form"]):
        tag.decompose()
    
    # Find main content area
    content = soup.find("main") or soup.find("article") or soup.body
    text = content.get_text(separator="\n", strip=True) if content else ""
    title = soup.title.string if soup.title else url
    
    return Document(
        content=text,
        source=url,
        title=title,
        doc_type="web"
    )

# Test with a real page
try:
    doc = extract_web_page("https://en.wikipedia.org/wiki/Retrieval-augmented_generation")
    print(f"Title: {doc.title}")
    print(f"Word count: {doc.word_count}")
    print(f"Content preview: {doc.content[:300]}...")
except Exception as e:
    print(f"Could not fetch page (network issue): {e}")
    print("This is expected in offline environments.")


---
## Part 4: Text Cleaning Pipeline

Raw extracted text is almost never ready for embedding. Common issues include:

- **Encoding artifacts**: Smart quotes, em-dashes, and other Unicode oddities
- **Excessive whitespace**: Triple newlines, tab characters, trailing spaces
- **PDF artifacts**: Page numbers ("Page 3 of 10"), hyphenated line breaks

Let's build a cleaning pipeline that handles all of these.

In [None]:
import re
import unicodedata

def clean_text(text: str) -> str:
    """Master cleaning function for extracted text."""
    if not text:
        return ""
    
    # 1. Fix common encoding artifacts
    text = text.replace("\u2019", "'").replace("\u201c", '"').replace("\u201d", '"')
    text = unicodedata.normalize("NFC", text)
    
    # 2. Normalize whitespace
    text = re.sub(r"\n{3,}", "\n\n", text)  # Max 2 consecutive newlines
    text = re.sub(r"[ \t]+", " ", text)      # Collapse spaces/tabs
    
    # 3. Remove common PDF artifacts
    text = re.sub(r"Page \d+ of \d+", "", text)
    text = re.sub(r"-\n(\w)", r"\1", text)  # Fix hyphenated line breaks
    
    return text.strip()

# Test
dirty_text = """This is   a   messy     text.


Page 3 of 10

It has    extra     spaces and    too many


newlines.    The end-
ing of words gets broken."""

cleaned = clean_text(dirty_text)
print("BEFORE:")
print(repr(dirty_text[:200]))
print("\nAFTER:")
print(repr(cleaned[:200]))


### Exercise 4.1: Extend the Cleaner

The base `clean_text` function handles the most common issues, but real-world data has more noise. Add three new cleaning rules to handle emails, URLs, and standalone page numbers.

In [None]:
def clean_text_extended(text: str) -> str:
    """Extended cleaning with additional rules."""
    # Start with base cleaning
    text = clean_text(text)
    
    # TODO: Add a regex to remove email addresses from the text
    # Hint: Use re.sub with a pattern like r'\S+@\S+\.\S+'
    
    # TODO: Add a regex to remove URLs (http:// or https://)
    # Hint: Use re.sub with a pattern like r'https?://\S+'
    
    # TODO: Remove lines that are just numbers (e.g., page numbers)
    # Hint: Use re.sub with r'^\d+$' and re.MULTILINE flag
    
    return text.strip()

# Test your implementation
test_text = """
Contact us at info@example.com for more details.
Visit https://www.example.com/research for the full paper.
42
This is the actual content we want to keep.
"""

result = clean_text_extended(test_text)
print(f"Result: {repr(result)}")

# Validation
from tests import checks
checks.check_lab_1_4(result)


---
## Part 5: The Dispatcher Pattern

In a production RAG system, documents arrive in many formats. Rather than writing `if/elif` chains everywhere, we use a **dispatcher** — a single entry point that routes to the correct extractor based on the source type.

This pattern provides:
- **Single entry point**: Callers don't need to know which extractor to use.
- **Easy extensibility**: Adding a new format means adding one branch.
- **Consistent output**: Every extractor returns the same `Document` object.

In [None]:
from pathlib import Path

def extract_document(source: str) -> Document:
    """Route to the correct extractor based on source type."""
    if source.startswith(("http://", "https://")):
        return extract_web_page(source)
    
    ext = Path(source).suffix.lower()
    
    if ext in (".md", ".txt"):
        text = Path(source).read_text(encoding="utf-8")
        return Document(content=text, source=source, doc_type="text")
    elif ext == ".pdf":
        # In production: use Docling here
        # from docling.document_converter import DocumentConverter
        raise NotImplementedError(
            "PDF extraction requires Docling. "
            "Install with: pip install docling"
        )
    else:
        raise ValueError(f"Unsupported format: {ext}")

# Demonstrate the dispatcher
print("Dispatcher routing:")
for source in ["https://example.com/paper", "report.pdf", "notes.md", "data.csv"]:
    try:
        print(f"  {source} -> ", end="")
        # Don't actually call - just show routing logic
        if source.startswith("http"):
            print("extract_web_page()")
        elif source.endswith(".pdf"):
            print("extract_pdf() [Docling]")
        elif source.endswith((".md", ".txt")):
            print("extract_text_file()")
        else:
            print(f"ValueError: Unsupported format")
    except Exception as e:
        print(f"Error: {e}")


### Exercise 5.1: Add Markdown Support

Implement a dedicated markdown extractor that goes beyond plain text reading. It should:
1. Read the file content
2. Extract the title from the first `#` heading (if present)
3. Return a `Document` with `doc_type="markdown"`

In [None]:
# TODO: Implement a markdown extractor and integrate it into the dispatcher
def extract_markdown(file_path: str) -> Document:
    """Extract content from a markdown file with basic metadata."""
    # TODO: Read the file content
    # TODO: Extract the title from the first # heading (if present)
    # TODO: Return a Document object with doc_type="markdown"
    
    # Hint: Use Path(file_path).read_text() and check lines starting with "# "
    pass

# Test with an inline markdown string (simulating a file)
import tempfile
import os

sample_md = """# My Research Notes

## Introduction
This is a sample markdown document about RAG systems.

## Key Findings
- Chunking strategy matters most
- Overlap prevents context loss
"""

# Write to temp file for testing
with tempfile.NamedTemporaryFile(mode='w', suffix='.md', delete=False, encoding='utf-8') as f:
    f.write(sample_md)
    temp_path = f.name

try:
    doc = extract_markdown(temp_path)
    from tests import checks
    checks.check_lab_1_5(doc)
finally:
    os.unlink(temp_path)


---
## Reflection Questions

1. **Trade-offs**: Why might you use PyMuPDF for a first pass and Docling as a fallback? What metric would you use to decide when to escalate?
2. **Data Model**: What additional metadata fields would you add to the Document class for a legal document processing system?
3. **Cleaning**: What risks come with aggressive text cleaning? Can you think of a case where removing "noise" actually removes important information?

*Your answers here:*

1. ...
2. ...
3. ...