<a href="https://colab.research.google.com/github/GiovanniPasq/agentic-rag-for-dummies/blob/main/pdf_to_md.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PDF to Markdown Conversion for RAG Systems

## Introduction

Converting PDF documents to Markdown is often the **most critical step** in building an effective RAG (Retrieval-Augmented Generation) system. Markdown strikes an optimal balance: it preserves the structural hierarchy of the original document (headers, lists, tables, code, formulas) while remaining lightweight and directly consumable by LLMs without additional preprocessing.

**Why Markdown over plain text or JSON?**

While plain text extraction is fast and JSON provides structured data, **Markdown is superior for RAG systems** because it:

- **Preserves document structure**: Headers, subheaders, lists, and formatting hierarchy remain intact, helping LLMs understand the logical flow and relationships between sections
- **Maintains semantic meaning**: Bold, italic, code blocks, and other formatting cues provide context that aids comprehension
- **Handles complex elements naturally**: Tables, code blocks, mathematical formulas, and blockquotes are represented in a standardized, readable format
- **Human and machine readable**: Unlike JSON's rigid key-value structure or plain text's lack of hierarchy, Markdown balances readability for both humans and LLMs
- **Optimal for chunking**: Clear structural boundaries (headers, paragraphs) make intelligent document chunking straightforward for retrieval

Plain text loses all formatting and structure, making it difficult for retrieval systems to distinguish between titles, body text, and metadata. JSON can capture structure but often requires custom schemas and is verbose for text-heavy documents. Markdown naturally bridges this gap.

**Why is this conversion so important?**

The quality of your RAG system is fundamentally constrained by the quality of your extracted data. Poorly extracted or "dirty" data‚Äîtext with broken formatting, missing tables, garbled formulas, or lost context‚Äîleads directly to inaccurate retrieval and hallucinated responses. Before implementing any extraction pipeline, you must ask yourself:

- **What type of content do my PDFs contain?** Plain text only? Images? Tables? Mathematical formulas?
- **How complex is the layout?** Single-column? Multi-column? Mixed layouts with sidebars?
- **Are there visual elements?** Diagrams, charts, photographs that carry semantic meaning?
- **Are the documents scanned?** Scanned PDFs require OCR capabilities regardless of layout complexity.

Based on these questions, you can categorize your PDFs into three complexity tiers:

---

## PDF Complexity Classification

**üü¢ Simple PDFs (Category 1)**
- Text-only documents with standard layouts
- Digital PDFs with selectable text
- Examples: Reports, articles, plain books, documentation
- **Note:** If scanned, move to Category 2 (OCR required)

**üü° Medium Complexity PDFs (Category 2)**
- Documents with tables and basic formatting
- Scanned documents (even if simple layout)
- PDFs with occasional images
- Multi-column layouts
- Examples: Academic papers, business reports, scanned books

**üî¥ Complex PDFs (Category 3)**
- Image-heavy documents where visuals carry critical information
- Complex charts, diagrams, and infographics
- Mixed content types with spatial relationships
- Examples: Scientific papers with diagrams, medical reports, technical manuals, presentations

---

## Extraction Methods Overview

| Complexity Level | Recommended Tools | Key Capability |
|-----------------|-------------------|----------------|
| **Simple (Digital Text)** | PyMuPDF4LLM, PyMuPDF, PDFPlumber | Fast text extraction |
| **Medium (Tables/Scanned)** | Docling, Marker, PaddleOCR | OCR + Table structure |
| **Complex (Image-Heavy)** | Vision-Language Models (VLMs) | Visual understanding |

---

## Category 1: Simple PDFs - Fast Text Extraction

**Use these tools when:** Your PDFs are digital (not scanned), contain primarily text with simple formatting, and have no critical visual elements.

**Available Tools:**
- **PyMuPDF4LLM** - https://github.com/pymupdf/PyMuPDF4LLM (Optimized for LLM consumption)
- **PDFPlumber** - https://github.com/jsvine/pdfplumber (Fine-grained control with table detection)
- **MarkItDown** - https://github.com/microsoft/markitdown (Zero-configuration, multiple formats)

### Example Implementation: PyMuPDF4LLM

In [None]:
!pip install pymupdf4llm

In [None]:
import pymupdf4llm
import pathlib

def convert_simple_pdfs_pymupdf4llm(pdf_folder: str, output_folder: str):
    """
    Convert simple text-based PDFs to Markdown using PyMuPDF4LLM.

    Args:
        pdf_folder: Path to folder containing PDF files
        output_folder: Path to output folder for Markdown files
    """
    pdf_path = pathlib.Path(pdf_folder)
    output_path = pathlib.Path(output_folder)
    output_path.mkdir(parents=True, exist_ok=True)

    for pdf_file in pdf_path.glob("*.pdf"):
        try:
            # Extract text as Markdown
            md_text = pymupdf4llm.to_markdown(str(pdf_file))

            # Save to file
            output_file = output_path / f"{pdf_file.stem}.md"
            output_file.write_text(md_text, encoding='utf-8')

            print(f"‚úì Converted: {pdf_file.name}")

        except Exception as e:
            print(f"‚úó Error processing {pdf_file.name}: {e}")

    print(f"\nConversion complete! Output in '{output_folder}'")

# Example usage
convert_simple_pdfs_pymupdf4llm("./simple_pdfs", "./md_output")

---

## Category 2: Medium Complexity PDFs - OCR + Structure Recognition

**Use these tools when:** You have scanned documents, tables that need structure preservation, or multi-column layouts.

**Available Tools:**
- **Docling** - https://github.com/DS4SD/docling (OCR + table structure + optional VLM integration)
- **Marker** - https://github.com/VikParuchuri/marker (Fast conversion with excellent layout preservation)
- **PaddleOCR** - https://github.com/PaddlePaddle/PaddleOCR (Multilingual OCR, 80+ languages, optional VLM integration)

### Example Implementation: Docling
**Reference:** [Docling Documentation](https://docling-project.github.io/docling/)


In [None]:
!pip install docling

In [None]:
from pathlib import Path
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions

def convert_medium_pdfs_docling(pdf_folder: str, output_folder: str):
    """
    Convert medium-complexity PDFs using Docling with OCR and table extraction.

    Args:
        pdf_folder: Path to folder containing PDF files
        output_folder: Path to output folder for Markdown files
    """
    pipeline_options = PdfPipelineOptions()
    pipeline_options.do_table_structure = True  # Extract table structures
    pipeline_options.do_ocr = True  # Enable OCR for scanned content
    pipeline_options.images_scale = 2.0  # Higher quality image extraction
    pipeline_options.generate_picture_images = True

    converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
        }
    )

    output_path = Path(output_folder)
    output_path.mkdir(parents=True, exist_ok=True)

    pdf_files = list(Path(pdf_folder).glob("*.pdf"))

    for pdf_file in pdf_files:
        try:
            result = converter.convert(str(pdf_file))
            markdown_content = result.document.export_to_markdown()

            output_file = output_path / f"{pdf_file.stem}.md"
            output_file.write_text(markdown_content, encoding='utf-8')

            print(f"‚úì Converted: {pdf_file.name}")

        except Exception as e:
            print(f"‚úó Error processing {pdf_file.name}: {e}")

    print(f"\nConversion complete! Output in '{output_folder}'")

# Example usage
convert_medium_pdfs_docling("./medium_pdfs", "./md_output")

---

## Category 3: Complex PDFs - Vision-Language Models (VLMs)

**Use VLMs when:** Your PDFs contain critical visual information‚Äîcharts, diagrams, complex layouts, or image-heavy content where visual elements must be accurately described and contextualized.

### The VLM Approach

The VLM approach works by converting each PDF page into a high-resolution image and sending it to a vision-language model with instructions to extract and convert the content into structured Markdown. This method leverages the model's visual understanding capabilities to:

1. **Recognize text** in any layout or orientation
2. **Interpret visual elements** like charts, diagrams, and images
3. **Understand spatial relationships** between document elements
4. **Preserve document structure** through proper Markdown formatting
5. **Generate descriptions** for non-text elements

**How it works:**
- Each PDF page is rendered as a high-resolution image (typically 300 DPI)
- The image is sent to the VLM along with a detailed system prompt
- The model analyzes the visual content and outputs structured Markdown
- Pages are processed sequentially and combined into a single document

This approach is particularly powerful because the model "sees" the document as a human would, understanding context, layout, and visual meaning that traditional parsers might miss.

### Why Choose VLMs?

- **Best for:** Scientific papers, infographics, medical reports, technical diagrams
- **Pros:** Highest accuracy, excellent visual element description, preserves spatial relationships, handles any layout complexity
- **Cons:** Slower processing, requires API costs (or significant compute for local models)

### Cloud vs Local Deployment

**Cloud VLMs (Recommended for most users):**
- Google Gemini 2.0 Flash / 1.5 Pro (cost-effective: ~$0.075-0.30 per 1M tokens)
- OpenAI GPT-4o / GPT-4o-mini
- Anthropic Claude 3.5 Sonnet / Haiku

**Local VLMs (For privacy/offline requirements):**
- Performance depends heavily on model size and hardware
- Larger models = better accuracy (e.g., Qwen2-VL 72B > Qwen2-VL 7B)
- Examples: Qwen2-VL, LLaVA, BakLLaVA, CogVLM
- Requires significant GPU memory (24GB+ for 7B models, 80GB+ for 70B+ models)
- Can be run via Ollama, vLLM, or Hugging Face Transformers

**Cost Note:** Google Gemini is particularly cost-effective for PDF conversion tasks, offering excellent quality-to-price ratio with Gemini 2.0 Flash being the most economical option for production workloads.

**Alternative Tool:**
- **Dolphin** - https://github.com/bytedance/Dolphin (Specialized VLM for PDF to Markdown conversion)

### Custom System Prompt for VLM Conversion

In [None]:
# Customize this system prompt based on your PDF type (e.g., academic, technical, legal).
# This template works for 90% of documents‚Äîtweak rules as needed for your use case.
SYSTEM_PROMPT = """You are an expert document parser specializing in converting PDF pages to markdown format.

**Your task:**
Extract ALL content from the provided page image and return it as clean, well-structured markdown.

**Text Extraction Rules:**
1. Preserve the EXACT text as written (including typos, formatting, special characters)
2. Maintain the logical reading order (top-to-bottom, left-to-right)
3. Preserve hierarchical structure using appropriate markdown headers (#, ##, ###)
4. Keep paragraph breaks and line spacing as they appear
5. Use markdown lists (-, *, 1.) for bullet points and numbered lists
6. Preserve text emphasis: **bold**, *italic*, `code`
7. For multi-column layouts, extract left column first, then right column

**Tables:**
- Convert all tables to markdown table format
- Preserve column alignment and structure
- Use | for columns and - for headers

**Mathematical Formulas:**
- Convert to LaTeX format: inline `$formula$`, display `$$formula$$`
- If LaTeX conversion is uncertain, describe the formula clearly

**Images, Diagrams, Charts:**
- Insert markdown image placeholder: `![Description](image)`
- Provide a detailed, informative description including:
  * Type of visual (photo, diagram, chart, graph, illustration)
  * Main subject or purpose
  * Key elements, labels, or data points
  * Colors, patterns, or notable visual features
  * Context or relationship to surrounding text
- For charts/graphs: mention axes, data trends, and key values
- For diagrams: describe components and their relationships

**Special Elements:**
- Footnotes: Use markdown footnote syntax `[^1]`
- Citations: Preserve as written
- Code blocks: Use triple backticks with language specification
- Quotes: Use `>` for blockquotes
- Links: Preserve as `[text](url)` if visible

**Quality Guidelines:**
- DO NOT add explanations, comments, or meta-information
- DO NOT skip or summarize content
- DO NOT invent or hallucinate text not present in the image
- DO NOT include "Here is the markdown..." or similar preambles
- Output ONLY the markdown content, nothing else

**Output Format:**
Return raw markdown with no wrapper, no code blocks, no explanations.
Start immediately with the page content.
""".strip()

### Implementation with Google Gemini

This implementation demonstrates the page-by-page conversion approach.

**Reference:** [Gemini API ‚Äì Image Understanding](https://ai.google.dev/gemini-api/docs/image-understanding)



In [None]:
!pip install PyMuPDF google-genai

In [None]:
import os
import fitz  # PyMuPDF
from google import genai
from google.genai import types

def convert_complex_pdfs_vlm(pdf_path: str, api_key: str, model: str = "gemini-2.0-flash"):
    """
    Convert a single PDF using VLM (Vision-Language Model).

    Args:
        pdf_path: Path to PDF file
        api_key: Google Gemini API key
        model: Model name (gemini-2.0-flash, gemini-1.5-pro, etc.)

    Returns:
        Dictionary mapping page numbers to markdown content
    """
    client = genai.Client(api_key=api_key)
    pdf_document = fitz.open(pdf_path)
    markdown_pages = {}

    for page_num in range(pdf_document.page_count):
        try:
            page = pdf_document[page_num]

            # Convert page to high-resolution image (300 DPI)
            pix = page.get_pixmap(matrix=fitz.Matrix(300/72, 300/72))
            img_data = pix.tobytes("png")

            # Prepare image for VLM
            image = types.Part.from_bytes(data=img_data, mime_type="image/png")

            # Generate markdown from image
            response = client.models.generate_content(
                config=types.GenerateContentConfig(
                    system_instruction=SYSTEM_PROMPT,
                    temperature=0.1  # Low temperature for consistent output
                ),
                model=model,
                contents=[
                    "Convert this PDF page to clean, structured markdown. "
                    "Extract all text, describe images, and preserve the layout.",
                    image
                ],
            )

            markdown_pages[page_num + 1] = response.text
            print(f"‚úì Processed page {page_num + 1}/{pdf_document.page_count}")

        except Exception as e:
            print(f"‚úó Error on page {page_num + 1}: {e}")
            markdown_pages[page_num + 1] = f"<!-- Error processing page: {e} -->"

    pdf_document.close()
    return markdown_pages


def batch_convert_complex_pdfs(pdf_folder: str, output_folder: str, api_key: str):
    """
    Batch convert all PDFs in a folder using VLM.

    Args:
        pdf_folder: Path to folder containing PDF files
        output_folder: Path to output folder for Markdown files
        api_key: Google Gemini API key
    """
    os.makedirs(output_folder, exist_ok=True)

    for filename in os.listdir(pdf_folder):
        if filename.lower().endswith('.pdf'):
            print(f"\nüìÑ Processing: {filename}")
            pdf_path = os.path.join(pdf_folder, filename)
            pdf_name = os.path.splitext(filename)[0]

            # Convert PDF
            markdown_pages = convert_complex_pdfs_vlm(pdf_path, api_key)

            # Combine pages into single markdown file
            combined_markdown = "\n\n---\n\n".join([
                f"# Page {page_num}\n\n{content}"
                for page_num, content in markdown_pages.items()
            ])

            # Save to file
            output_path = os.path.join(output_folder, f"{pdf_name}.md")
            with open(output_path, 'w', encoding='utf-8') as f:
                f.write(combined_markdown)

            print(f"‚úì Saved: {output_path}")

    print(f"\nüéâ Batch conversion complete! Output in '{output_folder}'")

# Example usage
batch_convert_complex_pdfs("./complex_pdfs", "./md_output", "your-gemini-api-key")

---

## Comparison Matrix

| Feature | PyMuPDF4LLM | PDFPlumber | MarkItDown | Docling | Marker | PaddleOCR | VLM (Gemini) |
|---------|------------|------------|------------|---------|---------|-----------|--------------|
| **Speed** | ‚ö°‚ö°‚ö° | ‚ö°‚ö°‚ö° | ‚ö°‚ö°‚ö° | ‚ö°‚ö° | ‚ö°‚ö° | ‚ö°‚ö° | ‚ö° |
| **Accuracy** | ‚≠ê‚≠ê‚≠ê | ‚≠ê‚≠ê‚≠ê | ‚≠ê‚≠ê‚≠ê | ‚≠ê‚≠ê‚≠ê‚≠ê | ‚≠ê‚≠ê‚≠ê‚≠ê | ‚≠ê‚≠ê‚≠ê‚≠ê | ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê |
| **Table Extraction** | ‚ùå | ‚úÖ | ‚úÖ | ‚úÖ | ‚úÖ | ‚ö†Ô∏è | ‚úÖ |
| **Image Description** | ‚ùå | ‚ùå | ‚ùå | ‚ö†Ô∏è | ‚ö†Ô∏è | ‚ùå | ‚úÖ |
| **OCR Support** | ‚ùå | ‚ùå | ‚ö†Ô∏è | ‚úÖ | ‚úÖ | ‚úÖ | ‚úÖ |
| **Complex Layouts** | ‚ùå | ‚ùå | ‚ùå | ‚ö†Ô∏è | ‚úÖ | ‚ö†Ô∏è | ‚úÖ |
| **Scanned PDFs** | ‚ùå | ‚ùå | ‚ùå | ‚úÖ | ‚úÖ | ‚úÖ | ‚úÖ |
| **Multilingual** | ‚úÖ | ‚úÖ | ‚úÖ | ‚úÖ | ‚úÖ | ‚úÖ | ‚úÖ |
| **Cost** | Free | Free | Free | Free | Free | Free | ~$0.075/1M tokens |
| **Best For** | Digital text | Text + tables | Quick conversion | Scanned + tables | Fast + layout | Asian languages | Visual content |

---

## Recommended Workflow
```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  Analyze PDFs               ‚îÇ
‚îÇ  1. Is it scanned?          ‚îÇ
‚îÇ  2. Are images critical?    ‚îÇ
‚îÇ  3. Complex layout?         ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
           ‚îÇ
           ‚ñº
    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
    ‚îÇ Digital + Simple?‚îÇ‚îÄ‚îÄ‚îÄYes‚îÄ‚îÄ‚ñ∫ PyMuPDF4LLM / PDFPlumber / MarkItDown
    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
           ‚îÇNo
           ‚ñº
    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
    ‚îÇ Scanned / Tables only          ‚îÇ
    ‚îÇ (no critical visual content)?  ‚îÇ‚îÄ‚îÄ‚îÄYes‚îÄ‚îÄ‚ñ∫ Docling / Marker / PaddleOCR
    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
           ‚îÇNo
           ‚ñº
    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
    ‚îÇ Complex layouts, formulas,   ‚îÇ
    ‚îÇ charts,  diagrams,           ‚îÇ
    ‚îÇ or visual content            ‚îÇ
    ‚îÇ requiring interpretation?    ‚îÇ‚îÄ‚îÄ‚îÄYes‚îÄ‚îÄ‚ñ∫ VLM (Gemini / Claude / Local)
    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

---

## Special Considerations for Scanned Documents

**Important:** Scanned PDFs always require OCR, regardless of layout complexity. Even if a scanned PDF has a simple layout, it must be processed with Category 2 tools (Docling, Marker, PaddleOCR) because the text is not digitally selectable.

**How to identify scanned PDFs:**
1. Try to select text in the PDF viewer - if you can't, it's scanned
2. Check file size - scanned PDFs are typically much larger
3. Look for image artifacts or slightly rotated text

**Recommended tools by scenario:**
- **English scanned documents:** Docling or Marker
- **Multilingual scanned documents:** PaddleOCR (supports 80+ languages)
- **Low-quality scans:** VLM approach for best accuracy
- **High-volume scanned documents:** Marker (fastest processing)

---

## Best Practices

### 1. Always Test on Sample Documents
Before processing your entire corpus, test 3-5 representative PDFs from each category to validate extraction quality.

### 2. Implement Quality Checks

In [None]:
def validate_markdown_quality(md_file: Path) -> dict:
    """Check markdown conversion quality"""
    content = md_file.read_text()
    words = content.split()

    return {
        "word_count": len(words),
        "has_headers": "#" in content,
        "has_tables": "|" in content,
        "has_formulas": "$" in content,
        "avg_line_length": len(content) / max(content.count("\n"), 1),
        "empty_ratio": content.count("\n\n") / max(content.count("\n"), 1)
    }

# Example usage
quality_metrics = validate_markdown_quality(Path("output.md"))
print(f"Quality metrics: {quality_metrics}")

### 3. Cost Management for VLMs

For large-scale conversion projects:
- Start with Gemini 2.0 Flash (most cost-effective)
- Use selective VLM processing: Category 1-2 tools for most pages, VLM only for critical visual pages
- Implement caching to avoid reprocessing
- Consider local VLMs for sensitive documents (deploy Qwen2-VL or LLaVA)

### 4. Handling Mixed Document Types

In [None]:
def smart_convert_pdf(pdf_path: str, api_key: str = None):
    """
    Intelligently choose conversion method based on PDF characteristics.
    """
    # Quick analysis
    doc = fitz.open(pdf_path)
    sample_page = doc[0]

    # Check if text is selectable
    text = sample_page.get_text()
    is_scanned = len(text.strip()) < 50  # Likely scanned if very little text

    # Check for images
    image_count = len(sample_page.get_images())
    has_images = image_count > 2

    doc.close()

    # Route to appropriate tool
    if is_scanned:
        print("‚Üí Using Docling (scanned document)")
        return convert_medium_pdfs_docling(pdf_path, "output")
    elif has_images:
        print("‚Üí Using VLM (image-heavy)")
        return convert_complex_pdfs_vlm(pdf_path, api_key)
    else:
        print("‚Üí Using PyMuPDF4LLM (simple digital PDF)")
        return convert_simple_pdfs_pymupdf4llm(pdf_path, "output")

---

## Conclusion

Choosing the right PDF-to-Markdown converter directly impacts your RAG system's performance. Remember:

- üìÑ **Simple digital PDFs** ‚Üí PyMuPDF4LLM, PDFPlumber, or MarkItDown
- üìä **Scanned PDFs or tables** ‚Üí Docling, Marker, or PaddleOCR
- üñºÔ∏è **Image-heavy complex PDFs** ‚Üí VLM (Gemini 2.0 Flash for cloud, Qwen3-VL for local)

Start with the simplest tool that meets your needs. Upgrade to more sophisticated methods only when quality demands it. Your extraction strategy should match your document characteristics and project requirements.

For production systems, consider implementing a hybrid approach that automatically routes PDFs to the appropriate conversion tool based on their characteristics. This maximizes both quality and cost-efficiency.

Happy converting! üöÄ