[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ovaccarelli/LLM-RAG/blob/main/notebooks/llm_rag_Open_Source_AI_Workshop_2.ipynb)


# 🔧 Setup

In [None]:
# Install all required Python packages for this workshop

!pip install langchain langchain-community wget pypdf pymupdf4llm langchain-docling

In [None]:
import os, time
from pathlib import Path
import wget

from rich.console import Console
from rich.markdown import Markdown
from langchain.document_loaders import PyPDFLoader
import pymupdf4llm
from langchain_docling import DoclingLoader
from langchain_docling.loader import ExportType

console = Console()

In [None]:
# Create the "data" folder 
try:
    import google.colab  # only exists on Colab
    DATA_FOLDER = Path("data")         # Colab path
except ImportError:
    DATA_FOLDER = Path("../data")      # Local path
os.makedirs(DATA_FOLDER, exist_ok=True)

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## 2. Extract Text from a Single PDF

In this step, we’ll load one PDF file and convert its pages into plain text (or Markdown) using three different methods:

- **PyPDFLoader** (LangChain): A straightforward loader that splits the PDF into page-level `Document` objects.  
- **PyMuPDF4LLM**: A fast, native extractor that generates Markdown-formatted text with optional page-wise chunking.  
- **Docling**: A robust parser that preserves layout and exports content as Markdown, either per page (DOC_CHUNKS) or whole-document (MARKDOWN).

You will see how to:

1. Read the PDF from disk.  
2. Extract every page’s text into a structured format.  
3. Time each method to compare performance.  
4. Preview a specific page for verification.

### 📁 Setup Paths & Choose only 1 PDF for testing

In [None]:
# Create the "data/sample_pdf" folder if it doesn't exist
SAMPLE_PDF_DIR = DATA_FOLDER/"sample_pdf"
os.makedirs(SAMPLE_PDF_DIR, exist_ok=True)

# URL of the PDFs to test
urls = [
    "https://raw.githubusercontent.com/ovaccarelli/LLM-RAG/main/data/sample_pdf/2312.10997.pdf",
    "https://raw.githubusercontent.com/ovaccarelli/LLM-RAG/main/data/sample_pdf/2312.10997_page13.pdf",
]

# Download the PDFs
for url in urls:
    name = url.split("/")[-1]
    if not (SAMPLE_PDF_DIR / name).is_file():
        filename = wget.download(url, f"data/sample_pdf/{name}")
console.print("Pdf file downloaded successfully.", style="bold green")

#### PyPDFLoader

In [None]:
pdf_path = SAMPLE_PDF_DIR/"2312.10997.pdf"  

# Load the PDF with PyPDFLoader (returns a list of Document objects, one per page)
start = time.time()
loader = ...
docs_pypdf = ...             
end = time.time()

print(f"Using file: {pdf_path.name}")
print(f"🕒 PyPDFLoader loaded {len(docs_pypdf)} pages in {end - start:.2f} seconds")

In [None]:
# --- Preview the PDF contents ---
# Pages are indexed starting from 0

page_to_print = ...  # Change this to the page index you want
max_num_characters = ... # Change the max num of characters you want to print

# Now preview the chosen page:

if 0 <= page_to_print < len(docs_pypdf):
    content = docs_pypdf[page_to_print].page_content
    print(f"--- 📄 Page {page_to_print + 1} / {len(docs_pypdf)} ---\n")
    print(content[:max_num_characters])
else:
    print(f"Page {page_to_print} is out of range (max:{len(docs_pypdf)})")

### PyMuPDF4LLM

In [None]:
# Load the PDF with PyMuPDF4LLM (return a list of page dicts)
start = time.time()
docs_pymupdf = ...       
end = time.time()

print(f"Using file: {pdf_path.name}")
print(f"🕒 PyMuPDF4LLM extracted {len(docs_pymupdf)} pages in {end - start:.2f} seconds\n")

In [None]:
# --- Preview the PDF contents ---
# Pages are indexed starting from 0

page_to_print = ...  # Change this to the page index you want
max_num_characters = ... # Change the max num of characters you want to print

# Now preview the chosen page:

if 0 <= page_to_print < len(docs_pymupdf):
    md = docs_pymupdf[page_to_print]["text"]
    print(f"--- 📄 Page {page_to_print + 1} / {len(docs_pymupdf)} ---\n")
    print(md[:max_num_characters])
else:
    print(f"Page {page_to_print} is out of range (max:{len(docs_pymupdf)})")

### Docling

In [None]:
pdf_path_docling = SAMPLE_PDF_DIR/"2312.10997_page13.pdf"  # Just pick one page for testing

# Load the PDF with Docling
start = time.time()
loader_docling = ...
docs_docling = ...
end = time.time()

print(f"Using file: {pdf_path_docling.name}")
print(f"🕒 Docling loaded {len(docs_docling)} document(s) in {end - start:.2f} seconds")

In [None]:
# --- Preview the PDF contents ---

# Print the full extracted text
for idx, doc in enumerate(docs_docling):
    print(f"\n--- 📄 PDF Document: {pdf_path_docling.name} ---\n")
    print(doc.page_content)

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------