<a href="https://colab.research.google.com/github/kashtienooh/Mytest/blob/master/Copy_of_llm_rag_Open_Source_AI_Workshop_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ovaccarelli/LLM-RAG/blob/main/notebooks/llm_rag_Open_Source_AI_Workshop_2.ipynb)


# 🔧 Setup

In [1]:
# Install all required Python packages for this workshop

!pip install langchain langchain-community wget pypdf pymupdf4llm langchain-docling

Collecting langchain-community
  Downloading langchain_community-0.3.23-py3-none-any.whl.metadata (2.5 kB)
Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pypdf
  Downloading pypdf-5.4.0-py3-none-any.whl.metadata (7.3 kB)
Collecting pymupdf4llm
  Downloading pymupdf4llm-0.0.22-py3-none-any.whl.metadata (4.7 kB)
Collecting langchain-docling
  Downloading langchain_docling-0.2.0-py3-none-any.whl.metadata (2.7 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.9.1-py3-none-any.whl.metadata (3.8 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting pymupdf>=1.25.5 (from pymupdf4llm)
  Downloading pymupdf-1.25.5-cp39-abi3-manylinux2014_x86_64.many

In [2]:
import os, time
from pathlib import Path
import wget

from rich.console import Console
from rich.markdown import Markdown
from langchain.document_loaders import PyPDFLoader
import pymupdf4llm
from langchain_docling import DoclingLoader
from langchain_docling.loader import ExportType

console = Console()

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## 2. Extract Text from a Single PDF

In this step, we’ll load one PDF file and convert its pages into plain text (or Markdown) using three different methods:

- **PyPDFLoader** (LangChain): A straightforward loader that splits the PDF into page-level `Document` objects.  
- **PyMuPDF4LLM**: A fast, native extractor that generates Markdown-formatted text with optional page-wise chunking.  
- **Docling**: A robust parser that preserves layout and exports content as Markdown, either per page (DOC_CHUNKS) or whole-document (MARKDOWN).

You will see how to:

1. Read the PDF from disk.  
2. Extract every page’s text into a structured format.  
3. Time each method to compare performance.  
4. Preview a specific page for verification.

### 📁 Setup Paths & Choose only 1 PDF for testing

In [3]:
# Create the "data/sample_pdf" folder if it doesn't exist
SAMPLE_PDF_DIR = Path("data/sample_pdf")
os.makedirs(SAMPLE_PDF_DIR, exist_ok=True)

# URL of the PDFs to test
urls = [
    "https://raw.githubusercontent.com/ovaccarelli/LLM-RAG/main/data/sample_pdf/2312.10997.pdf",
    "https://raw.githubusercontent.com/ovaccarelli/LLM-RAG/main/data/sample_pdf/2312.10997_page13.pdf",
]

# Download the PDFs
for url in urls:
    name = url.split("/")[-1]
    if not (SAMPLE_PDF_DIR / name).is_file():
        filename = wget.download(url, f"data/sample_pdf/{name}")
console.print("Pdf file downloaded successfully.", style="bold green")

#### PyPDFLoader

In [4]:
pdf_path = SAMPLE_PDF_DIR/"2312.10997.pdf"  # Just pick one page for testing

# Load the PDF with PyPDFLoader
start = time.time()
loader = PyPDFLoader(str(pdf_path))
docs_pypdf = loader.load()                 # returns a list of Document objects, one per page, page 1 is index 0
end = time.time()

print(f"Using file: {pdf_path.name}")
print(f"🕒 PyPDFLoader loaded {len(docs_pypdf)} pages in {end - start:.2f} seconds")

Using file: 2312.10997.pdf
🕒 PyPDFLoader loaded 21 pages in 1.21 seconds


In [15]:
# --- Preview the PDF contents ---
# Pages are indexed starting from 0

page_to_print = 0  # Change this to the page index you want
max_num_characters = 7000 # Change the max num of characters you want to print

# Now preview the chosen page:

if 0 <= page_to_print < len(docs_pypdf):
    content = docs_pypdf[page_to_print].page_content
    print(f"--- 📄 Page {page_to_print + 1} / {len(docs_pypdf)} ---\n")
    print(content[:max_num_characters])
else:
    print(f"Page {page_to_print} is out of range (max:{len(docs_pypdf)})")

--- 📄 Page 1 / 21 ---

1
Retrieval-Augmented Generation for Large
Language Models: A Survey
Yunfan Gaoa, Yun Xiongb, Xinyu Gao b, Kangxiang Jia b, Jinliu Pan b, Yuxi Bic, Yi Dai a, Jiawei Sun a, Meng
Wangc, and Haofen Wang a,c
aShanghai Research Institute for Intelligent Autonomous Systems, Tongji University
bShanghai Key Laboratory of Data Science, School of Computer Science, Fudan University
cCollege of Design and Innovation, Tongji University
Abstract—Large Language Models (LLMs) showcase impres-
sive capabilities but encounter challenges like hallucination,
outdated knowledge, and non-transparent, untraceable reasoning
processes. Retrieval-Augmented Generation (RAG) has emerged
as a promising solution by incorporating knowledge from external
databases. This enhances the accuracy and credibility of the
generation, particularly for knowledge-intensive tasks, and allows
for continuous knowledge updates and integration of domain-
specific information. RAG synergistically merges LLMs’ i

### PyMuPDF4LLM

In [18]:
# Load the PDF with PyMuPDF4LLM
start = time.time()
docs_pymupdf = pymupdf4llm.to_markdown(str(pdf_path), page_chunks=True)       # return a list of page dicts
end = time.time()

print(f"Using file: {pdf_path.name}")
print(f"🕒 PyMuPDF4LLM extracted {len(docs_pymupdf)} pages in {end - start:.2f} seconds\n")

Using file: 2312.10997.pdf
🕒 PyMuPDF4LLM extracted 21 pages in 5.07 seconds



In [23]:
# --- Preview the PDF contents ---
# Pages are indexed starting from 0

page_to_print = 0  # Change this to the page index you want
max_num_characters = 4000 # Change the max num of characters you want to print

# Now preview the chosen page:

if 0 <= page_to_print < len(docs_pymupdf):
    md = docs_pymupdf[page_to_print]["text"]
    print(f"--- 📄 Page {page_to_print + 1} / {len(docs_pymupdf)} ---\n")
    print(md[:max_num_characters])
else:
    print(f"Page {page_to_print} is out of range (max:{len(docs_pymupdf)})")

--- 📄 Page 1 / 21 ---

1

## Retrieval-Augmented Generation for Large Language Models: A Survey
#### Yunfan Gao [a], Yun Xiong [b], Xinyu Gao [b], Kangxiang Jia [b], Jinliu Pan [b], Yuxi Bi [c], Yi Dai [a], Jiawei Sun [a], Meng Wang [c], and Haofen Wang [a,c] a Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University b Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University c College of Design and Innovation, Tongji University


***Abstract*** **—Large Language Models (LLMs) showcase impres-**
**sive capabilities but encounter challenges like hallucination,**
**outdated knowledge, and non-transparent, untraceable reasoning**
**processes. Retrieval-Augmented Generation (RAG) has emerged**
**as a promising solution by incorporating knowledge from external**
**databases. This enhances the accuracy and credibility of the**
**generation, particularly for knowledge-intensive tasks, and allows**
**for continuous knowledge updates and inte

### Docling

In [30]:
pdf_path_docling = SAMPLE_PDF_DIR/"2312.10997.pdf"  # Just pick one page for testing

# Load the PDF with Docling
start = time.time()
loader_docling = DoclingLoader(str(pdf_path_docling), export_type=ExportType.MARKDOWN)
docs_docling = loader_docling.load()
end = time.time()

print(f"Using file: {pdf_path_docling.name}")
print(f"🕒 Docling loaded {len(docs_docling)} document(s) in {end - start:.2f} seconds")

Using file: 2312.10997.pdf
🕒 Docling loaded 1 document(s) in 60.74 seconds


In [27]:
# --- Preview the PDF contents ---

# Print the full extracted text
for idx, doc in enumerate(docs_docling):
    print(f"\n--- 📄 PDF Document: {pdf_path_docling.name} ---\n")
    print(doc.page_content)


--- 📄 PDF Document: 2312.10997.pdf ---

## Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao a , Yun Xiong b , Xinyu Gao b , Kangxiang Jia , Jinliu Pan , Yuxi Bi , Yi Dai , Jiawei Sun , Meng b b c a a Wang c , and Haofen Wang a,c a Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University b Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University c College of Design and Innovation, Tongji University

Abstract -Large Language Models (LLMs) showcase impressive capabilities but encounter challenges like hallucination, outdated knowledge, and non-transparent, untraceable reasoning processes. Retrieval-Augmented Generation (RAG) has emerged as a promising solution by incorporating knowledge from external databases. This enhances the accuracy and credibility of the generation, particularly for knowledge-intensive tasks, and allows for continuous knowledge updates and integration of domainspecific information. 

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------