# PDF Loaders in LangChain

LangChain provides multiple PDF loaders:
- **PyPDFLoader** - Most commonly used, simple and reliable
- **PyMuPDFLoader** - Faster, good for large PDFs
- **PDFPlumberLoader** - Better for tables and complex layouts

In [1]:
from langchain_community.document_loaders import (
    PyPDFLoader,
    PyMuPDFLoader,
    PDFPlumberLoader
)

## 1. PyPDFLoader (Most Popular)

The most commonly used PDF loader. Simple, reliable, and splits by page.

In [14]:
# Load a PDF file using PyPDFLoader
pdf_path = "../data/AI ML Engineer & Agentic AI Engineer.pdf"  # Replace with your PDF path

loader = PyPDFLoader(pdf_path)

# Load and split by pages (each page becomes a separate document)
docs = loader.load()

print(f"Number of pages: {len(docs)}")
print(f"First page content preview: {docs[0].page_content[:500]}...")
print(f"Metadata: {docs[0].metadata}")

Number of pages: 1
First page content preview: M o h s i n  A l i  A g h a r i y a
B a n g a l o r e ,  K a r n a t a k a  |  9 3 2 7 9 0 0 8 5 5  |  m o h s i n a l i a b i d a l i 3 2 0 @ g m a i l . c o m
A I  /  M L  E n g i n e e r  ( A g e n t i c  A I ,  R A G  S y s t e m s )
PR OFILE
F r e s h e r  A I / M L  E n g i n e e r  w i t h  h a n d s - o n  e x p e r i e n c e  i n  b u i l d i n g  A g e n t i c  A I  s y s t e m s ,  R e t r i e v a l - A u g m e n t e d  G e n e r a t i o n
( R A G )  p i p e l i n e s ,  a n d  L L M ...
Metadata: {'producer': 'Canva', 'creator': 'Canva', 'creationdate': '2025-12-23T14:19:34+00:00', 'title': 'AI ML Engineer & Agentic AI Engineer', 'moddate': '2025-12-23T14:19:34+00:00', 'keywords': 'DAG6VaiTlqc,BAFlxON1WvE,0', 'author': 'Mohsin Ali', 'source': '../data/AI ML Engineer & Agentic AI Engineer.pdf', 'total_pages': 1, 'page': 0, 'page_label': '1'}


## 2. PyMuPDFLoader (Faster)

Uses PyMuPDF library. Faster than PyPDFLoader and extracts more metadata.

In [15]:
# Load a PDF file using PyMuPDFLoader (faster)
loader = PyMuPDFLoader(pdf_path)

docs = loader.load()

print(f"Number of pages: {len(docs)}")
print(f"First page content preview: {docs[0].page_content[:500]}...")
print(f"Metadata: {docs[0].metadata}")  # More metadata than PyPDFLoader

Number of pages: 1
First page content preview: Mohsin Ali Aghariya
Bangalore, Karnataka | 9327900855 | mohsinaliabidali320@gmail.com
AI / ML Engineer (Agentic AI, RAG Systems)
PROFILE
Fresher AI/ML Engineer with hands-on experience in building Agentic AI systems, Retrieval-Augmented Generation
(RAG) pipelines, and LLM-based applications. Skilled in Python, LangChain, vector databases, FastAPI and prompt
engineering. Passionate about developing intelligent AI agents for real-world use cases.
TECHNICAL SKILLS
Programming: Python, JavaScript
AI...
Metadata: {'producer': 'Canva', 'creator': 'Canva', 'creationdate': '2025-12-23T14:19:34+00:00', 'source': '../data/AI ML Engineer & Agentic AI Engineer.pdf', 'file_path': '../data/AI ML Engineer & Agentic AI Engineer.pdf', 'total_pages': 1, 'format': 'PDF 1.4', 'title': 'AI ML Engineer & Agentic AI Engineer', 'author': 'Mohsin Ali', 'subject': '', 'keywords': 'DAG6VaiTlqc,BAFlxON1WvE,0', 'moddate': '2025-12-23T14:19:34+00:00', 'trapped': '', 'm

## 3. PDFPlumberLoader (Best for Tables)

Better at extracting tables and structured content from PDFs.

In [16]:
# Load a PDF file using PDFPlumberLoader (better for tables)
loader = PDFPlumberLoader(pdf_path)

docs = loader.load()

print(f"Number of pages: {len(docs)}")
print(f"First page content preview: {docs[0].page_content[:500]}...")
print(f"Metadata: {docs[0].metadata}")

Number of pages: 1
First page content preview: Mohsin Ali Aghariya
Bangalore, Karnataka | 9327900855 | mohsinaliabidali320@gmail.com
AI / ML Engineer (Agentic AI, RAG Systems)
PROFILE
Fresher AI/ML Engineer with hands-on experience in building Agentic AI systems, Retrieval-Augmented Generation
(RAG) pipelines, and LLM-based applications. Skilled in Python, LangChain, vector databases, FastAPI and prompt
engineering. Passionate about developing intelligent AI agents for real-world use cases.
TECHNICAL SKILLS
Programming: Python, JavaScript
AI...
Metadata: {'source': '../data/AI ML Engineer & Agentic AI Engineer.pdf', 'file_path': '../data/AI ML Engineer & Agentic AI Engineer.pdf', 'page': 0, 'total_pages': 1, 'Title': 'AI ML Engineer & Agentic AI Engineer', 'Creator': 'Canva', 'Producer': 'Canva', 'CreationDate': "D:20251223141934+00'00'", 'ModDate': "D:20251223141934+00'00'", 'Keywords': 'DAG6VaiTlqc,BAFlxON1WvE,0', 'Author': 'Mohsin Ali'}


## 4. Load PDF from URL

In [17]:
# Load PDF directly from a URL
pdf_url = "https://arxiv.org/pdf/1706.03762.pdf"  # Attention Is All You Need paper

loader = PyPDFLoader(pdf_url)
docs = loader.load()

print(f"Loaded {len(docs)} pages from URL")
print(f"First page preview: {docs[0].page_content[:300]}...")

Loaded 15 pages from URL
First page preview: Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.com
Noam Shazeer∗
Google Brain
noam@google.com
Niki Par...


## 5. Load Multiple PDFs from Directory

In [19]:
from langchain_community.document_loaders import DirectoryLoader

# Load all PDFs from a directory
pdf_directory = "./pdfs"  # Replace with your directory path

loader = DirectoryLoader(
    pdf_directory,
    glob="**/*.pdf",  # Match all PDF files recursively
    loader_cls=PyPDFLoader  # Use PyPDFLoader for each PDF
)

# docs = loader.load()
# print(f"Loaded {len(docs)} pages from all PDFs")

## Comparison Table

| Loader | Speed | Table Support | Installation |
|--------|-------|---------------|--------------|
| PyPDFLoader | Medium | Basic | `pip install pypdf` |
| PyMuPDFLoader | Fast | Basic | `pip install pymupdf` |
| PDFPlumberLoader | Slow | Excellent | `pip install pdfplumber` |