# LangChain PDF Loaders 

LangChain provides several PDF loaders to help you extract text and metadata from PDF documents. Below are examples and descriptions of popular loaders:

---

## 1. PyPDFLoader

**PyPDFLoader** uses the `PyPDF2` library to load and parse PDF files.

```python
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("example.pdf")
documents = loader.load()
```

- **Pros:** Simple, widely used, good for basic text extraction.
- **Cons:** May struggle with complex layouts or scanned PDFs.

---

## 2. PyMuPDFLoader

**PyMuPDFLoader** leverages the `PyMuPDF` (fitz) library for more advanced PDF parsing.

```python
from langchain.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader("example.pdf")
documents = loader.load()
```

- **Pros:** Handles images, complex layouts, and can extract metadata.
- **Cons:** Slightly heavier dependency.

---

## 3. UnstructuredPDFLoader

**UnstructuredPDFLoader** uses the `unstructured` library for robust document parsing, including scanned PDFs and documents with complex formatting.

```python
from langchain.document_loaders import UnstructuredPDFLoader

loader = UnstructuredPDFLoader("example.pdf")
documents = loader.load()
```

- **Pros:** Best for messy, scanned, or highly formatted PDFs.
- **Cons:** Requires additional dependencies and setup.

---

## Loader Selection Guide

| Loader                | Best For                       | Handles Images | Handles Scanned PDFs | Dependency Size |
|-----------------------|-------------------------------|:--------------:|:-------------------:|:---------------:|
| PyPDFLoader           | Simple, text-based PDFs        | No             | No                  | Small           |
| PyMuPDFLoader         | Complex layouts, images        | Yes            | No                  | Medium          |
| UnstructuredPDFLoader | Scanned, messy, formatted PDFs | Yes            | Yes                 | Large           |

---

## Usage Tips

- Choose the loader based on your PDF's complexity and content type.
- For best results, test with a sample document before processing large batches.
- All loaders return a list of `Document` objects containing extracted text and metadata.

---

**References:**
- [LangChain Documentation](https://python.langchain.com/docs/modules/data_connection/document_loaders/)
- [PyPDF2](https://pypdf2.readthedocs.io/)
- [PyMuPDF](https://pymupdf.readthedocs.io/)
- [Unstructured](https://github.com/Unstructured-IO/unstructured)

**PyPDFLoader Example:**

In [44]:
from langchain_community.document_loaders import PyPDFLoader

try:
    pypdf_loader = PyPDFLoader("data/pdf_files/Generative-AI-and-LLMs-for-Dummies-Snowflake.pdf")
    # pypdf_loader = PyPDFLoader("data/pdfs/Databricks-Big-Book-Of-GenAI-FINAL.pdf")
    pypdf_documents = pypdf_loader.load()
    print(f"Loaded {len(pypdf_documents)} documents using PyPDFLoader.")
    print("Sample document content:")
    print(pypdf_documents[1])  # Print the first document's content
except Exception as e:
    print(f"Error loading PDF with PyPDFLoader: {e}")

Loaded 52 documents using PyPDFLoader.
Sample document content:
page_content='These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.' metadata={'producer': 'iTextSharp™ 5.5.0 ©2000-2013 iText Group NV (AGPL-version); modified using iTextSharp™ 5.5.0 ©2000-2013 iText Group NV (AGPL-version)', 'creator': 'PyPDF', 'creationdate': '2024-01-08T17:12:43+05:30', 'author': 'David Baum', 'moddate': '2024-01-08T20:10:49+05:30', 'title': 'Generative AI and LLMs For Dummies®, Snowflake Special Edition', 'source': 'data/pdf_files/Generative-AI-and-LLMs-for-Dummies-Snowflake.pdf', 'total_pages': 52, 'page': 1, 'page_label': 'C2'}


**PyMuPDFLoader Example:**

In [None]:
from langchain_community.document_loaders import PyMuPDFLoader
try:
    pymupdf_loader = PyMuPDFLoader("data/pdf_files/Generative-AI-and-LLMs-for-Dummies-Snowflake.pdf")
    # pymupdf_loader = PyMuPDFLoader("data/pdfs/Databricks-Big-Book-Of-GenAI-FINAL.pdf")
    pymupdf_documents = pymupdf_loader.load()
    print(f"\nLoaded {len(pymupdf_documents)} documents using PyMuPDFLoader.")
    print("Sample document content:")
    print(pymupdf_documents[19])  # Print the first document's content
except Exception as e:
    print(f"Error loading PDF with PyMuPDFLoader: {e}")


Loaded 52 documents using PyMuPDFLoader.
Sample document content:
page_content='14      Generative AI and LLMs For Dummies, Snowflake Special Edition
These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Using task-specific and  
domain-specific LLMs
Bing and Bard are examples of applications developed utiliz-
ing their respective foundation LLMs. These applications have a 
user interface and have undergone additional specialized training, 
enhancing their capabilities for specific tasks. For example, Bard 
offers chatbot access to Google’s full suite of products — includ-
ing YouTube, Google Drive, Google Flights, and others — to assist 
users in a wide variety of tasks. Google users can link their personal 
Gmail, Google Docs, and other account data to allow Bard to analyze 
and manage their personal information. For example, you can ask 
Bard to plan an upcoming trip based on suggestions from a recent 
email 

## Common PDF Issues and Fixes

When working with PDF loaders in LangChain, you may encounter several common issues:

1. **Encrypted PDFs:**  
    - Some PDFs are password-protected.  
    - Solution: Ensure you have the password and use loaders that support encrypted PDFs.

2. **Corrupted PDFs:**  
    - Loaders may fail if a PDF is corrupted.  
    - Solution: Try opening the PDF in a viewer to check its integrity.

3. **Complex Layouts:**  
    - PDFs with tables, images, or unusual formatting may not extract text cleanly.  
    - Solution: Use OCR-based loaders (like UnstructuredPDFLoader) for scanned or complex documents.

4. **Missing Dependencies:**  
    - Required libraries (e.g., PyMuPDF, PyPDF2) may not be installed or up-to-date.  
    - Solution: Ensure all dependencies are installed and updated.

5. **Large Files:**  
    - Very large PDFs can be slow or fail to process.  
    - Solution: Split large PDFs into smaller sections before loading.

**Tip:**  
Use `TokenTextSplitter` from `langchain.text_splitter` to break large text documents into manageable chunks for processing with language models.
```

In [40]:
import re
import unicodedata
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

class CustomPDFLoader:
    def __init__(self, file_path, chunk_size=None, chunk_overlap=20):
        self.loader = PyPDFLoader(file_path)
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.text_splitter =  RecursiveCharacterTextSplitter(
            chunk_size=chunk_size if chunk_size else 1000,
            chunk_overlap=chunk_overlap if chunk_overlap else 20,
            separators=["\n\n", "\n", " ", ""],
            length_function=len
        )

    def _normalize_unicode(self, text):
        # Remove non-ASCII characters 
        return unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')

    def _clean_text(self, text): 
        text = self._normalize_unicode(text)
        # Remove URLs
        text = re.sub(r'http[s]?://\S+', '', text)
        # Remove email addresses
        text = re.sub(r'\S+@\S+', '', text)
        # Remove special characters except basic punctuation
        text = re.sub(r'[^a-zA-Z0-9.,;:!?\'\"()\-\s]', ' ', text)
        # Remove multiple punctuations
        text = re.sub(r'([.,;:!?\'\"()\-])\1+', r'\1', text)
        # Remove leading/trailing punctuation
        text = re.sub(r'^[.,;:!?\'\"()\-]+|[.,;:!?\'\"()\-]+$', '', text)
        # Remove standalone punctuation
        text = re.sub(r'\s[.,;:!?\'\"()\-]+\s', ' ', text) 
        # Remove extra whitespace and newlines
        text = re.sub(r'\s+', ' ', text).strip()
        # Remove empty lines
        text = re.sub(r'\n\s*\n', '\n', text).strip()
        # Remove extract spaces, tabs, and newlines
        text = re.sub(r'[ \t]+', ' ', text)
        text = re.sub(r'\n+', '\n', text)
        text = re.sub(r'\s+', ' ', text).strip()
        # Remove page numbers or isolated numbers
        text = re.sub(r'^\d+\s*$', '', text, flags=re.MULTILINE)
        text = re.sub(r'\s\d+\s', ' ', text)
        text = re.sub(r'\s\d+$', '', text)
        text = re.sub(r'\s\d+\s', ' ', text) 

        # Remove unwanted special characters (keep punctuation & letters)
        text = re.sub(r'[^\x00-\x7F]+', ' ', text)  # remove non-ASCII if needed
        text = re.sub(r'[\u200B-\u200D\uFEFF]', '', text)  # zero-width spaces

         # Normalize punctuation spacing
        text = re.sub(r'\s*([.,!?;:])\s*', r'\1 ', text)

        # Final: Remove leading/trailing spaces and clean redundant punctuation
        text = text.strip()
        text = re.sub(r' +', ' ', text)       
        return text
    
    # Create chunks with enhanced metadata
    def _create_chunks(self, text):
        if not self.chunk_size:
            return [text]
        
        words = text.split()
        chunks = []
        for i in range(0, len(words), self.chunk_size - self.chunk_overlap):
            chunk = ' '.join(words[i:i + self.chunk_size])
            chunks.append(chunk)
        return chunks

    def _load_and_clean(self):
        documents = self.loader.load()
        processed_docs = []
        for page_num, page in enumerate(documents):
            cleaned_text = self._clean_text(page.page_content)
            if cleaned_text:  # Only process non-empty text
                chunks = self.text_splitter.create_documents(
                    [cleaned_text], 
                    metadatas=[{
                        **page.metadata,  # retain original metadata
                        "page": page_num + 1,
                        "total_pages": len(documents),
                        "chunk_method": "recursive_splitter",
                        "char_count": len(cleaned_text)
                    }]
                )
                processed_docs.extend(chunks)
        return processed_docs  # Add this return statement
    
    def load(self):
        return self._load_and_clean()

In [45]:
# Test the updated loader   
pdf_path =  "data/pdf_files/Generative-AI-and-LLMs-for-Dummies-Snowflake.pdf"
loader = CustomPDFLoader(pdf_path, chunk_size=1000, chunk_overlap=100)
docs = loader.load()

# Show Metadata and Sample Content
for i, doc in enumerate(docs):
    print(f"\n--- Document Chunk {i+1} Metadata ---")
    for key, value in doc.metadata.items():
        print(f"{key}: {value}")
    print(f"Content Preview: {doc.page_content[:200]}...\n")  # Print first 200 characters of content
print(f"Loaded {len(docs)} text chunks from {pdf_path}")
print(docs[0])  # print a sample cleaned chunk


--- Document Chunk 1 Metadata ---
producer: iTextSharp™ 5.5.0 ©2000-2013 iText Group NV (AGPL-version); modified using iTextSharp™ 5.5.0 ©2000-2013 iText Group NV (AGPL-version)
creator: PyPDF
creationdate: 2024-01-08T17:12:43+05:30
author: David Baum
moddate: 2024-01-08T20:10:49+05:30
title: Generative AI and LLMs For Dummies®, Snowflake Special Edition
source: data/pdf_files/Generative-AI-and-LLMs-for-Dummies-Snowflake.pdf
total_pages: 52
page: 2
page_label: C2
chunk_method: recursive_splitter
char_count: 117
Content Preview: These materials are John Wiley Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited...


--- Document Chunk 2 Metadata ---
producer: iTextSharp™ 5.5.0 ©2000-2013 iText Group NV (AGPL-version); modified using iTextSharp™ 5.5.0 ©2000-2013 iText Group NV (AGPL-version)
creator: PyPDF
creationdate: 2024-01-08T17:12:43+05:30
author: David Baum
moddate: 2024-01-08T20:10:49+05:30
title: Generative AI and LLMs For Dummies®, Snowflake S