# PDF Loading and Preprocessing Techniques with LangChain

## Overview
This notebook provides a comprehensive introduction to document loading techniques using PyMuPDF and LangChain, focusing on extracting and preparing PDF documents for advanced information retrieval and analysis. The guide demonstrates robust methods for discovering, loading, and preprocessing PDF files from various sources.

## Key Features:
- Recursive PDF document discovery
- Flexible directory-based document loading
- Metadata preservation during extraction
- Preparation for semantic vectorization
- Scalable document processing approach

## Technologies Used:
- PyMuPDF
- LangChain Document Loaders
- File system traversal
- Metadata extraction
- Document preprocessing utilities

## Use Cases:
- Medical document intelligence
- Legal document analysis
- Financial record processing
- Academic research document management
- Compliance and regulatory document review

## Activities Covered in This Notebook

1. **PDF Document Discovery**  
    - Implementing recursive directory scanning
    - Identifying and filtering PDF files
    - Creating a comprehensive document collection

2. **PDF Text Extraction**  
    - Utilizing PyMuPDF for precise text extraction
    - Preserving document structure and formatting
    - Handling complex PDF layouts and encodings

3. **Metadata Management**  
    - Extracting document-level metadata
    - Preserving source information
    - Preparing metadata for future processing stages

4. **Document Loading Strategies**  
    - Exploring different loading approaches
    - Managing large document collections
    - Implementing efficient loading mechanisms

5. **Error Handling and Robustness**  
    - Implementing basic error management
    - Handling potential loading exceptions
    - Ensuring consistent document extraction

## What's Next?

This notebook provides a foundational understanding of PDF document loading techniques. In upcoming notebooks, we will explore advanced topics, including:

- **Text Splitting and Chunking**: Breaking documents into semantic chunks
- **Embedding Generation**: Converting text to numerical representations
- **Vector Store Creation**: Indexing documents for semantic search
- **Advanced Retrieval Techniques**: Implementing sophisticated information retrieval methods
- **Metadata Enrichment**: Adding contextual information to document chunks


Stay tuned for more detailed discussions and hands-on examples!

> **Sidenote:** Ensure that you have selected the kernel as the conda environment named `langchain`, as instructed in Lab Guide 1. This is crucial for running the code in this notebook without any issues.


In [1]:
# !pip install langchain_community


In [2]:
# Import required libraries
import os
import warnings
import tiktoken
import faiss
from dotenv import load_dotenv

# Document Loading Libraries
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_ollama import OllamaEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.docstore.in_memory import InMemoryDocstore












### Method: `load_pdf_documents`

The `load_pdf_documents` method is designed to recursively load PDF documents from a specified directory. It identifies all PDF files within the directory and its subdirectories, extracts their content, and returns a list of loaded documents.

#### **Function Signature**
```python
def load_pdf_documents(directory):
```

#### **Parameters**
- `directory` (str): The path to the directory containing PDF files. This can include subdirectories, as the method performs recursive scanning.

#### **Returns**
- `list`: A list of loaded documents, where each document represents the content extracted from a PDF file.

#### **Workflow**
1. **Recursive File Discovery**:
    - The method uses `os.walk()` to traverse the directory and its subdirectories.
    - It identifies all files with a `.pdf` extension and stores their paths in a list called `pdfs`.

2. **PDF Loading**:
    - For each PDF file in the `pdfs` list, the method initializes a `PyMuPDFLoader` instance.
    - The loader extracts the content of the PDF and appends it to the `docs` list.

3. **Return Loaded Documents**:
    - The method returns the `docs` list, which contains the content of all discovered and loaded PDF files.

#### **Example Usage**
```python
# Load all PDF documents from the specified directory
documents = load_pdf_documents("../dataset/health_docs")

# Check the number of loaded documents
print(f"Total documents loaded: {len(documents)}")
```

#### **Key Features**
- **Recursive Scanning**: Ensures all PDF files in nested directories are discovered.
- **Flexible Loading**: Uses `PyMuPDFLoader` for robust PDF content extraction.
- **Scalability**: Handles large collections of PDF files efficiently.



In [3]:
def load_pdf_documents(directory):
    """
    Load PDF documents from a specified directory.
    
    Args:
        directory (str): Path to the directory containing PDF files
    
    Returns:
        list: List of loaded documents
    """
    pdfs = []
    docs = []
    
    # Find all PDF files in the specified directory
    for root, _, files in os.walk(directory):
        pdfs.extend([os.path.join(root, file) for file in files if file.endswith(".pdf")])
    
    # Load each PDF document
    for pdf in pdfs:
        loader = PyMuPDFLoader(pdf)
        docs.extend(loader.load())
    
    return docs

### `chunk_documents`

The `chunk_documents` method is designed to split large documents into smaller, manageable chunks. This is particularly useful for processing lengthy documents in tasks such as semantic search, embedding generation, and information retrieval.

#### **Function Signature**
```python
def chunk_documents(docs, chunk_size=1000, chunk_overlap=100):
```

#### **Parameters**
- `docs` (list): A list of documents to be split into chunks. Each document is expected to have a `page_content` attribute containing the text.
- `chunk_size` (int): The maximum size of each chunk in terms of characters. Default is 1000.
- `chunk_overlap` (int): The number of overlapping characters between consecutive chunks. Default is 100.

#### **Returns**
- `list`: A list of document chunks, where each chunk is a smaller portion of the original document.

#### **Workflow**
1. **Initialize Text Splitter**:
    - The method uses the `RecursiveCharacterTextSplitter` from LangChain to handle the splitting process.
    - The splitter is configured with the specified `chunk_size` and `chunk_overlap`.

2. **Split Documents**:
    - The `split_documents` method of the text splitter is applied to the input `docs`.
    - This generates a list of smaller chunks, ensuring that no chunk exceeds the specified size and that overlapping content is preserved.

3. **Return Chunks**:
    - The method returns the list of chunks, which can be used for downstream tasks such as embedding generation or vector store creation.

#### **Example Usage**
```python
# Split documents into chunks
chunks = chunk_documents(docs, chunk_size=500, chunk_overlap=50)

# Check the number of chunks created
print(f"Total document chunks: {len(chunks)}")
```



In [4]:
def chunk_documents(docs, chunk_size=1000, chunk_overlap=100):
    """
    Split documents into smaller chunks.
    
    Args:
        docs (list): List of documents to chunk
        chunk_size (int): Size of each document chunk
        chunk_overlap (int): Overlap between chunks
    
    Returns:
        list: List of document chunks
    """
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, 
        chunk_overlap=chunk_overlap
    )
    return text_splitter.split_documents(docs)

### `print_retrieved_docs`

The `print_retrieved_docs` method is designed to display retrieved documents in a clean and readable format. This is particularly useful for reviewing the results of a document retrieval process, such as those obtained from a vector store or semantic search.

#### **Function Signature**
```python
def print_retrieved_docs(retrieved_docs, max_length=500):
```

#### **Parameters**
- `retrieved_docs` (list): A list of retrieved documents. Each document is expected to have metadata and content attributes.
- `max_length` (int): The maximum length of the document content to display. If the content exceeds this length, it will be truncated. Default is 500.

#### **Workflow**
1. **Print Summary**:
    - The method begins by printing the total number of retrieved documents and a separator line for clarity.

2. **Iterate Through Documents**:
    - For each document in the `retrieved_docs` list, the method:
        - Displays the document's index.
        - Prints the `score` and `source` metadata, if available. If not, it defaults to "N/A" or "Unknown".

3. **Truncate Content**:
    - If the document's `page_content` exceeds the `max_length`, it truncates the content and appends a `[truncated]` note.

4. **Display Content**:
    - The truncated or full content of the document is printed, followed by a separator line for readability.

#### **Example Usage**
```python
# Example list of retrieved documents
retrieved_docs = [
    {"metadata": {"score": 0.95, "source": "doc1.pdf"}, "page_content": "This is the content of document 1."},
    {"metadata": {"score": 0.85, "source": "doc2.pdf"}, "page_content": "This is the content of document 2."}
]

# Print the retrieved documents
print_retrieved_docs(retrieved_docs)
```




In [5]:
def print_retrieved_docs(retrieved_docs, max_length=500):
    """
    Print retrieved documents in a clean, readable format.
    
    Args:
        retrieved_docs (list): List of retrieved documents
        max_length (int): Maximum length of content to display
    """
    print("\n--- Retrieved Documents ---")
    print(f"Total documents retrieved: {len(retrieved_docs)}")
    print("-" * 50)
    
    for i, doc in enumerate(retrieved_docs, 1):
        print(f"\nDocument {i}:")
        print(f"Score: {doc.metadata.get('score', 'N/A')}")
        print(f"Source: {doc.metadata.get('source', 'Unknown')}")
        
        # Truncate content if it's too long
        content = doc.page_content
        if len(content) > max_length:
            content = content[:max_length] + "... [truncated]"
        
        print("\nContent:")
        print(content)
        print("-" * 50)

### Method `create_vector_store`

The `create_vector_store` function is designed to create a vector store from document chunks. This vector store is used for efficient semantic search and retrieval tasks by embedding the document chunks into a vector space.

#### **Function Signature**
```python
def create_vector_store(chunks, embedding_model='nomic-embed-text', base_url='http://localhost:11434'):
```

#### **Parameters**
- `chunks` (list): A list of document chunks to be embedded and stored in the vector store.
- `embedding_model` (str): The name of the embedding model to be used for generating vector embeddings. Default is `'nomic-embed-text'`.
- `base_url` (str): The base URL for the embedding model service. Default is `'http://localhost:11434'`.

#### **Returns**
- `FAISS`: A FAISS-based vector store containing the embedded document chunks.

#### **Workflow**
1. **Initialize Embeddings**:
    - The function initializes an `OllamaEmbeddings` instance using the specified `embedding_model` and `base_url`.
    - A sample query (`"Hello World"`) is embedded to determine the vector size.

2. **Create FAISS Index**:
    - A FAISS index is created using `faiss.IndexFlatL2` with the determined vector size.
    - The FAISS index is wrapped in a `FAISS` object, which includes:
        - The embedding function.
        - An in-memory document store (`InMemoryDocstore`).
        - A mapping of index IDs to document store IDs.

3. **Add Documents to Vector Store**:
    - The document chunks are embedded and added to the FAISS vector store.

4. **Return Vector Store**:
    - The function returns the FAISS vector store, which can be used for semantic search and retrieval.

#### **Example Usage**
```python
# Create a vector store from document chunks
vector_store = create_vector_store(chunks)

# Example query
question = "What are the benefits of regular exercise?"
retrieved_docs = vector_store.search(query=question, k=5, search_type="similarity")

# Display retrieved documents
print_retrieved_docs(retrieved_docs)
```



In [6]:
def create_vector_store(chunks, embedding_model='nomic-embed-text', base_url='http://localhost:11434'):
    """
    Create a vector store from document chunks.
    
    Args:
        chunks (list): List of document chunks
        embedding_model (str): Name of the embedding model
        base_url (str): Base URL for Ollama embeddings
    
    Returns:
        FAISS: Vector store with embedded documents
    """
    # Initialize embeddings
    embeddings = OllamaEmbeddings(model=embedding_model, base_url=base_url)
    
    # Create vector embedding
    vector = embeddings.embed_query("Hello World")
    
    # Create FAISS index
    index = faiss.IndexFlatL2(len(vector))
    vector_store = FAISS(
        embedding_function=embeddings,
        index=index,
        docstore=InMemoryDocstore(),
        index_to_docstore_id={},
    )
    
    # Add documents to vector store
    vector_store.add_documents(documents=chunks)
    
    return vector_store

In [7]:
if __name__ == "__main__":
    """
    Main function to orchestrate document processing and vector store creation.
    """
    # Suppress warnings (optional)
    warnings.filterwarnings('ignore')
    
    # Step 1 : Load PDF documents
    docs = load_pdf_documents("../dataset/health_docs")
    
    # Optional: Check document count and content
    print(f"Total Pages loaded: {len(docs)}")
    
   
    
    # Step 2 : Chunk documents
    # chunks = chunk_documents(docs)
    # print(f"Total document chunks: {len(chunks)}")
    
    # Optional: Tokenization check
    # encoding = tiktoken.encoding_for_model("gpt-4o-mini")
    # token_lengths = [len(encoding.encode(chunk.page_content)) for chunk in chunks[:3]]
    # print(f"Token lengths of first 3 chunks: {token_lengths}")
    
    # Step 3 : Create vector store, Embeddings
    # vector_store = create_vector_store(chunks)
    
    # Example retrieval
    # question = "What nutritional supplements support muscle protein synthesis?"
    # retrieved_docs = vector_store.search(query=question, k=5, search_type="similarity")

    # print_retrieved_docs(retrieved_docs)
    
    # # Optional: Save vector store
    # db_name = "../health_docs"
    # vector_store.save_local(db_name)
    
    # Step 4 : Load the vector store from the saved location
    
    # Step 5 : Configure retriever with search parameters
    
    # Step 6 : Build the retrieval chain
     
    # Step 7 : RAG-based retrieval and generation

Total Pages loaded: 38
