# PDF-Based Retrieval Augmented Generation (RAG): Intelligent Document Querying

## Overview
This notebook demonstrates the implementation of a Retrieval Augmented Generation (RAG) pipeline using PDF documents, showcasing how to combine document retrieval with intelligent language model responses. The guide provides a practical approach to creating context-aware, document-grounded question-answering systems.

## Key Features:
- Semantic document retrieval
- Context-aware response generation
- PDF-based knowledge querying
- Intelligent information extraction
- Flexible RAG pipeline construction

## Technologies Used:
- Ollama Language Models
- FAISS Vector Store
- Semantic Retrieval
- Prompt Engineering
- Context-Based Generation

## Use Cases:
- Intelligent document querying
- Medical research information extraction
- Technical documentation analysis
- Contextual question answering
- Knowledge base exploration

## Activities Covered in This Notebook

1. **Vector Store Retrieval**  
    - Loading pre-indexed document vectors
    - Configuring semantic search parameters
    - Retrieving most relevant document chunks

2. **RAG Pipeline Construction**  
    - Designing context-aware prompt templates
    - Integrating retriever with language model
    - Creating flexible generation pipeline

3. **Intelligent Querying**  
    - Performing semantic search
    - Retrieving contextually relevant documents
    - Generating informed responses

4. **Response Generation**  
    - Using retrieved context to ground LLM responses
    - Implementing fallback mechanisms
    - Ensuring response relevance

5. **Error Handling and Robustness**  
    - Managing retrieval and generation exceptions
    - Providing clear user feedback
    - Ensuring pipeline reliability

## What's Next?

This notebook provides a foundational implementation of RAG techniques. For more advanced, practical examples, please refer langchain documentation.



In [1]:
# Import required libraries
import os
import warnings
import tiktoken
import faiss
from dotenv import load_dotenv

# Document Loading Libraries
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_ollama import OllamaEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.docstore.in_memory import InMemoryDocstore

def load_pdf_documents(directory):
    """
    Load PDF documents from a specified directory.
    
    Args:
        directory (str): Path to the directory containing PDF files
    
    Returns:
        list: List of loaded documents
    """
    pdfs = []
    docs = []
    
    # Find all PDF files in the specified directory
    for root, _, files in os.walk(directory):
        pdfs.extend([os.path.join(root, file) for file in files if file.endswith(".pdf")])
    
    # Load each PDF document
    for pdf in pdfs:
        loader = PyMuPDFLoader(pdf)
        docs.extend(loader.load())
    
    return docs

def chunk_documents(docs, chunk_size=1000, chunk_overlap=100):
    """
    Split documents into smaller chunks.
    
    Args:
        docs (list): List of documents to chunk
        chunk_size (int): Size of each document chunk
        chunk_overlap (int): Overlap between chunks
    
    Returns:
        list: List of document chunks
    """
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, 
        chunk_overlap=chunk_overlap
    )
    return text_splitter.split_documents(docs)

def create_vector_store(chunks, embedding_model='nomic-embed-text', base_url='http://localhost:11434'):
    """
    Create a vector store from document chunks.
    
    Args:
        chunks (list): List of document chunks
        embedding_model (str): Name of the embedding model
        base_url (str): Base URL for Ollama embeddings
    
    Returns:
        FAISS: Vector store with embedded documents
    """
    # Initialize embeddings
    embeddings = OllamaEmbeddings(model=embedding_model, base_url=base_url)
    
    # Create vector embedding
    vector = embeddings.embed_query("Hello World")
    
    # Create FAISS index
    index = faiss.IndexFlatL2(len(vector))
    vector_store = FAISS(
        embedding_function=embeddings,
        index=index,
        docstore=InMemoryDocstore(),
        index_to_docstore_id={},
    )
    
    # Add documents to vector store
    vector_store.add_documents(documents=chunks)
    
    return vector_store
def print_retrieved_docs(retrieved_docs, max_length=500):
    """
    Print retrieved documents in a clean, readable format.
    
    Args:
        retrieved_docs (list): List of retrieved documents
        max_length (int): Maximum length of content to display
    """
    print("\n--- Retrieved Documents ---")
    print(f"Total documents retrieved: {len(retrieved_docs)}")
    print("-" * 50)
    
    for i, doc in enumerate(retrieved_docs, 1):
        print(f"\nDocument {i}:")
        print(f"Score: {doc.metadata.get('score', 'N/A')}")
        print(f"Source: {doc.metadata.get('source', 'Unknown')}")
        
        # Truncate content if it's too long
        content = doc.page_content
        if len(content) > max_length:
            content = content[:max_length] + "... [truncated]"
        
        print("\nContent:")
        print(content)
        print("-" * 50)

if __name__ == "__main__":
    """
    Main function to orchestrate document processing and vector store creation.
    """
    # Suppress warnings (optional)
    warnings.filterwarnings('ignore')
    
    # Load PDF documents
    docs = load_pdf_documents("../dataset/health_docs")
    
    # Optional: Check document count and content
    print(f"Total documents loaded: {len(docs)}")
    
    # Chunk documents
    chunks = chunk_documents(docs)
    print(f"Total document chunks: {len(chunks)}")
    
    # Optional: Tokenization check
    # encoding = tiktoken.encoding_for_model("gpt-4o-mini")
    # token_lengths = [len(encoding.encode(chunk.page_content)) for chunk in chunks[:3]]
    # print(f"Token lengths of first 3 chunks: {token_lengths}")
    
    # Create vector store
    vector_store = create_vector_store(chunks)
    
    # Example retrieval
    question = "What nutritional supplements support muscle protein synthesis?"
    retrieved_docs = vector_store.search(query=question, k=5, search_type="similarity")

    print_retrieved_docs(retrieved_docs)
    
    # Optional: Save vector store
    db_name = "../health_docs"
    vector_store.save_local(db_name)
    print(f"Vector store saved to {db_name}.")

Total documents loaded: 38
Total document chunks: 201

--- Retrieved Documents ---
Total documents retrieved: 5
--------------------------------------------------

Document 1:
Score: N/A
Source: ../dataset/health_docs/dietary supplements.pdf

Content:
supplements mean products that are concentrated sources of vitamins, minerals, or other
substances with a nutritional or physiological effect (e.g., amino acids, essential fatty acids,
probiotics, plants, and herbal extracts) intended to supplement the regular diet. Dietary
supplements are produced in the form of capsules, tablets, pills, and other similar forms,
designed to be taken in measured small unit quantities [1,2]. Dietary supplements, despite
their route of administration and drug-like... [truncated]
--------------------------------------------------

Document 2:
Score: N/A
Source: ../dataset/health_docs/health supplements.pdf

Content:
women consuming isoflavone supplements (59) and, given the clear evidence of 
estrogenicity, 

In [2]:
import os
import warnings
import faiss
from dotenv import load_dotenv
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_ollama import OllamaEmbeddings, ChatOllama
from langchain_community.vectorstores import FAISS
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

# Build RAG pipeline
def build_rag_pipeline(retriever, llm, template):
    """
    Build a retrieval-augmented generation (RAG) pipeline.
    
    Args:
        retriever (Retriever): The retriever for fetching relevant documents
        llm (ChatOllama): The language model for generation
        template (str): The prompt template for the LLM
    
    Returns:
        Runnable: The RAG pipeline
    """
    # Initialize chat prompt from template
    prompt = ChatPromptTemplate.from_template(template)

    # Format retrieved documents
    def format_docs(docs):
        return "\n".join([doc.page_content for doc in docs])

    # Build the RAG pipeline
    rag_pipeline = (
        {"context": retriever | format_docs, "question": RunnablePassthrough()}
        | prompt
        | llm
        | StrOutputParser()
    )
    return rag_pipeline

def main():
    # Define database name and embedding model
    db_name = "../health_docs"  # Update with your actual path
    embedding_model = "nomic-embed-text"  # or "nomic-embed-text"

    # Load vector store
    embeddings = OllamaEmbeddings(model=embedding_model)
    vector_store = FAISS.load_local(
        db_name, 
        embeddings=embeddings, 
        allow_dangerous_deserialization=True
    )
    
    # Configure retriever
    retriever = vector_store.as_retriever(
        search_type="similarity",
        search_kwargs={"k": 5}  # retrieve top 5 most relevant documents
    )

    # Initialize the language model
    ollama_model = ChatOllama(
        base_url="http://localhost:11434",
        model='llama3.2:1b',
        temperature=0.5,
        num_predict=512
    )
    
    # Define the prompt template
    template = """You are an expert in health and nutrition. 
    Answer the question based strictly on the following context:

    Context:
    {context}

    Question: {question}

    If the context does not provide sufficient information, clearly state that you cannot provide a comprehensive answer based on the available information."""

    # Build the RAG pipeline
    rag_pipeline = build_rag_pipeline(retriever, ollama_model, template)
    
    # Example RAG-based retrieval and generation
    question = "What nutritional supplements support muscle protein synthesis?"
    try:
        print("\n--- Query ---")
        print(f"Question: {question}")
        
        # Retrieve documents first
        retrieved_docs = retriever.invoke(question)
        
        # Print retrieved documents (optional)
        print("\n--- Retrieved Documents ---")
        for i, doc in enumerate(retrieved_docs, 1):
            print(f"\nDocument {i}:")
            print(doc.page_content[:500] + "...")  # Print first 500 characters
        
        # Invoke RAG pipeline
        result = rag_pipeline.invoke(question)
        
        print("\n--- Generated Answer ---")
        print(result)
    except Exception as e:
        print(f"Error during RAG process: {e}")
        import traceback
        traceback.print_exc()

if __name__ == "__main__":
    main()



--- Query ---
Question: What nutritional supplements support muscle protein synthesis?

--- Retrieved Documents ---

Document 1:
supplements mean products that are concentrated sources of vitamins, minerals, or other
substances with a nutritional or physiological effect (e.g., amino acids, essential fatty acids,
probiotics, plants, and herbal extracts) intended to supplement the regular diet. Dietary
supplements are produced in the form of capsules, tablets, pills, and other similar forms,
designed to be taken in measured small unit quantities [1,2]. Dietary supplements, despite
their route of administration and drug-like...

Document 2:
women consuming isoflavone supplements (59) and, given the clear evidence of 
estrogenicity, there is a likelihood of increased risk of estrogen sensitive cancers in 
consumers of these products.
WEIGHT-LOSS, SPORTS, AND BODYBUILDING SUPPLEMENTS
As more and more of the world population becomes overweight and obese, there is a huge 
market for weight-l