PDF Summarization using Retrieval-Augmented Generation (RAG)

This notebook demonstrates a complete PDF document summarization pipeline using the Retrieval-Augmented Generation (RAG) approach.

ðŸ”§ Technology Stack

Model: google/flan-t5-large

Framework: LangChain

Vector Database: ChromaDB

Embedding Model: Sentence Transformers

Pipeline Design: LCEL (LangChain Expression Language)

ðŸŽ¯ Objective

To generate accurate and context-aware summaries of PDF documents by retrieving relevant document chunks from a vector database and passing them to a large language model.

ðŸš« Explicit Exclusion

This implementation does not use OpenAI APIs and relies entirely on open-source models and tools.

In [1]:
# Document loading and processing
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
# Embeddings and vector database
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma

# Prompt and LCEL utilities
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.documents import Document
# LLM integration
from langchain_community.llms import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline



  from .autonotebook import tqdm as notebook_tqdm


In [2]:
"""
Loads the PDF document and converts each page into a
LangChain Document object.

Each Document contains:
- page_content (text)
- metadata (page number, source)
"""

PDF_PATH = r"sample.pdf"  

loader = PyPDFLoader(PDF_PATH)
documents = loader.load()

print(f"Total pages loaded: {len(documents)}")


Total pages loaded: 28


In [3]:
"""
Initializes an open-source embedding model.

This model converts text chunks into numerical vectors
that capture semantic meaning.
"""
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

  embeddings = HuggingFaceEmbeddings(


In [4]:
"""
Splits large document text into smaller overlapping chunks.

Why chunking is required:
- LLMs have context length limits
- Vector search works better on smaller chunks
- Improves retrieval accuracy in RAG systems
"""

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,      # safe for FLAN-T5
    chunk_overlap=50
)

split_docs = text_splitter.split_documents(documents)

In [5]:
"""
Stores embeddings inside ChromaDB.

ChromaDB enables:
- Fast similarity search
- Persistent storage
- Efficient retrieval for RAG pipelines
"""
vectorstore = Chroma.from_documents(
    documents=split_docs,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

In [6]:
"""
Creates a retriever that fetches the most relevant
document chunks based on vector similarity.
"""

retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

In [7]:
"""
Loads the open-source FLAN-T5-Large model.

Why FLAN-T5-Large?
- Fully open-source
- Much lighter than Mistral / LLaMA
- Works well on CPU
- Excellent for summarization tasks
"""
model_id = "google/flan-t5-large"


tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

pipe = pipeline(
    "text2text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=300
)

llm = HuggingFacePipeline(pipeline=pipe)



Device set to use cpu
  llm = HuggingFacePipeline(pipeline=pipe)


In [8]:
"""
Prompt template used to guide the LLM.

The LLM receives retrieved document chunks
as 'context' and produces a structured summary.
"""

summary_prompt = PromptTemplate(
    input_variables=["context"],
    template="""
You are an expert document analyst.

Based on the following document excerpts,
generate a concise yet comprehensive summary.
Focus on:
- Main themes
- Key points
- Important conclusions

Document Content:
{context}

Final Summary:
"""
)


In [9]:
"""
LCEL-based RAG pipeline.

This replaces:
- LLMChain
- StuffDocumentsChain
- RetrievalQA

Pipeline Flow:
User Query
â†’ Retriever
â†’ Format retrieved documents
â†’ Prompt
â†’ LLM
"""

def format_docs(docs: list[Document]) -> str:
    """
    Converts a list of Documents into a single
    formatted string for the prompt context.
    """
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {
        "context": retriever | format_docs,
        "question": RunnablePassthrough()
    }
    | summary_prompt
    | llm
)



In [10]:
"""
Executes the RAG pipeline to summarize the document.
"""

query = "Summarize the document"

summary = rag_chain.invoke(query)

print("===== DOCUMENT SUMMARY =====\n")
print(summary)


===== DOCUMENT SUMMARY =====

Retrieval-Augmented Generation addresses these limitations by integrating information retrieval mechanisms with language generation. In a RAG pipeline, documents are embedded into vector space using embedding models. These embeddings are stored in a vector database, enabling similarity-based retrieval. Retrieved documents are injected into the prompt context, allowing the language model to produce Retrieval-Augmented Generation addresses these limitations by integrating information retrieval mechanisms with language generation. In a RAG pipeline, documents are embedded into vector space using embedding models. These embeddings are stored in a vector database, enabling similarity-based retrieval. Retrieved documents are injected into the prompt context, allowing the language model to produce Retrieval-Augmented Generation addresses these limitations by integrating information retrieval mechanisms with language generation. In a RAG pipeline, documents are em

In [11]:
"""
Debug / Transparency step:
View which document chunks were retrieved
and used for summarization.
"""

docs = retriever.invoke(query)

for i, doc in enumerate(docs[:2], 1):
    print(f"\n--- Chunk {i} ---")
    print(doc.page_content[:500])



--- Chunk 1 ---
Retrieval-Augmented Generation addresses these limitations by integrating information retrieval mechanisms with language generation. In a RAG pipeline, documents are embedded into vector space using embedding models. These embeddings are stored in a vector database, enabling similarity-based retrieval. Retrieved documents are injected into the prompt context, allowing the language model to produce

--- Chunk 2 ---
Retrieval-Augmented Generation addresses these limitations by integrating information retrieval mechanisms with language generation. In a RAG pipeline, documents are embedded into vector space using embedding models. These embeddings are stored in a vector database, enabling similarity-based retrieval. Retrieved documents are injected into the prompt context, allowing the language model to produce
