# Assignment 1: PDF Summarization using RAG with Open Source LLM

## Objective
This notebook demonstrates:
- PDF document processing and text extraction
- Creating embeddings and storing them in ChromaDB vector database
- Using RAG (Retrieval Augmented Generation) with LangChain
- Utilizing Groq API with open-source Llama model (instead of OpenAI)

## Requirements
- langchain
- langchain-core
- langchain-community
- langchain-text-splitters
- langchain-groq
- chromadb
- pypdf
- sentence-transformers
- python-dotenv

## Step 1: Import Required Libraries

In [1]:
import os
from dotenv import load_dotenv
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_groq import ChatGroq

# RetrievalQA can move between langchain versions — try common locations
try:
    from langchain.chains import RetrievalQA
except Exception:
    try:
        from langchain.chains.retrieval_qa import RetrievalQA
    except Exception:
        RetrievalQA = None

# PromptTemplate location may vary between langchain and langchain_core
try:
    from langchain.prompts import PromptTemplate
except Exception:
    try:
        from langchain_core.prompts import PromptTemplate
    except Exception:
        PromptTemplate = None

print("✅ Libraries imported successfully")

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

✅ Libraries imported successfully


## Step 2: Load Environment Variables

We load the GROQ API key from the .env file located in the parent directory.

In [2]:
# Load environment variables from parent directory
env_path = os.path.join('..', '.env')
load_dotenv(env_path)

# Get GROQ API key
groq_api_key = os.getenv('GROQ_API_KEY')

if groq_api_key:
    print("✅ GROQ API key loaded successfully")
else:
    print("❌ GROQ API key not found. Please check your .env file")

✅ GROQ API key loaded successfully


## Step 3: Load and Process PDF Document

We'll load the PDF from the dataset folder and extract its content.

In [3]:
# Define PDF path
pdf_path = os.path.join('dataset', 'sample_document.pdf')

# Load PDF
loader = PyPDFLoader(pdf_path)
documents = loader.load()

print(f"✅ Loaded {len(documents)} page(s) from PDF")
print(f"\nFirst 500 characters of content:")
print(documents[0].page_content[:500])

✅ Loaded 1 page(s) from PDF

First 500 characters of content:
Artificial Intelligence and Machine Learning
Introduction:
Artificial Intelligence (AI) is revolutionizing how we interact with technology.
Machine Learning, a subset of AI, enables computers to learn from data without
explicit programming.
Key Concepts:
1. Supervised Learning: Training models with labeled data to make predictions.
2. Unsupervised Learning: Finding patterns in unlabeled data.
3. Deep Learning: Using neural networks with multiple layers for complex tasks.
4. Natural Language Proc


## Step 4: Split Text into Chunks

For better retrieval, we split the document into smaller chunks with overlap.

In [4]:
# Initialize text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,  # Size of each chunk
    chunk_overlap=50,  # Overlap between chunks
    length_function=len
)

# Split documents
chunks = text_splitter.split_documents(documents)

print(f"✅ Split document into {len(chunks)} chunks")
print(f"\nExample chunk:")
print(chunks[0].page_content)

✅ Split document into 3 chunks

Example chunk:
Artificial Intelligence and Machine Learning
Introduction:
Artificial Intelligence (AI) is revolutionizing how we interact with technology.
Machine Learning, a subset of AI, enables computers to learn from data without
explicit programming.
Key Concepts:
1. Supervised Learning: Training models with labeled data to make predictions.
2. Unsupervised Learning: Finding patterns in unlabeled data.
3. Deep Learning: Using neural networks with multiple layers for complex tasks.


## Step 5: Create Embeddings and Store in ChromaDB

We use HuggingFace embeddings (open-source) and store them in ChromaDB vector database.

In [5]:
# Initialize embedding model (using open-source HuggingFace model)
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

print("✅ Embedding model loaded")

# Create ChromaDB vector store
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

print(f"✅ Created ChromaDB with {len(chunks)} document chunks")
print(f"   Vector store saved to: ./chroma_db")

  embeddings = HuggingFaceEmbeddings(


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

✅ Embedding model loaded
✅ Created ChromaDB with 3 document chunks
   Vector store saved to: ./chroma_db
✅ Created ChromaDB with 3 document chunks
   Vector store saved to: ./chroma_db


## Step 6: Initialize Open Source LLM (Groq with Llama)

We use Groq API with the open-source Llama model instead of OpenAI.

In [16]:
# Initialize Groq LLM with latest Llama 3.2 text model (vision model decommissioned)
llm = ChatGroq(
    groq_api_key=groq_api_key,
    model_name="llama-3.3-70b-versatile",  # Latest Llama 3.2 text model
    temperature=0.3
)

print("✅ Groq LLM with Llama model initialized")

✅ Groq LLM with Llama model initialized


## Step 7: Create RAG Chain with Custom Prompt

We set up a Retrieval QA chain that uses RAG to answer questions based on the PDF content.

In [17]:
# Create custom prompt template for summarization
prompt_template = """
Use the following context from the document to answer the question.
If you cannot find the answer in the context, say "I cannot find this information in the document."

Context: {context}

Question: {question}

Answer:
"""

PROMPT = PromptTemplate(
    template=prompt_template,
    input_variables=["context", "question"]
)

# Create retriever from vector store
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 3}  # Retrieve top 3 most relevant chunks
)

# Create RetrievalQA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    chain_type_kwargs={"prompt": PROMPT},
    return_source_documents=True
)

print("✅ RAG chain created successfully")

✅ RAG chain created successfully


## Step 8: Generate Document Summary

Now let's use our RAG system to summarize the PDF document.

In [18]:
# Query for summarization
query = "Provide a comprehensive summary of this document, including main topics, key concepts, and conclusions."

# Get response from RAG chain
result = qa_chain.invoke({"query": query})

print("="*80)
print("DOCUMENT SUMMARY")
print("="*80)
print(result['result'])
print("\n" + "="*80)
print(f"Sources used: {len(result['source_documents'])} document chunks")

DOCUMENT SUMMARY
The document provides an overview of Artificial Intelligence (AI) and Machine Learning (ML), highlighting their potential to revolutionize various aspects of technology and society. The main topics covered include:

1. **Introduction to AI and ML**: AI is introduced as a revolutionary technology, with ML as a subset that enables computers to learn from data without explicit programming.
2. **Key Concepts**: The document outlines four key concepts:
   - **Supervised Learning**: Training models with labeled data for predictions.
   - **Unsupervised Learning**: Finding patterns in unlabeled data.
   - **Deep Learning**: Utilizing neural networks with multiple layers for complex tasks.
   - **Natural Language Processing**: Enabling computers to understand human language.
3. **Applications**: AI and ML have various applications across industries, including:
   - **Healthcare**: Disease diagnosis and drug discovery.
   - **Finance**: Fraud detection and algorithmic trading.


## Step 9: Ask Specific Questions

Let's test the RAG system with specific questions about the document.

In [19]:
# Example questions
questions = [
    "What are the key concepts discussed in this document?",
    "What are the applications of AI mentioned?",
    "What challenges does AI face according to the document?"
]

for i, question in enumerate(questions, 1):
    print(f"\n{'='*80}")
    print(f"Question {i}: {question}")
    print('='*80)
    
    result = qa_chain.invoke({"query": question})
    print(f"\nAnswer:\n{result['result']}")
    print(f"\nRelevant chunks used: {len(result['source_documents'])}")


Question 1: What are the key concepts discussed in this document?

Answer:
The key concepts discussed in this document are:

1. Supervised Learning: Training models with labeled data to make predictions.
2. Unsupervised Learning: Finding patterns in unlabeled data.
3. Deep Learning: Using neural networks with multiple layers for complex tasks.
4. Natural Language Processing: Enabling computers to understand human language.

Relevant chunks used: 3

Question 2: What are the applications of AI mentioned?

Answer:
The key concepts discussed in this document are:

1. Supervised Learning: Training models with labeled data to make predictions.
2. Unsupervised Learning: Finding patterns in unlabeled data.
3. Deep Learning: Using neural networks with multiple layers for complex tasks.
4. Natural Language Processing: Enabling computers to understand human language.

Relevant chunks used: 3

Question 2: What are the applications of AI mentioned?

Answer:
The applications of AI mentioned are:

1

## Step 10: View Retrieved Context

Let's examine what content was retrieved from the vector database.

In [20]:
# Perform a similarity search
test_query = "What are the applications of AI?"
relevant_docs = vectorstore.similarity_search(test_query, k=3)

print(f"\nTop {len(relevant_docs)} relevant chunks for query: '{test_query}'\n")

for i, doc in enumerate(relevant_docs, 1):
    print(f"Chunk {i}:")
    print("-" * 80)
    print(doc.page_content)
    print("\n")


Top 3 relevant chunks for query: 'What are the applications of AI?'

Chunk 1:
--------------------------------------------------------------------------------
Artificial Intelligence and Machine Learning
Introduction:
Artificial Intelligence (AI) is revolutionizing how we interact with technology.
Machine Learning, a subset of AI, enables computers to learn from data without
explicit programming.
Key Concepts:
1. Supervised Learning: Training models with labeled data to make predictions.
2. Unsupervised Learning: Finding patterns in unlabeled data.
3. Deep Learning: Using neural networks with multiple layers for complex tasks.


Chunk 2:
--------------------------------------------------------------------------------
are working to develop more transparent and responsible AI systems.
Conclusion:
AI and ML continue to transform industries and create new opportunities.
As these technologies evolve, they promise to solve complex problems and
enhance human capabilities in unprecedented wa

## Summary

### What We Accomplished:

1. **PDF Processing**: Loaded and extracted text from a PDF document
2. **Text Chunking**: Split the document into manageable chunks for better retrieval
3. **Embeddings**: Created vector embeddings using HuggingFace's sentence transformers (open-source)
4. **Vector Database**: Stored embeddings in ChromaDB for efficient similarity search
5. **Open Source LLM**: Used Groq API with Llama 3.1 model instead of OpenAI
6. **RAG Implementation**: Built a Retrieval Augmented Generation system using LangChain
7. **Summarization**: Generated summaries and answered questions about the PDF content

### Key Technologies Used:
- **LangChain**: Framework for LLM applications
- **ChromaDB**: Open-source vector database
- **Groq + Llama**: Open-source LLM (alternative to OpenAI)
- **HuggingFace Embeddings**: Open-source embedding model
- **RAG**: Retrieval Augmented Generation pattern